LEXIS Airflow Provider#

As stated on the official website of the Apache Airflow:

Apache Airflow 2 is built in modular way. The “Core” of Apache Airflow provides core scheduler functionality which allow you to write some basic tasks, but the capabilities of Apache Airflow can be extended by installing additional packages, called providers.

Providers can contain operators, hooks, sensor, and transfer operators to communicate with a multitude of external systems, but they can also extend Airflow core with new capabilities.

LEXIS Airflow Provider extends the Airflow by new hooks, models, and operators (with sensors) needed for using the Airflow along with the LEXIS Platform. Here, a brief summary of all is provided.

LEXIS Operators#

The definition of an Operator is the following:

An Operator is conceptually a template for a predefined Task, that you can just define declaratively inside your DAG (Directed Acyclic Graph).

Airflow has a very extensive set of operators available, with some built-in to the core or pre-installed providers. Some popular operators from core include:

  • BashOperator – executes a bash command

  • PythonOperator – calls an arbitrary Python function

  • EmailOperator – sends an email

  • Use the @task decorator to execute an arbitrary Python function. It doesn’t support rendering jinja templates passed as arguments.

It is also important to recall the following note:

Inside Airflow’s code, we often mix the concepts of Tasks and Operators, and they are mostly interchangeable. However, when we talk about a Task, we mean the generic “unit of execution” of a DAG; when we talk about an Operator, we mean a reusable, pre-made Task template whose logic is all done for you and that just needs some arguments.

The sensors operate in a manner analogous to operators, but they are repeatedly invoked within a loop until a predefined condition is met.

Here, a list of existing LEXIS operators/sensors defined in the LEXIS Airflow Provider:

Base Operators#

  • lexis.operators.base.LexisOperatorBase – a simple base operator base.

  • lexis.operators.base.LexisAAIOperatorBase – includes base attributes and functions to handle LEXIS AAI authentization.

  • lexis.operators.base.LexisStagingOperatorBaseV2 – includes base attributes to handle DDI staging.

  • lexis.operators.base.HEAppEOperatorBase – includes base attributes and functions to handle HEAppE session.

AAI Operators#

  • lexis.operators.LexisAAIOperator – used to exchange access token for tokens of other services used by the Airflow instance, and store them in the database.

  • lexis.operators.aai.LexisAAITokenKeeperSensor – to periodically check and refresh (if needed) tokens, until the all of the task related to the session are done or failed.

  • lexis.operators.aai.LexisAAIStopKeeperOperator – to stop LexisAAITokenKeeperSensor once all tasks are done.

DDI Operators#

  • lexis.operators.ddi_v2.LexisStagingOperatorV2 – used to trigger staging operations on the DDI, e.g. from DDI to HPC system or other way around.

  • lexis.operators.ddi_v2.LexisWaitStagingRequestSensorV2 – used to wait for staging operation to be finished or failed.

HEAppE Operators#

Current HEAppE Operators supports the following major versions of the HEAppE: v4, v5 and v6.

  • lexis.operators.heappe.common.HEAppESessionLEXISOperator – to manage HEAppE session.

  • lexis.operators.heappe.common.HEAppEPrepareJobOperatorFullContext – to prepare a HEAppE job context by job specifications.

  • lexis.operators.heappe.common.HEAppEPrepareJobSchedulerOperator – similar to HEAppEPrepareJobOperatorFullContext, but the job specification is passed as a single JSON (or Python dict) as a post-processed data by the Smart Scheduler.

  • lexis.operators.heappe.common.HEAppESubmitJobOperator – to submit the HEAppE job.

  • lexis.operators.heappe.common.HEAppEDeleteJobOperator – to delete HEAppE job once the job is finished (or failed).

  • lexis.operators.heappe.common.HEAppEWaitSubmittedJobLEXISSensor – to wait for submitted job until it is finished (or failed).

  • lexis.operators.heappe.common.HEAppESessionLEXISKeeperSensor – periodically check and refresh (if needed) the HEAppE session, until the all of the task related to the session are done or failed.

  • lexis.operators.heappe.common.HEAppESessionStopKeeperOperator – to stop HEAppESessionKeeperSensor` all tasks are done.

Smart Scheduler Operators#

  • lexis.operators.scheduler.hpc_job.XcomStorageOperator – serves as storage for necessary info while using e.g. ‘failover’ policy to keep the info ‘in memory’.

  • lexis.operators.scheduler.hpc_job.SelectOperator – tries to select some location/cluster based on definition. The input job requirements are defined by the lexis.models.pydantic.hpc.hpcjobrequirements.HPCJobRequirementsModel, but passed as dict or JSON string.

  • lexis.operators.scheduler.hpc_job.CheckJobStatusOperator – check the status of the job (success/failed) and update the status in the Smart Scheduler database. Also, restarts the job using another available HPC Cluster if, e.g., ‘failover’ policy is used.

  • lexis.operators.scheduler.hpc_job.FinishOperator – check if the whole workflow was successful.

LEXIS Hooks#

The definition of a hook is the following:

A Hook is a high-level interface to an external platform that lets you quickly and easily talk to them without having to write low-level code that hits their API or uses special libraries. They’re also often the building blocks that Operators are built out of.

They integrate with Connections to gather credentials, and many have a default conn_id; for example, the PostgresHook automatically looks for the Connection with a conn_id of postgres_default if you don’t pass one in.

In short, the hooks are the base elements providing API calls.

Within the LEXIS Airflow provider, the following hooks are implemented:

  • lexis.hooks.aai.KeycloakHook – to handle requests to the Keycloak service (authentication/token management).

  • lexis.hooks.heappe.heappe_v4.HEAppEV4Hook – to handle requests to the HEAppE (version 4) service.

  • lexis.hooks.heappe.heappe_v5.HEAppEV4Hook – to handle requests to the HEAppE (version 5) service.

  • lexis.hooks.heappe.heappe_v6.HEAppEV4Hook – to handle requests to the HEAppE (version 6) service.

  • lexis.hooks.userorg.userorg_v3.UserOrgV3Hook – to handle requests to the UserOrg service.

  • lexis.hooks.smart_scheduler.smart_scheduler_v2.SmartSchedulerV2Hook – to handle requests to the Smart Scheduler service.