.. _`lexis-workflow-definition`:

=========================
LEXIS Workflow Definition
=========================

Overview
--------

LEXIS Workflow Definition (LWD) is a YAML-based format for defining computational workflows with job dependencies, data flow, and resource requirements. Workflows consist of multiple jobs that can run sequentially or in parallel across different computing clusters. This documentation covers how to create, structure, and deploy LEXIS workflows using the LWD.

Usage
-----

Creation of worfklow works in these steps:

1. **Create**: Use Applications -> User Workflows menu to upload or create your workflow in place 
2. **Translation**: The file is translated to a new workflow as an Airflow DAG
3. **Execution**: Workflow exeucution on target computing infrastructure with its parameters

Basic Structure
---------------

A workflow file contains:

- **id**: Unique workflow identifier
- **desc**: Human-readable description
- **project_shortname**: LEXIS Project identifier
- **jobs**: Dictionary of job definitions
- **metadata**: Optional workflow-level settings

Job Definition
--------------

Each job specifies:

Requirements
~~~~~~~~~~~~

.. code-block:: yaml

    requirements:
      command_template_name: preprocessor
      locations:
        - location_name: preprocess_cluster
      walltime_limit: 1800
      max_retries: 2

- **command_template_name**: Command template to be executed
- **locations**: Target computing clusters
- **walltime_limit**: Maximum execution time in seconds
- **max_retries**: Optional retry attempts

Other optinal requirements can be seen in LEXIS Datatypes specification.

Data Inputs
~~~~~~~~~~~

.. code-block:: yaml

    data_inputs:
      - source: ddi://~/data.tar
        storage:
          location_resource: irods
          location_name: it4i_irods
        target: raw/
      - source: job://preprocessing/cleaned_dataset
        target: analysis_input/

Data sources can be:

- **ddi://**: External data repository
- **job://**: Output from another job
- **storage**: Specifies storage resource and storage location which will be used while feching dataset. Information about names can be found in dataset metadata e.g. in LEXIS portal. However, values are optional, since can be determined while creation of workflow execution.

Data Outputs
~~~~~~~~~~~~

.. code-block:: yaml

    data_outputs:
      - source: cleaned_data/
        metadata:
          $name: cleaned_dataset
      - source: final_report/
        target: ddi://~
        storage:
          location_resource: irods
          location_name: it4i_irods
        metadata:
          title: Analysis Report
          access: public
          datacite: 
            titles:
              - title: test100GBdataset
            creators:
              - name: jan@vsb.cz
            publicationYear: 2025
            publisher:
              name: jan@vsb.cz
            types:
              resourceType: dataset
              resourceTypeGeneral: Dataset
            schemaVersion: http://datacite.org/schema/kernel-4

- **source**: Local output directory
- **target**: Optional destination (defaults to job storage)
- **storage**: Specifies storage resource and storage location which will be used to store output dataset. Information about names can be found in your LEXIS project in LEXIS portal. However, values are optional, since can be selected while creation of workflow execution.
- **metadata**: Output annotations including name and access level and optional name of storage resource and location
    - For *public* access is required to specify *datacite*, *project* and *user* access does not require *datacite* to be specified.

**Important**: Properties beginning with **$** (like **$name**) are internal references used only for job-to-job data transfer within the workflow. These properties are not translated to the final DAG and will not appear in the dataset metadata after workflow execution.

Dependencies
~~~~~~~~~~~~

.. code-block:: yaml

    depends_on:
      - preprocessing
      - data_validation

Explicit job dependencies (optional, as data flow creates implicit dependencies).

Simple Sequential Example
-------------------------

.. code-block:: yaml

    id: simple_pipeline
    desc: Basic three-step processing pipeline
    project_shortname: myproject
    jobs:
      preprocess:
        requirements:          
          command_template_name: cleaner
          locations:
            - location_name: cpu_cluster
          walltime_limit: 1800
        data_inputs:
          - source: ddi://~/raw_data.tar
            target: input/
        data_outputs:
          - source: clean/
            metadata:
              $name: cleaned_data
      
      analyze:
        requirements:
          command_template_name: analyzer
          locations:
            - location_name: gpu_cluster
          walltime_limit: 3600
        data_inputs:
          - source: job://preprocess/cleaned_data
            target: data/
        data_outputs:
          - source: results/
            metadata:
              $name: analysis_results
      
      report:
        requirements:
          command_template_name: reporter
          locations:
            - location_name: cpu_cluster
          walltime_limit: 1200
        data_inputs:
          - source: job://analyze/analysis_results
            target: input/
        data_outputs:
          - source: report.pdf
            target: ddi://~/final_report.pdf

Parallel Processing Example
---------------------------

.. code-block:: yaml

    id: parallel_workflow
    desc: Parallel data processing with final merge
    project_shortname: parallel_proj
    jobs:
      split_data:
        requirements:          
          command_template_name: splitter
          locations:
            - location_name: prep_cluster
          walltime_limit: 1800
        data_inputs:
          - source: ddi://~/big_dataset.tar
            target: source/
        data_outputs:
          - source: chunk1/
            metadata:
              $name: data_chunk_1
          - source: chunk2/
            metadata:
              $name: data_chunk_2
          - source: chunk3/
            metadata:
              $name: data_chunk_3
      
      process_chunk1:
        requirements:
          command_template_name: processor
          locations:
            - location_name: worker1
          walltime_limit: 2400
        data_inputs:
          - source: job://split_data/data_chunk_1
            target: input/
        data_outputs:
          - source: output/
            metadata:
              $name: result_1
      
      process_chunk2:
        requirements:
          command_template_name: processor
          locations:
            - location_name: worker2
          walltime_limit: 2400
        data_inputs:
          - source: job://split_data/data_chunk_2
            target: input/
        data_outputs:
          - source: output/
            metadata:
              $name: result_2
      
      process_chunk3:
        requirements:
          command_template_name: processor
          locations:
            - location_name: worker3
          walltime_limit: 2400
        data_inputs:
          - source: job://split_data/data_chunk_3
            target: input/
        data_outputs:
          - source: output/
            metadata:
              $name: result_3
      
      merge_results:
        requirements:
          command_template_name: merger
          locations:
            - location_name: merge_cluster
          walltime_limit: 1800
        data_inputs:
          - source: job://process_chunk1/result_1
            target: input1/
          - source: job://process_chunk2/result_2
            target: input2/
          - source: job://process_chunk3/result_3
            target: input3/
        data_outputs:
          - source: merged/
            target: ddi://~/final_results.tar

Data Flow Patterns
------------------

**External Input**::

    - source: ddi://~/input_file.data
      target: local_dir/

**Job-to-Job Transfer**::

    - source: job://previous_job/output_name
      target: input_dir/

**Nested Output Access**::

    - source: job://preprocessing/cleaned_dataset/inner_directory
      target: analysis_input/

**External Output**::

    - source: results/
      target: ddi://~/final_output.tar

Metadata Options
----------------

**Workflow Metadata**::

    metadata:
      start_date: "2024-01-01T00:00:00"
      catchup: false

**Output Metadata**::

    metadata:
      $name: dataset_identifier          # Internal reference only
      title: Human Readable Title       # Appears in final metadata
      access: public                     # Appears in final metadata

Access levels: public, project, private

**Note**: Only properties without the **$** prefix become part of the actual dataset metadata. Properties like **$name** are used solely for internal workflow references and data flow definitions.

Integration with Airflow
------------------------

Workflows integrate with Apache Airflow using the translator::

    from lexis.operators.aai import LexisAAIOperator
    from yaml2dag.lexis_workflow_definition import LexisWorkflowTranslator
    
    t = LexisWorkflowTranslator(
        yaml_path="workflow.yaml",
        log_level=DEBUG
    )
    dag = t.build(override_tags=["resourcetest"])

Best Practices
--------------

1. Use descriptive job and output names
2. Use **$name** metadata for internal data flow references
3. Test workflows with small datasets first


Conditional Execution
^^^^^^^^^^^^^^^^^^^^^

Use Airflow's branching capabilities by structuring your workflow appropriately and using task dependencies.

This documentation provides a comprehensive guide to creating LEXIS worfklows. For specific implementation details or advanced use cases, consult the LEXIS Platform documentation or contact support@lexis.tech.