LEXIS Workflow Definition

Overview

LEXIS Workflow Definition (LWD) is a YAML-based format for defining computational workflows with job dependencies, data flow, and resource requirements. Workflows consist of multiple jobs that can run sequentially or in parallel across different computing clusters. This documentation covers how to create, structure, and deploy LEXIS workflows using the LWD.

Usage

Creation of worfklow works in these steps:

  1. Create: Use Applications -> User Workflows menu to upload or create your workflow in place

  2. Translation: The file is translated to a new workflow as an Airflow DAG

  3. Execution: Workflow exeucution on target computing infrastructure with its parameters

Basic Structure

A workflow file contains:

  • id: Unique workflow identifier

  • desc: Human-readable description

  • project_shortname: LEXIS Project identifier

  • jobs: Dictionary of job definitions

  • metadata: Optional workflow-level settings

Job Definition

Each job specifies:

Requirements

requirements:
  command_template_name: preprocessor
  locations:
    - location_name: preprocess_cluster
  walltime_limit: 1800
  max_retries: 2
  • command_template_name: Command template to be executed

  • locations: Target computing clusters

  • walltime_limit: Maximum execution time in seconds

  • max_retries: Optional retry attempts

Other optinal requirements can be seen in LEXIS Datatypes specification.

Data Inputs

data_inputs:
  - source: ddi://~/data.tar
    storage:
      location_resource: irods
      location_name: it4i_irods
    target: raw/
  - source: job://preprocessing/cleaned_dataset
    target: analysis_input/

Data sources can be:

  • ddi://: External data repository

  • job://: Output from another job

  • storage: Specifies storage resource and storage location which will be used while feching dataset. Information about names can be found in dataset metadata e.g. in LEXIS portal. However, values are optional, since can be determined while creation of workflow execution.

Data Outputs

data_outputs:
  - source: cleaned_data/
    metadata:
      $name: cleaned_dataset
  - source: final_report/
    target: ddi://~
    storage:
      location_resource: irods
      location_name: it4i_irods
    metadata:
      title: Analysis Report
      access: public
      datacite:
        titles:
          - title: test100GBdataset
        creators:
          - name: jan@vsb.cz
        publicationYear: 2025
        publisher:
          name: jan@vsb.cz
        types:
          resourceType: dataset
          resourceTypeGeneral: Dataset
        schemaVersion: http://datacite.org/schema/kernel-4
  • source: Local output directory

  • target: Optional destination (defaults to job storage)

  • storage: Specifies storage resource and storage location which will be used to store output dataset. Information about names can be found in your LEXIS project in LEXIS portal. However, values are optional, since can be selected while creation of workflow execution.

  • metadata: Output annotations including name and access level and optional name of storage resource and location
    • For public access is required to specify datacite, project and user access does not require datacite to be specified.

Important: Properties beginning with $ (like $name) are internal references used only for job-to-job data transfer within the workflow. These properties are not translated to the final DAG and will not appear in the dataset metadata after workflow execution.

Dependencies

depends_on:
  - preprocessing
  - data_validation

Explicit job dependencies (optional, as data flow creates implicit dependencies).

Simple Sequential Example

id: simple_pipeline
desc: Basic three-step processing pipeline
project_shortname: myproject
jobs:
  preprocess:
    requirements:
      command_template_name: cleaner
      locations:
        - location_name: cpu_cluster
      walltime_limit: 1800
    data_inputs:
      - source: ddi://~/raw_data.tar
        target: input/
    data_outputs:
      - source: clean/
        metadata:
          $name: cleaned_data

  analyze:
    requirements:
      command_template_name: analyzer
      locations:
        - location_name: gpu_cluster
      walltime_limit: 3600
    data_inputs:
      - source: job://preprocess/cleaned_data
        target: data/
    data_outputs:
      - source: results/
        metadata:
          $name: analysis_results

  report:
    requirements:
      command_template_name: reporter
      locations:
        - location_name: cpu_cluster
      walltime_limit: 1200
    data_inputs:
      - source: job://analyze/analysis_results
        target: input/
    data_outputs:
      - source: report.pdf
        target: ddi://~/final_report.pdf

Parallel Processing Example

id: parallel_workflow
desc: Parallel data processing with final merge
project_shortname: parallel_proj
jobs:
  split_data:
    requirements:
      command_template_name: splitter
      locations:
        - location_name: prep_cluster
      walltime_limit: 1800
    data_inputs:
      - source: ddi://~/big_dataset.tar
        target: source/
    data_outputs:
      - source: chunk1/
        metadata:
          $name: data_chunk_1
      - source: chunk2/
        metadata:
          $name: data_chunk_2
      - source: chunk3/
        metadata:
          $name: data_chunk_3

  process_chunk1:
    requirements:
      command_template_name: processor
      locations:
        - location_name: worker1
      walltime_limit: 2400
    data_inputs:
      - source: job://split_data/data_chunk_1
        target: input/
    data_outputs:
      - source: output/
        metadata:
          $name: result_1

  process_chunk2:
    requirements:
      command_template_name: processor
      locations:
        - location_name: worker2
      walltime_limit: 2400
    data_inputs:
      - source: job://split_data/data_chunk_2
        target: input/
    data_outputs:
      - source: output/
        metadata:
          $name: result_2

  process_chunk3:
    requirements:
      command_template_name: processor
      locations:
        - location_name: worker3
      walltime_limit: 2400
    data_inputs:
      - source: job://split_data/data_chunk_3
        target: input/
    data_outputs:
      - source: output/
        metadata:
          $name: result_3

  merge_results:
    requirements:
      command_template_name: merger
      locations:
        - location_name: merge_cluster
      walltime_limit: 1800
    data_inputs:
      - source: job://process_chunk1/result_1
        target: input1/
      - source: job://process_chunk2/result_2
        target: input2/
      - source: job://process_chunk3/result_3
        target: input3/
    data_outputs:
      - source: merged/
        target: ddi://~/final_results.tar

Data Flow Patterns

External Input:

- source: ddi://~/input_file.data
  target: local_dir/

Job-to-Job Transfer:

- source: job://previous_job/output_name
  target: input_dir/

Nested Output Access:

- source: job://preprocessing/cleaned_dataset/inner_directory
  target: analysis_input/

External Output:

- source: results/
  target: ddi://~/final_output.tar

Metadata Options

Workflow Metadata:

metadata:
  start_date: "2024-01-01T00:00:00"
  catchup: false

Output Metadata:

metadata:
  $name: dataset_identifier          # Internal reference only
  title: Human Readable Title       # Appears in final metadata
  access: public                     # Appears in final metadata

Access levels: public, project, private

Note: Only properties without the $ prefix become part of the actual dataset metadata. Properties like $name are used solely for internal workflow references and data flow definitions.

Integration with Airflow

Workflows integrate with Apache Airflow using the translator:

from lexis.operators.aai import LexisAAIOperator
from yaml2dag.lexis_workflow_definition import LexisWorkflowTranslator

t = LexisWorkflowTranslator(
    yaml_path="workflow.yaml",
    log_level=DEBUG
)
dag = t.build(override_tags=["resourcetest"])

Best Practices

  1. Use descriptive job and output names

  2. Use $name metadata for internal data flow references

  3. Test workflows with small datasets first

Use Airflow’s branching capabilities by structuring your workflow appropriately and using task dependencies.

This documentation provides a comprehensive guide to creating LEXIS worfklows. For specific implementation details or advanced use cases, consult the LEXIS Platform documentation or contact support@lexis.tech.