LEXIS Workflow Definition
Overview
LEXIS Workflow Definition (LWD) is a YAML-based format for defining computational workflows with job dependencies, data flow, and resource requirements. Workflows consist of multiple jobs that can run sequentially or in parallel across different computing clusters. This documentation covers how to create, structure, and deploy LEXIS workflows using the LWD.
Usage
Creation of worfklow works in these steps:
Create: Use Applications -> User Workflows menu to upload or create your workflow in place
Translation: The file is translated to a new workflow as an Airflow DAG
Execution: Workflow exeucution on target computing infrastructure with its parameters
Basic Structure
A workflow file contains:
id: Unique workflow identifier
desc: Human-readable description
project_shortname: LEXIS Project identifier
jobs: Dictionary of job definitions
metadata: Optional workflow-level settings
Job Definition
Each job specifies:
Requirements
requirements:
command_template_name: preprocessor
locations:
- location_name: preprocess_cluster
walltime_limit: 1800
max_retries: 2
command_template_name: Command template to be executed
locations: Target computing clusters
walltime_limit: Maximum execution time in seconds
max_retries: Optional retry attempts
Other optinal requirements can be seen in LEXIS Datatypes specification.
Data Inputs
data_inputs:
- source: ddi://~/data.tar
storage:
location_resource: irods
location_name: it4i_irods
target: raw/
- source: job://preprocessing/cleaned_dataset
target: analysis_input/
Data sources can be:
ddi://: External data repository
job://: Output from another job
storage: Specifies storage resource and storage location which will be used while feching dataset. Information about names can be found in dataset metadata e.g. in LEXIS portal. However, values are optional, since can be determined while creation of workflow execution.
Data Outputs
data_outputs:
- source: cleaned_data/
metadata:
$name: cleaned_dataset
- source: final_report/
target: ddi://~
storage:
location_resource: irods
location_name: it4i_irods
metadata:
title: Analysis Report
access: public
datacite:
titles:
- title: test100GBdataset
creators:
- name: jan@vsb.cz
publicationYear: 2025
publisher:
name: jan@vsb.cz
types:
resourceType: dataset
resourceTypeGeneral: Dataset
schemaVersion: http://datacite.org/schema/kernel-4
source: Local output directory
target: Optional destination (defaults to job storage)
storage: Specifies storage resource and storage location which will be used to store output dataset. Information about names can be found in your LEXIS project in LEXIS portal. However, values are optional, since can be selected while creation of workflow execution.
- metadata: Output annotations including name and access level and optional name of storage resource and location
For public access is required to specify datacite, project and user access does not require datacite to be specified.
Important: Properties beginning with $ (like $name) are internal references used only for job-to-job data transfer within the workflow. These properties are not translated to the final DAG and will not appear in the dataset metadata after workflow execution.
Dependencies
depends_on:
- preprocessing
- data_validation
Explicit job dependencies (optional, as data flow creates implicit dependencies).
Simple Sequential Example
id: simple_pipeline
desc: Basic three-step processing pipeline
project_shortname: myproject
jobs:
preprocess:
requirements:
command_template_name: cleaner
locations:
- location_name: cpu_cluster
walltime_limit: 1800
data_inputs:
- source: ddi://~/raw_data.tar
target: input/
data_outputs:
- source: clean/
metadata:
$name: cleaned_data
analyze:
requirements:
command_template_name: analyzer
locations:
- location_name: gpu_cluster
walltime_limit: 3600
data_inputs:
- source: job://preprocess/cleaned_data
target: data/
data_outputs:
- source: results/
metadata:
$name: analysis_results
report:
requirements:
command_template_name: reporter
locations:
- location_name: cpu_cluster
walltime_limit: 1200
data_inputs:
- source: job://analyze/analysis_results
target: input/
data_outputs:
- source: report.pdf
target: ddi://~/final_report.pdf
Parallel Processing Example
id: parallel_workflow
desc: Parallel data processing with final merge
project_shortname: parallel_proj
jobs:
split_data:
requirements:
command_template_name: splitter
locations:
- location_name: prep_cluster
walltime_limit: 1800
data_inputs:
- source: ddi://~/big_dataset.tar
target: source/
data_outputs:
- source: chunk1/
metadata:
$name: data_chunk_1
- source: chunk2/
metadata:
$name: data_chunk_2
- source: chunk3/
metadata:
$name: data_chunk_3
process_chunk1:
requirements:
command_template_name: processor
locations:
- location_name: worker1
walltime_limit: 2400
data_inputs:
- source: job://split_data/data_chunk_1
target: input/
data_outputs:
- source: output/
metadata:
$name: result_1
process_chunk2:
requirements:
command_template_name: processor
locations:
- location_name: worker2
walltime_limit: 2400
data_inputs:
- source: job://split_data/data_chunk_2
target: input/
data_outputs:
- source: output/
metadata:
$name: result_2
process_chunk3:
requirements:
command_template_name: processor
locations:
- location_name: worker3
walltime_limit: 2400
data_inputs:
- source: job://split_data/data_chunk_3
target: input/
data_outputs:
- source: output/
metadata:
$name: result_3
merge_results:
requirements:
command_template_name: merger
locations:
- location_name: merge_cluster
walltime_limit: 1800
data_inputs:
- source: job://process_chunk1/result_1
target: input1/
- source: job://process_chunk2/result_2
target: input2/
- source: job://process_chunk3/result_3
target: input3/
data_outputs:
- source: merged/
target: ddi://~/final_results.tar
Data Flow Patterns
External Input:
- source: ddi://~/input_file.data
target: local_dir/
Job-to-Job Transfer:
- source: job://previous_job/output_name
target: input_dir/
Nested Output Access:
- source: job://preprocessing/cleaned_dataset/inner_directory
target: analysis_input/
External Output:
- source: results/
target: ddi://~/final_output.tar
Metadata Options
Workflow Metadata:
metadata:
start_date: "2024-01-01T00:00:00"
catchup: false
Output Metadata:
metadata:
$name: dataset_identifier # Internal reference only
title: Human Readable Title # Appears in final metadata
access: public # Appears in final metadata
Access levels: public, project, private
Note: Only properties without the $ prefix become part of the actual dataset metadata. Properties like $name are used solely for internal workflow references and data flow definitions.
Integration with Airflow
Workflows integrate with Apache Airflow using the translator:
from lexis.operators.aai import LexisAAIOperator
from yaml2dag.lexis_workflow_definition import LexisWorkflowTranslator
t = LexisWorkflowTranslator(
yaml_path="workflow.yaml",
log_level=DEBUG
)
dag = t.build(override_tags=["resourcetest"])
Best Practices
Use descriptive job and output names
Use $name metadata for internal data flow references
Test workflows with small datasets first
Use Airflow’s branching capabilities by structuring your workflow appropriately and using task dependencies.
This documentation provides a comprehensive guide to creating LEXIS worfklows. For specific implementation details or advanced use cases, consult the LEXIS Platform documentation or contact support@lexis.tech.