.. _`lexis-workflow-definition`: ========================= LEXIS Workflow Definition ========================= Overview -------- LEXIS Workflow Definition (LWD) is a YAML-based format for defining computational workflows with job dependencies, data flow, and resource requirements. Workflows consist of multiple jobs that can run sequentially or in parallel across different computing clusters. This documentation covers how to create, structure, and deploy LEXIS workflows using the LWD. Usage ----- Creation of worfklow works in these steps: 1. **Create**: Use Applications -> User Workflows menu to upload or create your workflow in place 2. **Translation**: The file is translated to a new workflow as an Airflow DAG 3. **Execution**: Workflow exeucution on target computing infrastructure with its parameters Basic Structure --------------- A workflow file contains: - **id**: Unique workflow identifier - **desc**: Human-readable description - **project_shortname**: LEXIS Project identifier - **jobs**: Dictionary of job definitions - **metadata**: Optional workflow-level settings Job Definition -------------- Each job specifies: Requirements ~~~~~~~~~~~~ .. code-block:: yaml requirements: command_template_name: preprocessor locations: - location_name: preprocess_cluster walltime_limit: 1800 max_retries: 2 - **command_template_name**: Command template to be executed - **locations**: Target computing clusters - **walltime_limit**: Maximum execution time in seconds - **max_retries**: Optional retry attempts Other optinal requirements can be seen in LEXIS Datatypes specification. Data Inputs ~~~~~~~~~~~ .. code-block:: yaml data_inputs: - source: ddi://~/data.tar storage: location_resource: irods location_name: it4i_irods target: raw/ - source: job://preprocessing/cleaned_dataset target: analysis_input/ Data sources can be: - **ddi://**: External data repository - **job://**: Output from another job - **storage**: Specifies storage resource and storage location which will be used while feching dataset. Information about names can be found in dataset metadata e.g. in LEXIS portal. However, values are optional, since can be determined while creation of workflow execution. Data Outputs ~~~~~~~~~~~~ .. code-block:: yaml data_outputs: - source: cleaned_data/ metadata: $name: cleaned_dataset - source: final_report/ target: ddi://~ storage: location_resource: irods location_name: it4i_irods metadata: title: Analysis Report access: public datacite: titles: - title: test100GBdataset creators: - name: jan@vsb.cz publicationYear: 2025 publisher: name: jan@vsb.cz types: resourceType: dataset resourceTypeGeneral: Dataset schemaVersion: http://datacite.org/schema/kernel-4 - **source**: Local output directory - **target**: Optional destination (defaults to job storage) - **storage**: Specifies storage resource and storage location which will be used to store output dataset. Information about names can be found in your LEXIS project in LEXIS portal. However, values are optional, since can be selected while creation of workflow execution. - **metadata**: Output annotations including name and access level and optional name of storage resource and location - For *public* access is required to specify *datacite*, *project* and *user* access does not require *datacite* to be specified. **Important**: Properties beginning with **$** (like **$name**) are internal references used only for job-to-job data transfer within the workflow. These properties are not translated to the final DAG and will not appear in the dataset metadata after workflow execution. Dependencies ~~~~~~~~~~~~ .. code-block:: yaml depends_on: - preprocessing - data_validation Explicit job dependencies (optional, as data flow creates implicit dependencies). Simple Sequential Example ------------------------- .. code-block:: yaml id: simple_pipeline desc: Basic three-step processing pipeline project_shortname: myproject jobs: preprocess: requirements: command_template_name: cleaner locations: - location_name: cpu_cluster walltime_limit: 1800 data_inputs: - source: ddi://~/raw_data.tar target: input/ data_outputs: - source: clean/ metadata: $name: cleaned_data analyze: requirements: command_template_name: analyzer locations: - location_name: gpu_cluster walltime_limit: 3600 data_inputs: - source: job://preprocess/cleaned_data target: data/ data_outputs: - source: results/ metadata: $name: analysis_results report: requirements: command_template_name: reporter locations: - location_name: cpu_cluster walltime_limit: 1200 data_inputs: - source: job://analyze/analysis_results target: input/ data_outputs: - source: report.pdf target: ddi://~/final_report.pdf Parallel Processing Example --------------------------- .. code-block:: yaml id: parallel_workflow desc: Parallel data processing with final merge project_shortname: parallel_proj jobs: split_data: requirements: command_template_name: splitter locations: - location_name: prep_cluster walltime_limit: 1800 data_inputs: - source: ddi://~/big_dataset.tar target: source/ data_outputs: - source: chunk1/ metadata: $name: data_chunk_1 - source: chunk2/ metadata: $name: data_chunk_2 - source: chunk3/ metadata: $name: data_chunk_3 process_chunk1: requirements: command_template_name: processor locations: - location_name: worker1 walltime_limit: 2400 data_inputs: - source: job://split_data/data_chunk_1 target: input/ data_outputs: - source: output/ metadata: $name: result_1 process_chunk2: requirements: command_template_name: processor locations: - location_name: worker2 walltime_limit: 2400 data_inputs: - source: job://split_data/data_chunk_2 target: input/ data_outputs: - source: output/ metadata: $name: result_2 process_chunk3: requirements: command_template_name: processor locations: - location_name: worker3 walltime_limit: 2400 data_inputs: - source: job://split_data/data_chunk_3 target: input/ data_outputs: - source: output/ metadata: $name: result_3 merge_results: requirements: command_template_name: merger locations: - location_name: merge_cluster walltime_limit: 1800 data_inputs: - source: job://process_chunk1/result_1 target: input1/ - source: job://process_chunk2/result_2 target: input2/ - source: job://process_chunk3/result_3 target: input3/ data_outputs: - source: merged/ target: ddi://~/final_results.tar Data Flow Patterns ------------------ **External Input**:: - source: ddi://~/input_file.data target: local_dir/ **Job-to-Job Transfer**:: - source: job://previous_job/output_name target: input_dir/ **Nested Output Access**:: - source: job://preprocessing/cleaned_dataset/inner_directory target: analysis_input/ **External Output**:: - source: results/ target: ddi://~/final_output.tar Metadata Options ---------------- **Workflow Metadata**:: metadata: start_date: "2024-01-01T00:00:00" catchup: false **Output Metadata**:: metadata: $name: dataset_identifier # Internal reference only title: Human Readable Title # Appears in final metadata access: public # Appears in final metadata Access levels: public, project, private **Note**: Only properties without the **$** prefix become part of the actual dataset metadata. Properties like **$name** are used solely for internal workflow references and data flow definitions. Integration with Airflow ------------------------ Workflows integrate with Apache Airflow using the translator:: from lexis.operators.aai import LexisAAIOperator from yaml2dag.lexis_workflow_definition import LexisWorkflowTranslator t = LexisWorkflowTranslator( yaml_path="workflow.yaml", log_level=DEBUG ) dag = t.build(override_tags=["resourcetest"]) Best Practices -------------- 1. Use descriptive job and output names 2. Use **$name** metadata for internal data flow references 3. Test workflows with small datasets first Conditional Execution ^^^^^^^^^^^^^^^^^^^^^ Use Airflow's branching capabilities by structuring your workflow appropriately and using task dependencies. This documentation provides a comprehensive guide to creating LEXIS worfklows. For specific implementation details or advanced use cases, consult the LEXIS Platform documentation or contact support@lexis.tech.