Snakemake Rules
We have chosen to use Snakemake as the EasyLink workflow manager. This module is responsible for generating the Snakemake rules to be run as well as writing them to the Snakefile.
Note we have adopted the Snakemake term “rule” to refer to a singular component in a Snakefile (i.e. in a Snakemake pipeline) that defines input files, output files, and the command to run to create those output files. These rules are generated dynamically as strings and appended to the Snakefile.
- class easylink.rule.Rule[source]
Bases:
ABCAn abstract class used to generate Snakemake rules.
- write_to_snakefile(snakefile_path)[source]
Writes the rule to the Snakefile.
- Return type:
- Parameters:
snakefile_path – Path to the Snakefile to write the rule to.
- abstract build_rule()[source]
Builds the snakemake rule to be written to the Snakefile.
This is an abstract method and must be implemented by concrete instances.
- Return type:
- _abc_impl = <_abc._abc_data object>
- class easylink.rule.TargetRule(target_files, validation, requires_spark)[source]
Bases:
RuleA
Rulethat defines the final output of the pipeline.Snakemake will determine the directed acyclic graph (DAG) based on this target.
-
validation:
str Name of file created by
InputValidationRule.
- _abc_impl = <_abc._abc_data object>
-
validation:
- class easylink.rule.ImplementedRule(name, step_name, implementation_name, input_slots, validations, output, resources, envvars, diagnostics_dir, image_path, script_cmd, requires_spark, is_auto_parallel=False)[source]
Bases:
RuleA
Rulethat defines the execution of anImplementation.- Parameters:
-
validations:
list[str] Names of files created by
InputValidationRule. These files are empty but used by Snakemake to build the graph edges of dependency on validation rules.
-
is_auto_parallel:
bool= False Whether or not this
Implementationis to be automatically run in parallel.
- _abc_impl = <_abc._abc_data object>
- class easylink.rule.InputValidationRule(name, input_slot_name, input, output, validator)[source]
Bases:
RuleA
Rulethat validates input files.Each file coming into the pipeline via an
InputSlotmust be validated against a specific validator function. This rule is responsible for running those validations as well as creating the (empty) validation output files that are used by Snakemake to build the graph edge from this rule to the next.- Parameters:
- _abc_impl = <_abc._abc_data object>
- build_rule()[source]
Builds the Snakemake rule for this validation.
This rule runs the appropriate validator function on each input file as well as creates an empty file at the end. This empty file is used by Snakemake to build the graph edge from this rule to the next (since the validations themselves don’t generate any output).
- Return type:
- class easylink.rule.CheckpointRule(name, input_files, splitter_func_name, output_dir, checkpoint_filepath)[source]
Bases:
RuleA
Rulethat defines a checkpoint.When running an
Implementationin an auto parallel way, we do not know until runtime how many parallel jobs there will be (e.g. we don’t know beforehand how many chunks a large incoming dataset will be split into since the incoming dataset isn’t created until runtime). The snakemake mechanism to handle this dynamic nature is a checkpoint rule along with a directory as output.Notes
There is a known Snakemake bug which prevents the use of multiple checkpoints in a single Snakefile. We work around this by generating an empty checkpoint.txt file as part of this rule. If this file does not yet exist when trying to run the
AggregationRule, it means that the checkpoint has not yet been executed for the particular wildcard value(s). In this case, we manually raise a SnakemakeIncompleteCheckpointExceptionwhich Snakemake automatically handles and leads to a re-evaluation after the checkpoint has successfully passed.TODO [MIC-5658]: Thoroughly test this workaround when implementing cacheing.
- Parameters:
- _abc_impl = <_abc._abc_data object>
-
checkpoint_filepath:
str Path to the checkpoint file. This is only needed for the bugfix workaround.
- build_rule()[source]
Builds the Snakemake rule for this checkpoint.
Checkpoint rules are a special type of rule in Snakemake that allow for dynamic generation of output files. This rule is responsible for splitting the input files into chunks. Note that the output of this rule is a Snakemake
directoryobject as opposed to a specific file like typical rules have.- Return type:
- class easylink.rule.AggregationRule(name, input_files, aggregated_output_file, aggregator_func_name, checkpoint_filepath, checkpoint_rule_name)[source]
Bases:
RuleA
Rulethat aggregates the processed chunks of output data.When running an
Implementationin an auto parallel way, we need to aggregate the output files from each parallel job into a single output file.- Parameters:
- _abc_impl = <_abc._abc_data object>
-
checkpoint_filepath:
str Path to the checkpoint file. This is only needed for the bugfix workaround.
- build_rule()[source]
Builds the Snakemake rule for this aggregator.
When running an
AutoParallelStep, we need to aggregate the output files from each parallel job into a single output file. This rule relies on a dynamically generated aggregation function which returns all of the processed chunks (from running theAutoParallelStep'scontainer in parallel) and uses them as inputs to the actual aggregation rule.- Return type:
Notes
There is a known Snakemake bug which prevents the use of multiple checkpoints in a single Snakefile. We work around this by generating an empty checkpoint.txt file in the
CheckpointRule. If this file does not yet exist when trying to aggregate, it means that the checkpoint has not yet been executed for the particular wildcard value(s). In this case, we manually raise a SnakemakeIncompleteCheckpointExceptionwhich Snakemake automatically handles and leads to a re-evaluation after the checkpoint has successfully passed, i.e. we replicate Snakemake’s behavior.
- _define_input_function()[source]
Builds the input function.