Data Splitting Utilities

This module contains utility functions for splitting datasets into smaller datasets. One primary use case for this is to run sections of the pipeline in an auto parallel manner.

Note that it is critical that all data splitting utility functions are definied in this module; easylink will not be able to find them otherwise.

easylink.utilities.splitter_utils.split_data_by_size(input_files, output_dir, desired_chunk_size_mb)[source]

Splits the data (from a single input slot) into chunks of desired size.

This function takes all datasets from a single input slot, concatenates them, and then splits the resulting dataset into chunks of the desired size. Note that this will split the data as evenly as possible, but the final chunk may be smaller than the desired size if the input data does not divide evenly; it makes no effort to redistribute the lingering data.

Return type:

None

Parameters:

input_files (list[str]) – A list of input file paths to be concatenated and split.
output_dir (str) – The directory where the resulting chunks will be saved.
desired_chunk_size_mb (int | float) – The desired size of each chunk, in megabytes.

easylink.utilities.splitter_utils.split_data_in_two(input_files, output_dir, *args, **kwargs)[source]

Splits the data (from a single input slot) into two chunks of equal.

This function takes all datasets from a single input slot, concatenates them, and then splits the resulting dataset into two chunks of similar size.

Return type:

None

Parameters:

input_files (list[str]) – A list of input file paths to be concatenated and split.
output_dir (str) – The directory where the resulting chunks will be saved.
desired_chunk_size_mb – The desired size of each chunk, in megabytes.