Data Validation Utilities

This module contains utility functions for validating datasets, e.g. the validation function(s) for processed data being passed out of one pipeline step and into the next.

easylink.utilities.validation_utils._read_file(filepath)[source]

Reads a file.

Return type:: DataFrame
Parameters:: filepath (str) – The path to the file to read.
Returns:: The loaded DataFrame.
Raises:: NotImplementedError – If the file type is not supported.

easylink.utilities.validation_utils._validate_required_columns(filepath, required_columns)[source]

Validates that the file at filepath contains all columns in required_columns.

Return type:

None

Parameters:

filepath (str) – The path to the file to validate.
required_columns (set[str]) – The set of required column names.

Raises:

NotImplementedError – If the file type is not supported.
LookupError – If any required columns are missing.

easylink.utilities.validation_utils._validate_unique_column(df, column_name, filepath)[source]

Validates that a column in a DataFrame has unique values.

Return type:

None

Parameters:

df (pandas.DataFrame) – The DataFrame to validate.
column_name (str) – The name of the column to check.
filepath (str) – The path to the file being validated.

Raises:

ValueError – If the column contains duplicate values.

easylink.utilities.validation_utils._validate_unique_column_set(df, columns, filepath)[source]

Validates that the combination of columns in columns is unique in the DataFrame.

Return type:

None

Parameters:

df (pandas.DataFrame) – The DataFrame to validate.
columns (set[str]) – The set of column names to check for uniqueness as a group.
filepath (str) – The path to the file being validated.

Raises:

ValueError – If duplicate rows exist for the given columns.

easylink.utilities.validation_utils.validate_input_file_dummy(filepath)[source]

Validates an input file to a dummy Step.

The file must contain the columns: “foo”, “bar”, and “counter”.

Return type:: None
Parameters:: filepath (str) – The path to the input file.
Raises:: LookupError – If the file is missing required columns.

easylink.utilities.validation_utils.validate_input_dataset_or_known_clusters(filepath)[source]

Validates a dataset or clusters file based on its filename.

Return type:: None
Parameters:: filepath (str) – The path to the input file.
Raises:: LookupError, ValueError – If the file fails validation as a dataset or clusters file.

easylink.utilities.validation_utils.validate_dataset(filepath)[source]

Validates a dataset file. :rtype: None

Must be in a tabular format and contain a “Record ID” column.
The “Record ID” column must have unique integer values.

Parameters:

filepath (str) – The path to the input dataset file.

Raises:

LookupError – If the file is missing the required “Record ID” column.
ValueError – If the “Record ID” column is not unique or not integer dtype.

Return type:

None

easylink.utilities.validation_utils.validate_datasets_directory(filepath)[source]

Validates a directory of input dataset files. :rtype: None

Each file in the directory must be in a tabular format and contain a “Record ID” column.
The “Record ID” column must have unique values.

Parameters:

filepath (str) – The path to the directory containing input dataset files.

Raises:

NotADirectoryError – If the provided path is not a directory.
LookupError – If any file is missing the required “Record ID” column.
ValueError – If the “Record ID” column is not unique in any file or if a non-file is present.

Return type:

None

easylink.utilities.validation_utils.validate_clusters(filepath)[source]

Validates a file containing cluster information. :rtype: None

The file must contain three columns: “Input Record Dataset”, “Input Record ID”, and “Cluster ID”.
“Input Record Dataset” and “Input Record ID”, considered as a pair, must have unique values.

Parameters:

filepath (str) – The path to the file containing cluster data.

Raises:

LookupError – If the file is missing required columns.
ValueError – If the (“Input Record Dataset”, “Input Record ID”) pair is not unique.

Return type:

None

easylink.utilities.validation_utils.validate_links(filepath)[source]

Validates a file containing link information. :rtype: None

The file must contain five columns: “Left Record Dataset”, “Left Record ID”, “Right Record Dataset”, “Right Record ID”, and “Probability”.
“Left Record ID” and “Right Record ID” cannot be equal in a row where “Left Record Dataset” also equals “Right Record Dataset”.
Rows must be unique, ignoring the Probability column.
“Left Record Dataset” must be alphabetically before (or equal to) “Right Record Dataset.”
“Left Record ID” must be less than “Right Record ID” if “Left Record Dataset” equals “Right Record Dataset”.
“Probability” values must be between 0 and 1 (inclusive).

Parameters:

filepath (str) – The path to the file containing link data.

Raises:

LookupError – If the file is missing required columns.
ValueError – If: - “Left Record ID” equals “Right Record ID” in any row where datasets match. - Duplicate rows exist with the same “Left Record Dataset”, “Left Record ID”, “Right Record Dataset”, and “Right Record ID”. - “Left Record Dataset” is not alphabetically before or equal to “Right Record Dataset”. - “Left Record ID” is not less than “Right Record ID” when datasets match. - Values in the “Probability” column are not between 0 and 1 (inclusive).

Return type:

None

easylink.utilities.validation_utils._validate_pairs(df, filepath)[source]

Validates pairs in a DataFrame for link or pairs files.

Return type:

None

Parameters:

df (pandas.DataFrame) – The DataFrame to validate.
filepath (str) – The path to the file being validated.

Raises:

ValueError – If any validation rule for pairs is violated.

easylink.utilities.validation_utils.validate_ids_to_remove(filepath)[source]

Validates a file containing IDs to remove. :rtype: None

The file must contain a single column: “Input Record ID”.
“Input Record ID” must have unique values.

Parameters:

filepath (str) – The path to the file containing IDs to remove.

Raises:

LookupError – If the file is missing the “Input Record ID” column.
ValueError – If the “Input Record ID” column is not unique.

Return type:

None

easylink.utilities.validation_utils.validate_records(filepath)[source]

Validates a file containing records. :rtype: None

A file in a tabular format.
The file may have any number of columns.
Two columns must be called “Input Record Dataset” and “Input Record ID” and they must have unique values as a pair.

Parameters:

filepath (str) – The path to the file containing records.

Raises:

LookupError – If required columns are missing.
ValueError – If the (“Input Record Dataset”, “Input Record ID”) pair is not unique.

Return type:

None

easylink.utilities.validation_utils.validate_blocks(filepath)[source]

Validates a directory containing blocks.

Each block subdirectory must contain exactly two files: a records file and a pairs file, both in tabular format.

Validation checks include: - The parent directory must exist and be a directory. - Each block subdirectory must contain exactly one records file (filename contains “records”) and one pairs file (filename contains “pairs”). - The records file must have columns “Input Record Dataset” and “Input Record ID” with unique pairs. - The pairs file must have columns “Left Record Dataset”, “Left Record ID”, “Right Record Dataset”, and “Right Record ID”. - All values in (“Left Record Dataset”, “Left Record ID”) and (“Right Record Dataset”, “Right Record ID”) must exist in the records file. - No row in the pairs file may have “Left Record Dataset” == “Right Record Dataset” and “Left Record ID” == “Right Record ID”. - All rows in the pairs file must be unique with respect to (“Left Record Dataset”, “Left Record ID”, “Right Record Dataset”, “Right Record ID”). - “Left Record Dataset” must be alphabetically before or equal to “Right Record Dataset”. - “Left Record ID” must be less than “Right Record ID” if datasets match. - No extra files are allowed in block subdirectories.

Return type:

None

Parameters:

filepath (str) – Path to the directory containing block subdirectories.

Raises:

NotADirectoryError – If the provided path is not a directory.
FileNotFoundError – If a required records or pairs file is missing in any block.
LookupError – If required columns are missing in records or pairs files.
ValueError – If: - (“Input Record Dataset”, “Input Record ID”) is not unique in the records file. - (“Left Record Dataset”, “Left Record ID”) or (“Right Record Dataset”, “Right Record ID”) in the pairs file do not exist in the records file. - “Left Record Dataset” == “Right Record Dataset” and “Left Record ID” == “Right Record ID” in any row of the pairs file. - Duplicate rows exist in the pairs file. - “Left Record Dataset” is not alphabetically before or equal to “Right Record Dataset” in any row. - “Left Record ID” is not less than “Right Record ID” when datasets match. - Extra files are present in a block subdirectory.

easylink.utilities.validation_utils.validate_dir(filepath)[source]

Validates that the given path is a directory.

Return type:: None
Parameters:: filepath (str) – The path to check.
Raises:: NotADirectoryError – If the path is not a directory.

easylink.utilities.validation_utils.validate_dataset_dir(filepath)[source]

Validates a directory containing a single dataset file.

Return type:

None

Parameters:

filepath (str) – The path to the directory.

Raises:

NotADirectoryError – If the path is not a directory.
ValueError – If the directory contains more than one file.
FileNotFoundError – If the directory does not contain any files.

easylink.utilities.validation_utils.dont_validate(filepath)[source]

Placeholder function that performs no validation.

Return type:: None
Parameters:: filepath (str) – The path to the file (not used).