Data Validation Utilities

This module contains utility functions for validating datasets, e.g. the validation function(s) for processed data being passed out of one pipeline step and into the next.

easylink.utilities.validation_utils._read_file(filepath)[source]

Reads a file.

Return type:

DataFrame

Parameters:

filepath (str) – The path to the file to read.

Returns:

The loaded DataFrame.

Raises:

NotImplementedError – If the file type is not supported.

easylink.utilities.validation_utils._validate_required_columns(filepath, required_columns)[source]

Validates that the file at filepath contains all columns in required_columns.

Return type:

None

Parameters:
  • filepath (str) – The path to the file to validate.

  • required_columns (set[str]) – The set of required column names.

Raises:
easylink.utilities.validation_utils._validate_unique_column(df, column_name, filepath)[source]

Validates that a column in a DataFrame has unique values.

Return type:

None

Parameters:
  • df (pandas.DataFrame) – The DataFrame to validate.

  • column_name (str) – The name of the column to check.

  • filepath (str) – The path to the file being validated.

Raises:

ValueError – If the column contains duplicate values.

easylink.utilities.validation_utils._validate_unique_column_set(df, columns, filepath)[source]

Validates that the combination of columns in columns is unique in the DataFrame.

Return type:

None

Parameters:
  • df (pandas.DataFrame) – The DataFrame to validate.

  • columns (set[str]) – The set of column names to check for uniqueness as a group.

  • filepath (str) – The path to the file being validated.

Raises:

ValueError – If duplicate rows exist for the given columns.

easylink.utilities.validation_utils.validate_input_file_dummy(filepath)[source]

Validates an input file to a dummy Step.

The file must contain the columns: “foo”, “bar”, and “counter”.

Return type:

None

Parameters:

filepath (str) – The path to the input file.

Raises:

LookupError – If the file is missing required columns.

easylink.utilities.validation_utils.validate_input_dataset_or_known_clusters(filepath)[source]

Validates a dataset or clusters file based on its filename.

Return type:

None

Parameters:

filepath (str) – The path to the input file.

Raises:

LookupError, ValueError – If the file fails validation as a dataset or clusters file.

easylink.utilities.validation_utils.validate_dataset(filepath)[source]

Validates a dataset file. :rtype: None

  • Must be in a tabular format and contain a “Record ID” column.

  • The “Record ID” column must have unique integer values.

Parameters:

filepath (str) – The path to the input dataset file.

Raises:
  • LookupError – If the file is missing the required “Record ID” column.

  • ValueError – If the “Record ID” column is not unique or not integer dtype.

Return type:

None

easylink.utilities.validation_utils.validate_datasets_directory(filepath)[source]

Validates a directory of input dataset files. :rtype: None

  • Each file in the directory must be in a tabular format and contain a “Record ID” column.

  • The “Record ID” column must have unique values.

Parameters:

filepath (str) – The path to the directory containing input dataset files.

Raises:
  • NotADirectoryError – If the provided path is not a directory.

  • LookupError – If any file is missing the required “Record ID” column.

  • ValueError – If the “Record ID” column is not unique in any file or if a non-file is present.

Return type:

None

easylink.utilities.validation_utils.validate_clusters(filepath)[source]

Validates a file containing cluster information. :rtype: None

  • The file must contain three columns: “Input Record Dataset”, “Input Record ID”, and “Cluster ID”.

  • “Input Record Dataset” and “Input Record ID”, considered as a pair, must have unique values.

Parameters:

filepath (str) – The path to the file containing cluster data.

Raises:
  • LookupError – If the file is missing required columns.

  • ValueError – If the (“Input Record Dataset”, “Input Record ID”) pair is not unique.

Return type:

None

Validates a file containing link information. :rtype: None

  • The file must contain five columns: “Left Record Dataset”, “Left Record ID”, “Right Record Dataset”, “Right Record ID”, and “Probability”.

  • “Left Record ID” and “Right Record ID” cannot be equal in a row where “Left Record Dataset” also equals “Right Record Dataset”.

  • Rows must be unique, ignoring the Probability column.

  • “Left Record Dataset” must be alphabetically before (or equal to) “Right Record Dataset.”

  • “Left Record ID” must be less than “Right Record ID” if “Left Record Dataset” equals “Right Record Dataset”.

  • “Probability” values must be between 0 and 1 (inclusive).

Parameters:

filepath (str) – The path to the file containing link data.

Raises:
  • LookupError – If the file is missing required columns.

  • ValueError – If: - “Left Record ID” equals “Right Record ID” in any row where datasets match. - Duplicate rows exist with the same “Left Record Dataset”, “Left Record ID”, “Right Record Dataset”, and “Right Record ID”. - “Left Record Dataset” is not alphabetically before or equal to “Right Record Dataset”. - “Left Record ID” is not less than “Right Record ID” when datasets match. - Values in the “Probability” column are not between 0 and 1 (inclusive).

Return type:

None

easylink.utilities.validation_utils._validate_pairs(df, filepath)[source]

Validates pairs in a DataFrame for link or pairs files.

Return type:

None

Parameters:
  • df (pandas.DataFrame) – The DataFrame to validate.

  • filepath (str) – The path to the file being validated.

Raises:

ValueError – If any validation rule for pairs is violated.

easylink.utilities.validation_utils.validate_ids_to_remove(filepath)[source]

Validates a file containing IDs to remove. :rtype: None

  • The file must contain a single column: “Input Record ID”.

  • “Input Record ID” must have unique values.

Parameters:

filepath (str) – The path to the file containing IDs to remove.

Raises:
  • LookupError – If the file is missing the “Input Record ID” column.

  • ValueError – If the “Input Record ID” column is not unique.

Return type:

None

easylink.utilities.validation_utils.validate_records(filepath)[source]

Validates a file containing records. :rtype: None

  • A file in a tabular format.

  • The file may have any number of columns.

  • Two columns must be called “Input Record Dataset” and “Input Record ID” and they must have unique values as a pair.

Parameters:

filepath (str) – The path to the file containing records.

Raises:
  • LookupError – If required columns are missing.

  • ValueError – If the (“Input Record Dataset”, “Input Record ID”) pair is not unique.

Return type:

None

easylink.utilities.validation_utils.validate_blocks(filepath)[source]

Validates a directory containing blocks.

Each block subdirectory must contain exactly two files: a records file and a pairs file, both in tabular format.

Validation checks include: - The parent directory must exist and be a directory. - Each block subdirectory must contain exactly one records file (filename contains “records”) and one pairs file (filename contains “pairs”). - The records file must have columns “Input Record Dataset” and “Input Record ID” with unique pairs. - The pairs file must have columns “Left Record Dataset”, “Left Record ID”, “Right Record Dataset”, and “Right Record ID”. - All values in (“Left Record Dataset”, “Left Record ID”) and (“Right Record Dataset”, “Right Record ID”) must exist in the records file. - No row in the pairs file may have “Left Record Dataset” == “Right Record Dataset” and “Left Record ID” == “Right Record ID”. - All rows in the pairs file must be unique with respect to (“Left Record Dataset”, “Left Record ID”, “Right Record Dataset”, “Right Record ID”). - “Left Record Dataset” must be alphabetically before or equal to “Right Record Dataset”. - “Left Record ID” must be less than “Right Record ID” if datasets match. - No extra files are allowed in block subdirectories.

Return type:

None

Parameters:

filepath (str) – Path to the directory containing block subdirectories.

Raises:
  • NotADirectoryError – If the provided path is not a directory.

  • FileNotFoundError – If a required records or pairs file is missing in any block.

  • LookupError – If required columns are missing in records or pairs files.

  • ValueError – If: - (“Input Record Dataset”, “Input Record ID”) is not unique in the records file. - (“Left Record Dataset”, “Left Record ID”) or (“Right Record Dataset”, “Right Record ID”) in the pairs file do not exist in the records file. - “Left Record Dataset” == “Right Record Dataset” and “Left Record ID” == “Right Record ID” in any row of the pairs file. - Duplicate rows exist in the pairs file. - “Left Record Dataset” is not alphabetically before or equal to “Right Record Dataset” in any row. - “Left Record ID” is not less than “Right Record ID” when datasets match. - Extra files are present in a block subdirectory.

easylink.utilities.validation_utils.validate_dir(filepath)[source]

Validates that the given path is a directory.

Return type:

None

Parameters:

filepath (str) – The path to check.

Raises:

NotADirectoryError – If the path is not a directory.

easylink.utilities.validation_utils.validate_dataset_dir(filepath)[source]

Validates a directory containing a single dataset file.

Return type:

None

Parameters:

filepath (str) – The path to the directory.

Raises:
easylink.utilities.validation_utils.dont_validate(filepath)[source]

Placeholder function that performs no validation.

Return type:

None

Parameters:

filepath (str) – The path to the file (not used).