easylink
  • Getting Started
  • Concepts
  • API Reference
  • Glossary
easylink
  • EasyLink
  • View page source

EasyLink

EasyLink is a tool that allows users to build and run highly configurable record linkage/entity resolution pipelines. Its configurability enables users to “mix and match” different pieces of record linkage software by ensuring that each piece of the pipeline conforms to standard patterns.

For example, users at the Census Bureau could easily evaluate whether using a more sophisticated “blocking” method would improve results in a certain pipeline, without having to rewrite the entire pipeline.

In its current state, EasyLink provides only one or two implementations for each step, does not yet have documentation to support users in creating their own implementations, and is not yet stable enough to be recommended as a tool for production pipelines.

Installation

Supported Python versions: 3.11, 3.12

NOTE: This package requires AMD64 CPU architecture - it is not compatible with Apple’s ARM64 architecture (e.g. M1 and newer Macs).

There are a few things to install in order to use this package:

  • Set up Linux.

    Singularity (and thus EasyLink) requires Linux to run. If you are not already using Linux, you will need to set up a virtual machine; refer to the Singularity documentation for installing on Windows or Mac.

  • Install Singularity.

    First check if you already have Singularity installed by running the command singularity --version. For an existing installation, your Singularity version number is printed.

    If Singularity is not yet installed, you will need to install it; refer to the Singularity docs for installing on Linux.

    Note that this requires administrator privileges; you may need to request installation from your system admin if you are working in a shared computing environment.

  • Install conda.

    We recommend miniforge. You can check if you already have conda installed by running the command conda --version. For an existing installation, a version will be displayed.

  • Create a conda environment with python and graphviz installed.

    $ conda create --name easylink -c conda-forge python=3.12 graphviz 'gcc<14' -y
    $ conda activate easylink
    
  • Install easylink in the environment.

    Option 1 - Install from PyPI with pip:

    $ pip install easylink
    

    Option 2 - Build from source with pip:

    $ pip install git+https://github.com/ihmeuw/easylink.git
    

Once you have EasyLink installed, see the Getting Started tutorial for how to use it.

Motivation

Imagine the Census Bureau has a record linkage pipeline that links people between datasets. One step in this pipeline, called “blocking,” categorizes records into “blocks” in order to focus only on the pairs of records that might really be links. The current pipeline uses a simple blocking mechanism, which won’t compare two records unless they match exactly on any of a few key attributes. Census wants to explore whether using more sophisticated blocking methods would improve results, without changing anything else in the pipeline.

Currently, software for record linkage is mostly created by researchers. Each researcher uses the technologies familiar to them and frames the record linkage task in the way that is most natural for their own examples, making it hard to use multiple software modules together. As a result, trying a new blocking method is too expensive for the Census Bureau to undertake without knowing what the benefit will be.

EasyLink aims to solve this problem by standardizing the record linkage pipeline and providing an “ecosystem” of compatible record linkage implementations. With EasyLink, a switch from one software to another requires only a change to a configuration file.

Next

© Copyright 2024, The EasyLink developers.

Built with Sphinx using a theme provided by Read the Docs.