Getting Started

Introduction 

EasyLink is a tool that allows users to build and run highly configurable record linkage pipelines. Its configurability enables users to “mix and match” different pieces of record linkage software by ensuring that each piece of the pipeline conforms to standard patterns.

For example, users at the Census Bureau could easily evaluate whether using a more sophisticated “blocking” method would improve results in a certain pipeline, without having to rewrite the entire pipeline.

Overview 

This tutorial introduces EasyLink concepts and features by demonstrating the software’s usage. Covered concepts include the EasyLink record linkage “pipeline schema,” EasyLink pipeline configuration, running pipelines, changing record linkage step implementations, changing input data, evaluating and comparing results, and more.

Audience 

This tutorial is intended for people familiar with record linkage practices, who are interested in easily comparing linkage results across different methods. This tutorial will not include introductory information about record linkage, though it demonstrates a simple example of it.

Tutorial prerequisites 

Install EasyLink if you haven’t already.

The tutorial uses the Splink Python package for record linkage implementations. You do not need to install Splink. Splink knowledge is not required to complete the tutorial but may be helpful when configuring Splink models.

Simulated input data 

Our first demonstration of running an EasyLink pipeline will configure a simple, “naive” record linkage model with implementations written using the Splink package. Our pipeline will link two simulated datasets generated by our pseudopeople package: simulated Social Security Administration records and simulated W-2 and 1099 employment tax forms. These datasets are about entirely simulated people, who we call “simulants,” but they contain realistic data, including “noise” such as typos, for an authentic record linkage challenge.

Naive model - running a pipeline 

Let’s start by using the easylink run command to run a pipeline that configures a simple record linkage model.

First we need to download the configuration files we will pass to the command line: input_data_demo.yaml and pipeline_demo_naive.yaml. Save them into the directory from which you will execute the easylink run command.

input_data_demo.yaml additionally references a few input files which we will save as well. Save known_clusters.parquet to the same directory as the other files, then create a subdirectory called 2020 and save input_file_ssa.parquet and input_file_w2.parquet into it.

Now we can run the pipeline. Note that if this is your first time running EasyLink, the command will first download the required Singularity container images. These files contain the code EasyLink will run for each step in the record linkage pipeline. The total amount to be downloaded is approximately 5GB, so we recommend first running the command below, then reading the information about it while the files download. Hopefully the download will be complete by the time you reach the next interactive section! The progress of your image downloads will be displayed in the console.

$ easylink run -p pipeline_demo_naive.yaml -i input_data_demo.yaml
 2025-06-30 14:17:58 | 00:00:01 | Running pipeline
 2025-06-30 14:17:58 | 00:00:01 | Results directory: /mnt/share/homes/tylerdy/easylink/docs/source/user_guide/tutorials/results/2025_06_26_10_13_31
 ... Downloading Images ...
 2025-06-30 14:18:21 | 00:00:24 | Running Snakemake
 2025-06-30 14:18:22 | 00:00:25 | Validating determining_exclusions_and_removing_records_clone_2_removing_records_default_removing_records input slot input_datasets
 ...
 2025-06-30 14:18:24 | 00:00:27 | Running clusters_to_links implementation: default_clusters_to_links
 2025-06-30 14:18:24 | 00:00:27 | Running determining_exclusions implementation: default_determining_exclusions
 2025-06-30 14:18:24 | 00:00:27 | Running determining_exclusions implementation: default_determining_exclusions
 ...
 2025-06-30 14:18:39 | 00:00:42 | Validating splink_blocking_and_filtering input slot records
 2025-06-30 14:18:42 | 00:00:45 | Running blocking_and_filtering implementation: splink_blocking_and_filtering
 2025-06-30 14:18:50 | 00:00:53 | Validating splink_evaluating_pairs input slot blocks
 2025-06-30 14:18:53 | 00:00:56 | Running evaluating_pairs implementation: splink_evaluating_pairs
 ...
 2025-06-30 14:19:19 | 00:01:22 | Running canonicalizing_and_downstream_analysis implementation: save_clusters
 2025-06-30 14:19:21 | 00:01:24 | Validating results input slot analysis_output
 2025-06-30 14:19:23 | 00:01:26 | Grabbing final output
 2025-06-30 14:19:26 | 00:01:29 | Pipeline finished running - full log saved to: /mnt/share/homes/tylerdy/easylink/docs/source/user_guide/tutorials/results/2025_06_26_10_13_31/pipeline.log

Success! Our pipeline has linked the input data and outputted the results, the clusters of records it found. We’ll take a look at these results later and see how the model performed.

Naive model - command line arguments 

This section will explain the command line arguments and show the file we pass to each one, including the pipeline specification YAML and how it relates to the EasyLink pipeline schema. That file can look a little complicated at first, so feel free to skip ahead to the Naive model - results section, where the interactive part of the tutorial continues, and come back later.

Input data 

The --input-data (-i) argument to easylink run accepts a YAML file specifying a list of paths to files or directories containing input data to be used by the pipeline. We passed input_data_demo.yaml, the contents of which are shown below:

input_file_ssa: 2020/input_file_ssa.parquet
input_file_w2: 2020/input_file_w2.parquet
known_clusters: known_clusters.parquet

Here we have defined the locations of the three input files we will use: the 2020 versions of the two pseudopeople datasets, and an empty known_clusters file, since no clusters are known to us before running this pipeline.

Note

To meet the input specifications for Datasets defined by the pipeline schema (see the next section), the SSA and W2 datasets, after being generated by pseudopeople, were modified to add the required Record ID column. Separately, for data cleaning rather than specification reasons, SSA death records were removed, leaving only SSN creation records.

Pipeline specification 

The --pipeline-specification (-p) argument to easylink run accepts a YAML file specifying the implementations and other configuration options for the pipeline being run. We passed pipeline_demo_naive.yaml, the contents of which can be seen by clicking below:

Show pipeline_demo_naive.yaml

  steps:
    entity_resolution:
      substeps:
        determining_exclusions_and_removing_records:
          clones:
            - determining_exclusions:
                implementation:
                  name: default_determining_exclusions
                  configuration:
                    INPUT_DATASET: input_file_ssa
              removing_records:
                implementation:
                  name: default_removing_records
                  configuration:
                    INPUT_DATASET: input_file_ssa
            - determining_exclusions:
                implementation:
                  name: default_determining_exclusions
                  configuration:
                    INPUT_DATASET: input_file_w2
              removing_records:
                implementation:
                  name: default_removing_records
                  configuration:
                    INPUT_DATASET: input_file_w2
        clustering:
          substeps:
            clusters_to_links:
              implementation:
                name: default_clusters_to_links
            linking:
              substeps:
                pre-processing:
                  clones:
                    - implementation:
                        name: middle_name_to_initial
                        configuration: 
                          INPUT_DATASET: input_file_ssa
                    - implementation:
                        name: no_pre-processing
                        configuration: 
                          INPUT_DATASET: input_file_w2
                schema_alignment:
                  implementation:
                    name: default_schema_alignment
                blocking_and_filtering:
                  implementation:
                    name: splink_blocking_and_filtering
                    configuration:
                      LINK_ONLY: true
                      BLOCKING_RULES: "l.first_name == r.first_name,l.last_name == r.last_name"
                evaluating_pairs:
                  implementation:
                    name: splink_evaluating_pairs
                    configuration:
                      LINK_ONLY: true
                      BLOCKING_RULES_FOR_TRAINING: "l.first_name == r.first_name,l.last_name == r.last_name"
                      COMPARISONS: "ssn:exact,first_name:exact,middle_initial:exact,last_name:exact"
                      PROBABILITY_TWO_RANDOM_RECORDS_MATCH: 0.0001  # == 1 / len(w2)
            links_to_clusters:
              implementation:
                name: one_to_many_links_to_clusters
                configuration:
                  NO_DUPLICATES_DATASET: input_file_ssa
                  THRESHOLD_MATCH_PROBABILITY: 0.996
        updating_clusters:
          implementation:
            name: default_updating_clusters
    canonicalizing_and_downstream_analysis:
      implementation:
        name: save_clusters

The pipeline specification follows the structure defined in the Pipeline Schema, a very important part of EasyLink. The EasyLink pipeline schema enforces the standard patterns that implementations of each step in the linkage process must follow. These standard patterns enable easy configuration and swapping.

There are some flexible sections in the pipeline schema, such as cloneable sections, which allow a pipeline to create multiple copies of that section and use different implementations or inputs for each copy. We’ll see one of those soon.

Important

Before proceeding, it’s important to understand the relationship between a pipeline, a pipeline specification (YAML file), and the pipeline schema:

A pipeline consists of a complete set of software which can perform a whole record linkage task, taking in record datasets as inputs and outputting a result such as clusters of records or some analysis on those clusters. EasyLink makes it simple to define and run many different pipelines in order to experiment with what methods yield the best results for a task.
A pipeline specification is a YAML file, which defines a pipeline which can be run with EasyLink. The schema defines the implementation which will be run for each step, and performs any necessary configuration for those implementations. An example specification is expandable above.
The EasyLink Pipeline Schema defines the universe of pipelines that can be constructed using EasyLink, including steps, inputs and outputs, and operators, as described above. All pipelines must adhere to the pipeline schema and implement all its steps!

Top-level steps 

Let’s take a closer look at the pipeline specification YAML bit by bit. We’ll start at the top level.

steps:
  entity_resolution:
    substeps:
      ...
  canonicalizing_and_downstream_analysis:
    implementation:
      name: save_clusters

This code block shows the same file, but with all the substeps of entity_resolution hidden, like in this diagram of the pipeline schema. Each time we link to one of these diagrams, the text below will also describe what each of the substeps involved does.

The children of the steps key are the top-level steps in the pipeline - as you can see, there are only two. We can see our first example of a step being configured if we look at canonicalizing_and_downstream_analysis. The children of the implementation key define and configure the code we will run for the canonicalizing and downstream analysis step. This step is intended to be used for determining best representative (“canonical”) records for each cluster, and/or doing some kind of summary data analysis (such as a linear regression) within EasyLink. In this case, we won’t do either of these things, and simply save the resolved clusters with no additional processing. We use the name key to choose the save_clusters implementation of canonicalization_and_downstream_analysis. save_clusters corresponds to one of the images which was downloaded the first time you ran the pipeline.

Entity resolution substeps 

Next we will show the ellipsed part of the above code block, which corresponds to this diagram in the pipeline schema.

determining_exclusions_and_removing_records:
  clones:
    - determining_exclusions:
        implementation:
          name: default_determining_exclusions
          configuration:
            INPUT_DATASET: input_file_ssa
      removing_records:
        implementation:
          name: default_removing_records
          configuration:
            INPUT_DATASET: input_file_ssa
    - determining_exclusions:
        implementation:
          name: default_determining_exclusions
          configuration:
            INPUT_DATASET: input_file_w2
      removing_records:
        implementation:
          name: default_removing_records
          configuration:
            INPUT_DATASET: input_file_w2
clustering:
  substeps:
    ...
updating_clusters:
  implementation:
    name: default_updating_clusters

The last step shown, updating_clusters, looks similar to canonicalization_and_downstream_analysis above; it simply chooses an implementation for the step using the name key.

The substeps of clustering are hidden – we’ll look at them next.

The complicated part is determining_exclusions_and_removing_records and its clones key:

As described in the pipeline schema, the steps “determining exclusions and removing records” identify and remove records that can be excluded from this linking pass to save computational time, generally because they have already been assigned to clusters.

The schema can define cloneable sections, which allow a pipeline to create multiple copies of that section and use different implementations or inputs for each copy. We can see that the entity resolution sub-steps schema section defines determining_exclusions and removing_records as cloneable in the diagram (blue dashed box).

In the YAML, the cloneable superstep determining_exclusions_and_removing_records is expanded using the clones key, and two copies are made of its substeps, determining_exclusions and removing_records. The - denotes the beginning of an item in a YAML collection.

We can see that the only difference between the two copies is what filename is passed to the INPUT_DATASET configuration key for each step. In the first copy, the ssa dataset files are used as inputs for both steps, while in the second copy, the w2 dataset files are the inputs. In practice, this means that records to exclude will be identified and removed separately for each input file, as required by the schema since each input file has different data. This cloneable section also allows different implementations to be used for each dataset if desired.

Note

All the steps listed here use default implementations. Much of the time, steps with default implementations aren’t very interesting to change, and the defaults will do whatever operation is the common or simple case. The pipeline schema section linked above the code block describes the behavior of each of these default implementations.

Clustering substeps 

Next we will show the ellipsed part of the above code block, which corresponds to this diagram in the pipeline schema.

clusters_to_links:
  implementation:
    name: default_clusters_to_links
linking:
  substeps:
    ...
links_to_clusters:
  implementation:
    name: one_to_many_links_to_clusters
    configuration:
      DUPLICATE_FREE_DATASET: input_file_ssa
      THRESHOLD_MATCH_PROBABILITY: 0.996

We will show the hidden linking substeps in the next section.

In links_to_clusters we see a more interesting example of configuring an implementation. DUPLICATE_FREE_DATASET specifies which dataset is assumed not to contain duplicates within it. THRESHOLD_MATCH_PROBABILITY here allows the user to define at what probability a pair of records will be considered part of the same cluster. Our one_to_many_links_to_clusters implementation implements this step by filtering out links below the threshold, and then choosing the single best-matching SSA record for each W-2 record (since linking a W-2 to multiple SSA records would imply those SSA records were duplicates). The name of the implementation reflects that in the resulting clusters, one SSA record can have many W-2 records (but not vice versa).

While this implementation doesn’t use the Splink package, the Splink docs have helpful info on how to choose a probability threshold.

Linking substeps 

Next we will show the ellipsed part of the above code block, which corresponds to this diagram in the pipeline schema.

pre-processing:
  clones:
    - implementation:
        name: middle_name_to_initial
        configuration:
          INPUT_DATASET: input_file_ssa
    - implementation:
        name: no_pre-processing
        configuration:
          INPUT_DATASET: input_file_w2
schema_alignment:
  implementation:
    name: default_schema_alignment
blocking_and_filtering:
  implementation:
    name: splink_blocking_and_filtering
    configuration:
      LINK_ONLY: true
      BLOCKING_RULES: "l.first_name == r.first_name,l.last_name == r.last_name"
evaluating_pairs:
  implementation:
    name: splink_evaluating_pairs
    configuration:
      LINK_ONLY: true
      BLOCKING_RULES_FOR_TRAINING: "l.first_name == r.first_name,l.last_name == r.last_name"
      COMPARISONS: "ssn:exact,first_name:exact,middle_initial:exact,last_name:exact"
      PROBABILITY_TWO_RANDOM_RECORDS_MATCH: 0.0001  # == 1 / len(w2)

We see that pre-processing is another cloneable step, allowing us to select different pre-processing implementations for different input datasets. In this case, we leave the w2 dataset unchanged, while changing the middle_name column in the ssa dataset to a middle_initial column to match the w2 data.

Finally, we will configure the two Splink implementations.

For splink_blocking_and_filtering, we set:

LINK_ONLY: true
BLOCKING_RULES: "l.first_name == r.first_name,l.last_name == r.last_name"

The first variable instructs Splink to link records between datasets without de-depulicating within datasets. The second is used by the Splink implementation to define which pairs of records will be considered as possible matches (only records with matching first or last names).

For splink_evaluating_pairs, we set:

LINK_ONLY: true
BLOCKING_RULES_FOR_TRAINING: "l.first_name == r.first_name,l.last_name == r.last_name"
COMPARISONS: "ssn:exact,first_name:exact,middle_initial:exact,last_name:exact"
PROBABILITY_TWO_RANDOM_RECORDS_MATCH: 0.0001  # == 1 / len(w2)

The first two variables are used similarly to the previous implementation. BLOCKING_RULES_FOR_TRAINING is specifically used for estimating parameters in the model. COMPARISONS defines the columns which will be compared by the Splink model, and how Splink will evaluate whether the column values match (exact comparisons). The fourth is a parameter used in training the model and making predictions (see the Splink docs for more info).

And that’s the whole pipeline specification for our naive Splink model! Next let’s take a look at the results from when we ran the pipeline earlier.

Naive model - results 

Input and output data is stored in Parquet files. For example, to see our original records, we can view the contents of the input files listed in input_data_demo.yaml using Python:

$ # Activate your EasyLink conda environment!
$ python
>>> import pandas as pd
>>> pd.read_parquet("2020/input_file_ssa.parquet")
      simulant_id          ssn first_name    middle_name  ...     sex event_type event_date Record ID
0         0_19979  786-77-6454     Evelyn  Granddaughter  ...  Female   creation   19191204         0
1          0_6846  688-88-6377     George         Robert  ...    Male   creation   19210616         1
2         0_19983  651-33-9561   Beatrice         Jennie  ...  Female   creation   19220113         2
3           0_262  665-25-7858       Eura         Nadine  ...  Female   creation   19220305         3
4         0_12473  875-10-2359    Roberta           Ruth  ...  Female   creation   19220306         4
...           ...          ...        ...            ...  ...     ...        ...        ...       ...
16492     0_20687  183-90-0619    Matthew        Michael  ...  Female   creation   20201229     16492
16493     0_20686  803-81-8527     Jermey          Tyler  ...    Male   creation   20201229     16493
16494     0_20692  170-62-5253  Brittanie         Lauren  ...  Female   creation   20201229     16494
16495     0_20662  281-88-9330     Marcus         Jasper  ...    Male   creation   20201230     16495
16496     0_20673  547-99-7034     Analia        Brielle  ...  Female   creation   20201231     16496
[15984 rows x 10 columns]

>>> pd.read_parquet("2020/input_file_w2.parquet")
    simulant_id household_id employer_id          ssn  ... mailing_address_zipcode tax_form tax_year Record ID
0            0_4          0_8          95  584-16-0130  ...                   00000       W2     2020         0
1            0_5          0_8          29  854-13-6295  ...                   00000       W2     2020         1
2            0_5          0_8          30  854-13-6295  ...                   00000       W2     2020         2
3         0_5621       0_2289          46  674-27-1745  ...                   00000       W2     2020         3
4         0_5623       0_2289          83  794-23-1522  ...                   00000       W2     2020         4
...          ...          ...         ...          ...  ...                     ...      ...      ...       ...
9898     0_18936       0_7621          23  006-92-7857  ...                   00000       W2     2020      9898
9899     0_18936       0_7621          90  006-92-7857  ...                   00000       W2     2020      9899
9900     0_18937       0_7621           1  182-82-5017  ...                   00000     1099     2020      9900
9901     0_18937       0_7621         105  182-82-5017  ...                   00000     1099     2020      9901
9902     0_18939       0_7621           9  283-97-5940  ...                   00000       W2     2020      9902
[9903 rows x 25 columns]

>>> pd.read_parquet("known_clusters.parquet")
Empty DataFrame
Columns: [Input Record Dataset, Input Record ID, Cluster ID]
Index: []

It can also be useful to set up an alias to more easily preview parquet files. Run the following line to do so. (If you want this alias to persist across terminal restarts, you can add it to your .bashrc or .bash_aliases in your home directory.)

pqprint() { python -c "import pandas as pd; print(pd.read_parquet('$1'))" ; }

Let’s use the alias to print the results parquet, the location of which was printed when we ran the pipeline.

$ pqprint results/2025_06_26_10_13_31/result.parquet
      Input Record Dataset  Input Record ID  Cluster ID
         input_file_ssa             7371           1
          input_file_w2                0           1
         input_file_ssa             7037           2
          input_file_w2                2           2
          input_file_w2                1           2
...                    ...              ...         ...
      input_file_w2              997        6693
     input_file_ssa             5883        6693
      input_file_w2              999        6694
     input_file_ssa             6358        6694
      input_file_w2              998        6694

[15815 rows x 3 columns]

As we can see, the pipeline has successfully outputted a Cluster ID for every input record it was able to link to another record for our probability threshold of 99.6%.

Note

Running the pipeline also generates a DAG.svg file in the results directory which shows the implementations, data dependencies and input validations present in the pipeline. Due to the large number of steps, the figure is not very readable when embedded in this page, but can be opened in a new tab to allow for zooming in.

To see how the model linked pairs of records before resolving them into clusters, we can look at the intermediate output produced by the splink_evaluating_pairs implementation:

$ pqprint results/2025_06_26_10_13_31/intermediate/splink_evaluating_pairs/result.parquet
      Left Record Dataset  Left Record ID Right Record Dataset  Right Record ID   Probability
0           input_file_ssa           16314        input_file_w2             7604  5.593631e-06
1           input_file_ssa           16318        input_file_w2             7604  5.593631e-06
2           input_file_ssa           16326        input_file_w2             6049  5.593631e-06
3           input_file_ssa           16351        input_file_w2             3549  5.593631e-06
4           input_file_ssa           16353        input_file_w2             7434  5.593631e-06
...                    ...             ...                  ...              ...           ...
515790      input_file_ssa            8586        input_file_w2              943  3.526073e-04
515791      input_file_ssa            8591        input_file_w2             3326  7.227902e-07
515792      input_file_ssa            8595        input_file_w2             3369  7.227902e-07
515793      input_file_ssa            8596        input_file_w2             6458  3.526073e-04
515794      input_file_ssa            8597        input_file_w2             3248  7.227902e-07

[515795 rows x 5 columns]

The record pairs displayed in the preview are all far below the match threshold, but the full results could be investigated further using pandas.read_parquet() in a Python session.

The Splink implementations in our pipeline also produce some diagnostic charts which can be useful for evaluating results, such as the match weights chart (Splink docs) and comparison viewer tool (Splink docs). These charts are from the diagnostics/splink_evaluating_pairs subdirectory of the results directory for each pipeline run.

Finally, since we are using simulated input datasets, and therefore know the ground truth of which records are truly links, we can directly see how our naive model performed with the help of a script to evaluate false positives and false negatives, print_fp_fn_w2_ssa.py. Download and run it:

$ python print_fp_fn_w2_ssa.py results/2025_06_26_10_13_31
12509 true links
len(false_positives)=31; len(false_negatives)=555

In other words, with a threshold probability of 99.6%, out of 12,509 true links to be found, our model missed 555 (false negatives), and additionally linked 31 pairs that shouldn’t have been linked (false positives).

Depending on our goals with the linked data, we might decrease the threshold to reduce false negatives, at the cost of increased false positives. But this was a simple linkage model. Let’s improve it to see if we can get a better performance tradeoff!

Configuring an improved pipeline 

Next, let’s modify our naive pipeline configuration YAML to try to improve our results. Primarily, we will change the COMPARISONS we pass to splink_evaluating_pairs to use flexible comparison methods rather than exact matches, allowing us to link records which have typos or other noise in them. We’ll use a new pipeline configuration YAML, pipeline_demo_improved.yaml, with these changes.

In splink_evaluating_pairs, we make the following change:

   LINK_ONLY: true
   BLOCKING_RULES_FOR_TRAINING: "'l.first_name == r.first_name,l.last_name == r.last_name'"
-  COMPARISONS: "ssn:exact,first_name:exact,middle_initial:exact,last_name:exact"
+  COMPARISONS: "ssn:levenshtein,first_name:name,middle_initial:exact,last_name:name"
   PROBABILITY_TWO_RANDOM_RECORDS_MATCH: 0.0001  # == 1 / len(w2)

COMPARISONS now uses Levenshtein comparisons for ssn, and Name comparisons for first_name and last_name, to link similar but not identical SSNs and names.

By re-running the pipeline with these changes and then running the evauation script, we can see how our results compare:

$ easylink run -p pipeline_demo_improved.yaml -i input_data_demo.yaml
$ python print_fp_fn_w2_ssa.py results/2025_06_26_11_08_57
12509 true links
len(false_positives)=34; len(false_negatives)=488

We eliminated 67 false negatives compared to the naive results, thanks to our model linking more records with columns that are similar but don’t exactly match. At the cost of only three additional false positives, this seems like a good improvement!

Linking 2030 datasets using improved pipeline 

Let’s run this same “improved” pipeline, but using input_data_demo_2030.yaml as the input YAML, which uses the SSA and W-2 datasets from 2030 rather than 2020. Like we did for 2020, we’ll create a 2030 directory and save input_file_ssa.parquet and input_file_w2.parquet into it.

We can run the same pipeline on different data by changing only the input parameter:

$ easylink run -p pipeline_demo_improved.yaml -i input_data_demo_2030.yaml
$ python print_fp_fn_w2_ssa.py results/2025_06_26_11_17_52
13888 true links
len(false_positives)=33; len(false_negatives)=547

We get similar, but not identical, results with the 2030 data.

Linking with an iterative “cascade”

Cascading is an iterative approach to entity resolution used by the US Census Bureau (and possibly other organizations too) to deal with the computational challenge of linking billions of records. In cascading, multiple passes are made to find clusters, starting with faster techniques (such as exact matching) that can solve some “easy” cases and make the problem smaller. As the focus narrows to only the records that are hardest to cluster, making the size of the problem smaller, more sophisticated and computationally expensive techniques can be used.

Cascading isn’t found very often in the scientific literature, and its statistical properties are under-theorized.

Cascading depends on having some way to determine, from an initial/provisional linkage result, which records are “done” and do not need to be considered any longer. In our case, because we know (or are willing to assume) that there are no duplicates in the SSA dataset, that means that any W-2 record that has already linked to one SSA record has found its only match and does not need to be compared against any more SSA records.

Cascading can involve any number of iterative “passes,” but for simplicity we will consider only two. In the first pass, we’ll use deterministic linkage, linking records that match exactly on SSN, first name, and last name. In the second pass, we’ll use our improved Splink model on the remaining records.

We don’t expect cascading to make our results any more accurate – in fact, it seems likely that doing things in two steps might lead to a few more mistakes. Since cascading is a computational optimization, let’s get a baseline of how much computation we are doing without it: how many pairs we evaluated with the improved Splink model. For this model, we’ll return to 2020 data, so you’ll want to go back and find the timestamp of the last model before you ran on 2030 data:

$ pqprint results/2025_06_26_10_13_31/intermediate/splink_evaluating_pairs/result.parquet
      Left Record Dataset  Left Record ID Right Record Dataset  Right Record ID   Probability
0           input_file_ssa           16314        input_file_w2             7604  5.593631e-06
1           input_file_ssa           16318        input_file_w2             7604  5.593631e-06
2           input_file_ssa           16326        input_file_w2             6049  5.593631e-06
3           input_file_ssa           16351        input_file_w2             3549  5.593631e-06
4           input_file_ssa           16353        input_file_w2             7434  5.593631e-06
...                    ...             ...                  ...              ...           ...
515790      input_file_ssa            8586        input_file_w2              943  3.526073e-04
515791      input_file_ssa            8591        input_file_w2             3326  7.227902e-07
515792      input_file_ssa            8595        input_file_w2             3369  7.227902e-07
515793      input_file_ssa            8596        input_file_w2             6458  3.526073e-04
515794      input_file_ssa            8597        input_file_w2             3248  7.227902e-07

[515795 rows x 5 columns]

We ran over half a million pairs of records through our Splink model. Let’s see if cascading can decrease this number.

We’ll add cascading to our pipeline specification by giving the entity_resolution step multiple iterations. entity_resolution is what’s called a “loop-able section”, which works similarly to a cloneable section. The main difference in syntax is that the iterations key is used instead of clones. When the pipeline is run, rather than running in parallel like clones, these iterations will be executed in order, with the output from one iteration being passed as input to the next.

steps:
  entity_resolution:
    iterations:
      - substeps:
          ...
      - substeps:
          ...
  canonicalizing_and_downstream_analysis:
    implementation:
      name: save_clusters

Within the first ellipsed substeps section (the specification for our first cascading pass) we will copy the entire substeps section from our improved model, but make some changes to make the linkage deterministic:

substeps:
  determining_exclusions_and_removing_records:
    clones:
      - determining_exclusions:
          implementation:
            name: default_determining_exclusions
            configuration:
              INPUT_DATASET: input_file_ssa
        removing_records:
          implementation:
            name: default_removing_records
            configuration:
              INPUT_DATASET: input_file_ssa
      - determining_exclusions:
         implementation:
            name: default_determining_exclusions
            configuration:
              INPUT_DATASET: input_file_w2
        removing_records:
          implementation:
            name: default_removing_records
            configuration:
              INPUT_DATASET: input_file_w2
  clustering:
    substeps:
      clusters_to_links:
        implementation:
          name: default_clusters_to_links
      linking:
        substeps:
          pre-processing:
            clones:
            - implementation:
-               name: middle_name_to_initial
+               name: no_pre-processing
                configuration:
                  INPUT_DATASET: input_file_ssa
            - implementation:
                name: no_pre-processing
                configuration:
                  INPUT_DATASET: input_file_w2
          schema_alignment:
            implementation:
              name: default_schema_alignment
          blocking_and_filtering:
            implementation:
              name: splink_blocking_and_filtering
              configuration:
                LINK_ONLY: true
-               BLOCKING_RULES: "l.first_name == r.first_name,l.last_name == r.last_name"
+               BLOCKING_RULES: "l.ssn == r.ssn and l.first_name == r.first_name and l.last_name == r.last_name"
          evaluating_pairs:
            implementation:
-             name: splink_evaluating_pairs
-               configuration:
-                 LINK_ONLY: true
-                 BLOCKING_RULES_FOR_TRAINING: "l.first_name == r.first_name,l.last_name == r.last_name"
-                 COMPARISONS: "ssn:levenshtein,first_name:name,middle_initial:exact,last_name:name"
-                 PROBABILITY_TWO_RANDOM_RECORDS_MATCH: 0.0001  # == 1 / len(w2)
+             name: accept_all_pairs
      links_to_clusters:
        implementation:
          name: one_to_many_links_to_clusters
          configuration:
            NO_DUPLICATES_DATASET: input_file_ssa
-           THRESHOLD_MATCH_PROBABILITY: 0.996
+           # All come with certainty from accept_all_pairs anyway, so this doesn't matter
+           THRESHOLD_MATCH_PROBABILITY: 0.9

To do our strict deterministic linkage, we change our BLOCKING_RULES to contain just one rule, for our determinstic linkage rule (exact match on SSN, first name, and last name). Then we replace our Splink evaluating-pairs model with an implementation called accept_all_pairs which, as the name suggests, accepts every pair that passes blocking as a match with probability 1.

For the second iteration, we can paste in another copy of our original improved model, with the following changes:

substeps:
  determining_exclusions_and_removing_records:
    clones:
      - determining_exclusions:
          implementation:
-           name: default_determining_exclusions
+           name: exclude_none
            configuration:
              INPUT_DATASET: input_file_ssa
        removing_records:
          implementation:
            name: default_removing_records
            configuration:
              INPUT_DATASET: input_file_ssa
      - determining_exclusions:
         implementation:
-           name: default_determining_exclusions
+           name: exclude_clustered
            configuration:
              INPUT_DATASET: input_file_w2
        removing_records:
          implementation:
            name: default_removing_records
            configuration:
              INPUT_DATASET: input_file_w2
  clustering:
    substeps:
      clusters_to_links:
        implementation:
          name: default_clusters_to_links
      linking:
        substeps:
          pre-processing:
            clones:
            - implementation:
                name: middle_name_to_initial
                configuration:
                  INPUT_DATASET: input_file_ssa
            - implementation:
                name: no_pre-processing
                configuration:
                  INPUT_DATASET: input_file_w2
          schema_alignment:
            implementation:
              name: default_schema_alignment
          blocking_and_filtering:
            implementation:
              name: splink_blocking_and_filtering
              configuration:
                LINK_ONLY: true
                BLOCKING_RULES: "l.first_name == r.first_name,l.last_name == r.last_name"
          evaluating_pairs:
            implementation:
              name: splink_evaluating_pairs
                configuration:
                  LINK_ONLY: true
                  BLOCKING_RULES_FOR_TRAINING: "l.first_name == r.first_name,l.last_name == r.last_name"
                  COMPARISONS: "ssn:levenshtein,first_name:name,middle_initial:exact,last_name:name"
                  PROBABILITY_TWO_RANDOM_RECORDS_MATCH: 0.0001  # == 1 / len(w2)
      links_to_clusters:
        implementation:
          name: one_to_many_links_to_clusters
          configuration:
            NO_DUPLICATES_DATASET: input_file_ssa
            THRESHOLD_MATCH_PROBABILITY: 0.996

The only difference here is that we use new implementations for determining_exclusions. The determining exclusions step exists in order to facilitate cascading, but the default implementation (default_determining_exclusions) implements the simplest case: no cascading. If it is used for anything other than the first cascade pass, it will throw an error.

In its place, we’ve used exclude_none for SSA, which as the name suggests, does not exclude any records (since all SSA records remain eligible to match regardless of what is found in the first cascade pass). exclude_clustered, which we use for W-2, excludes records that have already been clustered with any other records; so this implements the rule described above, that we can drop W-2 records that have already linked to an SSA record. All SSA records remain eligible, while some W-2 records are excluded, because there can be duplicate W-2 records for the same SSA record, but not the other way around.

The full pipeline specification YAML resulting from these changes is pipeline_demo_improved_cascade.yaml.

Now we’re ready to run the cascading pipeline on the 2020 data and check our accuracy results!

$ easylink run -p pipeline_demo_improved_cascade.yaml -i input_data_demo.yaml -e environment_local.yaml
$ python print_fp_fn_w2_ssa.py results/2025_06_26_11_32_15
12509 true links
len(false_positives)=47; len(false_negatives)=505

As we guessed, accuracy didn’t get better; it actually got a bit worse. We had 13 more false positives than the base improved model on 2020 data, and 17 more false negatives as well. When we look at how many pairs we evaluated though, we see the benefits:

$ pqprint results/2025_06_26_11_32_15/intermediate/entity_resolution_loop_2_clustering_linking_linking_evaluating_pairs_splink_evaluating_pairs/result.parquet
      Left Record Dataset  Left Record ID Right Record Dataset  Right Record ID  Probability
        input_file_ssa               0        input_file_w2             9815     0.000759
        input_file_ssa               1        input_file_w2             8608     0.000403
        input_file_ssa               5        input_file_w2             9177     0.000202
        input_file_ssa               6        input_file_w2             6195     0.001840
        input_file_ssa               7        input_file_w2             9177     0.000202
...                   ...             ...                  ...              ...          ...
    input_file_ssa           16156        input_file_w2              466     0.000037
    input_file_ssa           16244        input_file_w2              466     0.000037
    input_file_ssa           16270        input_file_w2              466     0.000037
    input_file_ssa           16353        input_file_w2              466     0.000037
    input_file_ssa           16421        input_file_w2              466     0.000037

[91425 rows x 5 columns]

We evaluated only 17.7% as many pairs with our complex Splink model as when we weren’t using cascading! Clearly, the first pass with deterministic linkage is quite effective in reducing the size of the problem. Though there is no noticeable runtime speedup with these small data, this difference could be enormous for large data, where evaluating pairs becomes the bottleneck.

Wrapping Up 

In this tutorial, we’ve introduced EasyLink and demonstrated how to configure and run EasyLink pipelines, change step implementations, change input data, and evaluate and compare results between pipelines.

Not everything EasyLink can do has been covered in this tutorial. EasyLink currently includes a few more implementations we haven’t used here, can run pipelines on a computational cluster managed by Slurm or distribute work using Apache Spark, and has additional flexibility in the pipeline schema that we haven’t demonstrated here.

In its current state, EasyLink provides only one or two implementations for each step, does not yet have documentation to support users in creating their own implementations, and is not yet stable enough to be recommended as a tool for production pipelines. However, interested users are encouraged to utilize the provided implementations to their full potential by creating more pipelines, changing how implementations are configured, and linking different datasets.

We hope to be able to add more features in the future, including:

Full suite of implementations reflecting a range of common record linkage techniques
Documentation supporting users in creating their own implementations
User-experience improvements, especially regarding writing pipeline specifications and implementations
Auto-parallel sections for processing large scale data