.. _getting_started:

===============
Getting Started
===============

.. contents:: :local:

Introduction
============
EasyLink is a tool that allows users to build and run highly configurable record linkage pipelines. 
Its configurability enables users to "mix and match" different pieces of record 
linkage software by ensuring that each piece of the pipeline conforms to standard patterns. 

For example, users at the Census Bureau could easily evaluate whether using a more sophisticated "blocking" 
method would improve results in a certain pipeline, without having to rewrite the entire pipeline.

Overview
--------
This tutorial introduces EasyLink concepts and features by demonstrating the software's usage. Covered 
concepts include the EasyLink record linkage "pipeline schema," EasyLink pipeline configuration, running 
pipelines, changing record linkage step implementations, changing input data, evaluating and comparing 
results, and more. 

Audience
--------
This tutorial is intended for people familiar with record linkage practices, who are interested
in easily comparing linkage results across different methods. This tutorial will *not* include 
introductory information about record linkage, though it demonstrates a simple example of it.

Tutorial prerequisites
----------------------
`Install EasyLink <https://github.com/ihmeuw/easylink?tab=readme-ov-file#installation>`_ if you haven't already. 

The tutorial uses the `Splink <https://moj-analytical-services.github.io/splink/index.html>`_ Python package 
for record linkage implementations. You do **not** need to install Splink. Splink knowledge is not 
required to complete the tutorial but may be helpful when configuring Splink models.


Simulated input data
--------------------
Our first demonstration of running an EasyLink pipeline will configure a simple, "naive" record linkage
model with implementations written using the Splink package. Our pipeline will link
two simulated datasets generated by our `pseudopeople <https://pseudopeople.readthedocs.io/en/latest/>`_
package: simulated `Social Security Administration records <https://pseudopeople.readthedocs.io/en/latest/datasets/index.html#social-security-administration>`_
and simulated `W-2 and 1099 employment tax forms <https://pseudopeople.readthedocs.io/en/latest/datasets/index.html#tax-forms-w-2-1099>`_.
These datasets are about entirely simulated people, who we call "simulants,"
but they contain realistic data, including "noise" such as typos,
for an authentic record linkage challenge.


Naive model - running a pipeline
================================
Let's start by using the ``easylink run`` :ref:`command <cli>` to run a pipeline that configures a simple 
record linkage model.

First we need to download the configuration files we will pass to the command line: 
:download:`input_data_demo.yaml` and :download:`pipeline_demo_naive.yaml`. Save them into the directory
from which you will execute the ``easylink run`` command. 

``input_data_demo.yaml`` additionally references a few 
input files which we will save as well. Save :download:`known_clusters.parquet` to the same directory as
the other files, then create a subdirectory called ``2020`` and save :download:`input_file_ssa.parquet <2020/input_file_ssa.parquet>` and 
:download:`input_file_w2.parquet <2020/input_file_w2.parquet>` into it.

Now we can run the pipeline. Note that if this is your first time running EasyLink, the command will first
download the required `Singularity <https://docs.sylabs.io/guides/latest/user-guide/introduction.html>`_
container images. These files contain the code EasyLink will run for each step in the record linkage 
pipeline. The total amount to be downloaded is approximately 5GB, so we recommend first running the command 
below, then reading the information about it while the files download. Hopefully the download will be 
complete by the time you reach the next interactive section! The progress of your image downloads will be 
displayed in the console.

.. code-block:: console

  $ easylink run -p pipeline_demo_naive.yaml -i input_data_demo.yaml
   2025-06-30 14:17:58 | 00:00:01 | Running pipeline
   2025-06-30 14:17:58 | 00:00:01 | Results directory: /mnt/share/homes/tylerdy/easylink/docs/source/user_guide/tutorials/results/2025_06_26_10_13_31
   ... Downloading Images ...
   2025-06-30 14:18:21 | 00:00:24 | Running Snakemake
   2025-06-30 14:18:22 | 00:00:25 | Validating determining_exclusions_and_removing_records_clone_2_removing_records_default_removing_records input slot input_datasets
   ...
   2025-06-30 14:18:24 | 00:00:27 | Running clusters_to_links implementation: default_clusters_to_links
   2025-06-30 14:18:24 | 00:00:27 | Running determining_exclusions implementation: default_determining_exclusions
   2025-06-30 14:18:24 | 00:00:27 | Running determining_exclusions implementation: default_determining_exclusions
   ...
   2025-06-30 14:18:39 | 00:00:42 | Validating splink_blocking_and_filtering input slot records
   2025-06-30 14:18:42 | 00:00:45 | Running blocking_and_filtering implementation: splink_blocking_and_filtering
   2025-06-30 14:18:50 | 00:00:53 | Validating splink_evaluating_pairs input slot blocks
   2025-06-30 14:18:53 | 00:00:56 | Running evaluating_pairs implementation: splink_evaluating_pairs
   ...
   2025-06-30 14:19:19 | 00:01:22 | Running canonicalizing_and_downstream_analysis implementation: save_clusters
   2025-06-30 14:19:21 | 00:01:24 | Validating results input slot analysis_output
   2025-06-30 14:19:23 | 00:01:26 | Grabbing final output
   2025-06-30 14:19:26 | 00:01:29 | Pipeline finished running - full log saved to: /mnt/share/homes/tylerdy/easylink/docs/source/user_guide/tutorials/results/2025_06_26_10_13_31/pipeline.log

Success! Our pipeline has linked the input data and outputted the results, the clusters of records it found. We'll take a look 
at these results later and see how the model performed.

Naive model - command line arguments
====================================
This section will explain the command line arguments and show the file we pass to each one, including the 
pipeline specification YAML and how it relates to the EasyLink pipeline schema. That file can look a 
little complicated at first, so feel free to skip ahead to the :ref:`naive_results` section, where the 
interactive part of the tutorial continues, and come back later.

..
   TODO: possibly move this elsewhere

   Computing Environment
   ---------------------
   The ``--computing-environment`` (``-e``) argument to ``easylink run`` accepts a YAML file specifying 
   information about the computing environment which will execute the steps of the 
   pipeline. We passed ``environment_local.yaml``, the contents of which are shown below:

   .. code-block:: yaml

      computing_environment: local
      container_engine: singularity

   It specifies a ``local`` computing environment using ``singularity`` as the container engine. These parameters indicate that no new compute resources will 
   be used to execute the pipeline steps, and that the Singularity container for each implementation will run within the context where ``easylink run`` is being executed.
   For example, if you ran the ``easylink run`` command on your laptop, the implementations would run on your laptop;
   if you ran the ``easylink run`` command on a cloud (e.g. EC2) instance that you were connected to with SSH, the implementations would run on that instance,
   and so on.

Input data
----------
The ``--input-data`` (``-i``) argument to ``easylink run`` accepts a YAML file specifying a list 
of paths to files or directories containing input data to be used by the pipeline. 
We passed ``input_data_demo.yaml``, the contents of which are shown below:

.. code-block:: yaml

  input_file_ssa: 2020/input_file_ssa.parquet
  input_file_w2: 2020/input_file_w2.parquet
  known_clusters: known_clusters.parquet

Here we have defined the locations of the three input files we will use: the 2020 versions of the 
two pseudopeople datasets, and an empty ``known_clusters`` file, since no
clusters are known to us before running this pipeline. 

.. note::
    To meet the input specifications for :ref:`datasets` defined by the pipeline schema (see the next section),
    the SSA and W2 datasets, after being generated by pseudopeople, were modified
    to add the required ``Record ID`` column. Separately, for data cleaning rather than specification reasons, 
    SSA death records were removed, leaving only SSN creation records.
  

Pipeline specification
----------------------
The ``--pipeline-specification`` (``-p``) argument to ``easylink run`` accepts a YAML file specifying 
the implementations and other configuration options for the pipeline being run. We passed 
``pipeline_demo_naive.yaml``, the contents of which can be seen by clicking below:

.. raw:: html

   <details>
   <summary>Show pipeline_demo_naive.yaml</summary>

.. literalinclude:: pipeline_demo_naive.yaml
   :language: YAML

.. raw:: html

  </details>

The pipeline specification follows the structure defined in the :ref:`pipeline_schema`, a very important
part of EasyLink. The EasyLink pipeline schema enforces the standard patterns that implementations of each step in the linkage process must follow.
These standard patterns enable easy configuration and swapping.

There are some flexible sections in the pipeline schema, such as :ref:`cloneable sections <cloneable_sections>`, which allow a pipeline to create multiple copies of that section and use different 
implementations or inputs for each copy. We'll see one of those soon.

.. important::

  Before proceeding, it's important to understand the relationship between a pipeline, a pipeline 
  specification (YAML file), and the pipeline schema:

  - A `pipeline <https://easylink.readthedocs.io/en/latest/concepts/pipeline_schema/index.html#pipelines>`_ 
    consists of a complete set of software which can perform a whole record linkage task, taking in record datasets as inputs and outputting 
    a result such as clusters of records or some analysis on those clusters. EasyLink makes it simple to define and run 
    many different pipelines in order to experiment with what methods yield the best results for a task.
  - A pipeline specification is a YAML file, which defines a pipeline which can be run with EasyLink. The schema defines the 
    implementation which will be run for each step, and performs any necessary configuration for those implementations. An 
    example specification is expandable above.
  - The EasyLink :ref:`pipeline_schema` defines the universe of pipelines that can be constructed using EasyLink, including
    steps, inputs and outputs, and operators, as described above. All pipelines must adhere to the pipeline schema and implement all its steps! 

Top-level steps
^^^^^^^^^^^^^^^

Let's take a closer look at the pipeline specification YAML bit by bit. We'll start at the top level.

.. code-block:: yaml

  steps:
    entity_resolution:
      substeps:
        ...
    canonicalizing_and_downstream_analysis:
      implementation:
        name: save_clusters

This code block shows the same file, but with all the substeps of ``entity_resolution`` hidden, 
like in :ref:`this diagram <easylink_pipeline_schema>`
of the pipeline schema. Each time we link to one of these diagrams, the text below will also describe what 
each of the substeps involved does.

The children of the ``steps`` key are the top-level steps in the pipeline - as you can see, there are 
only two. We can see our first example of a step being configured if we look at ``canonicalizing_and_downstream_analysis``. 
The children of the ``implementation`` key define and configure the code we will run for 
:ref:`the canonicalizing and downstream analysis step <canonicalizing>`.
This step is intended to be used for determining best representative ("canonical") records for each cluster, and/or
doing some kind of summary data analysis (such as a linear regression) within EasyLink.
In this case, we won't do either of these things, and simply save the resolved clusters with no additional processing.
We use the ``name`` key to choose the ``save_clusters`` implementation of ``canonicalization_and_downstream_analysis``.
``save_clusters`` corresponds to one of the images which was downloaded the first time you ran the pipeline.

Entity resolution substeps
^^^^^^^^^^^^^^^^^^^^^^^^^^

Next we will show the ellipsed part of the above code block, which corresponds to 
:ref:`this diagram <entity_resolution_sub_steps>`
in the pipeline schema.

.. code-block:: yaml

  determining_exclusions_and_removing_records:
    clones:
      - determining_exclusions:
          implementation:
            name: default_determining_exclusions
            configuration:
              INPUT_DATASET: input_file_ssa
        removing_records:
          implementation:
            name: default_removing_records
            configuration:
              INPUT_DATASET: input_file_ssa
      - determining_exclusions:
          implementation:
            name: default_determining_exclusions
            configuration:
              INPUT_DATASET: input_file_w2
        removing_records:
          implementation:
            name: default_removing_records
            configuration:
              INPUT_DATASET: input_file_w2
  clustering:
    substeps:
      ...
  updating_clusters:
    implementation:
      name: default_updating_clusters

The last step shown, ``updating_clusters``, looks similar to ``canonicalization_and_downstream_analysis`` above; it simply chooses 
an implementation for the step using the ``name`` key. 

The substeps of ``clustering`` are hidden -- we'll look at them next. 

The complicated part is ``determining_exclusions_and_removing_records`` and its ``clones`` key:

As described :ref:`in the pipeline schema <entity_resolution_sub_steps>`, the steps "determining exclusions and removing records" identify and remove
records that can be excluded from this linking pass to save computational time, generally because they have 
already been assigned to clusters.

The schema can define :ref:`cloneable sections <cloneable_sections>`, which allow a pipeline to create 
multiple copies of that section and use different implementations or inputs
for each copy. We can see that the :ref:`entity resolution sub-steps <entity_resolution_sub_steps>` schema section defines
``determining_exclusions`` and ``removing_records`` as cloneable in the diagram 
(blue dashed box).

In the YAML, the cloneable superstep ``determining_exclusions_and_removing_records`` is expanded 
using the ``clones`` key, and two copies are made of its substeps, 
``determining_exclusions`` and ``removing_records``. The ``-`` denotes the beginning
of an item in a `YAML collection <https://yaml.org/spec/1.2.2/#21-collections>`_.

We can see that the only difference between the two copies is what filename is passed 
to the ``INPUT_DATASET`` configuration key for each step. In 
the first copy, the ``ssa`` dataset files are used as inputs for both steps, 
while in the second copy, the ``w2`` dataset files are the inputs. In practice, 
this means that records to exclude will be identified and removed separately for 
each input file, as required by the schema since each input file has different data. 
This cloneable section also allows different implementations to be used for each dataset 
if desired.

.. note::
  All the steps listed here use ``default`` implementations.
  Much of the time, steps with default implementations aren't very interesting to change,
  and the defaults will do whatever operation is the common or simple case.
  The pipeline schema section linked above the code block describes the behavior 
  of each of these default implementations.

Clustering substeps
^^^^^^^^^^^^^^^^^^^

Next we will show the ellipsed part of the above code block, which corresponds to 
`this diagram <https://easylink.readthedocs.io/en/latest/concepts/pipeline_schema/index.html#clustering-sub-steps>`__
in the pipeline schema.

.. code-block:: yaml

  clusters_to_links:
    implementation:
      name: default_clusters_to_links
  linking:
    substeps:
      ...
  links_to_clusters:
    implementation:
      name: one_to_many_links_to_clusters
      configuration:
        DUPLICATE_FREE_DATASET: input_file_ssa
        THRESHOLD_MATCH_PROBABILITY: 0.996

We will show the hidden linking substeps in the next section. 

In ``links_to_clusters`` we see a more interesting example of configuring an implementation.
``DUPLICATE_FREE_DATASET`` specifies which dataset is assumed not to contain duplicates within it.
``THRESHOLD_MATCH_PROBABILITY`` here allows the user to define at what probability a pair of records 
will be considered part of the same cluster.
Our ``one_to_many_links_to_clusters`` implementation implements `this step <https://easylink.readthedocs.io/en/latest/concepts/pipeline_schema/index.html#links-to-clusters>`_ by filtering out links
below the threshold, and then choosing the single *best-matching* SSA record for each
W-2 record (since linking a W-2 to multiple SSA records would imply those SSA records were duplicates).
The name of the implementation reflects that in the resulting clusters, *one* SSA record can have *many*
W-2 records (but not vice versa).

While this implementation doesn't use the Splink package,
the Splink docs have
`helpful info <https://moj-analytical-services.github.io/splink/topic_guides/evaluation/edge_overview.html#choosing-a-threshold>`__ on
how to choose a probability threshold.

Linking substeps
^^^^^^^^^^^^^^^^

Next we will show the ellipsed part of the above code block, which corresponds to 
`this diagram <https://easylink.readthedocs.io/en/latest/concepts/pipeline_schema/index.html#linking-sub-steps>`__
in the pipeline schema.

.. code-block:: yaml

  pre-processing:
    clones:
      - implementation:
          name: middle_name_to_initial
          configuration: 
            INPUT_DATASET: input_file_ssa
      - implementation:
          name: no_pre-processing
          configuration: 
            INPUT_DATASET: input_file_w2
  schema_alignment:
    implementation:
      name: default_schema_alignment
  blocking_and_filtering:
    implementation:
      name: splink_blocking_and_filtering
      configuration:
        LINK_ONLY: true
        BLOCKING_RULES: "l.first_name == r.first_name,l.last_name == r.last_name"
  evaluating_pairs:
    implementation:
      name: splink_evaluating_pairs
      configuration:
        LINK_ONLY: true
        BLOCKING_RULES_FOR_TRAINING: "l.first_name == r.first_name,l.last_name == r.last_name"
        COMPARISONS: "ssn:exact,first_name:exact,middle_initial:exact,last_name:exact"
        PROBABILITY_TWO_RANDOM_RECORDS_MATCH: 0.0001  # == 1 / len(w2)

We see that ``pre-processing`` is another cloneable step, allowing us to select different pre-processing implementations for different
input datasets. In this case, we leave the ``w2`` dataset unchanged, while changing the ``middle_name`` column in the ``ssa`` dataset 
to a ``middle_initial`` column to match the ``w2`` data.

Finally, we will configure the two Splink implementations.

For ``splink_blocking_and_filtering``, we set:

.. code-block:: yaml

    LINK_ONLY: true
    BLOCKING_RULES: "l.first_name == r.first_name,l.last_name == r.last_name"

The first variable instructs Splink to link records between datasets without de-depulicating within 
datasets.
The second is used by the Splink implementation to define which pairs of records 
will be considered as possible matches (only records with matching first or last names).

For ``splink_evaluating_pairs``, we set:

.. code-block:: yaml

  LINK_ONLY: true
  BLOCKING_RULES_FOR_TRAINING: "l.first_name == r.first_name,l.last_name == r.last_name"
  COMPARISONS: "ssn:exact,first_name:exact,middle_initial:exact,last_name:exact"
  PROBABILITY_TWO_RANDOM_RECORDS_MATCH: 0.0001  # == 1 / len(w2)

The first two variables are used similarly to the previous implementation.
``BLOCKING_RULES_FOR_TRAINING`` is specifically used for `estimating parameters in the model <https://moj-analytical-services.github.io/splink/demos/tutorials/04_Estimating_model_parameters.html#estimating-with-expectation-maximisation>`_.
``COMPARISONS``
defines the columns which will be compared by the Splink model, and how Splink will evaluate
whether the column values match (exact comparisons). The fourth is a parameter used in training
the model and making predictions
(`see the Splink docs for more info <https://moj-analytical-services.github.io/splink/demos/tutorials/04_Estimating_model_parameters.html#estimation-of-probability_two_random_records_match>`__). 


And that's the whole pipeline specification for our naive Splink model! Next let's take a look at the results from when we ran the 
pipeline earlier.

.. _naive_results:

Naive model - results
=====================

Input and output data is stored in Parquet files. For example, to see our original records, 
we can view the contents of the input files listed in ``input_data_demo.yaml`` using Python:

.. code-block:: console

  $ # Activate your EasyLink conda environment!
  $ python
  >>> import pandas as pd
  >>> pd.read_parquet("2020/input_file_ssa.parquet")
        simulant_id          ssn first_name    middle_name  ...     sex event_type event_date Record ID
  0         0_19979  786-77-6454     Evelyn  Granddaughter  ...  Female   creation   19191204         0
  1          0_6846  688-88-6377     George         Robert  ...    Male   creation   19210616         1
  2         0_19983  651-33-9561   Beatrice         Jennie  ...  Female   creation   19220113         2
  3           0_262  665-25-7858       Eura         Nadine  ...  Female   creation   19220305         3
  4         0_12473  875-10-2359    Roberta           Ruth  ...  Female   creation   19220306         4
  ...           ...          ...        ...            ...  ...     ...        ...        ...       ...
  16492     0_20687  183-90-0619    Matthew        Michael  ...  Female   creation   20201229     16492
  16493     0_20686  803-81-8527     Jermey          Tyler  ...    Male   creation   20201229     16493
  16494     0_20692  170-62-5253  Brittanie         Lauren  ...  Female   creation   20201229     16494
  16495     0_20662  281-88-9330     Marcus         Jasper  ...    Male   creation   20201230     16495
  16496     0_20673  547-99-7034     Analia        Brielle  ...  Female   creation   20201231     16496
  [15984 rows x 10 columns]

  >>> pd.read_parquet("2020/input_file_w2.parquet")
      simulant_id household_id employer_id          ssn  ... mailing_address_zipcode tax_form tax_year Record ID
  0            0_4          0_8          95  584-16-0130  ...                   00000       W2     2020         0
  1            0_5          0_8          29  854-13-6295  ...                   00000       W2     2020         1
  2            0_5          0_8          30  854-13-6295  ...                   00000       W2     2020         2
  3         0_5621       0_2289          46  674-27-1745  ...                   00000       W2     2020         3
  4         0_5623       0_2289          83  794-23-1522  ...                   00000       W2     2020         4
  ...          ...          ...         ...          ...  ...                     ...      ...      ...       ...
  9898     0_18936       0_7621          23  006-92-7857  ...                   00000       W2     2020      9898
  9899     0_18936       0_7621          90  006-92-7857  ...                   00000       W2     2020      9899
  9900     0_18937       0_7621           1  182-82-5017  ...                   00000     1099     2020      9900
  9901     0_18937       0_7621         105  182-82-5017  ...                   00000     1099     2020      9901
  9902     0_18939       0_7621           9  283-97-5940  ...                   00000       W2     2020      9902
  [9903 rows x 25 columns]

  >>> pd.read_parquet("known_clusters.parquet")
  Empty DataFrame
  Columns: [Input Record Dataset, Input Record ID, Cluster ID]
  Index: []

It can also be useful to set up an alias to more easily preview parquet files.
Run the following line to do so.
(If you want this alias to persist across terminal restarts, you can add it to your ``.bashrc`` or ``.bash_aliases`` in your home directory.)

.. code-block:: console

   pqprint() { python -c "import pandas as pd; print(pd.read_parquet('$1'))" ; }

Let's use the alias to print the results parquet, the location of which was printed when we ran the pipeline.

.. code-block:: console

  $ pqprint results/2025_06_26_10_13_31/result.parquet 
        Input Record Dataset  Input Record ID  Cluster ID
  0           input_file_ssa             7371           1
  1            input_file_w2                0           1
  2           input_file_ssa             7037           2
  3            input_file_w2                2           2
  4            input_file_w2                1           2
  ...                    ...              ...         ...
  15810        input_file_w2              997        6693
  15811       input_file_ssa             5883        6693
  15812        input_file_w2              999        6694
  15813       input_file_ssa             6358        6694
  15814        input_file_w2              998        6694

  [15815 rows x 3 columns]

As we can see, the pipeline has successfully outputted a ``Cluster ID`` for every 
input record it was able to link to another record for our probability threshold 
of 99.6%.

.. note::

  Running the pipeline also generates a :download:`DAG.svg <DAG-naive-pipeline.svg>` file in 
  the results directory which shows the implementations, data dependencies and 
  input validations present in the pipeline. Due to the large number of steps, the figure is 
  not very readable when embedded in this page, but can be opened in a new tab to allow for
  zooming in.

To see how the model linked pairs of records before resolving them into clusters, we can 
look at the intermediate output produced by the ``splink_evaluating_pairs`` 
implementation::

  $ pqprint results/2025_06_26_10_13_31/intermediate/splink_evaluating_pairs/result.parquet 
        Left Record Dataset  Left Record ID Right Record Dataset  Right Record ID   Probability
  0           input_file_ssa           16314        input_file_w2             7604  5.593631e-06
  1           input_file_ssa           16318        input_file_w2             7604  5.593631e-06
  2           input_file_ssa           16326        input_file_w2             6049  5.593631e-06
  3           input_file_ssa           16351        input_file_w2             3549  5.593631e-06
  4           input_file_ssa           16353        input_file_w2             7434  5.593631e-06
  ...                    ...             ...                  ...              ...           ...
  515790      input_file_ssa            8586        input_file_w2              943  3.526073e-04
  515791      input_file_ssa            8591        input_file_w2             3326  7.227902e-07
  515792      input_file_ssa            8595        input_file_w2             3369  7.227902e-07
  515793      input_file_ssa            8596        input_file_w2             6458  3.526073e-04
  515794      input_file_ssa            8597        input_file_w2             3248  7.227902e-07

  [515795 rows x 5 columns]

The record pairs displayed in the preview are all far below the match threshold, but the full results could 
be investigated further using ``pandas.read_parquet()`` in a Python session.

The Splink implementations in our pipeline also produce some diagnostic charts which can be useful 
for evaluating results, such as the :download:`match weights chart <naive_match_weights.html>` 
(`Splink docs <https://moj-analytical-services.github.io/splink/charts/match_weights_chart.html>`__) and 
:download:`comparison viewer tool <naive_comparison_viewer.html>` 
(`Splink docs <https://moj-analytical-services.github.io/splink/charts/comparison_viewer_dashboard.html>`__). 
These charts are from the 
``diagnostics/splink_evaluating_pairs`` subdirectory of the results directory for each pipeline run.

Finally, since we are using simulated input datasets, and therefore know the ground truth of 
which records are truly links, we can directly see how our naive model performed with the help of 
a script to evaluate false positives and false negatives, :download:`print_fp_fn_w2_ssa.py`.
Download and run it::

  $ python print_fp_fn_w2_ssa.py results/2025_06_26_10_13_31
  12509 true links
  len(false_positives)=31; len(false_negatives)=555

In other words, with a threshold 
probability of 99.6%, out of 12,509 true links to be found, our model missed 555 (false negatives),
and additionally linked 31 pairs that shouldn't have been linked (false positives). 


Depending on our goals with the linked data, we might decrease the threshold to reduce false negatives,
at the cost of increased false positives.
But this was a simple linkage model.
Let's improve it to see if we can get a better performance tradeoff!


Configuring an improved pipeline
================================
Next, let's modify our naive pipeline configuration YAML to try to improve our results. Primarily, we 
will change the ``COMPARISONS`` we pass to ``splink_evaluating_pairs`` to use flexible comparison 
methods rather than exact matches, allowing us to link records which have typos or other noise in them. We'll 
use a new pipeline configuration YAML, :download:`pipeline_demo_improved.yaml`, with these changes.

In ``splink_evaluating_pairs``, we make the following change:

.. code-block:: diff

     LINK_ONLY: true
     BLOCKING_RULES_FOR_TRAINING: "'l.first_name == r.first_name,l.last_name == r.last_name'"
  -  COMPARISONS: "ssn:exact,first_name:exact,middle_initial:exact,last_name:exact"
  +  COMPARISONS: "ssn:levenshtein,first_name:name,middle_initial:exact,last_name:name"
     PROBABILITY_TWO_RANDOM_RECORDS_MATCH: 0.0001  # == 1 / len(w2)


``COMPARISONS`` now uses 
`Levenshtein <https://moj-analytical-services.github.io/splink/api_docs/comparison_library.html#splink.comparison_library.LevenshteinAtThresholds>`_
comparisons for ``ssn``, and 
`Name <https://moj-analytical-services.github.io/splink/api_docs/comparison_library.html#splink.comparison_library.NameComparison>`_
comparisons for ``first_name`` and ``last_name``, to link similar but not identical SSNs and names.

By re-running the pipeline with these changes and then running the evauation script, we can see how our results compare::

  $ easylink run -p pipeline_demo_improved.yaml -i input_data_demo.yaml
  $ python print_fp_fn_w2_ssa.py results/2025_06_26_11_08_57
  12509 true links
  len(false_positives)=34; len(false_negatives)=488

We eliminated 67 false negatives compared to the naive results, thanks to our model linking more records with columns that 
are similar but don't exactly match.
At the cost of only three additional false positives, this seems like a good improvement!

Linking 2030 datasets using improved pipeline
=============================================
Let's run this same "improved" pipeline, but using :download:`input_data_demo_2030.yaml` 
as the input YAML, which uses the SSA and W-2 datasets from 2030 rather than 2020.
Like we did for 2020, we'll create a ``2030`` directory and save :download:`input_file_ssa.parquet <2030/input_file_ssa.parquet>` and 
:download:`input_file_w2.parquet <2030/input_file_w2.parquet>` into it.

We can run the same pipeline on different data by changing only the input parameter::

  $ easylink run -p pipeline_demo_improved.yaml -i input_data_demo_2030.yaml
  $ python print_fp_fn_w2_ssa.py results/2025_06_26_11_17_52
  13888 true links
  len(false_positives)=33; len(false_negatives)=547

We get similar, but not identical, results with the 2030 data.

Linking with an iterative "cascade"
===================================

*Cascading* is an iterative approach to entity resolution
used by the US Census Bureau (and possibly other organizations too)
to deal with the computational challenge of linking billions of records.
In cascading, multiple passes are made to find clusters, starting with
faster techniques (such as exact matching) that
can solve some "easy" cases and make the problem smaller.
As the focus narrows to only the records that
are hardest to cluster, making the size of the problem smaller,
more sophisticated and computationally expensive
techniques can be used.

Cascading isn't found very often in the scientific literature, and its statistical
properties are under-theorized.

Cascading depends on having some way to determine, from an initial/provisional
linkage result, which records are "done" and do not need to be considered any longer.
In our case, because we know (or are willing to assume) that there are no duplicates in
the SSA dataset, that means that any W-2 record that has already linked to one SSA record
has found its only match and does not need to be compared against any more SSA records.

Cascading can involve any number of iterative "passes," but for simplicity we will consider
only two.
In the first pass, we'll use deterministic linkage, linking records that match exactly on SSN, first name,
and last name.
In the second pass, we'll use our improved Splink model on the remaining records.

We don't expect cascading to make our results any more accurate -- in fact, it seems likely
that doing things in two steps might lead to a few more mistakes.
Since cascading is a computational optimization, let's get a baseline of how much computation we
are doing without it: how many pairs we evaluated with the improved Splink model.
For this model, we'll return to 2020 data, so you'll want to go back and find the timestamp of the
last model before you ran on 2030 data:

.. code::

   $ pqprint results/2025_06_26_10_13_31/intermediate/splink_evaluating_pairs/result.parquet 
         Left Record Dataset  Left Record ID Right Record Dataset  Right Record ID   Probability
   0           input_file_ssa           16314        input_file_w2             7604  5.593631e-06
   1           input_file_ssa           16318        input_file_w2             7604  5.593631e-06
   2           input_file_ssa           16326        input_file_w2             6049  5.593631e-06
   3           input_file_ssa           16351        input_file_w2             3549  5.593631e-06
   4           input_file_ssa           16353        input_file_w2             7434  5.593631e-06
   ...                    ...             ...                  ...              ...           ...
   515790      input_file_ssa            8586        input_file_w2              943  3.526073e-04
   515791      input_file_ssa            8591        input_file_w2             3326  7.227902e-07
   515792      input_file_ssa            8595        input_file_w2             3369  7.227902e-07
   515793      input_file_ssa            8596        input_file_w2             6458  3.526073e-04
   515794      input_file_ssa            8597        input_file_w2             3248  7.227902e-07

   [515795 rows x 5 columns]

We ran over half a million pairs of records through our Splink model.
Let's see if cascading can decrease this number.

We'll add cascading to our pipeline specification by giving the ``entity_resolution`` step multiple
iterations.
``entity_resolution`` is what's called a :ref:`"loop-able section" <loopable_sections>`, which works
similarly to a cloneable section.
The main difference in syntax is that the ``iterations`` key is used instead of ``clones``.
When the pipeline is run, rather than running in parallel like clones, these iterations will be executed
in order, with the output from one iteration being passed as input to the next.

.. code-block:: yaml

  steps:
    entity_resolution:
      iterations:
        - substeps:
            ...
        - substeps:
            ...
    canonicalizing_and_downstream_analysis:
      implementation:
        name: save_clusters

Within the first ellipsed ``substeps`` section (the specification for our first cascading pass)
we will copy the entire substeps section from our improved model, but make some changes to
make the linkage deterministic:

.. code-block:: diff

  substeps:
    determining_exclusions_and_removing_records:
      clones:
        - determining_exclusions:
            implementation:
              name: default_determining_exclusions
              configuration:
                INPUT_DATASET: input_file_ssa
          removing_records:
            implementation:
              name: default_removing_records
              configuration:
                INPUT_DATASET: input_file_ssa
        - determining_exclusions:
           implementation:
              name: default_determining_exclusions
              configuration:
                INPUT_DATASET: input_file_w2
          removing_records:
            implementation:
              name: default_removing_records
              configuration:
                INPUT_DATASET: input_file_w2
    clustering:
      substeps:
        clusters_to_links:
          implementation:
            name: default_clusters_to_links
        linking:
          substeps:
            pre-processing:
              clones:
              - implementation:
  -               name: middle_name_to_initial
  +               name: no_pre-processing
                  configuration:
                    INPUT_DATASET: input_file_ssa
              - implementation:
                  name: no_pre-processing
                  configuration:
                    INPUT_DATASET: input_file_w2
            schema_alignment:
              implementation:
                name: default_schema_alignment
            blocking_and_filtering:
              implementation:
                name: splink_blocking_and_filtering
                configuration:
                  LINK_ONLY: true
  -               BLOCKING_RULES: "l.first_name == r.first_name,l.last_name == r.last_name"
  +               BLOCKING_RULES: "l.ssn == r.ssn and l.first_name == r.first_name and l.last_name == r.last_name"
            evaluating_pairs:
              implementation:
  -             name: splink_evaluating_pairs
  -               configuration:
  -                 LINK_ONLY: true
  -                 BLOCKING_RULES_FOR_TRAINING: "l.first_name == r.first_name,l.last_name == r.last_name"
  -                 COMPARISONS: "ssn:levenshtein,first_name:name,middle_initial:exact,last_name:name"
  -                 PROBABILITY_TWO_RANDOM_RECORDS_MATCH: 0.0001  # == 1 / len(w2)
  +             name: accept_all_pairs
        links_to_clusters:
          implementation:
            name: one_to_many_links_to_clusters
            configuration:
              NO_DUPLICATES_DATASET: input_file_ssa
  -           THRESHOLD_MATCH_PROBABILITY: 0.996
  +           # All come with certainty from accept_all_pairs anyway, so this doesn't matter
  +           THRESHOLD_MATCH_PROBABILITY: 0.9

To do our strict deterministic linkage, we change our ``BLOCKING_RULES`` to contain just one
rule, for our determinstic linkage rule (exact match on SSN, first name, and last name).
Then we replace our Splink evaluating-pairs model with an implementation called ``accept_all_pairs``
which, as the name suggests, accepts every pair that passes blocking as a match with probability 1.

For the second iteration, we can paste in *another* copy of our original improved model, with
the following changes:

.. code-block:: diff

  substeps:
    determining_exclusions_and_removing_records:
      clones:
        - determining_exclusions:
            implementation:
  -           name: default_determining_exclusions
  +           name: exclude_none
              configuration:
                INPUT_DATASET: input_file_ssa
          removing_records:
            implementation:
              name: default_removing_records
              configuration:
                INPUT_DATASET: input_file_ssa
        - determining_exclusions:
           implementation:
  -           name: default_determining_exclusions
  +           name: exclude_clustered
              configuration:
                INPUT_DATASET: input_file_w2
          removing_records:
            implementation:
              name: default_removing_records
              configuration:
                INPUT_DATASET: input_file_w2
    clustering:
      substeps:
        clusters_to_links:
          implementation:
            name: default_clusters_to_links
        linking:
          substeps:
            pre-processing:
              clones:
              - implementation:
                  name: middle_name_to_initial
                  configuration:
                    INPUT_DATASET: input_file_ssa
              - implementation:
                  name: no_pre-processing
                  configuration:
                    INPUT_DATASET: input_file_w2
            schema_alignment:
              implementation:
                name: default_schema_alignment
            blocking_and_filtering:
              implementation:
                name: splink_blocking_and_filtering
                configuration:
                  LINK_ONLY: true
                  BLOCKING_RULES: "l.first_name == r.first_name,l.last_name == r.last_name"
            evaluating_pairs:
              implementation:
                name: splink_evaluating_pairs
                  configuration:
                    LINK_ONLY: true
                    BLOCKING_RULES_FOR_TRAINING: "l.first_name == r.first_name,l.last_name == r.last_name"
                    COMPARISONS: "ssn:levenshtein,first_name:name,middle_initial:exact,last_name:name"
                    PROBABILITY_TWO_RANDOM_RECORDS_MATCH: 0.0001  # == 1 / len(w2)
        links_to_clusters:
          implementation:
            name: one_to_many_links_to_clusters
            configuration:
              NO_DUPLICATES_DATASET: input_file_ssa
              THRESHOLD_MATCH_PROBABILITY: 0.996

The only difference here is that we use new implementations for ``determining_exclusions``.
The determining exclusions step exists in order to facilitate cascading,
but the default implementation (``default_determining_exclusions``) implements the simplest
case: no cascading.
If it is used for anything other than the first cascade pass, it will throw an error.

In its place, we've used ``exclude_none`` for SSA, which as the name suggests, does not exclude
any records (since all SSA records remain eligible to match regardless of what is found in the first cascade pass).
``exclude_clustered``, which we use for W-2, excludes records that have already been clustered with
any other records; so this implements the rule described above, that we can drop W-2 records that have
already linked to an SSA record.
All SSA records remain eligible, while some W-2 records are excluded,
because there can be duplicate W-2 records for the same SSA record, but not the other way around.

The full pipeline specification YAML resulting from these changes is :download:`pipeline_demo_improved_cascade.yaml`.

Now we're ready to run the cascading pipeline on the 2020 data and check our accuracy results!

.. code::

   $ easylink run -p pipeline_demo_improved_cascade.yaml -i input_data_demo.yaml -e environment_local.yaml
   $ python print_fp_fn_w2_ssa.py results/2025_06_26_11_32_15
   12509 true links
   len(false_positives)=47; len(false_negatives)=505

As we guessed, accuracy didn't get better; it actually got a bit worse.
We had 13 more false positives than the base improved model on 2020 data,
and 17 more false negatives as well.
When we look at how many pairs we evaluated though, we see the benefits:

.. code::

   $ pqprint results/2025_06_26_11_32_15/intermediate/entity_resolution_loop_2_clustering_linking_linking_evaluating_pairs_splink_evaluating_pairs/result.parquet
         Left Record Dataset  Left Record ID Right Record Dataset  Right Record ID  Probability
   0          input_file_ssa               0        input_file_w2             9815     0.000759
   1          input_file_ssa               1        input_file_w2             8608     0.000403
   2          input_file_ssa               5        input_file_w2             9177     0.000202
   3          input_file_ssa               6        input_file_w2             6195     0.001840
   4          input_file_ssa               7        input_file_w2             9177     0.000202
   ...                   ...             ...                  ...              ...          ...
   91420      input_file_ssa           16156        input_file_w2              466     0.000037
   91421      input_file_ssa           16244        input_file_w2              466     0.000037
   91422      input_file_ssa           16270        input_file_w2              466     0.000037
   91423      input_file_ssa           16353        input_file_w2              466     0.000037
   91424      input_file_ssa           16421        input_file_w2              466     0.000037

   [91425 rows x 5 columns]

We evaluated only 17.7% as many pairs with our complex Splink model as when we weren't using
cascading!
Clearly, the first pass with deterministic linkage is quite effective in reducing the size of
the problem.
Though there is no noticeable runtime speedup with these small data, this difference could
be enormous for large data, where evaluating pairs becomes the bottleneck.


Wrapping Up
===========

In this tutorial, we've introduced EasyLink and demonstrated how to configure and run EasyLink pipelines, change step
implementations, change input data, and evaluate and compare results between pipelines.

Not everything EasyLink can do has been covered in this tutorial. EasyLink currently includes a few more implementations 
we haven't used here, can run pipelines on a computational cluster managed by `Slurm <https://slurm.schedmd.com/documentation.html>`_
or distribute work using `Apache Spark <https://spark.apache.org/docs/latest/quick-start.html>`_, and has additional flexibility in
the pipeline schema that we haven't demonstrated here.

In its current state, EasyLink provides only one or two implementations for each step, does not yet have documentation 
to support users in creating their own implementations, and is not yet stable enough to be recommended as a tool for production pipelines.
However, interested users are encouraged to utilize the provided implementations to their full potential
by creating more pipelines, changing how implementations are configured, and linking different datasets. 

We hope to be able to add more features in the future, including:

- Full suite of implementations reflecting a range of common record linkage techniques
- Documentation supporting users in creating their own implementations
- User-experience improvements, especially regarding writing pipeline specifications and implementations
- Auto-parallel sections for processing large scale data