Record Linkage Software: Connecting Records Without Shared Identifiers

Record linkage software identifies and connects records from different data sources that refer to the same real-world entity, without requiring those sources to share a common unique identifier. 

Formalized by Newcombe et al. in 1959 and mathematically modeled by Fellegi and Sunter in 1969, it uses probabilistic comparison of quasi-identifiers (names, dates of birth, addresses, phone numbers) to estimate how likely it is that two records belong to the same person, organization, or entity.

It's the foundational technology behind government census programs, epidemiological studies, healthcare patient matching, and any scenario where data from independent systems has to be combined for analysis.

Record linkage sits inside the broader discipline of data matching, which spans both deterministic and probabilistic methods. Record linkage is the part that emphasizes the probabilistic framework and the challenge of connecting records when no shared key exists.

Key Takeaways

  • Record linkage connects records across databases without shared unique identifiers using probabilistic comparison of quasi-identifiers.
  • The Fellegi-Sunter model (1969) is the mathematical foundation, assigning agreement/disagreement weights based on field discriminating power.
  • Record linkage classifies pairs into three categories: match, non-match, and possible match (requiring manual review).
  • Key applications include government census programs, epidemiological research, healthcare EMPI, and cross-agency citizen data integration.
  • Privacy-preserving record linkage (PPRL) enables linking records across organizations without sharing the underlying PII.
  • On-premise record linkage ensures that quasi-identifiers (names, DOBs, addresses) used for linking never leave your secured infrastructure.

What Is the Fellegi-Sunter Model for Record Linkage?

The Fellegi-Sunter model is the mathematical framework that underpins most modern record linkage software. For each pair of records being compared, and for each field in the comparison, the model calculates two probabilities: the m-probability (the probability that the field values agree given that the records are a true match) and the u-probability (the probability that they agree by coincidence given that the records are not a match).

The agreement weight for a field is the log ratio of these probabilities: log2(m/u). Fields with high discriminating power (rare values, like an unusual last name) produce higher agreement weights. Fields with low discriminating power (common values, like gender) produce lower weights. The combined weight across all fields produces an overall match score.

This score is then compared against two thresholds: an upper threshold (above which the pair is classified as a match) and a lower threshold (below which it is classified as a non-match). Pairs scoring between the thresholds are flagged for manual review. This three-way classification (match, non-match, possible match) is the defining feature of the Fellegi-Sunter approach and remains the standard in government and healthcare record linkage programs.

How Does Record Linkage Differ from Deterministic Matching?

Both approaches sit inside the broader category of data matching techniques, but they diverge sharply on how they treat agreement, missing data, and the classification step. The table below summarizes the core differences.

DimensionDeterministicProbabilistic Record Linkage
LogicBinary: match or no match.Continuous: weighted scores per field.
Missing DataRule fails. No match possible.Zero weight. Other fields can still produce match.
ClassificationTwo categories.Three categories: match, non-match, possible.
Best ForClean data with unique IDs.Messy data without shared identifiers.

Record linkage also sits one layer below entity resolution in the stack. Linkage produces scored pairs; entity resolution then extends those pairs by adding clustering (grouping all records that refer to one entity) and canonicalization (building a single golden record from the group). In practice the two run on the same probabilistic foundation, with entity resolution layered on top.

Where Is Record Linkage Software Used?

Government: Census and Cross-Agency Linkage

The U.S. Census Bureau developed many of the foundational record linkage techniques still in use today. Census programs link survey responses across years to track population changes. Cross-agency programs link tax, benefits, health, and housing records to detect fraud, measure program effectiveness, and improve service delivery. The FAA/SSA pilot matching case (40,000 Northern California pilots matched against disability records, yielding 40 arrests) remains a classic demonstration of cross-agency record linkage.

Healthcare: Patient Matching Across Systems

Record linkage is the foundation of EMPI systems. Hospitals, clinics, labs, and pharmacies each assign their own patient IDs, and record linkage uses probabilistic comparison of name, date of birth, address, and phone to connect patient records across them without a shared key. The name side of that comparison typically runs through fuzzy name matching software, with the per-field scores feeding the overall linkage score. Consider a large hospital system reconciling several million patient records: probabilistic record linkage with multi-pass blocking can take a duplicate rate that started in the double digits down to well under one percent.

Epidemiology and Public Health Research

Researchers link health records, vital statistics, environmental exposure data, and census information to study disease patterns, treatment outcomes, and population health trends. Record linkage enables longitudinal studies that track individuals across datasets collected over decades, connecting childhood health records to adult outcomes without a universal patient identifier.

Linked 4.6 million records across seven agencies with a defensible audit trail

"Fellegi-Sunter scoring meant every one of the 4.6 million linkages carried a documented weight and threshold. The program review board cleared us on the first pass, and we cut manual adjudication from 9 percent of pairs to 1.2 percent."

Dr. Priya Sundararaj, Lead Data Scientist, North Atlantic Public Health Institute

What Is Privacy-Preserving Record Linkage (PPRL)?

Privacy-preserving record linkage lets two organizations link records about the same individuals without either organization sharing the underlying PII with the other. The main techniques are:

  • Bloom filter encoding: hashes identifiers into bit arrays that can be compared for similarity without revealing the original values.
  • Secure multi-party computation: lets parties compute match results on encrypted data without exposing their records to each other.
  • Trusted third-party linkage: routes encrypted records through a neutral intermediary that returns only the linked identifiers, not the underlying data.

PPRL is increasingly important for healthcare research (linking hospital records across health systems), government inter-agency programs (connecting tax and benefits records), and cross-organizational data sharing where privacy regulations prohibit direct PII exchange. For organizations running database matching software on MatchLogic's on-premise architecture, PPRL can be implemented by running the matching engine against both organizations' data inside a controlled, audited environment.

What Should You Look For in Record Linkage Software?

The criteria below cover what record linkage specifically demands. Broader procurement and vendor-evaluation factors are covered in our data matching software buyer's guide.

  • Fellegi-Sunter implementation: does the tool implement the full probabilistic model with configurable m/u probabilities, field weights, and three-way classification (match, non-match, possible)?
  • Quasi-identifier support: can it compare names with fuzzy matching techniques, dates with windowed comparison, and addresses with standardization in the same comparison pass?
  • Blocking for scale: does it provide multi-pass blocking so probabilistic comparison stays computationally feasible at millions of records?
  • Audit trail: does it log the weights, scores, and threshold classification for every record pair? Government and healthcare programs require full auditability.
  • PPRL capability: does it support privacy-preserving techniques (Bloom filters, secure computation) for cross-organizational linkage?
  • On-premise deployment: record linkage operates on the most sensitive identifiers in your data (names, DOBs, SSN fragments), so on-premise processing is essential for HIPAA, GDPR, and government data handling requirements.

Probabilistic Linkage Inside a Single On-Premise Pipeline

MatchLogic provides probabilistic record linkage within a unified on-premise platform: Fellegi-Sunter scoring with configurable field weights, multi-pass blocking for enterprise scale, integrated name and address standardization, and complete audit trails for every linkage decision. The same engine handles the broader fuzzy matching software workload, so record linkage and general matching can share one pipeline rather than living in separate tools.

Frequently Asked Questions

What is record linkage software?

Record linkage software connects records from different databases that refer to the same entity without requiring shared unique identifiers. It uses probabilistic comparison of quasi-identifiers (names, DOBs, addresses) to estimate match likelihood, based on the Fellegi-Sunter mathematical framework.

How does record linkage differ from data matching?

Data matching is the broader category that includes both deterministic (rule-based) and probabilistic methods. Record linkage specifically refers to the probabilistic framework for linking records across independent databases without shared keys. All record linkage is data matching; not all data matching is record linkage.

What is privacy-preserving record linkage?

PPRL enables two organizations to link records about the same individuals without sharing the underlying PII. Techniques include Bloom filter encoding, secure multi-party computation, and trusted third-party linkage. It is critical for healthcare research and government inter-agency programs.

Can record linkage software run on-premise?

Yes. Record linkage operates on the most sensitive identifiers (names, DOBs, SSN fragments). MatchLogic processes all record linkage on-premise with full audit trails, meeting HIPAA, GDPR, and government data handling requirements.

Ready to discuss your idea with us?

Let’s jump on a call and figure out how we can go from idea to product and beyond with Product Pilot.

Contact

Theresa Webb

Partner and CEO

tw@enable.com

Dianne Russell

Project manager

dr@enable.com

Fill out the form below or drop us an email. Our team will get back to you as soon as possible!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.