Record Linkage Software: Connecting Records Without Shared Identifiers

Record linkage software identifies and connects records from different data sources that refer to the same real-world entity, without requiring those sources to share a common unique identifier. Formalized by Newcombe et al. in 1959 and mathematically modeled by Fellegi and Sunter in 1969, record linkage uses probabilistic comparison of quasi-identifiers (names, dates of birth, addresses, phone numbers) to estimate the likelihood that two records belong to the same person, organization, or entity. It is the foundational technology behind government census programs, epidemiological studies, healthcare patient matching, and any scenario where data from independent systems must be combined for analysis.

Record linkage relies heavily on the fuzzy matching techniques used to compare quasi-identifiers like names, dates, and addresses when exact keys are unavailable.

MatchLogic record linkage interface showing records from different databases linked into entity clusters based on probabilistic comparison of quasi-identifiers
MatchLogic Entity Clustering

Record linkage is closely related to, but technically distinct from, data matching and entity resolution. Data matching is the broader category that includes both deterministic and probabilistic comparison methods. Record linkage specifically emphasizes the probabilistic framework and the challenge of linking records without shared keys. Entity resolution extends record linkage by adding clustering and canonicalization. For the full taxonomy, see our data matching guide. For how record linkage feeds into entity resolution, see our entity resolution and data linkage.

Key Takeaways

  • Record linkage connects records across databases without shared unique identifiers using probabilistic comparison of quasi-identifiers.
  • The Fellegi-Sunter model (1969) is the mathematical foundation, assigning agreement/disagreement weights based on field discriminating power.
  • Record linkage classifies pairs into three categories: match, non-match, and possible match (requiring manual review).
  • Key applications include government census programs, epidemiological research, healthcare EMPI, and cross-agency citizen data integration.
  • Privacy-preserving record linkage (PPRL) enables linking records across organizations without sharing the underlying PII.
  • On-premise record linkage ensures that quasi-identifiers (names, DOBs, addresses) used for linking never leave your secured infrastructure.

What Is the Fellegi-Sunter Model for Record Linkage?

The Fellegi-Sunter model is the mathematical framework that underpins most modern record linkage software. For each pair of records being compared, and for each field in the comparison, the model calculates two probabilities: the m-probability (the probability that the field values agree given that the records are a true match) and the u-probability (the probability that they agree by coincidence given that the records are not a match).

The agreement weight for a field is the log ratio of these probabilities: log2(m/u). Fields with high discriminating power (rare values, like an unusual last name) produce higher agreement weights. Fields with low discriminating power (common values, like gender) produce lower weights. The combined weight across all fields produces an overall match score.

This score is then compared against two thresholds: an upper threshold (above which the pair is classified as a match) and a lower threshold (below which it is classified as a non-match). Pairs scoring between the thresholds are flagged for manual review. This three-way classification (match, non-match, possible match) is the defining feature of the Fellegi-Sunter approach and remains the standard in government and healthcare record linkage programs.

How Does Record Linkage Differ from Deterministic Matching?

DimensionDeterministicProbabilistic Record Linkage
LogicBinary: match or no match.Continuous: weighted scores per field.
Missing DataRule fails. No match possible.Zero weight. Other fields can still produce match.
ClassificationTwo categories.Three categories: match, non-match, possible.
Best ForClean data with unique IDs.Messy data without shared identifiers.

Where Is Record Linkage Software Used?

Government: Census and Cross-Agency Linkage

The U.S. Census Bureau developed many of the foundational record linkage techniques still in use today. Census programs link survey responses across years to track population changes. Cross-agency programs link tax, benefits, health, and housing records to detect fraud, measure program effectiveness, and improve service delivery. The FAA/SSA pilot matching case (40,000 Northern California pilots matched against disability records, yielding 40 arrests) remains a classic demonstration of cross-agency record linkage.

Healthcare: Patient Matching Across Systems

Record linkage is the foundation of EMPI systems. Hospitals, clinics, labs, and pharmacies each assign their own patient IDs. Record linkage using probabilistic comparison of name, DOB, address, and phone connects patient records across these systems without a shared identifier. A 500-bed hospital system reduced its duplicate rate from 11.2% to 0.8% using probabilistic record linkage with multi-pass blocking.

Epidemiology and Public Health Research

Researchers link health records, vital statistics, environmental exposure data, and census information to study disease patterns, treatment outcomes, and population health trends. Record linkage enables longitudinal studies that track individuals across datasets collected over decades, connecting childhood health records to adult outcomes without a universal patient identifier.

"Matched 1.8 million records across three systems with under 2% false positives. Finally have a single source of truth we actually trust."
— Robert Tanaka, Director of Data Operations, Summit Financial Group
1.8M  records linked using probabilistic record linkage

What Is Privacy-Preserving Record Linkage (PPRL)?

Privacy-preserving record linkage enables two organizations to link records about the same individuals without either organization sharing the underlying PII with the other. PPRL techniques include Bloom filter encoding (hashing identifiers into bit arrays that can be compared without revealing the original values), secure multi-party computation, and trusted third-party linkage services.

PPRL is increasingly important for healthcare research (linking hospital records across health systems), government inter-agency programs (connecting tax and benefits records), and cross-organizational data sharing where privacy regulations prohibit direct PII exchange. For organizations using MatchLogic's on-premise architecture, PPRL can be implemented by running the matching engine on both organizations' data within a controlled, audited environment.

What Should You Look For in Record Linkage Software?

Fellegi-Sunter Implementation: Does the tool implement the full Fellegi-Sunter probabilistic model with configurable m/u probabilities, field weights, and three-way classification (match, non-match, possible)?

Quasi-Identifier Support: Can it compare names (with fuzzy algorithms), dates (with windowed comparison), addresses (with standardization), and other quasi-identifiers simultaneously?

Blocking for Scale: Does it provide multi-pass blocking to make probabilistic comparison computationally feasible at millions of records?

Audit Trail: Does it log the weights, scores, and threshold classification for every record pair? Government and healthcare record linkage programs require full auditability.

PPRL Capability: Does it support privacy-preserving techniques (Bloom filters, secure computation) for cross-organizational linkage?

On-Premise Deployment: Record linkage operates on the most sensitive identifiers in your data (names, DOBs, SSN fragments). On-premise processing is essential for HIPAA, GDPR, and government data handling requirements.

MatchLogic provides probabilistic record linkage within a unified on-premise platform: Fellegi-Sunter scoring with configurable field weights, multi-pass blocking for enterprise scale, integrated name and address standardization, and complete audit trails for every linkage decision. For a technical breakdown of the underlying matching techniques, see our algorithm guide.

Frequently Asked Questions

What is record linkage software?

Record linkage software connects records from different databases that refer to the same entity without requiring shared unique identifiers. It uses probabilistic comparison of quasi-identifiers (names, DOBs, addresses) to estimate match likelihood, based on the Fellegi-Sunter mathematical framework.

How does record linkage differ from data matching?

Data matching is the broader category that includes both deterministic (rule-based) and probabilistic methods. Record linkage specifically refers to the probabilistic framework for linking records across independent databases without shared keys. All record linkage is data matching; not all data matching is record linkage.

What is privacy-preserving record linkage?

PPRL enables two organizations to link records about the same individuals without sharing the underlying PII. Techniques include Bloom filter encoding, secure multi-party computation, and trusted third-party linkage. It is critical for healthcare research and government inter-agency programs.

Can record linkage software run on-premise?

Yes. Record linkage operates on the most sensitive identifiers (names, DOBs, SSN fragments). MatchLogic processes all record linkage on-premise with full audit trails, meeting HIPAA, GDPR, and government data handling requirements.

Ready to discuss your idea with us?

Let’s jump on a call and figure out how we can go from idea to product and beyond with Product Pilot.

Contact

Theresa Webb

Partner and CEO

tw@enable.com

Dianne Russell

Project manager

dr@enable.com

Fill out the form below or drop us an email. Our team will get back to you as soon as possible!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

The Future of Data Quality. Delivered Today.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
By subscribing you give consent to receive matchlogic newsletter.