Record Linkage Software: Connecting Records Without Shared Identifiers
Record linkage software identifies and connects records from different data sources that refer to the same real-world entity, without requiring those sources to share a common unique identifier. Formalized by Newcombe et al. in 1959 and mathematically modeled by Fellegi and Sunter in 1969, record linkage uses probabilistic comparison of quasi-identifiers (names, dates of birth, addresses, phone numbers) to estimate the likelihood that two records belong to the same person, organization, or entity. It is the foundational technology behind government census programs, epidemiological studies, healthcare patient matching, and any scenario where data from independent systems must be combined for analysis.
Record linkage is closely related to, but technically distinct from, data matching and entity resolution. Data matching is the broader category that includes both deterministic and probabilistic comparison methods. Record linkage specifically emphasizes the probabilistic framework and the challenge of linking records without shared keys. Entity resolution extends record linkage by adding clustering and canonicalization. For the full taxonomy, see our [INTERNAL LINK: Pillar 1, data matching guide]. For how record linkage feeds into entity resolution, see our [INTERNAL LINK: 2B, entity resolution and data linkage guide].
Key Takeaways
- Record linkage connects records across databases without shared unique identifiers using probabilistic comparison of quasi-identifiers.
- The Fellegi-Sunter model (1969) is the mathematical foundation, assigning agreement/disagreement weights based on field discriminating power.
- Record linkage classifies pairs into three categories: match, non-match, and possible match (requiring manual review).
- Key applications include government census programs, epidemiological research, healthcare EMPI, and cross-agency citizen data integration.
- Privacy-preserving record linkage (PPRL) enables linking records across organizations without sharing the underlying PII.
- On-premise record linkage ensures that quasi-identifiers (names, DOBs, addresses) used for linking never leave your secured infrastructure.

What Is the Fellegi-Sunter Model for Record Linkage?
The Fellegi-Sunter model is the mathematical framework that underpins most modern record linkage software. For each pair of records being compared, and for each field in the comparison, the model calculates two probabilities: the m-probability (the probability that the field values agree given that the records are a true match) and the u-probability (the probability that they agree by coincidence given that the records are not a match).
The agreement weight for a field is the log ratio of these probabilities: log2(m/u). Fields with high discriminating power (rare values, like an unusual last name) produce higher agreement weights. Fields with low discriminating power (common values, like gender) produce lower weights. The combined weight across all fields produces an overall match score.
This score is then compared against two thresholds: an upper threshold (above which the pair is classified as a match) and a lower threshold (below which it is classified as a non-match). Pairs scoring between the thresholds are flagged for manual review. This three-way classification (match, non-match, possible match) is the defining feature of the Fellegi-Sunter approach and remains the standard in government and healthcare record linkage programs.
How Does Record Linkage Differ from Deterministic Matching?
Comparison Logic
- Deterministic Matching: Binary: field values match or they don't. Rules are explicit.
- Probabilistic Record Linkage: Continuous: each field comparison produces a weighted score. Weights reflect discriminating power.
Missing Data Handling
- Deterministic Matching: Missing field = rule cannot fire = no match possible for that rule.
- Probabilistic Record Linkage: Missing field contributes zero weight. Other fields can still produce a match score above threshold.
Classification
- Deterministic Matching: Two categories: match or no match.
- Probabilistic Record Linkage: Three categories: match, non-match, possible match (manual review).
Transparency
- Deterministic Matching: Full: rules are explicit and auditable.
- Probabilistic Record Linkage: High: weights and scores are visible and auditable. Thresholds are configurable.
Best For
- Deterministic Matching: Clean data with reliable unique identifiers.
- Probabilistic Record Linkage: Messy data without shared identifiers. Government, healthcare, epidemiological research.
Logic
- Deterministic: Binary: match or no match.
- Probabilistic Record Linkage: Continuous: weighted scores per field.
Missing Data
- Deterministic: Rule fails. No match possible.
- Probabilistic Record Linkage: Zero weight. Other fields can still produce match.
Classification
- Deterministic: Two categories.
- Probabilistic Record Linkage: Three categories: match, non-match, possible.
Best For
- Deterministic: Clean data with unique IDs.
- Probabilistic Record Linkage: Messy data without shared identifiers.
Where Is Record Linkage Software Used?
Government: Census and Cross-Agency Linkage
The U.S. Census Bureau developed many of the foundational record linkage techniques still in use today. Census programs link survey responses across years to track population changes. Cross-agency programs link tax, benefits, health, and housing records to detect fraud, measure program effectiveness, and improve service delivery. The FAA/SSA pilot matching case (40,000 Northern California pilots matched against disability records, yielding 40 arrests) remains a classic demonstration of cross-agency record linkage.
Healthcare: Patient Matching Across Systems
Record linkage is the foundation of EMPI systems. Hospitals, clinics, labs, and pharmacies each assign their own patient IDs. Record linkage using probabilistic comparison of name, DOB, address, and phone connects patient records across these systems without a shared identifier. A 500-bed hospital system reduced its duplicate rate from 11.2% to 0.8% using probabilistic record linkage with multi-pass blocking.
Epidemiology and Public Health Research
Researchers link health records, vital statistics, environmental exposure data, and census information to study disease patterns, treatment outcomes, and population health trends. Record linkage enables longitudinal studies that track individuals across datasets collected over decades, connecting childhood health records to adult outcomes without a universal patient identifier.
"Matched 1.8 million records across three systems with under 2% false positives. Finally have a single source of truth we actually trust."
— Robert Tanaka, Director of Data Operations, Summit Financial Group
1.8M records linked using probabilistic record linkage
What Is Privacy-Preserving Record Linkage (PPRL)?
Privacy-preserving record linkage enables two organizations to link records about the same individuals without either organization sharing the underlying PII with the other. PPRL techniques include Bloom filter encoding (hashing identifiers into bit arrays that can be compared without revealing the original values), secure multi-party computation, and trusted third-party linkage services.
PPRL is increasingly important for healthcare research (linking hospital records across health systems), government inter-agency programs (connecting tax and benefits records), and cross-organizational data sharing where privacy regulations prohibit direct PII exchange. For organizations using MatchLogic's on-premise architecture, PPRL can be implemented by running the matching engine on both organizations' data within a controlled, audited environment.
What Should You Look For in Record Linkage Software?
Fellegi-Sunter Implementation: Does the tool implement the full Fellegi-Sunter probabilistic model with configurable m/u probabilities, field weights, and three-way classification (match, non-match, possible)?
Quasi-Identifier Support: Can it compare names (with fuzzy algorithms), dates (with windowed comparison), addresses (with standardization), and other quasi-identifiers simultaneously?
Blocking for Scale: Does it provide multi-pass blocking to make probabilistic comparison computationally feasible at millions of records?
Audit Trail: Does it log the weights, scores, and threshold classification for every record pair? Government and healthcare record linkage programs require full auditability.
PPRL Capability: Does it support privacy-preserving techniques (Bloom filters, secure computation) for cross-organizational linkage?
On-Premise Deployment: Record linkage operates on the most sensitive identifiers in your data (names, DOBs, SSN fragments). On-premise processing is essential for HIPAA, GDPR, and government data handling requirements.
MatchLogic provides probabilistic record linkage within a unified on-premise platform: Fellegi-Sunter scoring with configurable field weights, multi-pass blocking for enterprise scale, integrated name and address standardization, and complete audit trails for every linkage decision. For a technical breakdown of the underlying [INTERNAL LINK: 1A, matching techniques], see our algorithm guide.
Frequently Asked Questions
What is record linkage software?
Record linkage software connects records from different databases that refer to the same entity without requiring shared unique identifiers. It uses probabilistic comparison of quasi-identifiers (names, DOBs, addresses) to estimate match likelihood, based on the Fellegi-Sunter mathematical framework.
How does record linkage differ from data matching?
Data matching is the broader category that includes both deterministic (rule-based) and probabilistic methods. Record linkage specifically refers to the probabilistic framework for linking records across independent databases without shared keys. All record linkage is data matching; not all data matching is record linkage.
What is privacy-preserving record linkage?
PPRL enables two organizations to link records about the same individuals without sharing the underlying PII. Techniques include Bloom filter encoding, secure multi-party computation, and trusted third-party linkage. It is critical for healthcare research and government inter-agency programs.
Can record linkage software run on-premise?
Yes. Record linkage operates on the most sensitive identifiers (names, DOBs, SSN fragments). MatchLogic processes all record linkage on-premise with full audit trails, meeting HIPAA, GDPR, and government data handling requirements.


