What is data matching and why do enterprises need it?

Data matching is the process of comparing records across datasets to identify entries that refer to the same real-world entity. Enterprises need it because fragmented records create duplicates that inflate costs, weaken analytics, and create compliance risk. According to Gartner, poor data quality costs organizations an average of $12.9 million per year.

What is the difference between deterministic and probabilistic data matching?

Deterministic matching compares fields for exact equality and works well when unique identifiers are present. Probabilistic matching assigns weighted scores to field comparisons and calculates overall match probability, making it effective when data is incomplete or inconsistent. Most enterprise implementations use both approaches.

How accurate is fuzzy matching for enterprise data?

With proper threshold tuning, fuzzy matching typically achieves F1 scores between 0.88 and 0.95. Combining fuzzy matching with probabilistic weighting across multiple fields pushes accuracy higher. Accuracy depends on the algorithm, threshold, and input data quality.

Can data matching run on-premise for regulated industries?

Yes. On-premise data matching platforms process all data within your secured infrastructure, ensuring sensitive records never leave your network. This addresses data residency requirements under HIPAA, GDPR, SOX, and industry-specific mandates.

How do you measure data matching quality?

Three metrics matter most: Precision (percentage of declared matches that are correct), Recall (percentage of true matches found), and F1 Score (harmonic mean of precision and recall). Enterprise benchmarks target F1 above 0.95.

What is blocking in data matching and why is it necessary?

Blocking partitions records into subsets sharing a common attribute so the system only compares records within the same block. Without it, 10 million records would require 50 trillion comparisons. Blocking reduces this by 99%+ while preserving high recall.

Record Linkage Software: Connecting Records Without Shared Identifiers

Record linkage software identifies and connects records from different data sources that refer to the same real-world entity, without requiring those sources to share a common unique identifier. Formalized by Newcombe and colleagues in 1959 and mathematically modeled by Fellegi and Sunter in 1969, it uses probabilistic comparison of quasi-identifiers (names, dates of birth, addresses, phone numbers) to estimate how likely it is that two records belong to the same person, organization, or entity.

It is the foundational technology behind government census programs, epidemiological studies, healthcare patient matching, and any scenario where data from independent systems has to be combined for analysis.

Record linkage sits inside the broader discipline of data matching, which spans both deterministic and probabilistic methods. Record linkage is the part that emphasizes the probabilistic framework and the challenge of connecting records when no shared key exists. This guide covers the Fellegi-Sunter model, how linkage differs from deterministic matching, the main applications, and what to evaluate in record linkage software.

Key Takeaways

✓Record linkage connects records across databases without shared unique identifiers using probabilistic comparison of quasi-identifiers.
✓The Fellegi-Sunter model (1969) is the mathematical foundation, assigning agreement/disagreement weights based on field discriminating power.
✓Record linkage classifies pairs into three categories: match, non-match, and possible match (requiring manual review).
✓Key applications include government census programs, epidemiological research, healthcare EMPI, and cross-agency citizen data integration.
✓Privacy-preserving record linkage (PPRL) enables linking records across organizations without sharing the underlying PII.
✓On-premise record linkage ensures that quasi-identifiers (names, DOBs, addresses) used for linking never leave your secured infrastructure.

What Is the Fellegi-Sunter Model for Record Linkage?

The Fellegi-Sunter model is the mathematical framework that underpins most modern record linkage software, set out in the 1969 paper. For each pair of records, and for each field in the comparison, it calculates two probabilities: the m-probability (that the field values agree given a true match) and the u-probability (that they agree by coincidence given a non-match).

The agreement weight for a field is the log ratio of these probabilities, log2(m/u). Fields with high discriminating power, such as an unusual last name, produce higher agreement weights, while low-discriminating fields like gender produce lower weights. The combined weight across all fields produces an overall match score.

That score is compared against two thresholds: an upper threshold above which the pair is a match, and a lower threshold below which it is a non-match. Pairs scoring between the thresholds are flagged for manual review, and this three-way classification (match, non-match, possible match) is the defining feature of the Fellegi-Sunter approach and the standard in government and healthcare linkage programs.

How Does Record Linkage Differ from Deterministic Matching?

Both approaches sit inside the broader category of data matching, but they diverge sharply on how they treat agreement, missing data, and the classification step. The table summarizes the core differences.

Dimension	Deterministic	Probabilistic Record Linkage
Logic	Binary: match or no match.	Continuous: weighted scores per field.
Missing Data	Rule fails. No match possible.	Zero weight. Other fields can still produce match.
Classification	Two categories.	Three categories: match, non-match, possible.
Best For	Clean data with unique IDs.	Messy data without shared identifiers.

Record linkage also sits one layer below entity resolution in the stack. Linkage produces scored pairs, and entity resolution then extends those pairs with clustering (grouping all records that refer to one entity) and canonicalization (building a single golden record from the group). The two run on the same probabilistic foundation, with entity resolution layered on top.

Where Is Record Linkage Software Used?

Government: Census and Cross-Agency Linkage

The U.S. Census Bureau developed many of the foundational record linkage techniques still in use today, linking survey responses across years to track population change. Cross-agency programs link tax, benefits, health, and housing records to detect fraud, measure program effectiveness, and improve service delivery. The Operation Safe Pilot case, which matched about 40,000 Northern California pilots against disability records, remains a classic demonstration of cross-agency record linkage.

Healthcare: Patient Matching Across Systems

Record linkage is the foundation of enterprise master patient index (EMPI) systems. Hospitals, clinics, labs, and pharmacies each assign their own patient IDs, and record linkage uses probabilistic comparison of name, date of birth, address, and phone to connect records across them without a shared key. The name side of that comparison typically runs through fuzzy name matching software, with the per-field scores feeding the overall linkage score. A large hospital system reconciling several million patient records with probabilistic linkage and multi-pass blocking can take a duplicate rate that started in the double digits down to well under one percent.

Epidemiology and Public Health Research

Researchers link health records, vital statistics, environmental exposure data, and census information to study disease patterns, treatment outcomes, and population health trends. Record linkage enables longitudinal studies that track individuals across datasets collected over decades, connecting childhood health records to adult outcomes without a universal patient identifier.

Linked 4.6 million records across seven agencies with a defensible audit trail

"Fellegi-Sunter scoring meant every one of the 4.6 million linkages carried a documented weight and threshold. The program review board cleared us on the first pass, and we cut manual adjudication from 9 percent of pairs to 1.2 percent."

Dr. Priya Sundararaj, Lead Data Scientist, North Atlantic Public Health Institute

What Is Privacy-Preserving Record Linkage (PPRL)?

Privacy-preserving record linkage lets two organizations link records about the same individuals without either one sharing the underlying PII, an approach used in national health and statistical linkage programs (see the AIHW Bloom-filter method). The main techniques are listed below.

Bloom filter encoding: Hashes identifiers into bit arrays that can be compared for similarity without revealing the original values.
Secure multi-party computation: Lets parties compute match results on encrypted data without exposing their records to each other.
Trusted third-party linkage: Routes encrypted records through a neutral intermediary that returns only the linked identifiers, not the underlying data.

PPRL matters for healthcare research, government inter-agency programs, and any cross-organizational data sharing where privacy rules prohibit direct PII exchange. For organizations running database matching software on MatchLogic's on-premise architecture, PPRL can be implemented by running the matching engine against both organizations' data inside a controlled, audited environment.

What Should You Look For in Record Linkage Software?

The criteria below cover what record linkage specifically demands. Broader procurement and vendor-evaluation factors are covered in our data matching software buyer's guide.

Fellegi-Sunter implementation: does the tool implement the full probabilistic model with configurable m/u probabilities, field weights, and three-way classification (match, non-match, possible)?
Quasi-identifier support: can it compare names with fuzzy matching techniques, dates with windowed comparison, and addresses with standardization in the same comparison pass?
Blocking for scale: does it provide multi-pass blocking so probabilistic comparison stays computationally feasible at millions of records?
Audit trail: does it log the weights, scores, and threshold classification for every record pair? Government and healthcare programs require full auditability.
PPRL capability: does it support privacy-preserving techniques (Bloom filters, secure computation) for cross-organizational linkage?
On-premise deployment: record linkage operates on the most sensitive identifiers in your data (names, DOBs, SSN fragments), so on-premise processing is essential for HIPAA, GDPR, and government data handling requirements.

Probabilistic Linkage Inside a Single On-Premise Pipeline

MatchLogic provides probabilistic record linkage within a unified on-premise platform: Fellegi-Sunter scoring with configurable field weights, multi-pass blocking for enterprise scale, integrated name and address standardization, and complete audit trails for every linkage decision. The same engine handles the broader fuzzy matching software workload, so record linkage and general matching can share one pipeline rather than living in separate tools.

Frequently Asked Questions

What is record linkage software?

Record linkage software connects records from different databases that refer to the same entity without requiring shared unique identifiers. It uses probabilistic comparison of quasi-identifiers such as names, dates of birth, and addresses to estimate match likelihood, based on the Fellegi-Sunter mathematical framework that classifies each pair as a match, non-match, or possible match.

How does record linkage differ from data matching?

Data matching is the broader category that includes both deterministic (rule-based) and probabilistic methods. Record linkage specifically refers to the probabilistic framework for linking records across independent databases without shared keys. All record linkage is data matching, but not all data matching is record linkage, since deterministic key-based matching is also data matching.

What are quasi-identifiers in record linkage?

Quasi-identifiers are fields that are not unique on their own but become identifying in combination: name, date of birth, address, phone, and sex. Record linkage weights each by its discriminating power, so a rare surname carries more weight than a common one, and combines the field weights into a single match score. They are the inputs that make linkage possible without a shared key.

What is the difference between record linkage and entity resolution?

Record linkage produces scored pairs of records that likely refer to the same entity. Entity resolution takes those pairs further by clustering all related records into one group and building a single golden record from the group. Linkage is the pairwise scoring step; entity resolution adds clustering and canonicalization on top of the same probabilistic foundation.

What is privacy-preserving record linkage?

PPRL lets two organizations link records about the same individuals without sharing the underlying PII. Techniques include Bloom filter encoding, secure multi-party computation, and trusted third-party linkage. It is critical for healthcare research and government inter-agency programs where regulations prohibit direct exchange of names, dates of birth, and other identifiers.

Why does record linkage need to run on-premise?

Record linkage operates on the most sensitive identifiers in your data, including names, dates of birth, and SSN fragments, often drawn from purchased, partner, or cross-agency sources. On-premise processing keeps those quasi-identifiers inside your secured infrastructure, which is what HIPAA, GDPR, and government data-handling rules require, and it is also the basis for controlled PPRL between organizations.