What is data matching and why do enterprises need it?

Data matching is the process of comparing records across datasets to identify entries that refer to the same real-world entity. Enterprises need it because fragmented records create duplicates that inflate costs, weaken analytics, and create compliance risk. According to Gartner, poor data quality costs organizations an average of $12.9 million per year.

What is the difference between deterministic and probabilistic data matching?

Deterministic matching compares fields for exact equality and works well when unique identifiers are present. Probabilistic matching assigns weighted scores to field comparisons and calculates overall match probability, making it effective when data is incomplete or inconsistent. Most enterprise implementations use both approaches.

How accurate is fuzzy matching for enterprise data?

With proper threshold tuning, fuzzy matching typically achieves F1 scores between 0.88 and 0.95. Combining fuzzy matching with probabilistic weighting across multiple fields pushes accuracy higher. Accuracy depends on the algorithm, threshold, and input data quality.

Can data matching run on-premise for regulated industries?

Yes. On-premise data matching platforms process all data within your secured infrastructure, ensuring sensitive records never leave your network. This addresses data residency requirements under HIPAA, GDPR, SOX, and industry-specific mandates.

How do you measure data matching quality?

Three metrics matter most: Precision (percentage of declared matches that are correct), Recall (percentage of true matches found), and F1 Score (harmonic mean of precision and recall). Enterprise benchmarks target F1 above 0.95.

What is blocking in data matching and why is it necessary?

Blocking partitions records into subsets sharing a common attribute so the system only compares records within the same block. Without it, 10 million records would require 50 trillion comparisons. Blocking reduces this by 99%+ while preserving high recall.

Record Linkage Software: Connecting Records Without Shared Identifiers

Record linkage software identifies and connects records from different data sources that refer to the same real-world entity, without requiring those sources to share a common unique identifier. Formalized by Newcombe et al. in 1959 and mathematically modeled by Fellegi and Sunter in 1969, record linkage uses probabilistic comparison of quasi-identifiers (names, dates of birth, addresses, phone numbers) to estimate the likelihood that two records belong to the same person, organization, or entity. It is the foundational technology behind government census programs, epidemiological studies, healthcare patient matching, and any scenario where data from independent systems must be combined for analysis.

Record linkage is closely related to, but technically distinct from, data matching and entity resolution. Data matching is the broader category that includes both deterministic and probabilistic comparison methods. Record linkage specifically emphasizes the probabilistic framework and the challenge of linking records without shared keys. Entity resolution extends record linkage by adding clustering and canonicalization. For the full taxonomy, see our [INTERNAL LINK: Pillar 1, data matching guide]. For how record linkage feeds into entity resolution, see our [INTERNAL LINK: 2B, entity resolution and data linkage guide].

Key Takeaways

Record linkage connects records across databases without shared unique identifiers using probabilistic comparison of quasi-identifiers.
The Fellegi-Sunter model (1969) is the mathematical foundation, assigning agreement/disagreement weights based on field discriminating power.
Record linkage classifies pairs into three categories: match, non-match, and possible match (requiring manual review).
Key applications include government census programs, epidemiological research, healthcare EMPI, and cross-agency citizen data integration.
Privacy-preserving record linkage (PPRL) enables linking records across organizations without sharing the underlying PII.
On-premise record linkage ensures that quasi-identifiers (names, DOBs, addresses) used for linking never leave your secured infrastructure.

MatchLogic record linkage interface showing records from different databases linked into entity clusters based on probabilistic comparison of quasi-identifiers — MatchLogic Entity Clustering

What Is the Fellegi-Sunter Model for Record Linkage?

The Fellegi-Sunter model is the mathematical framework that underpins most modern record linkage software. For each pair of records being compared, and for each field in the comparison, the model calculates two probabilities: the m-probability (the probability that the field values agree given that the records are a true match) and the u-probability (the probability that they agree by coincidence given that the records are not a match).

The agreement weight for a field is the log ratio of these probabilities: log2(m/u). Fields with high discriminating power (rare values, like an unusual last name) produce higher agreement weights. Fields with low discriminating power (common values, like gender) produce lower weights. The combined weight across all fields produces an overall match score.

This score is then compared against two thresholds: an upper threshold (above which the pair is classified as a match) and a lower threshold (below which it is classified as a non-match). Pairs scoring between the thresholds are flagged for manual review. This three-way classification (match, non-match, possible match) is the defining feature of the Fellegi-Sunter approach and remains the standard in government and healthcare record linkage programs.

How Does Record Linkage Differ from Deterministic Matching?

Comparison Logic

Deterministic Matching: Binary: field values match or they don't. Rules are explicit.
Probabilistic Record Linkage: Continuous: each field comparison produces a weighted score. Weights reflect discriminating power.

Missing Data Handling

Deterministic Matching: Missing field = rule cannot fire = no match possible for that rule.
Probabilistic Record Linkage: Missing field contributes zero weight. Other fields can still produce a match score above threshold.

Classification

Deterministic Matching: Two categories: match or no match.
Probabilistic Record Linkage: Three categories: match, non-match, possible match (manual review).

Transparency

Deterministic Matching: Full: rules are explicit and auditable.
Probabilistic Record Linkage: High: weights and scores are visible and auditable. Thresholds are configurable.

Best For

Deterministic Matching: Clean data with reliable unique identifiers.
Probabilistic Record Linkage: Messy data without shared identifiers. Government, healthcare, epidemiological research.

Logic

Deterministic: Binary: match or no match.
Probabilistic Record Linkage: Continuous: weighted scores per field.

Missing Data

Deterministic: Rule fails. No match possible.
Probabilistic Record Linkage: Zero weight. Other fields can still produce match.

Classification

Deterministic: Two categories.
Probabilistic Record Linkage: Three categories: match, non-match, possible.

Best For

Deterministic: Clean data with unique IDs.
Probabilistic Record Linkage: Messy data without shared identifiers.

Where Is Record Linkage Software Used?

Government: Census and Cross-Agency Linkage

The U.S. Census Bureau developed many of the foundational record linkage techniques still in use today. Census programs link survey responses across years to track population changes. Cross-agency programs link tax, benefits, health, and housing records to detect fraud, measure program effectiveness, and improve service delivery. The FAA/SSA pilot matching case (40,000 Northern California pilots matched against disability records, yielding 40 arrests) remains a classic demonstration of cross-agency record linkage.

Healthcare: Patient Matching Across Systems

Record linkage is the foundation of EMPI systems. Hospitals, clinics, labs, and pharmacies each assign their own patient IDs. Record linkage using probabilistic comparison of name, DOB, address, and phone connects patient records across these systems without a shared identifier. A 500-bed hospital system reduced its duplicate rate from 11.2% to 0.8% using probabilistic record linkage with multi-pass blocking.

Epidemiology and Public Health Research

Researchers link health records, vital statistics, environmental exposure data, and census information to study disease patterns, treatment outcomes, and population health trends. Record linkage enables longitudinal studies that track individuals across datasets collected over decades, connecting childhood health records to adult outcomes without a universal patient identifier.

"Matched 1.8 million records across three systems with under 2% false positives. Finally have a single source of truth we actually trust."

— Robert Tanaka, Director of Data Operations, Summit Financial Group

1.8M records linked using probabilistic record linkage

What Is Privacy-Preserving Record Linkage (PPRL)?

Privacy-preserving record linkage enables two organizations to link records about the same individuals without either organization sharing the underlying PII with the other. PPRL techniques include Bloom filter encoding (hashing identifiers into bit arrays that can be compared without revealing the original values), secure multi-party computation, and trusted third-party linkage services.

PPRL is increasingly important for healthcare research (linking hospital records across health systems), government inter-agency programs (connecting tax and benefits records), and cross-organizational data sharing where privacy regulations prohibit direct PII exchange. For organizations using MatchLogic's on-premise architecture, PPRL can be implemented by running the matching engine on both organizations' data within a controlled, audited environment.

What Should You Look For in Record Linkage Software?

Fellegi-Sunter Implementation: Does the tool implement the full Fellegi-Sunter probabilistic model with configurable m/u probabilities, field weights, and three-way classification (match, non-match, possible)?

Quasi-Identifier Support: Can it compare names (with fuzzy algorithms), dates (with windowed comparison), addresses (with standardization), and other quasi-identifiers simultaneously?

Blocking for Scale: Does it provide multi-pass blocking to make probabilistic comparison computationally feasible at millions of records?

Audit Trail: Does it log the weights, scores, and threshold classification for every record pair? Government and healthcare record linkage programs require full auditability.

PPRL Capability: Does it support privacy-preserving techniques (Bloom filters, secure computation) for cross-organizational linkage?

On-Premise Deployment: Record linkage operates on the most sensitive identifiers in your data (names, DOBs, SSN fragments). On-premise processing is essential for HIPAA, GDPR, and government data handling requirements.

MatchLogic provides probabilistic record linkage within a unified on-premise platform: Fellegi-Sunter scoring with configurable field weights, multi-pass blocking for enterprise scale, integrated name and address standardization, and complete audit trails for every linkage decision. For a technical breakdown of the underlying [INTERNAL LINK: 1A, matching techniques], see our algorithm guide.

Frequently Asked Questions

What is record linkage software?

Record linkage software connects records from different databases that refer to the same entity without requiring shared unique identifiers. It uses probabilistic comparison of quasi-identifiers (names, DOBs, addresses) to estimate match likelihood, based on the Fellegi-Sunter mathematical framework.

How does record linkage differ from data matching?

Data matching is the broader category that includes both deterministic (rule-based) and probabilistic methods. Record linkage specifically refers to the probabilistic framework for linking records across independent databases without shared keys. All record linkage is data matching; not all data matching is record linkage.

What is privacy-preserving record linkage?

PPRL enables two organizations to link records about the same individuals without sharing the underlying PII. Techniques include Bloom filter encoding, secure multi-party computation, and trusted third-party linkage. It is critical for healthcare research and government inter-agency programs.

Can record linkage software run on-premise?

Yes. Record linkage operates on the most sensitive identifiers (names, DOBs, SSN fragments). MatchLogic processes all record linkage on-premise with full audit trails, meeting HIPAA, GDPR, and government data handling requirements.

Key Takeaways

What Is the Fellegi-Sunter Model for Record Linkage?

How Does Record Linkage Differ from Deterministic Matching?

Comparison Logic

Missing Data Handling

Classification

Transparency

Best For

Logic

Missing Data

Classification

Best For

Where Is Record Linkage Software Used?

Government: Census and Cross-Agency Linkage

Healthcare: Patient Matching Across Systems

Epidemiology and Public Health Research

What Is Privacy-Preserving Record Linkage (PPRL)?

What Should You Look For in Record Linkage Software?

Frequently Asked Questions

What is record linkage software?

How does record linkage differ from data matching?

What is privacy-preserving record linkage?

Can record linkage software run on-premise?

Contact

Fill out the form below or drop us an email. Our team will get back to you as soon as possible!

The Future of Data Quality. Delivered Today.