What is data matching and why do enterprises need it?

Data matching is the process of comparing records across datasets to identify entries that refer to the same real-world entity. Enterprises need it because fragmented records create duplicates that inflate costs, weaken analytics, and create compliance risk. According to Gartner, poor data quality costs organizations an average of $12.9 million per year.

What is the difference between deterministic and probabilistic data matching?

Deterministic matching compares fields for exact equality and works well when unique identifiers are present. Probabilistic matching assigns weighted scores to field comparisons and calculates overall match probability, making it effective when data is incomplete or inconsistent. Most enterprise implementations use both approaches.

How accurate is fuzzy matching for enterprise data?

With proper threshold tuning, fuzzy matching typically achieves F1 scores between 0.88 and 0.95. Combining fuzzy matching with probabilistic weighting across multiple fields pushes accuracy higher. Accuracy depends on the algorithm, threshold, and input data quality.

Can data matching run on-premise for regulated industries?

Yes. On-premise data matching platforms process all data within your secured infrastructure, ensuring sensitive records never leave your network. This addresses data residency requirements under HIPAA, GDPR, SOX, and industry-specific mandates.

How do you measure data matching quality?

Three metrics matter most: Precision (percentage of declared matches that are correct), Recall (percentage of true matches found), and F1 Score (harmonic mean of precision and recall). Enterprise benchmarks target F1 above 0.95.

What is blocking in data matching and why is it necessary?

Blocking partitions records into subsets sharing a common attribute so the system only compares records within the same block. Without it, 10 million records would require 50 trillion comparisons. Blocking reduces this by 99%+ while preserving high recall.

Entity Resolution and Data Linkage: Connecting the Dots Across Databases

‍Entity resolution data linkage is the process of identifying records across two or more separate databases that refer to the same real-world entity (a person, organization, product, or location) and connecting them into a unified view, even when no shared unique identifier exists. Data linkage is the matching mechanism; entity resolution is the broader process that uses linkage results to cluster, merge, and produce a single trusted profile for each entity.

For enterprises operating across multiple CRM, ERP, billing, and operational systems, cross-database linkage is the technical foundation for single-customer views, master data management, and regulatory compliance. This guide covers the relationship between linkage and entity resolution, the techniques that make cross-database matching work, and how to run linkage at scale, the matching foundation beneath entity resolution.

Key Takeaways

✓Data linkage connects records across separate databases that describe the same entity; entity resolution adds clustering and golden record creation on top of linkage.
✓Record linkage originated in 1946 with Halbert Dunn's vital statistics work and was formalized by the Fellegi-Sunter probability model in 1969.
✓Blocking reduces the computational cost of cross-database linkage from quadratic (n²) to near-linear by partitioning records into comparable groups.
✓Privacy-preserving record linkage (PPRL) enables cross-organization linkage without exposing raw PII, using techniques like Bloom filters and secure multi-party computation.
✓Transitive closure connects records that were never directly compared: if A matches B and B matches C, all three are linked to the same entity.

How Are Data Linkage and Entity Resolution Related?

The terms data linkage, record linkage, and entity resolution are often used interchangeably, but they describe different scopes of the same problem, and the distinction determines which capabilities you need from your tooling.

Record linkage (also called data linkage) compares records from one or more datasets and determines which pairs refer to the same entity, outputting linked pairs with confidence scores. The term was coined in 1946 by Halbert Dunn in the context of linking vital statistics across US state registries, and the mathematical foundation was set in 1969 by Fellegi and Sunter, whose probabilistic model remains the basis of most modern implementations.

Entity resolution starts where record linkage ends. It takes the linked pairs, applies transitive closure to form entity clusters, resolves conflicting values with survivorship rules, and produces a canonical golden record. Linkage answers “do these two records describe the same thing?”, while entity resolution answers “who is this entity, given everything we know?” Choosing the right entity resolution software comes down to how well a platform performs that clustering and survivorship.

Aspect	Record Linkage (Data Linkage)	Entity Resolution
Scope	Pairwise comparison across datasets	End-to-end: linkage, clustering, survivorship, golden record
Output	Linked pairs with classification and confidence	Unified entity profiles with full lineage
Handles transitivity	No; pairs evaluated independently	Yes; transitive closure or graph clustering
Conflict resolution	Not included; returns linked pairs only	Survivorship rules choose source values
Use-case fit	Research, one-time integration, epidemiology	MDM, Customer 360, patient matching, KYC and AML
Origin	Dunn (1946), Fellegi-Sunter (1969)	Evolved from record linkage in the 2000s

How Does Cross-Database Record Linkage Work?

Cross-database linkage follows a structured pipeline that finds true matches efficiently while minimizing false positives and false negatives. It addresses a core computational challenge: comparing every record in Database A against every record in Database B produces n times m candidate pairs, so two systems of 5 million records each would generate 25 trillion comparisons. The same pipeline underpins database matching software across operational systems.

Step 1: Schema Harmonization

Before records can be compared, the matching fields must be mapped across schemas. One database may store “First_Name” and “Last_Name” while another uses a single “Full_Name,” and date and address formats differ. Schema harmonization creates a common field structure without modifying the source data.

Step 2: Data Standardization

Standardization normalizes values to reduce noise: expanding nicknames, removing honorifics, resolving address abbreviations, and normalizing phone formats. The quality of this step directly affects match accuracy, and the standard reference text on the subject (Christen, Data Matching, Springer 2012) shows that standardization alone can meaningfully improve match recall.

Step 3: Blocking

Blocking partitions records into groups that share a blocking-key value, so only records in the same block are compared, which turns a 25 trillion comparison space into a manageable fraction. The trade-off is coverage, since overly restrictive keys miss matches where the blocking attribute itself has errors. Sorted neighborhood, canopy clustering, and locality-sensitive hashing use multiple overlapping keys to raise recall without sacrificing performance.

Step 4: Pairwise Comparison

Within each block, record pairs are compared field by field with similarity functions suited to the data type: Jaro-Winkler or Double Metaphone for names, exact or within-tolerance for dates and phones, and token-based cosine similarity for addresses. Each comparison produces a score between 0 and 1, the per-field output of the entity matching software engine.

Step 5: Classification

Field scores combine into a composite using deterministic rules, the probabilistic Fellegi-Sunter model, or an AI classifier trained on labeled pairs. Pairs are then classified as match (auto-link), non-match (auto-reject), or possible match (manual review).

What Is the Fellegi-Sunter Model and Why Does It Matter?

The Fellegi-Sunter model, published in the Journal of the American Statistical Association in 1969, provides the mathematical foundation for probabilistic record linkage. It formalizes the intuition that some fields are more informative than others when deciding whether two records represent the same entity.

The model calculates two probabilities per field comparison. The m-probability is the chance the values agree given a true match, and the u-probability is the chance they agree by chance among non-matches, with the ratio determining that field's match weight. A rare last name that agrees carries far more weight than a common first name such as “Michael.”

Per-field weights sum into a composite score: above an upper threshold is a match, below a lower threshold is a non-match, and the middle band is a possible match for review. The model can be estimated without labeled training data using Expectation-Maximization, which makes it practical when ground truth is unavailable.

How Does Transitive Closure Affect Data Linkage Quality?

Pairwise linkage evaluates each pair independently, which creates a transitivity problem. If linkage finds that A matches B and B matches C, the logical conclusion is that A, B, and C are the same entity, but pairwise linkage does not infer this automatically because A and C were never directly compared.

Entity resolution applies transitive closure, or graph clustering such as connected components and correlation clustering, to form complete entity clusters. This is the point where data linkage becomes entity resolution: the transition from pairs to unified entities.

Transitive closure introduces a risk of error propagation: if A-B is a false positive and B-C is a true match, closure wrongly links A into C's cluster and contaminates the golden record. Enterprise platforms mitigate this with cluster-level validation: maximum cluster size limits, minimum intra-cluster similarity thresholds, and automatic flagging of clusters where any link falls below a confidence threshold.

What Is Privacy-Preserving Record Linkage?

Not all linkage happens inside one organization's perimeter. Cross-organization linkage is common in public health, government, and financial services, where organizations need to know whether they share records about the same entity without exposing the underlying personal data.

Privacy-preserving record linkage addresses this with several techniques. Bloom filter encoding converts field values into binary vectors that preserve approximate similarity but cannot be reversed, secure multi-party computation lets two parties compute match scores on encrypted data, and trusted third-party models route encrypted records through a neutral intermediary that returns only linked identifiers.

This is not theoretical. The Australian Institute of Health and Welfare uses Bloom-filter linkage to connect patient records across state health systems without centralizing PII, and the US Census Bureau uses similar privacy-preserving methods across administrative datasets. For enterprises bound by GDPR or HIPAA, these methods enable linkage that data-sharing restrictions would otherwise prohibit.

What Does Entity Resolution Data Linkage Look Like at Enterprise Scale?

Consider a global manufacturer running 22 plants across 8 countries, whose procurement function uses three regional ERP instances, each with its own vendor master. The same raw-material supplier appears as “BASF SE” in Europe, “BASF Corporation” in North America, and the Mandarin name in Asia-Pacific, with different addresses and bank details in each system.

Without cross-database linkage, the manufacturer cannot consolidate spend analysis, negotiate volume discounts, or enforce consistent payment terms. The procurement team estimates 15 to 20 percent vendor duplication across the three ERPs, representing $4 million to $6 million a year in missed volume discounts and duplicate processing overhead.

An entity resolution platform ingests vendor records from all three ERPs, standardizes company names (expanding abbreviations, transliterating non-Latin characters), applies multi-field probabilistic matching across name, address, tax ID, and bank account, and produces entity clusters. The BASF cluster links all three regional records at a composite confidence of 97.2 percent, producing one golden vendor record with survivorship rules that keep each subsidiary's regional address while unifying the parent.

One supplier record per parent across three regional ERPs

“We linked the same supplier across three regional ERPs, even across a non-Latin name, and consolidated to one vendor record per parent, which finally let us negotiate on real global volume.”

Joaquim Ferreira, Head of Procurement Data, Ardent Materials Group

How Do You Maintain Data Linkages as Source Data Changes?

Most implementations focus on the initial matching project, but the harder problem is keeping linkages accurate as source systems generate new records, update existing ones, and delete records that should propagate as unlinks.

Batch re-linkage runs the full process on a schedule. It is simple to implement but leaves a window where new records exist unlinked, which is unacceptable for real-time use cases such as fraud detection. Incremental linkage evaluates each new or updated record against the existing entity index as it arrives, eliminating the delay but requiring a persistent index and real-time comparison infrastructure.

Hybrid approaches combine both: incremental linkage handles day-to-day updates, while periodic batch re-linkage catches edge cases and splits clusters on new contradictory evidence. MatchCore supports both batch and incremental matching, and MatchSense adds pre-trained, explainable AI entity resolution for the clustering and golden-record stages, both running on-premise so linkage of sensitive records never leaves your network.

Frequently Asked Questions

What is the difference between record linkage and entity resolution?

Record linkage compares records to determine which pairs refer to the same entity, producing linked pairs with confidence scores. Entity resolution is the broader end-to-end process that adds transitive closure to form clusters, survivorship rules to resolve conflicts, and golden-record creation. Record linkage is one step within the entity resolution pipeline.

What is the Fellegi-Sunter model?

The Fellegi-Sunter model is a probabilistic framework for record linkage published in 1969. It calculates match weights per field from the ratio of the m-probability (agreement given a true match) to the u-probability (agreement by chance). Records are classified as matches, non-matches, or possible matches against composite thresholds, and it can be estimated without labeled data using Expectation-Maximization.

How does blocking work in data linkage?

Blocking partitions records into groups that share a blocking-key value, so only records in the same block are compared, reducing cost from quadratic to near-linear. Common keys include the first few characters of a last name plus a ZIP code or birth year. Sorted neighborhood and locality-sensitive hashing use multiple overlapping keys to improve recall.

What is transitive closure in entity resolution?

Transitive closure links records that were never directly compared: if A matches B and B matches C, all three are assigned to the same entity. It turns independent linked pairs into complete entity clusters, and enterprise platforms add cluster-level validation to prevent a single false link from contaminating a cluster.

Can data linkage be performed across organizations without sharing raw data?

Yes. Privacy-preserving record linkage uses Bloom filter encoding, secure multi-party computation, and trusted third-party intermediaries to compare records in encrypted or tokenized form. These methods let organizations identify shared entities without exposing personal data, and they are used by public health agencies, census bureaus, and financial institutions.

How many records can modern data linkage handle?

Enterprise platforms routinely process datasets of 10 million to 100 million records. The key constraint is not record count but the number of candidate pairs the blocking strategy generates, so effective blocking reduces trillions of potential comparisons to a manageable subset. Scalability depends on blocking algorithms, parallelization, and indexing.

What happens when data linkage produces incorrect matches?

False positives are managed with configurable confidence thresholds, manual review queues for borderline cases, and cluster-level validation that flags suspicious entity groups. Enterprise platforms keep full audit trails, so false links can be identified, reviewed, and corrected without rebuilding the entire linkage.