LinkSolv 8.3 Help Pages and User Guide

Record Linkage FAQs

 
What is Record Linkage? Information from two data files can be treated as if reported together if corresponding records for the same person or other entity can be paired together accurately (linked or matched). Three or more files can be linked by linking two at a time. If records to be linked lack unique identifiers or data errors are common, you can never be completely sure about the true link status of any pair of records -- is it matched or unmatched? Probabilistic record linkage techniques are statistical procedures for finding true linked pairs by comparing data values on candidate pairs of records and calculating the probability that each pair is a true link (or true match) given comparison outcomes -- agreement, disagreement, or missing. Statistical details can vary substantially between different implementations of these techniques. The goal is to choose implementation details and data values for comparison that work well for your files so that most true links are found and most probability estimates are accurate.
 
Any data values that are reported in both files can be compared -- name, address, birth date, age, sex, work phone, salary level, marriage status, date of a hospital visit, etc. In general, data values agree on most true matched pairs unless, by chance, one of the records happens to have incorrect data. Data values disagree on most true unmatched pairs unless, by chance, two unrelated records happen to have the same value. Consequently, each agreement increases the probability that a candidate pair is a true match while each disagreement decreases the probability. Agreement in record linkage can mean equal reported values, equal transformed values such as SOUNDEX codes for names, or values that are close in some sense such as two dates that are one day apart. Data values to be compared usually have to be standardized so that equivalent information is coded in the same way. With accurate probabilities it is easy to distinguish good links from the rest. However, it is often the case that some true matched pairs have low probabilities and some true unmatched pairs have high probabilities. Deciding how to handle these outliers is an important implementation detail.
 
Typical results from a record linkage project are shown in the following extract from a LinkSolv display. Note that this linkage was done with fake data created so that Record IDs would be equal for the same patient in different hospitals -- true links can be determined by inspection. This was purely for demonstration purposes and would not be the case with real data.
 
Match Probability
Source
Record
ID
First
Soundex
Home
Zip
Last
Initial
Age
Birth
MM/DD
Sex
 
 
 
 
 
 
 
 
 
0.0827
Hospital A
A60016715
 
03053
P
7
03/31
F
0.0827
Hospital B
B60016715
F342
03051
P
7
03/31
F
 
 
 
 
 
 
 
 
 
0.9984
Hospital A
A60016713
K135
03455
G
14
04/30
M
0.9984
Hospital B
B60016713
 
03455
G
14
04/30
M
 
 
 
 
 
 
 
 
 
0.9747
Hospital A
A60016712
C624
03051
J
10
07/31
M
0.9747
Hospital B
B60016712
C624
03051
J
30
07/31
M
 
 
 
 
 
 
 
 
 
0.1658
Hospital A
A60016711
A416
03751
 
9
11/08
M
0.1658
Hospital B
B60016711
 
03751
N
29
11/08
M
 
 
 
 
 
 
 
 
 
0.9971
Hospital A
A60016567
J263
03235
W
11
11/06
M
0.9971
Hospital B
B60016662
J263
03235
R
11
11/06
M
 
Probabilistic record linkage with LinkSolv is not for everyone. It is harder to develop a statistical model for linkage that fits your data than to simply pick pairs of complete records that agree on most fields as true links. Suppose you have two or more independently-built data files describing different populations and you want to draw valid conclusions about all cases in both populations, from simply estimating the number of true links to listing all links above a cutoff probablility, estimating the number of false positive or false negative links in a sample, or more sophisticated studies such as regression analyses. Three situations can signal that simpler linkage techniques may not work for your files. First, there are no accurate unique identifiers like SSN for all records in both files. Second, misclassification and non-response are common enough that true linkage status is uncertain for many candidate record pairs. Third, expert manual review that can determine true link status for all marginal pairs is not feasible. For example, the files might describe all occupants of vehicles in crashes reported by police and all patients treated in hospitals for injuries. You might want to estimate total hospital costs for occupants of SUVs in reported crashes, predict next year’s costs, or infer the change in hospital costs if all occupants had been using a seat belt or no drivers had been texting on cell phones.
 
It will take extra time to evaluate alternate linkage models for good fit before choosing your final model, possibly including selection and linkage of sample real data, creation and linkage of fake data statistically like your real data, evaluation of different comparison methods and tolerances, evaluation of comparison outcomes for which one outcome is predicted by another, and evaluation of estimates of model parameters.
 
How do you use LinkSolv to link records? You work through a series of interactive LinkSolv dialogs. There are dialogs for managing projects, preparing data for linkage, specifying linkage models, performing matches, and reviewing match results. For example, the Specify Match Fields tab, step two in the Specify Match dialog is shown below in a screen shot from LinkSolv. This is where you describe the linkage model -- which fields to compare and how to compare them in order to calculate match probabilities for candidate pairs. Upstream dialogs are for making reported data better for linkage purposes by standardizing or converting reported values. Downstream dialogs are for fitting and checking the model. Often you learn something about your data in a downstream step that means revisiting an upstream step for a better linkage.
 
LinkSolv can accept equal values or close values as agreements. To accept close values, choose a built-in compare method (Hours, Days, Miles, Typos, etc.) and an acceptable tolerance. If two data values can be correct but different on true matched pairs (like Bob vs. Robert) then enter the probability of being different (i.e., outside the acceptable tolerance) as part of the linkage model. In general, try to include all information common to both files as match fields but avoid including fields where one comparison outcome is predicted by another field. For example, agreement on age strongly predicts agreement on birth year so you should not use both as match fields. Birth day and birth month could have dependent disagreements if they were derived from the same original birth date -- errors in the original date could propagate to both date parts on the same record, causing simultaneous disagreements. You can compare pairs of match fields with dependent outcomes using LinkSolv but you should estimate the strength of the dependency with a logistic regression model and make it part of the linkage model.
 
 
LinkSolv is a Microsoft Access® - based application. You must have an up-to-date version of Access installed on your computer before installing LinkSolv. You should be familiar with using Access databases and queries to make the best use of LinkSolv. LinkSolv imports your raw data files into Access or Microsoft SQL Server® databases, creates standardized tables for linkage purposes, and creates queries to perform all linkage calculations. All linkage results are tabulated in Access or SQL Server tables and you have to export them to another system such as SAS/STAT® or R for analysis. If you do not plan to use SQL Server then files selected for linkage should be limited to a few million records.
 
What is Bayesian Record Linkage? LinkSolv uses Bayesian statistical analysis to calculate match probabilities. In any Bayesian statistical analysis you describe parameters of interest with probability distributions instead of specific values whenever you are uncertain about their true values. When new information about a parameter becomes available then it can be used in a statistical formula called Bayes' Rule to update your prior distribution to a posterior distribution that reflects the new data. The posterior distribution becomes your new, more accurate, description of the parameter. For record linkage the most important parameter of interest is the status of each record pair -- true match (matched) or not (unmatched). The true status is uncertain so it is described by a probability that a pair is a true match -- the match probability. Bayes' Rule for record linkage can be written in terms of either probabilities or odds and the following example uses odds. Remember that probability = 1 / (1 + odds) and odds = probability / (1 - probability) so it is easy to convert between probability and odds. Probabilities are between 0 and 1 but odds can be any positive number.
 
To see how Bayes' Rule works, suppose two files A and B are to be linked and each has 1,000 records. There are 1,000 X 1,000 = 1,000,000 different record pairs. Also, suppose we know from prior knowledge about the data files that most records in file A have true matches in file B -- that is, there are about 1,000 true matched pairs. If we select a record pair at random then
 
(1)     Prior Odds that a pair is a true match = True Pairs / (All Pairs - True Pairs) = 1,000 / (1,000,000 - 1,000) = 0.001.
 
Now suppose we compare Age on the record pair and find that Age agrees. This is new information that can be used in Bayes' Rule to convert the prior odds to posterior odds. According to Bayes' Rule,
 
(2)     Posterior Odds = Prior Odds X (Probability Age agrees for a true matched pair / Probability Age agrees for a true unmatched pair)
 
Suppose we know from prior knowledge about the data files that age is reported incorrectly on about 1% of all A records and correctly on all B records.
 
(3)     Probability Age agrees for a true matched pair = 1 - Probability Age is incorrect = 1 - 0.01 = 0.99.
 
This term in the formula for Bayes' Rule describes matched pairs and is referred to as the m probability.
 
Suppose we find that there are 50 different ages in the files with about the same number of records for each age.
 
(4)     Probability Age agrees (by chance) for a true unmatched pair = 1 / 50 = 0.02.
 
This term in the formula for Bayes' Rule describes unmatched pairs and is referred to as the u probability.
 
(5)     Posterior Odds = Prior Odds X (m / u) = 0.001 X (0.99 / 0.02) = 0.0495.
 
The term (m / u) in the formula for Bayes' Rule is referred to as the likelihood ratio or odds ratio. In record linkage applications, the logarithm of the likelihood ratio is referred to as the match weight. Agreement on Age gives a likelihood ratio > 1 (match weight > 0) so the odds and match probability increase.
 
Posterior Odds = 0.0495, becoming the prior odds in Bayes' Rule when the next common data element is compared. Suppose the next element is Sex and Sex disagrees with m = 0.001 and u = 0.50.
 
(6)     Posterior Odds = Prior Odds X (m / u) = 0.0495 X (0.001 / 0.50) = 0.0001.
 
Disagreement on Sex gives a likelihood ratio < 1 (match weight < 0) so the odds and match probability decrease. The process is continued with all other data elements used for matching -- the order in which data elements are compared does not matter. Every independent common data element can have something important to contribute to match probability calculations. This is why you should try to compare all independent common information if possible. Testing just a few good fields like First and Last Name is like calling a baseball game after eight innings. You don't know what might have happened in the bottom of the ninth.
 
What is Hierarchical Bayesian Record Linkage? In practice, the true values of parameters used in applying Bayes' Rule for record linkage are uncertain. Consequently, each parameter must be described with it's own prior and posterior distributions and hyper-parameters, creating a hierarchical Bayesian model for record linkage. LinkSolv applies hierarchical Bayesian statistical procedures as described in Gelman and Hill, 2007, Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge University Press. Hyper-parameters include the number of true matches, error probabilities for each field, and the population proportion for each field value so that m and u probabilities can be calculated as decribed in What is Bayesian Record Linkage? Other record linkage implementations treat the number of true links among all candidate pairs and the m and u probabilities themselves as the hyper-parameters, and do not calculate m and u from other parameters.
 
Regardless of the choice of hyper-parameters, the next step is to fit the hierarchical model to your data. LinkSolv uses a statistical technique called Markov Chain Monte Carlo (MCMC) simulation, a general-purpose procedure for estimating the paramaters in any statistical model. This is an iterative process. Starting estimates for all hyper-parameters are chosen. Estimates for all hyper-parameters are used in Bayes' Rule for record linkage to obtain posterior estimates of match probabilities. Posterior estimates of match probabilities are used in Bayes' Rule for each hyper-parameter to obtain posterior estimates. This equals one iteration and iterations are repeated until MCMC test statistics indicate that stable estimates have been reached, say after 10 to 100 iterations. All of these iterations constitute one imputation. The process is repeated for several imputations, say 5, each with different starting estimates. Multiple imputations are used to measure stability of the MCMC process and to capture uncertainty associated with fitting the hierarchical model to your data. Each linkage imputation identifies a set of linked pairs. Most high probability pairs are selected in all imputations, most low probability pairs are skipped in all imputations, and intermediate probability pairs are selected in some but not all imputations. For statistically valid, unbiased results you just analyze each imputation as if it were true and then average the multiple results as described by Gelman and Hill. You can use SAS/STAT® PROC MIANALYZE to combine results from multiple imputations. Or, once multiple imputations have been used to get accurate match probability estimates, you can select a single, most-likely set of linked pairs. LinkSolv includes a summary report for checking your linkage model following recommendations in Gelman and Hill.
 
LinkSolv’s MCMC statistical methodology for fitting a record linkage model to your data is similar to that used in SAS/STAT® PROC MI or IVEware for fitting a missing value imputation model to your data for incomplete datasets. After your linkage is complete, you may want to build a model and impute missing values using one of these software packages, or using R and WinBUGS as described by Gelman and Hill. LinkSolv does not impute missing values.
 
Authored with help of Dr.Explain