Planning a Linkage Project

Learn about record linkage. Information from multiple data files can be combined and studied if records for the same person can be paired together (linked or matched). LinkSolv finds linked pairs by comparing data values on candidate pairs of records and calculating the probabilities that the pairs are true links given all comparison outcomes. Any data reported in both files can be compared including person names, birthdates, ages, sexes, residence locations, event dates, and event locations. With accurate probabilities it is easy to distinguish good links from the rest. Bayesian statistical models are the basis for probability calculations  agreements increase likelihoods and probabilities while disagreements decrease likelihoods and probabilities. The precise form of the statistical models varies substantially between linkage practitioners but all aim to compare the probability of a particular outcome for a true matched pair versus an unmatched pair. Data values to be compared in linkage models must be standardized so that equivalent information is coded in the same way.

Become familiar with LinkSolv. Become familiar with record linkage using LinkSolv by flying the Auto Pilot and reviewing linkage results. The Auto Pilot follows a predefined flight plan to create fake data and complete a real linkage project. It displays a description of each step in the process. Descriptions are logged automatically and can be reviewed in a report after the Auto Pilot flight is complete. You can Pause/Run or Stop the Auto Pilot while any description is displayed by clicking on the appropriate command button.

Learn the characteristics of your data. Become familiar with the data files to be linked and the data fields reported on each record. Review documentation for data files and interview data owners. Learn about coding standards, completeness, and accuracy of reported data. Investigate how you might standardize and compare different data fields.

Select and link sample data. Select and link sample records from your data files, say all records for events in June from a year of records or all records for people born in June. You can evaluate different linkage models more quickly to find one with good fit. Measure the effects of comparisons with tolerances and comparisons with dependent outcomes. Inaccurate estimates of model parameters can lead to inaccurate values for linkage probabilities. Linking a sample of your records helps you improve parameter estimates used for the full linkage because you can compare prior and posterior estimates for the sample linkage and detect poor model fit.

Create and link fake data. Become familiar with the LinkSolv fake data simulator and create fake data similar to your sample real data. The great advantage of linking fake data is that you can tell by inspection which record pairs are true links because true pairs have the same record id numbers. Investigate different linkage models until you find one that fits your data  that is, one for which most true matches are found and calculated probabilities are approximately equal to actual probabilities. Fake data will never be exactly the same as your real data but this model will be a good starting point for your real linkage. LinkSolv fake data consists of three files: police reports for motor vehicle crashes, EMS ambulance run reports for injured people, and hospital treatment records for injured people. Fake files include fields found in typical real files of these types. You can specify probability of missing values and probability of incorrect values for each field.

Select and link all data. Select and link all records from your data files. You may have to make small revisions to your sample linkage model because your complete data files have slightly different characteristics than your sample data files. You will always have to change your estimate of total true links and your search criteria for candidate pairs.

Produce linked data files. A simple way to produce a linked data file is to keep the best links  the highest match probabilities  and drop the rest. The number of links selected will depend on how many false positive and false negative matches you are willing to accept. You can do this with LinkSolv but the linked data file will be incomplete and possibly biased as a sample of the total population of true links. Administrative data files linked for observational studies usually have unintended missing or incorrect values in fields of interest. Missing or incorrect values in linkage comparison fields result in low probabilities for some true links and high probabilities for some false links. The best statistical technique for analyzing such datasets is known as multiple imputation. LinkSolv treats the unknown true link status of record pairs as missing data in the first step of a hierarchical imputation model. Missing values for each linkage imputation can be imputed using standard hierarchical models in SAS/STAT® PROC MI, IVEWare, or an equivalent. These tools give you multiplyimputed complete linked datasets for analysis.

Analyze linked data files. Analyze each of your multiplyimputed complete linked dataset using standard techniques such as SAS/STAT® PROC REG, SAS/STAT® PROC LOGISTIC, or their equivalents. Combine multiple analysis results such as regression coefficients or population proportions using SAS/STAT® PROC MIANALYZE or its equivalent.