Find a Linkage Model that Fits Your Data - Guidelines
Develop Linkage Models that Fit Your Data. It is important to know that if you follow the guidelines then your linkage model will be accurate -- that all true links are included as candidate record pairs and the calculated probability of being a true match is accurate for all candidate pairs. The Bayesian Model Check report applies to both real and simulated data. Use this report to confirm that posterior estimates for model parameters are not very different from your prior estimates. Or, if you find differences then you can make sure they are not due to errors in your linkage model. Linking simulated data can help establish that a linkage model fits the data well because you can tell by inspection which links are true and which links are false. The first test of goodness of fit is to confirm that almost all (say at least 95%) all true links are included as candidate pairs. If at least 95% of all true links are included as candidate pairs then you can proceed to the second goodness of fit test. If not, you should revise your linkage specs to pick up true links that were skipped or were dropped because they were below the cutoff. You might have to add a match pass, add a match field, increase a tolerance, etc. Calculated probabilities must be accurate because they are used to select imputed pairs for outcome studies. Also, CODES2000 uses imputed pairs to revise your prior estimates of match parameters during the Markov Chain process. In the end, high probability links must be almost all true, medium probability links must be an accurate mixture of true and false links, low probability links must be almost all false. Otherwise, imputations will not be accurate. This is the reason for our second goodness of fit test.
How Goodness of Fit Table is calculated. To get the table shown, all candidate pairs above the cutoff were ranked by probability (low to high) and divided into 10 deciles with approximately the same number of record pairs in each decile. Ties are ranked randomly. It is easy to count the actual number of true links in each decile because UniqueIDs are equal for true links in the simulated data. Expected true links are determined by summing match probabilities for all candidate pairs in each decile. Because we have multiply imputed linked data, the counts shown here are the average of the counts in the tables for each imputation. This is why actual counts are not integers.