LinkSolv 8.3 Help Pages and User Guide

How to Correct Dependent Outcomes

 
 
Correct for Dependent Outcomes. Try to select match fields with independent comparison outcomes so that agreements and disagreements for one match field do not predict agreements and disagreements for a second match field on either true matched pairs or true unmatched pairs. Sometimes this is not practical because you would lose too much information. One way to address dependent outcomes is to combine two dependent match fields into one. If you combine first and last name into full name for linkage purposes then you don't have to worry about dependent outcomes for first and last name. Of course, you would also lose partial agreement whenever first or last name agrees and the other name disagrees. Often it is not clear which strategy will work better with your data so you should test both. Another way to address dependent outcomes is to add a logistic regression to the linkage model as described below. Logistic regression is a standard statistical model for describing dependency. In a logistic regression, agreements for one or more independent match fields predict the logarithm of the odds for the outcome for a dependent match field. Logistic regression models correct for the overall dependency but are not necessarily better on a case-by-case basis.
 
Derived Fields with Dependent Disagreements. Often, two fields derived from the same reported field will have dependent disagreements – that is, all derived fields will tend to disagree together on matched pairs. For example, suppose day of birth and month of birth are split from date of birth. Then both derived fields will be likely to disagree together whenever the complete date of birth disagrees on true matched pairs – double counting disagree weights.
 
Three or More Fields with Dependent Outcomes. Currently, LinkSolv can only correct for pairs of fields with dependent outcomes so you should try to avoid three derived fields from the same original field as match fields.
 
How to Specify a Logistic Regression.
 
Suppose you have 1,000 true matched pairs from an earlier linkage and match fields include Month and Day derived from Birth Date. For simplicity, we assume no missing values. You can tabulate the number of pairs for each possible comparison outcome in a 2X2 contingency table:  Agree/Agree, Agree/Disagree, Disagree/Agree, and Disagree/Disagree. Marginal totals are the observed probabilities for agreement on true matched pairs for Month and Day -- called the m probabilities for Month and Day.
 
 
Day Agrees
Day Disagrees
Totals
Month Agrees
875
75
950
Month Disagrees
25
25
50
Totals
900
100
1000
 
Observed m probability for Month = 950 / 1000 = 0.95.
 
Observed m probability for Day = 900 / 1000 = 0.90.
 
Either field can be used as the dependent field. We code agreements as 1 and disagreements as 0. If Month agreement predicts Day agreement then the logistic regression is
 
log(Probability for Day Agreement / Probability for Day Disagreement) = Intercept + (Regression Coefficient) X (Month Outcome)
 
This logistic regression formula can be fit to the data using any of several different statistical analysis packages. We can also fit the data by hand. Enter the estimated Intercept and Coefficient in the appropriate columns for the independent match field on the Specify Match Fields tab of the Specify Match dialog. Enter the Intercept in both columns for the dependent match field. Two fields with exactly the same Intercept are assumed to have dependent outcomes.
 
If Month Disagrees (Month Outcome = 0):
 
Probability Day Agreement = 25 / 50 =  0.50.
 
Probability Day Disagreement = 25 / 50 = 0.50.
 
ln(0.50 / 0.50) = 0.00 = Intercept + 0 X Regression Coefficient = Intercept.
 
If Month Agrees (Month Outcome = 1):
 
Probability Day Agreement = 875 / 950 =  0.92.
 
Probability Day Disagreement = 75 / 950 = 0.079.
 
ln( 0.92 / 0.079) = 2.45 = Intercept + Regression Coefficient
 
So, Regression Coefficient = 2.45 because Intercept = 0.00.
 
ln(Odds for Day Agreement) = 2.45 * Month Outcome
 
Odds for Day Agreement = exp(2.45 * Month Outcome)
 
If you have no true matched pairs to start with you can calculate Intercept and Regression Coefficient for independent outcomes. For simplicity, we assume no missing values. First, set the marginal total so that they reflect the prior m probabilities for each field given your prior estimate of Total Matches. Second, prorate the marginal totals into the 2X2 contingency table:  Agree/Agree = 1000 X (950 / 1000) X (900 / 1000) = 855, Agree/Disagree = 1000 X (950 / 1000) X (100 / 1000) = 95, Disagree/Agree = 1000 X (50 / 1000) X (900 / 1000) = 45, and Disagree/Disagree = 1000 X (50 / 1000) X (100 / 1000) = 5. If you wish, you can also adjust the counts to show dependency with the same marginal totals. For example, you could add 5 to each diagonal cell and subtract 5 from the off-diagonal cells.
 
 
Day Agrees
Day Disagrees
Totals
Month Agrees
855
95
950
Month Disagrees
45
5
50
Totals
900
100
1000
 
If Month Disagrees (Month Outcome = 0):
 
Probability Day Agreement = 45 / 50 =  0.90.
 
Probability Day Disagreement = 5 / 50 = 0.10.
 
ln(0.90 / 0.10) = 2.2 = Intercept + 0 X Regression Coefficient = Intercept.
 
If Month Agrees (Month Outcome = 1):
 
Probability Day Agreement = 855 / 950 =  0.90.
 
Probability Day Disagreement = 95 / 950 = 0.10.
 
ln( 0.90 / 0.10) = 2.2 = Intercept + Regression Coefficient
 
So, Regression Coefficient = 0.00 because Intercept = 2.2.
 
ln(Odds for Day Agreement) = 2.2, independent of Month Outcome as assumed.
 
Odds for Day Agreement = exp(2.20) = 9.0.
 
 
 
Authored with help of Dr.Explain