Thursday, August 13, 2009

Questions about the Data

Please post any questions about the data here!


  1. Hi Claudia:
    I have a comment about cases where RecordID is null. I ran queries on both data sets and one has 11 and another has 4 entireis with RecordID is null. It seems there was a shift in the data. I.e. RecordID is shifted to PatientID, PatientId to record Count,etc. I assume it is some glitch in my data load, but I thought I'd let you know.

    Tanya Silva

  2. Tanya,

    I cannot find such cases (looking under cygwin with usual unix commands). So maybe it is a problem with your processing/opening the file.
    If you want me to look into it I would need to see an example record.


  3. Question 1. ) In the columns with Other-Dx-Code, we can find some numbers with V or E as prefix. Are there any meanings for these?
    For example, E4842 and 4842 have any relationships?

    No - there is no relation, they should be treated as different!

    Question 2.) From the definition " define a relevant 'current' visit as the earliest visit that started 90 days AFTER the last discharge date", can we say:
    The time interval between "Current" visit and the last discharge date is more than 90 days. Plus, there is no visit during this time interval?

    Not sure how to answer this. I think it really is of no relevance to you. We just explained what we did. I do not promise that there is NO visit during this interval nor that there is at least a 90 day interval. The point is that the end that you are seeing is not really the end. I have cut of the end and you may not see 0,1,2,.. hospital visits.
    I just look at the last discharge date in the original. Lets say is was May 5 of 2004. I will than go back to February 5 of 2004 and look for the first hospital stay with admission after that February 5. The discharge code of this admission is the target. It could be any day between Feb 6th and May 4th and there could be a number of additional visits between February and May that got cut of.

    Question 3.) There are 14 Other-Dx-Code items. Are they listed chronologically? They happened before the "principal DX-code" or after?

    No - the order should not matter. There is a slight chance that the recoding has some pattern, but I am not aware of it.

    Question 4.) "the relative time intervals between hospital stays and the duration of a hospital visit are available. " I am sorry I can not find the related info.

    This interval between stays is in column 5 called 'Interval' and the duration is in Column 11 "Length of stay".

  4. The training data have 15 patients with records after their supposed deaths (according to Sequence_Number), including one with two records where patient_disposition=20. The list of patient ids is:
    21459, 50278, 54532, 61378, 95011, 120130, 228228, 260122, 278316, 284120, 284243, 285068, 309647, 318640.

    Ordering problem, or worse?

  5. What does it mean when Interval is negative?

  6. Claudia,

    I find I have 10 patients with disposition=43 in the traing set, but this code does not appear in the data description xl spreadsheet.

  7. There are also a large number of diagnoses and procedures that appear in the data, but are not documented in the spreadsheet.

  8. Early death: Ok, the likely scenario is the following: while being in the hospital (count n) for a long while, the guy was transfered to another hospital (count n+1), returned and seem to have died in the original (count n). So you see a death in the next to last.

  9. Negative interval: Probably the same as the early death. Somebody gets transfered from one hospital to another and back and the previous stay is not recorded as having ended. The interval is calculated from end of the first to beginning of the next and therefore negative.

  10. Lack of description for diagnosis and discharge: Sorry - cannot help with this, the description is the best I can get.

  11. DX Codes in the data files have had leading zeros stripped. Can we have them back or was this intentional?

  12. 1. The Geometric mean and Arithmetic mean for Length of Stay given in the data description is not matching when we calculate the same from the data. Ideally we expect both to match. Do you have any reason?

    2. Cause_E_code and Admit_type are interrelated according to our understanding. Cause_E_code is populated only for 14% of the records for which 15% of it have Admit_type as Emergent. For the remaining 86% missing values in Cause_E_code, 80% of it have Admit_type = Emergent. Is it data error?

    3. Emergency_Dept_Ind should be populated for Admit_type = Emergent is our understanding. When Admit_type = Emergent, Emergency_Dept_Ind is missing for 20% of the records. Can we recode Emergent_Dept_Ind = E for those records?

  13. In the test data, I have 15 records with disposition = 20. I don't understand why these records are included - can you explain?

    In the training data, there are records showing paitent disposition = 43 and 65 - I don't see these codes as being documented anywhere. Can you explain their significance?

  14. I am still very confused by the description of how the data were collected, and wonder if there's an error in "Data Description.doc"

    From what you say above it sounds like you have for each patient a sequence of hospital visits. The current visit is defined to be the earliest visit that is no more than 90 days prior to the last visit. All visits later than the current visit are censored.

    But in the .doc file you say, "we define a current visit as the earliest visit that started 90 days AFTER the last discharge date." Should this be, "that started no more than 90 days BEFORE the last discharge date"?

    Please help me understand.

  15. Yes, I have the same question as Gordon V.

  16. The data appear to be consistent with my interpretation. This is a retrospective study of patients who visited the hospital several times during some (unspecified) time interval.

    For a particular patient, number the visits 1,2, ... n in chronological order by admission date, where a(k) and d(k) denote the admission and discharge dates of the kth visit.

    Define the last visit as

    l = argmax {d(k)}

    (note that it is common but not essential that l = n).

    Define the current visit c to be

    c = argmin {a(k) | d(l) - a(k) <= 90 days}

    The records in the dataset are for visits 1..k. The records for visits k+1..n are censored.

  17. Claudia,

    Most of the Diagnoses codes in the the competition data set are not listed in ICD10 or ICD9.
    I was looking at


    Am I looking at wrong places? Is there a link which gives the classification for the codes in our dataset?

    Best Regards,

  18. Since the leading zeros are deleted for 'PrincipalPRCode' (and for some other variables), there could be two interpretations for the same code.

    For example, when PrincipalPRCode = 101, we would not know if it represents 'Cisternal Puncture' or 'Conjunctiva Incision Nec'

    Anyone know how do address this issue?