Case Based Pediatrics For Medical Students and Residents
Department of Pediatrics, University of Hawaii John A. Burns School of Medicine
Chapter XXII.3. Epidemiology and Research Methodology
Loren G. Yamamoto, MD, MPH, MBA
February 2003

Return to Table of Contents

You are seeing an obese 10 year old for a school physical. A history of his overall activity level indicates that he does not participate in sports, he stays indoors all the time and watches TV during the entire weekend. He doesn't know how to ride a bicycle. His only physical exercise is at school during recess and physical education classes. Because he is obese, the other kids make fun of him, so he prefers to just sit in the shade during recess.

His family history is significant for: 1) obesity in both parents; 2) cigarette smoking, coronary artery disease and hypertension in his father; 3) death from acute myocardial infarction in his paternal grandfather at age 45.

Exam: VS are normal except for his blood pressure which is 130/85. He height is at the 50th percentile. His weight is 84 kg (>95th percentile). He is very obese in no distress. His examination is normal except for the findings associated with obesity.

You advise his parents that he is at risk for heart disease in his early adult life if his obesity continues. You recommend a physical exercise program and suggest that his father should not smoke inside the home. However, his mother and father state that they are unable to comply because they live in an apartment. They don't believe that indoor exercises would help. They are skeptical and say that they would like to see some proof that exercise has some benefit. His father shows you a magazine article (from your waiting room) which states that cigarette smoking does not cause lung cancer.

You decide to look up some studies on the effect of exercise on obesity and cardiovascular disease. However, you find that there are many different types of studies and these are hard to compare and it is difficult to determine the quality of these studies. The statement about smoking is puzzling. The article states that although cigarette smoking is associated with lung cancer, it has not been shown to cause lung cancer. You decide to find out how experts determine if an association is truly due to cause and effect.

Epidemiology includes the description of methods which describe the occurrence of disease. Descriptive and inferential statistics are discussed in a separate chapter. Many epidemiology numbers are special descriptive statistics which help to summarize the occurrence of disease within a population. In clinical research, several types of studies exist. Understanding the differences between these study methods enables one to assess how good a study is in contributing to the clinical question at hand. This chapter will cover some basic epidemiology and focus on research methodology to develop an ability to critically appraise the medical literature.

Study design types (method of study) can be categorized into: 1) Experimental design, 2) Clinical trial (placebo controlled, blinded), 3) Cohort study, and 4) Case control study.

Recognizing what "type" of study one is reading is not nearly as important as recognizing the actual weakness of the data and its conclusions. Many studies do not fit neatly into one specific study type.

For the above 4 study types, they can be further classified as prospective, longitudinal, and retrospective based on the time sequence of the data observations. A prospective study generally looks at some time of exposure (a risk factor) and then determines at some future time, if a disease condition develops. Retrospective studies look at those who have developed a disease and then determine if any risk factors were present in the patients at some time in the past. Longitudinal studies make observations in the study group at several points in time moving forward.

Prospective and longitudinal studies are the most difficult to do because they require a long period of time to complete. Retrospective studies are easier to do, however, they are subject to numerous methodological flaws. Prospective and longitudinal studies are less subject to methodological flaws, so the quality of their conclusions is usually superior to that of a retrospective study.

1) The experimental design type of study is not common in clinical medicine using actual patients. This type of study is usually done in a lab using models or study subjects who are subjected to different treatments. It is nearly flawless from a methodological standpoint if it is done correctly. Example: Does 20-minute EMLA cream (a topical anesthetic) reduce the pain of starting an IV? Healthy study subject volunteers are recruited who are willing to have two IV's started on them. EMLA cream is applied to one hand and placebo cream is applied to the other hand with the study subject blindfolded. An IV is started on one hand and then the other hand. The study subject must now rank which hand was more painful. Note that this type of study could not be done on actual patients, because they would not ordinarily require two elective IVs.

2) The clinical trial is a study type that generally appears in the New England Journal of Medicine. Because such studies are very expensive to undertake, they have consumed enormous resources, and they have taken a long time to complete, it is unlikely that anyone else will have the resources to repeat it, and such studies are often fairly definitive in drawing conclusions. Clinical trials of new treatments must be compared to some type of control. The control could be an older treatment or it can be a placebo (placebo controlled). If patients know which treatment they are getting (the new treatment or the control), then the study is not blinded. This is a problem because patients may perceive they have gotten better if they got the new treatment and those who got the control (placebo or older treatment) may be less likely to feel like they have gotten better. The study design should somehow measure the patient's clinical response to treatment. It is best if this measurement is highly objective (e.g., blood sugar at 6:00 a.m.), but often it is a rather subjective measurement such as tympanic membrane abnormality or subjective pain relief. Even a measurement such as blood pressure may be subject to bias to some degree. If the measurement of clinical outcome has any degree of subjectivity, then it may be subject to bias if those making the measurements know whether the patient received the new treatment or the control. If those making the measurement are blinded as to whether the patient received the new treatment or the control, this removes the bias. This is blinding the study investigator. If possible, it is best to blind both the subjects (patients) and the study investigators (double blinded). Double blinding can be accomplished by assigning codes to pre-measured treatment vials. After the clinical outcome measurements are made, the code is revealed to determine which study subjects received the new treatment versus the control. It is not always possible to blind the patient or the investigator. For example, comparing the outcome a jogging program and weight control 2 years later, it would not be possible to blind the patient as to whether they were assigned to the jogging or non-jogging group.

3) In a cohort study, a group (cohort) is identified. Disease outcome and risk factors are assessed within the cohort. This can be done prospectively, longitudinally, or retrospectively. Examples:

Prospective: All patients in the emergency department (ED) arriving with wheezing are treated in an unrestricted fashion by the ED physician on duty. A data sheet is completed during the ED visit which indicates the patient's initial oxygen saturation, the number of bronchodilator treatments received, and whether the patient required hospitalization. This data can be analyzed to see if hospitalized patients have lower initial oxygen saturations compared to patients who were discharged home from the ED.

Longitudinal: The Hawaii Heart and Cancer Study, located at Kuakini Medical Center, enrolled all Japanese males of a certain age group on Oahu using selective service (military draft) registration data. Since the 1960s, this group of men has undergone periodic screening physical examinations, histories, lifestyle surveys, and laboratory studies. Since this cohort has been followed for a long period of time, many of the men have developed heart disease, cancer, etc. Such large longitudinal cohorts have yielded substantial information regarding the role of risk factors for such diseases. Other large cohorts are followed similarly in other research centers in the U.S.

Retrospective: 120 inpatient cases of intussusception were reviewed. Each chart was reviewed (retrospectively) to look for documentation of currant jelly stool and whether the diagnosis of intussusception was missed on the initial medical evaluation. The study concluded that the presence of frank currant jelly stool was associated with a low likelihood of missing the diagnosis of intussusception while there was a higher likelihood of missing the diagnosis in the absence of frank currant jelly stool.

4) A case-control type of study is always retrospective. Patients with a particular disease condition are identified (cases). A set of matched controls without the disease are found. Risk factors (exposure factors) are then compared in the two groups. For example, to examine the relationship of bicycle helmets on brain injuries, we could identify 30 patients hospitalized with brain injuries due to bicycle accidents. We could then go to a different hospital ward and find 30 patients of similar age and sex, but without a brain injury. We could then survey each group for the bicycle helmet use. If the bicycle helmet use frequency is 30% in the head injury group and 75% in the control group, and if p<0.05 for this difference, then an association exists to suggest that bicycle helmet use may lower the risk of bicycle related head injuries. However, there are other possible explanations for this so the strength of the conclusion is not as strong.

Clinical trials are the most difficult and expensive to do. Case-control studies are the easiest and least expensive to carry out. Cohort studies are in between. While case-control studies are the easiest to perform, they are the most subject to methodological flaws. In studying rare diseases, case-control studies are the most efficient means of epidemiological research. For example, in the Hanta virus epidemic in New Mexico, the most efficient way to determine risk factors for the deaths due to the unknown cause was to examine the background history of all the patients who died mysteriously compared to a group of matched controls.

Consider the example in the case in which we are trying to determine if lack of childhood exercise is associated with coronary artery disease (CAD) in later adult life. Using the four different study types, this is how it would have to be done:

For an experimental design, we would obtain 1000 young mice, since we could not do this study in humans. 500 mice would be exercised and the other 500 would be forced to be sedentary. When the mice become adults, we would look at their hearts to see how many in each group have CAD.

For a clinical trial, we would obtain 1000 child volunteers. We would then randomize them to childhood exercise or childhood non-exercise groups. Study volunteers would have no choice as to what group they would be in. Those assigned to the exercise group would have to exercise all through childhood. Those assigned to the non-exercise group would be forbidden from exercise throughout childhood. 40 years later as adults, we could survey them for CAD using angiography or a more sophisticated future method of imaging the coronaries. We would not be able to do a double blinded study since the study volunteers would know whether they were randomized to the exercise or non-exercise group. Such a clinical trial would be unethical and impossible.

In a cohort study, we would identify 1000 children who will be observed longitudinally for the next 50 years. There would be no restrictions on their lifestyle. Some will have a lot of exercise, some will have none and some will have intermediate levels of exercise. Somehow, this will have to be quantitated. Other factors associated with CAD must also be monitored, such as blood pressure, cholesterol, diet etc. In late adulthood, the cohort could be evaluated for CAD to determine whether more CAD is present in the groups who had less childhood exercise.

In a case control study, we would go to the coronary care unit (CCU) to find 50 patients hospitalized with ischemic cardiac disease. We would find 50 age and sex matched controls from the general hospital ward (without heart disease). We would then obtain a detailed history of their childhood exercise habits to see if there is less childhood exercise in the CCU patients compared to the controls.

In reviewing the medical literature, there are several common pitfalls in the interpretation of published data. Some of these include:
. . . . . 1. Gold standard.
. . . . . 2. Statistical significance versus clinical importance.
. . . . . 3. "Fishing" for a significant result: Data dredging.
. . . . . 4. Confounding variables.
. . . . . 5. Association versus cause & effect.
. . . . . 6. Case reports.

The gold standard refers to a method that truly identifies a particular condition. Some disease conditions have gold standards and others do not. Examine the following examples. Which of these are good gold standards and which of these are "bronze" standards?
. . . . . Pregnancy: Positive beta-HCG, rising beta-HCG, baby/fetus/fetal tissue is eventually passed.
. . . . . Meningitis: Elevated white cell count on CSF analysis.
. . . . . Pneumonia: Characteristic CXR findings.
. . . . . Appendicitis: Exploratory laparotomy examination of the appendix confirmed by histopathology.

In the above examples, the only one that is a poor standard (not "gold") is the CXR for pneumonia, since for a mild pneumonia, some radiologists would read the CXR as normal, while others would read it as a slight pneumonia.

Think of gold standards for the following examples that have been studied frequently in the literature: otitis media, strep pharyngitis, periorbital cellulitis, scaphoid fracture, bacteremia. The diagnostic standard for otitis media is, an examiner states that the patient has otitis media. This is a poor standard. The conclusions from many of the studies on otitis media are weak because it is not convincing that all the study subjects actually have otitis media. The diagnosis of strep pharyngitis is based on a throat swab. However, we know that some positive throat cultures are due to tonsillar colonization and not necessarily to strep pharyngitis. Periorbital cellulitis is a clinical diagnosis. Some would call a swollen bug bite on the upper eyelid a cellulitis. Scaphoid fractures are not always radiographically apparent. Bacteremia relies on a positive blood culture, but many blood cultures grow out contaminants and a negative blood culture may occur in bacteremia if the bacteremia is low grade. Thus, none of these diagnostic clinical entities are well defined by gold standards. If a gold standard is lacking, one cannot be certain what clinical entity is being studied.

Statistical significance versus clinical importance has been described in the chapter on statistics. Just because something is statistically significant does not necessarily mean that it is clinically important. For example small differences in white blood counts, oxygen saturation values, pain scores may be statistically significant, but the differences need to be larger for them to be clinically important.

"Fishing" or "data dredging" for a significant result has been discussed in the statistics chapter. This phenomenon occurs when too many statistical tests are performed, which increases the likelihood that you will find a difference due to chance (which is not a true difference) leading you to an incorrect conclusion. Worded another way, the likelihood of a type 1 statistical error, increases with the number of statistical tests that are performed.

Confounding variables are variables that are related to the study risk factor and the outcome. Such factors must be matched or compensated for in order to compare two groups.

Example: Does watching Japanese language Samurai movies on TV cause GI cancer? We studied 50 men with GI cancer and found that 60% of them frequently watched Samurai movies. We then found a control group of teenagers from a high school who presumably do not have GI cancer. We found that 0% of the controls frequently watched Samurai movies. We thus, conclude that watching Samurai movies puts one at greater risk for GI cancer. What is wrong with this conclusion? Age is a confounding variable. Age is related to GI cancer since GI cancer tends to occur in older adults and not teenagers. Age is also related to the likelihood of watching Japanese language Samurai movies. While older Japanese men are somewhat likely to watch Japanese language Samurai movies, young teenagers are very unlikely to frequently watch these. Thus, to have a better control group, a confounding factor such as age must be matched for (i.e., the age distribution in the GI cancer group must be the same as that of the control group).

Age and sex are almost always confounding variables. Thus, comparison groups must always be matched for age and sex. Other confounding variables are more difficult to identify. Ethnicity is a significant confounding variable in the example above, since non-Japanese men are much less likely to watch Samurai movies.

See if you can think of some potential confounding factors for the following study. The emergency department carried out a study comparing special intraosseous needles with generic bone marrow needles to determine which needle was the easiest to use for an intraosseous (IO) procedure. Thirty Honolulu paramedics were asked to insert the special IO needle into a turkey bone, then to insert the generic bone marrow needle into a turkey bone. The two IO insertion times were recorded and the paramedics were asked to determine which needle they preferred. The following are the results. Mean insert time was 21 seconds for the special IO needle, and 15 seconds for the generic bone marrow needle, (p=0.02). 20% of the paramedics favored the special IO needle, while 80% favored the generic bone marrow needle (p=0.001). Identify some potential confounding variables?

1) These paramedics had previous experience and training with the intraosseous procedure. Which needle were they trained with and which needle have they used more in their experience? If they had previous significant experience with the generic bone marrow needle, then we would expect that they would have an easier time with this needle. Thus, previous intraosseous experience is potentially a confounding variable. This confounding variable could be corrected by matching those with previous generic bone marrow needle experience with others who have more special IO needle experience. Another way of eliminating this confounding variable is to use other study subjects, such as students, who have never performed an intraosseous insertion before.

2) Note that the study described above always started with the special IO needle, followed with the generic bone marrow needle. This gives the special IO needle a disadvantage because it gives the paramedics an opportunity to practice with a different needle prior to using the generic bone marrow needle, but no practice run is allowed prior to the special IO needle. If the turkey bone model is not similar to an actual intraosseous attempt, it may take some getting used to. The turkey femur is cylindrical, while an infant's tibia is flat. Thus, any practice on the unfamiliar cylindrical turkey femur may help to reduce the time of IO needle insertion. Thus, the sequence of which needle is used first is a confounding variable. This confounding variable could be eliminated by alternating or randomizing which needle is tried first.

How can we determine if the association between two variables is a mere association or if one causes the other? We often take this for granted. For example:
. . . . . 1. Measles virus exposure is associated with measles.
. . . . . 2. Asbestos exposure is associated with mesothelioma.
. . . . . 3. Stress is associated with shingles (zoster).
. . . . . 4. Cigarette smoking is associated with lung cancer.

Which of the above are mere associations and which are cause and effect? What objective criteria are you using to come to a decision?

Koch's Postulates are useful in determining cause and effect, but these postulates are applicable to infectious disease only:
. . . . . 1. The microorganism must be present in every case of the disease.
. . . . . 2. The organism can be cultured.
. . . . . 3. This cultured organism must cause the same disease when inoculated into another host.
. . . . . 4. The organism must be recovered and re-cultured from this new diseased host.

Although not as sound as Koch's postulates, the following criteria have been the standard for determining cause and effect. Epidemiologic cause and effect criteria include:
. . . . . 1. Strength and consistency of association (multiple studies show the same relationship).
. . . . . 2. Dose-response relationship (greater risk factor exposure results in greater risk of disease).
. . . . . 3. Correct temporal association (risk factor occurs first, then disease; not vice-versa).
. . . . . 4. Specificity of cause and effect (risk factor is specific for this disease; i.e., it does not cause other diseases).
. . . . . 5. Plausible reason for a cause and effect relationship to exist.

Are the following cause and effect under the above criteria?
. . . . . 1. Measles virus exposure is associated with measles. Yes.
. . . . . 2. Asbestos exposure is associated with mesothelioma. Very close to yes.
. . . . . 3. Stress is associated with shingles (zoster). No.
. . . . . 4. Cigarette smoking is associated with lung cancer. Probably, but not definite.

Stress does not meet the cause and effect criteria for zoster since the strength and consistency of the association is weak and the level of stress is difficult to measure, so a dose-response relationship is lacking. Stress and zoster are not specifically linked, since stress causes ill conditions other than zoster. Cigarette smoking does not meet cause and effect criteria because it does not meet the specificity criterion. Smoking is associated with GI cancer, lung cancer, esophageal cancer, oral cancer, heart disease, etc. If you have ever wondered why the tobacco industry claims that cigarette smoking is not a proven cause of lung cancer, this is the reason (lack of specificity). However, the reason for this is that cigarette smoke is not a homogenous substance and the pathogenesis of cancer is complex and multifactorial.

Single case reports should always be viewed with suspicion. Single case reports are only reported if the phenomenon reported is rare or unheard of. Such cases present a distorted view of reality. One could interpret case reports in the exact OPPOSITE way that they are presented. For example, if I wrote a "case report" about a child who got bit by a mosquito and then began to itch, no journal would ever publish this case report since we know that mosquito bites cause localized pruritus. But if I wrote a case report about a child who got bit by a mosquito and while scratching he invented a warp drive rocket engine, such an unexpected "case report" would be of interest to some journals. It has been said that you may choose to believe the exact opposite of a case report. In this case, getting bit by a mosquito does not cause one to invent an advanced means of interplanetary rocket propulsion. Additionally, if the case reported is so rare and it already occurred, it may not likely occur again.

Case series, on the other hand, are not subject to the same criticism as the single case report. Case series should be taken more seriously.

Screening tests: These terms are frequently misused.

Sensitivity=TP/(TP+FN)=the fraction of all true positives that are caught by the test. A very sensitive test identifies most of the true positives. However, there may still be a substantial number of false positives in a highly sensitive test.

Specificity=TN/(TN+FP)=the fraction of negatives that are true negatives. A very specific test correctly identifies most of the true negatives. However, there may still be a substantial number of false negatives in a highly specific test.

Positive predictive value=TP/(TP+FP)=the likelihood of having a disease if the test is positive.

Negative predictive value=TN/(TN+FN)=the likelihood of not having a disease if the test is negative.

Using tuberculosis (TB) as an example, primary screening tests should be very sensitive (catch most of the positives). TB skin testing is a useful primary screening test because it is positive in those with TB exposure (most of whom do not have pulmonary TB yet).

Secondary (confirmatory) screening tests should be specific. A negative CXR is likely to indicate the absence of pulmonary TB, thus, the positive skin test indicates TB exposure, but not pulmonary TB.

The negative predictive value is always high in rare conditions, regardless of how good or bad a test is. For example, is a serum beta-HCG a good test to rule out TB? Of course not. But did you know that its negative predictive value is 99% in ruling out TB. In a study sample only one in 200 TB skin tests was positive. All of these patients have negative B-HCG's. Thus, a negative B-HCG has a negative predictive value of 199/200=99.5% Wow!

Consider the following statement by Dr. Superstar: My clinical evaluation is 95% accurate in diagnosing appendicitis. What does this mean?

Does this mean that Dr. Superstar correctly identifies 95% of those with appendicitis (i.e., sensitivity=95%)? This may sound impressive at first glance, but I could roll a pair of dice and tell you that if I roll any number less than 13, the patient has appendicitis. Using this method, I will identify 100% of those with appendicitis. Of course there will be many false positives, but no false negatives. Rolling dice identifies 100% of those with appendicitis (i.e., sensitivity=100%).

Does Dr. Superstar mean that if he diagnoses appendicitis, then there is a 95% chance that the patient actually has appendicitis (i.e., positive predicative value=95%)? This may sound impressive at first glance, but consider the following: This could mean that Dr. Superstar evaluated 1000 patients. He made a clinical diagnosis of appendicitis in 100 patients, and of these 100 patients, 95 patients had appendicitis and 5 did not (positive predictive value=95%). Of the 900 patients who received a negative clinical evaluation by Dr. Superstar, 450 had appendicitis. Thus, Dr. Superstar could have a 95% positive predictive value, but this does not necessarily indicate that he is a good clinical diagnostician for appendicitis if he misses 450 cases of appendicitis for every 95 that he diagnoses.

Does Dr. Superstar mean that if he concludes that the patient does NOT have appendicitis, then there is a 95% chance that this is correct (i.e., negative predictive value=95%)? This may sound impressive at first glance, but remember the general statement made earlier about negative predictive value: The negative predictive value for any rare condition is always high regardless of how good or poor the test is. If Dr. Superstar evaluated 1000 patients and only 40 patients actually had appendicitis (4%), then I could use the dice test to predict which patients do not have appendicitis. Rolling any number less than 13 indicates the absence of appendicitis. This test will identify true negatives in 960 out of 1000 instances (i.e., negative predictive value=96%). Again, an obviously useless test, can often be better than a seemingly useful test. You must be careful in accepting some of these numbers.

There is a fallacy in the high percentages such as 95%. We tend to think that 95% and above for any kind of test is good, because all through our lives, we were taught that 95% was an "A grade". We needed these grades to get into college and medical school. The reality of these numbers is that 95% can be good, but it can also be very poor. THINKING is required to sort this out. Whenever sensitivity and specificity values are calculated, the author should ideally calculate all four values (sensitivity, specificity, PPV, NPV). If the author only publishes one or two of these, it is very likely that the unpublished values are very poor and do not support the author's conclusion. Such an omission should be viewed with extreme skepticism.

Special rate calculations are very common in epidemiology. These terms are frequently used incorrectly. Incidence is the number of new cases that occur. An incident rate is the incidence divided by some type of standardization factor such as a one year period (the annual incidence rate) or a clinical occurrence such as the total number of births as in the infant mortality rate (the number of infant deaths divided by the total number of live births). Prevalence is the number of cases that exist at a specific moment in time. A prevalence rate is the prevalence divided by some type of standardization factor (which cannot be a time period because by definition, prevalence refers to a single point in time and not a period of time) such as a population base. Because of these differences, incidence is generally used to describe acute conditions, while prevalence is used to describe chronic conditions. The incidence of a nursemaid's elbow might be 1000's of cases per year. But the prevalence (how many have the condition now) of nursemaid's elbow at this very instant might be zero (or perhaps 1 or 2), because it is very likely that only 0, 1 or 2 children have this at this moment. The prevalence of childhood diabetes in a community might be 300 cases at this moment. If the number of new cases if childhood diabetes is about 35 per year, then we could say that the incidence of new onset diabetes (the initial onset is the acute event) is 35 per year. Thus, incidence underestimates the magnitude of the problem for chronic diseases since incidence only measures new cases.

Mortality rates are commonly cited to describe survival and the overall health of a community. Similarly, injury rates and other outcome rates can be used to describe the health of a community. Remember that the mortality rate for everyone is 100%. That is, all of us will eventually die. Therefore a mortality rate of 10% is impossible. This would make sense only for a time-limited mortality rate. Such as a 10-year mortality rate for leukemia is 11%. Which means that 10 years after the diagnosis of leukemia, there is a 89% chance of survival. All mortality rates should have a time period attached to them or it should be understood that the time period is short. The mortality rate for bacterial meningitis is 10%. Does this mean that if I have bacterial meningitis, I have a 90% chance of living forever? No, it implies that 10% of children with bacterial meningitis, die shortly after the diagnosis. It should be more accurately called a 1 year or 6 month mortality rate.

Mortality rates can be age adjusted which also permits the rate to be less than 100% because it is a calculated mortality rate corrected by the age distribution of the community's population. This calculation is beyond the scope of this chapter. Some examples of this will be described later. Infant mortality rates are frequently used to assess the health of a community or country. The implication is that a healthy community or country should have a low likelihood of an infant dying.

Consider the following CHALLENGING data interpretation examples:

Patients with angina who have coronary artery imaging studies often reveal the presence of coronary artery disease. Following coronary artery bypass grafting (CABG), patients' angina often resolves. What can be said about the observations in determining the utility of CABG? While these findings suggest that CABG reduces angina pain by improving coronary blood flow, a control group of "sham thoracotomy" patients (their chest was cut open, but no coronary grafts were performed) noted a similar degree of angina pain improvement following the "sham thoracotomy" procedure, suggesting that the thoracotomy itself had at least some role in reducing angina pain and this reduction in pain was not necessarily due to the coronary grafts. CABG recommendations were modified to recommend this procedure only in those with severe coronary disease.

Is phototherapy useful in the management of neonatal hyperbilirubinemia? Is bilirubin toxicity the cause of kernicterus? This is a complicated question, but in a study of ABO incompatibility patients with pre-set exchange transfusion criteria, babies were randomized to phototherapy or no phototherapy to see how many babies in each group reached the pre-set exchange transfusion criteria. There was no significant difference suggesting that phototherapy does not prevent the need for exchange transfusion in ABO incompatibility hyperbilirubinemia. Whether bilirubin is the cause of kernicterus is controversial and unlikely in my opinion since very high bilirubins are only associated with kernicterus if the cause of the hyperbilirubinemia is due to Rh incompatibility or G6PD deficiency. Other causes of high bilirubins are not associated with kernicterus. Bilirubin may be a marker of the actual cause of kernicterus, rather than be the cause of kernicterus itself.

In a study of 100 joggers compared with 100 age matched controls, the mean HDL was significantly lower than in the controls. What can be said about this data? It might be tempting to conclude that jogging improves HDL levels. However, joggers and non-joggers differ from each other in more ways than just jogging. It is likely that joggers have different diets than non-joggers. Multiple confounding factors exist such as diet, smoking, other exercise, work related stress, obesity, hypertension, diabetes, etc. All these factors are related to jogging and potentially to HDL. These confounding variables must be matched for among joggers and non-jogger controls.

Design a study to investigate whether IV magnesium prevents the need for intubation in severe status asthmaticus. When a severe asthmatic in near respiratory failure comes in to the ED, randomize the patient to magnesium or no magnesium and determine which group does better. This is not possible because, we would need to get consent from such patients to participate in such a study. It's not possible to get consent from such severely ill patients. All such patients get treated with multiple meds to prevent respiratory failure. Nothing should be held back. Any study claiming to have ethically randomized severe asthmatics, must not have been enrolling severe asthmatics; they must have enrolled moderately severe asthmatics who were stable for a consent procedure.

In a meta-analysis of ultrasound in the diagnosis of appendicitis, a scan of the literature identified 32 studies. 27 of these studies concluded that ultrasound is highly accurate in the diagnosis of appendicitis while 5 studies concluded that ultrasound is not accurate? The meta-analysis concludes that ultrasound is accurate. Comments? Several points here: 1) The more studies you see in the literature on a topic, the more controversial it must be. If the answer were clear-cut, no further publications on the topic are necessary. However, in controversial subject areas, multiple publications are often present in the literature, attempting to clarify the controversy. Thus, the correct conclusion should be that the accuracy of ultrasound in appendicitis is controversial. 2) There is publication bias towards studies with "positive" results. Thus, although 27 studies conclude positively that ultrasound is accurate and 5 studies conclude negatively, this doesn't mean that the positives outnumber the negatives, therefore the positive conclusion should not necessarily be reached. One negative study should have the weight of multiple positive studies since there is known publication bias favoring the publication of "positive" studies. 3) A study of ultrasound done by the University of XXX's Department of Pediatric Abdominal Ultrasonography Professors (i.e., superspecialists), should not be equated to the practice of general radiologists in community hospitals. A study conducted by superspecialists does not necessarily mean that a group of generalists can match the same results.

Consider the following statement: Trauma is the second leading cause of death. Gee, I didn't know that trauma was such a common cause of death. Trauma is the second leading cause of death only if the leading cause of death could be lumped into "non-trauma". There are ways to make any disease seem to be very important. For example, bicycle injures might be the first or second leading cause "accidental, non-motor vehicle related death in children 5 years of age". Additional phrases qualifying the death rate can be used to make any particular condition seem very important.

The U.S. has a higher infant mortality rate than Country-X (a very poorly developed country). How can this be? The U.S. defines a live birth as any birth with an Apgar score greater than zero. If a 21-week fetus (not compatible with survival) is passed with a heart beat of 10 beats per minute for 15 seconds until it stops, this is considered a live birth and an infant death in the U.S. Other countries do not count this as a birth. Thus, the U.S. infant mortality rate is not comparable to the way that infant mortality rates are measured in other countries. Very poor countries have poor health surveillance methods and probably don't know about most births and most infant deaths in their country, making their published infant mortality rates very inaccurate. Additionally, a very ill infant with complex surgical problems is transferred from another country to the U.S. where specialized surgical care is available. Despite heroic surgical efforts the infant dies. This infant death is counted in U.S. statistics and not in the infant mortality rate of the country of the infant's birth. There are many things which distort the infant mortality rate making it an inaccurate proxy for the health status of the U.S.

A study is done to determine the most common causes of pneumonia in young adults presenting to a college student health clinic. Blood cultures are drawn from the first 500 patients diagnosed with pneumonia. Of these 500 blood cultures, 15 grow pneumococcus and 4 grow staph epidermidis. The study concludes that only 0.2% (15 out of 500) cases of pneumonia are due to pneumococcus. Comment on this conclusion. A blood culture is a not a gold standard for determining the cause of the pneumonia. Many patients with pneumococcal pneumonia have negative blood cultures. Staph epidermidis is often a contaminant and is very rarely a cause of pneumonia. Other causes of pneumonia such as mycoplasma and viruses will not be recovered from a blood culture.

A study was done to determine if a single IM dose of ceftriaxone (Rocephin) was sufficient to cure otitis media. This study randomized children with OM to receive 10 days of amoxicillin or a single IM dose of ceftriaxone. After 10 days, roughly 67% of children were found to be cured in both groups. The study concludes that single dose ceftriaxone is effective in treating otitis media. What are the problems with this study? In a study comparing amoxicillin with placebo in the treatment of otitis media, cure rates were roughly the same in both groups. Thus, ceftriaxone is similar to placebo in efficacy or similar to amoxicillin in efficacy. If similar to placebo, then one would conclude that ceftriaxone doesn't work, but if similar to amoxicillin, one would conclude that it does work. Unfortunately, we can conclude neither because the definition of OM is poor (no gold standard). It is likely that most cases of OM are minor and don't require treatment, so if the study deliberately enrolled patients with minor OM, then the spontaneous cure rate would make it appear that both groups got better at the same rates. Drug companies might deliberately set up a study to have a favorable conclusion. Notice that they chose to compare ceftriaxone to amoxicillin rather than to placebo because if the cure rates were shown to be similar to amoxicillin, the conclusion would be that ceftriaxone is effective, but if the cure rates were similar to placebo, then the conclusion would be that ceftriaxone is ineffective. When you do a study consisting of 980 cases of minor OM and 20 cases of severe OM, any treatment benefits will be diluted by the 980 minor cases that don't benefit from treatment. The problem here is that there is no gold standard to define the disease condition being studied and the heterogeneity (minor cases mixed with severe cases) of the condition being studied.


1. At the Acme emergency department, the hospitalization rate for all ED patients is 6%. The emergency physicians at Acme have developed a test to predict the need for hospitalization. When this test is positive, there is a 93% chance that the patient will NOT need hospitalization. Is this a useful test?

2. In a meta-analysis of midazolam (Versed) sedation in children undergoing procedures, a scan of the literature identified 10 studies. 7 of these studies concluded that midazolam was highly efficacious in accomplishing sedation without significant adverse effects. Three studies concluded otherwise. The meta-analysis concludes that midazolam is an effective agent for pediatric sedation. Comments?

3. True/False: Sweden and Norway have lower mortality rates than the U.S.

4. Poor country PP has a border with a wealthy country WW. The age adjusted mortality rate for WW is higher than for PP, suggesting that PP is a healthier country than WW. This is obviously not the case as one can observe by traveling through both countries. How can this discrepancy be explained?

5. You read in a textbook of medicine citing the incidence and prevalence of diabetes mellitus. Which number (incidence or prevalence) is more useful to describe the epidemiology of diabetes? When is one preferred over the other? How accurate are these numbers? Where do these numbers come from?

6. Define sensitivity, specificity, positive predictive value and negative predictive value. Which of these is frequently >90% even if the test is a poor one?

7. Is it possible to have a test that has a nearly 100% sensitivity, specificity, positive predictive value and negative predictive value? If so, give an example of such a test in clinical medicine.


1. Glaser AN. High-Yield Biostatistics, 2nd edition. 2001, Philadelphia: Lippincott Williams & Wilkins.

2. Kramer S (ed). Mausner and Bahn Epidemiology: An Introductory Text, 2nd edition. 1985, Philadelphia: W.B. Saunders Co.

Answers to questions

1. No. Since the Acme emergency department has a hospitalization rate 6%, we know that 94% are not hospitalized. By just stating that all patients do not require hospitalization, I have a 94% chance of predicting this correctly. Therefore, no test at all is better than the 93% predictive value of the Acme physicians' tests. Although 93% sounds like a good number, it is actually a poor number in this case.

2. The more studies you see in the literature on a topic, the more controversial it must be. If the answer were clear-cut, no further publications on the topic are necessary. However, in controversial subject areas, multiple publications are often present in the literature, attempting to clarify the controversy. Thus, the correct conclusion should be that the efficacy of midazolam for pediatric sedation is controversial.

3. No matter what country you live in, we all eventually die. The mortality rate in all countries is 100%.

4. It could be that since PP is so poor, they don't have an organized health department which keeps accurate health statistics. When the World Health Organizations asks PP to submit their age adjusted mortality rate, the health minister in PP just writes down any number and sends it in. This random number just happens to be lower than the accurately determined age adjusted mortality rate submitted by WW. Another explanation is that since PP has such a poor health care system, any patient who is very ill, is illegally smuggled over the border into WW where the patient shows up in an emergency room. The ethical staff in WW hospitals take care of these very ill patients who frequently die. These death statistics are registered in the health statistics of WW. Thus, many deaths that should have been attributed to PP, actually show up in the age adjusted mortality rate of WW instead. Such a phenomenon could make it appear that many people in PP never die.

5. Prevalence is better for diabetes. Chronic diseases are best described with prevalence while acute diseases are best described with incidence. These numbers may not be very accurate. They may come from disease condition registries or from health department statistics. These systems require that hospitals and/or physicians send in report cards diagnosing the patient's condition so that the statistic can be kept. However, such reports are frequently not made even in "reportable" diseases which the law requires to be reported. Diabetes is not a reportable illness. Another source may be health insurance claims information which contain diagnostic codes.

6. Sensitivity=TP/(TP+FN)=the fraction of all true positives that are caught by the test. A very sensitive test identifies most of the true positives. However, there may still be a substantial number of false positives in a highly sensitive test. Specificity=TN/(TN+FP)=the fraction of negatives that are true negatives. A very specific test correctly identifies most of the true negatives. However, there may still be a substantial number of false negatives in a highly specific test. PPV=TP/(TP+FP)=the likelihood of having a disease if the test is positive. NPV=TN/(TN+FN)=the likelihood of not having a disease if the test is negative. The NPV frequently has a deceptively high value, such as >90% if the disease condition is infrequent.

7. It is possible. Most of these tests are gold standards since they are nearly perfect and they actually define the disease entity. But generally if a study publishes only two out of these four values, it is likely that are publishing the two best values and the authors have suppressed the other two values which do not appear as good. A test which is has nearly perfect sensitivity, specificity, PPV and NPV is the pregnancy test. Lumbar puncture for meningitis is also quite good. Radiographic images for certain types of fractures which are obvious (e.g., forearm fractures) are also quite good.

Return to Table of Contents

University of Hawaii Department of Pediatrics Home Page