Department of Pediatrics, University of Hawaii John A. Burns School of Medicine

November 2002

The mother of a 15 month old boy is brought to the clinic because of fever. His temperature has been up to 40 degrees (104 degrees F). He has a slight cough, but no vomiting or diarrhea. His urine output is normal without urinary complaints. Mother has given him acetaminophen which results in some improvement. His past medical history is unremarkable. He is generally healthy.

Exam: VS T 39.8 (103.6), P120, R 35, BP 90/60, oxygen saturation 99% in room air. Height and weight are at the 50th percentile. His examination is normal except for mild nasal congestion. His tympanic membranes are normal.

You believe that he most likely has a viral infection, but you are told that occult bacteremia is a possibility. Since this is a clinically occult phenomenon, you decide to do a study to determine if a CBC is useful in predicting whether occult bacteremia is present. You put together an institutional review board (IRB) study proposal which is approved. 400 febrile patients are enrolled in the study. They all have CBCs and blood cultures drawn. Out of the 400 patients, 20 have positive blood cultures. The mean white blood count (WBC) in these 20 bacteremic patients is 16.8 (standard deviation=7.3). The mean WBC in the other 380 patients is 13.9 (standard deviation=6.2). p=0.04 which means that the probability that this difference is due to chance is only 4% (0.04). Thus, there is a 96% chance that this difference is real, which means that the WBC in bacteremic patients is significantly higher than in non-bacteremic patients. Believing that your study results are astounding, you decide to present this information to the hospital chief of staff summarized as follows:

All of a sudden, the results do not appear to be as astounding as initially thought. If a child has a WBC >20.0, most of the patients still have negative blood cultures. Additionally, 9 out of the 20 bacteremic patients had WBC counts less than 15.0.

There are two basic types of data: 1) continuous variables which can take on any value within a reasonable range (e.g., age, weight, blood pressure, peak flow, oxygen saturation, cholesterol, etc.), and 2) discrete (or categorical) variables (e.g., sex, ethnicity, socioeconomic status, medical insurance, etc.) which can only take on values of discrete categories. Data types are important because this determines which type of statistical test(s) to run.

Notice that most continuous variables can be converted into discrete variables by grouping them in ranges. For example, age groups can be formed: 1) 0-1 yr, 2) 2-5 yrs, 3) 6-10 yrs, 4) 11 yrs and above. Cholesterol values can be categorized into high cholesterol versus low cholesterol. Discrete variables cannot usually be converted into continuous variables. However, some discrete variables have rank order while others do not. For example, medical insurance has a particular rank order: 1) no insurance, 2) medicaid insurance, 3) private insurance. Military rank order is another discrete variable that has rank. Socioeconomic status could have rank as well: 1) unemployed, 2) blue collar, 3) white collar. Discrete variables such as race, hair color, political party, etc., have no inherent rank order.

Statistics are basically either descriptive or inferential. Descriptive statistics are summary numbers use to describe a set of data. If 1000 data measurements are obtained, it would be impractical to list all 1000 measurements in your publication. It would be more efficient to present a few summary numbers which describe the 1000 data measurements. These are descriptive statistics. Descriptive statistics for continuous variables include: mean, standard deviation, range, mode, median, etc. The mean, mode and median describe the central tendency of the group of observations. The range, standard deviation and confidence interval describe the spread of the observation measurements. For example, for a set of 1000 cholesterol measurements, the mean is 100, the range is 40 to 310, and the standard deviation is 45. Descriptive statistics for discrete variables include rates and frequencies (numerator/denominator). For example, 30% of the group has black hair.

These descriptive statistics can be graphically compared to determine if two sets of observations are different. The means represent the center of two bell shaped curves. A small standard deviation means that the shape of the bell is very narrow. A large standard deviation means that the shape of the bell is very wide. One standard deviation from the mean estimates the point of inflection (where the curve changes from convex down to convex up) of the bell shaped curve. The mean plus or minus two standard deviations should contain approximately 95% of the observations (or area under the curve). If the two bells have substantial overlap, then the two groups are most likely, NOT significantly different. If the two bell shaped curves have almost no overlap, then the two groups are most likely, significantly different.

The 95% confidence interval can be calculated to determine the likely range of the true mean. The mean of a sample estimates the true mean. The 95% confidence interval calculates the range of possible values for the mean with 95% confidence (i.e., there is a 95% chance that the true mean lies within the 95% confidence interval). A wide range or interval means that there is great uncertainty about what the true mean is (large variance), while a narrow 95% confidence interval means that there is great certainty about what the true mean is. The 95% confidence interval is similar to graphing two distributions because if the 95% confidence intervals of two groups exclude each other, then the two groups are significantly difference. For example, group A has a mean of 25 and group B has a mean of 30. The 95%CI for A is 23 to 27, while the 95%CI for B is 28 to 32. The two groups are significantly different because the two 95%CIs exclude each other. But if the 95%CIs were 20 to 30 and 27 to 33, then they overlap, so the two groups are probably not significantly different.

Inferential statistics compare two or more groups of observations to determine if the groups are significantly different (or related) or NOT different, in a more mathematically precise way. Nowadays, these tests are all done by computer software. It is not really useful to know the formulas for calculating the test results. The concepts of which test to use and how to interpret the results are more important. The selection of a statistical test seems perplexing, but in its basic form, it is rather simple. Since there are only two types of data (continuous and categorical), comparing variables can only take on a limited number of combinations. A basic guide is as follows:

Comparing a continuous variable between two groups: T-test.

Comparing a continuous variable between more than two groups: Analysis of variance.

Comparing a discrete variable between more two groups or more: Chi-square.

Determining the relationship between one continuous variable and one or more continuous variables: Regression (linear regression for two variables, multiple regression for more than two variables).

Although we often use inferential statistics to determine if two groups of observations are different, statisticians utilize a non-intuitive concept called the null hypothesis, which hypothesizes that the two groups are the same. If we are trying to determine if something is different (which is the usual case), you can think of the null hypothesis as the opposite of what we are trying to show. For the example of the study of WBCs and bacteremia, the null hypothesis is that the WBCs in bacteremic and non-bacteremic febrile children are the same.

The commonly cited p value is the probability that the difference demonstrated is due to chance alone. Statisticians have selected p=0.05 (or 5%) as an arbitrary cut off value. If p<0.05, then the difference is said to be statistically significant because the probability that this difference is due to chance alone is less than 5%. If this probability is greater than 5%, then this probability is too high for the difference to be statistically significant.

If p<0.05, the statistical terminology is that we reject the null hypothesis. Recall that the null hypothesis was that the WBC counts in bacteremic and non-bacteremic febrile children are the same. So by rejecting the null hypothesis, we are concluding that the WBCs in bacteremic and non-bacteremic children are different (i.e., significantly different such that we can reject the null hypothesis).

The null hypothesis is non-intuitive (seemingly backward thinking) to most non-statisticians. It might be easier to think of the p value as the significance level. If p is <0.05 (or 5%), then this difference is said to be significant. If p is >0.05, then this probability is not small enough, so the result is said to be non-significant.

These statistical tests are best understood by example. A study is undertake to determine which alien species is smarter: Jupitrons or Zoobies. IQ tests are performed on 1000 Jupitrons and 800 Zoobies. The mean IQ for the Jupitrons is 110 (standard deviation 30), and the mean IQ for the Zoobies is 120 (standard deviation is 75). A T-test is done which determines the p value to be non-significant. Although the Zoobies have a higher IQ, their standard deviation is large which means that there are a lot of smart Zoobies and a lot of not-so-smart Zoobies. The standard deviation for the Jupitrons is smaller, so the spread of Jupitron IQs is narrower. Since p is larger than 0.05, we must accept the null hypothesis and conclude that the IQs of Jupitrons and Zoobies are not different (i.e., they are the same). If p were 0.02 instead, then we would reject the hypothesis and conclude that the IQ levels of Jupitrons and Zoobies are significantly different.

Another alien group, the Dimbos, are added to the comparison. 1000 Dimbos are studied and their mean IQ is 68 (standard deviation 22). The three groups, Jupitrons, Zoobies and Dimbos are compared using analysis of variance (ANOVA): p=0.01, which means that at least one of these groups is different from the others. In this case, it is quite obvious that the Dimbos are less intelligent than the Jupitrons and Zoobies, but in some other instances, it may not be that obvious. If 10 different groups are tested and p is significant, this could mean that the lowest group is different from the highest group, but other groups may be different from the others as well.

Compare the T-test to ANOVA. The only difference is that the T-test tests two different groups and ANOVA tests three or more groups. What if we did an ANOVA test, but only used two groups? It turns out that this is mathematically identical to the T-test. So whether you select a T-test or ANOVA for the comparison of two groups, the statistical calculation and the p value will be the same.

Jupitrons have hearts too, so a study is done to compare heart attack (acute myocardial infarction) rates in Jupitrons and Humans. Out of the 1500 Jupitrons residing on the planet colony Vlazer, 15 have sustained myocardial infarcts (MI) (1%). This compares to the 3000 Humans residing on Vlazer of whom, 90 have sustained MIs (3%). A chi-square test is done to determine if the 3% MI rate in Humans is significantly higher than the 1% MI rate in Jupitrons. These results form a 2 by 2 table which looks like this:

The clinical question is: Do MIs occur more frequently in Humans? The 2 by 2 table above can be related to an "expected values" table. If the species (Human versus Jupitron) is unrelated to MI (in other words, there is no relationship between MI and species, or the MI rates are the same in both species), the numbers in each of the cells of the 2 by 2 table should distribute randomly. The expected value in each cell should be the row total multiplied by the column total, divided by the grand total. The expected values table should look like this:

The expected value for the Human MI cell is calculated by multiplying the row total (3000) by the column total (105), all divided by the grand total (4500).

3000 X 105 / 4500 = 70

The actual value in this cell is 90. So this deviates from the expected value by 20. The differences between the true values and the expected values in each cell are squared and added together. This forms the numerator for the chi-square value. The larger the chi-square value, the smaller the p value. The 2 by 2 table is a reasonably simple calculation by hand, but nowadays, all of these calculations are done by computer. The p value for this particular set of data is <1% (p<0.01), which is significant, so we conclude that the MI rates in Humans is significantly different from that of Jupitrons (i.e., we reject the null hypothesis that the MI rates in Humans and Jupitrons are the same).

A similar methodology can be used if there are more than two groups and more than two possible outcomes. For example, comparing MI rates in Humans, Jupitrons and Zoobies would result in a 2 by 3 table. Comparing hair color in Humans, Jupitrons and Zoobies would result in a 4 by 3 table assuming that there are 4 possible hair color types.

Note that so far, we have compared a continuous variable by a categorical variable (IQ by alien species using a T-test or ANOVA), then a categorical variable by a categorical variable (MI by alien species using the Chi-square method). The only other possible combination is to compare a continuous variable by a continuous variable. The method used here is regression. In the selection of a statistical test, there are only three possibilities: 1) continuous by categorical, 2) categorical by categorical, and 3) continuous by continuous. The selection of a statistical test is not that hard after all.

Regression analysis determines the degree of correlation that one continuous variable has with another. An example would be age and weight. Of course these have some degree of correlation, so such a study would show statistically significant correlation. Regression analysis generates a correlation coefficient (called r). The correlation coefficient (r) can range from -1 to 1. If r is positive, this means that as one variable goes up, the other variable goes up. Age and weight would be an example of this. If r is negative, this means that as one variable goes up, the other variable goes down. Birth weight and hospital length of stay is an example of this because low birth weight tends to result in longer hospitals lengths of stay. An r value of 1 or -1 implies perfect positive or perfect negative correlation, respectively. An r value of 0 indicates that there is no correlation between the two variables tested. Regression analysis also calculates a p value. The p value is the probability that the r value is 0.

For example, if the r value is 0.1 and p=0.01, then there is significant correlation, because p is <0.05, which means that the r value is significantly different from zero. If the r value is 0.5 and p=0.4, then there is no significant correlation even though the r value is larger, because p is too high, which means that there is a 40% probability that r could be zero. A large r value with a large p value is often seen with regression analysis with only a few observations (an inadequate sample size).

If the regression analysis involves only two variables, this is called linear regression. If the regression analysis involves more than two variables, then this is called multiple regression, in which case, one variable must be considered a dependent variable and the other variables must be independent variables, such that:

D = Q + aX + bY + cZ

In the above equation, D is the dependent variable, Q is a constant, XYZ are independent variables, and abc are factors which determine the effect of XYZ on D. An example of this is a study which attempts to determine the environmental factors that result in wheezing. D=the number of children wheezing on a given day. X=the amount of viral infections in the community on that day, Y=the bad weather index as measured by that day's barometric pressure, and Z=the amount of air pollution (dust, pollutants and volcanic dust present) that day. All three of these factors affect the amount of wheezing in the community. Terms a, b, and c can be thought of slope terms for the model, but they are not the correlation coefficients. Separate correlation coefficients and p values would be determined for each independent variable X, Y and Z to determine the degree of correlation (the r value) and whether the correlation is significant (p value) for each of X, Y and Z.

All the statistical tests that have been described so far, have a few assumptions. A basic assumption is that the data is distributed in a "normal distribution" (resembling a bell shaped curve). This assumption is usually not true. However, the central limit theorem (no need to describe this here) usually allows us to use these tests if the distribution is "somewhat normal" and there are enough observations (data points). If a normal distribution is clearly not present, then, we must use "non-parametric" tests (e.g., Wilcoxon rank-sum test, Mann-Whitney U test, etc.). So the selection of a statistical test is not quite as simple as what was described earlier.

There are two types of statistical errors knows as type 1 and type 2. The type 1 error is the probably of incorrectly concluding that a true difference exists. The probability of a type 1 error is known, measured by the p value. In the case example, the probability of incorrectly concluding that, the WBC counts in bacteremic versus non-bacteremic patients is different, is 4% (p=0.04).

The type 2 error is the probability of incorrectly concluding that the two groups are the same (i.e., no difference exists). Unfortunately, there is no foolproof way to measure this type of error accurately. If one concludes that no difference exists because p>0.05, then there are two possible realities. One reality is that the two groups are the same and the conclusion that no difference exists, is correct. The other possible reality is that the two groups are different, but because of an inadequate sample size, the study was unable to show that p<0.05. For example, if we undertook a study to determine if males were taller than females, and we took a random sample of 6 adults (3 male and 3 female), the mean male height is 173 cm and the mean female height is 163 cm. In this case p=0.15. We conclude that there is no significant difference in male and female heights. A type 2 error has occurred here. We know that there should be a difference. Yet the p value has not achieved statistical significance (p is not less than 0.05). This is due to an inadequate sample size. With only three data points in each group, it is intuitively obvious that more subjects in each group would be necessary.

Whenever an inferential statistical test concludes that no significant difference exists, it is customary to perform a "power calculation" which approximates the probability that the conclusion of "no significant difference" is correct. A large sample size has greater statistical power adding to the strength of the conclusion that no significant difference exists. How many subjects or observations does one need to avoid type 2 errors? This gets into the discussion of sample size determination. This is a rather complex subject, but suffice it to say that it requires several assumptions.

a) How do we know if a true difference exists? If one truly does NOT exist, it would take an infinite number of data observations to achieve statistical significance. Since we don't really know if a true difference exists (that's why we're doing a study), how can we really determine the sample size?

b) What is the true variance (std. deviation, etc.) or scatter of the real life data? If the scatter is wide, we need a large sample size. If the scatter is narrow, then we can get by with a smaller sample size. But we often do not know what the actual variance is. Then how can we determine a sample size before the study?

So to estimate a sample size before a study is done, we must guess that if a difference exists, it must be approximately as large as our assumption guess. Also, we must guess at what the spread (variance) of the data must be. If we make these two assumptions, we can estimate the sample size.

Just because something is statistically significant, does not necessarily mean that this is clinically important. In the WBC/bacteremia example in the case, bacteremic patients were shown to have significantly higher WBC counts compared to non-bacteremic patients. Look at the actual numbers again:

As we know, in any single case, the WBC is not very predictive because there is too much overlap. Note that MOST of the patients with WBC>20.0 have negative BCs. Also, nearly half of the bacteremic patients had WBCs lower than 15,000. Statistical significance does not always indicate clinical importance.

Most studies perform multiple statistical tests. Using p<0.05 as a cutoff value for statistical significance, means that each "statistically significant" result has a 5% chance of being due to chance alone (i.e., wrongly concluding that a difference actually exists). The more tests that are run, the more like it is that one will, by chance, wrongly find a "statistically significant" result. When multiple tests are performed, this phenomenon should be acknowledged. Prior to running the statistical tests, it may be more optimal to set statistical cutoff values at something less than 5% (e.g., 1%). The means to correct for the phenomenon of multiple tests has been supported by some editorials in the literature. Just realize that the problem exists and perhaps acknowledge it, form a crude means to correct it, or get a statistical expert to find an acceptable way of correcting it.

Optional paragraph (feel free to skip this entire paragraph, because the concept discussed in this paragraph is somewhat difficult to grasp): Note that the conclusion in the case example, is that the WBCs in the two groups are different. "Different" means greater than or less than. Although we know that the mean WBC in bacteremic patients is higher than the mean WBC in non-bacteremic patients, the probability of 4% (p=0.04 in the case example) means that the probability that the difference is due to chance is 4%. If we knew ahead of time, that if a difference exists, then we would expect the WBC in bacteremic patients to be higher and not lower, then we are actually only interested in checking one side of the "they are different" relationship (i.e., that the WBC count is higher in bacteremic patients). We are NOT interested in investigating the possibility that the WBC count is lower in bacteremic patients. This is a rather subtle difference and this concept is difficult to understand. This refers to the concept of the single sided test versus the two-sided test (also know as two tailed). Computer generated p values are always two-sided probabilities since the assumption is that we are performing a two tailed test. The two tailed probability is for the conclusion that the WBC is different in bacteremic patients. The single sided probability is that the WBC is greater in bacteremic patients. So the probability that the different WBC in bacteremic patients is due to chance is 4% (p=0.04). The probability that higher WBC in bacteremic patients is due to chance is 2% (half of the two tailed probability). This concept can be very important if your p value is for example 8% (p=0.08) which is not considered statistically significant. However, if you appropriately use a single sided probability, then p=0.04 which is statistically significant. The major issue here is that the null hypothesis must be stated properly prior to determining the probabilities. This is somewhat complex and beyond the scope of this chapter.

Questions

1. You have interviewed 50 children who have been hospitalized for bicycle related head injuries and found that 14 of them were wearing a bicycle helmet at the time of the accident. In a control group (children without injuries riding their bicycle on a community bicycle path), you observe the first 100 children and note that 92 of them are wearing bicycle helmets. What descriptive statistics should be described here? What inferential statistical test should be done?

2. If the result of the inferential statistical test for the example above is p=0.001, what conclusion can be drawn?

3. What would the null hypothesis be for the example above?

4. Indicate whether the following are categorical variables or continuous variables?

. . . . . a. Type of health insurance.

. . . . . b. Cholesterol.

. . . . . c. Oxygen saturation.

. . . . . d. Respiratory rate.

. . . . . e. Subdural hematoma.

. . . . . f. Lumbar puncture result.

. . . . . g. Cervical spine fracture.

5. You are doing a study on oxygen saturation values in asthmatics presenting to an emergency room. You find that asthmatics who are eventually discharged home had a mean oxygen saturation of 95.6% at initial presentation, but the asthmatics who require hospitalizations presented with a mean oxygen saturation of 94.5%. What are the descriptive statistics that should be presented? What inferential statistical test should be used here?

6. In the example above the p value is found to be 0.001. This is considered highly significant since the p value is so small. Comment on whether this highly significant result is clinically important?

7. Is the oxygen saturation measurement distributed in a normal fashion? In other words, if you plotted a value of oxygen saturation for 10,000 patients, would the shape of the distribution be bell shaped? Explain why or why not.

8. Without doing a statistical test, indicate whether you think the following examples show groups that are significantly different or not and justify your answer:

. . . . . a. Mean IQ in two groups are 90 and 120. The standard deviation is 45 for both groups.

. . . . . b. Mean weight in two groups are 45 and 55 kilograms. The standard deviation is 3 for the first group and 2 for the second group.

. . . . . c. Mean oxygen saturations in two groups are 94% and 97%. The standard deviation is 2% for both groups.

References

1. Glaser AN. High-Yield Biostatistics, 2nd edition. 2001, Philadelphia: Lippincott Williams & Wilkins.

2. Hildebrand DK, Ott L. Statistical Thinking for Managers, 2nd edition. 1987, Boston: Duxbury Press.

Answers to questions

1. Descriptive statistics are the rates of bicycle helmet use in the injured group and in the control group. The proper inferential statistical test to use is a chi-square test.

2. The rate of bicycle helmet use in the injured group is significantly different from that in the control group. It might be tempting to say that bicycle helmets prevent significant head injuries from this study, but such a study is not good enough to conclude this.

3. Bicycle helmet use rates in the two groups are the same.

4a. Categorical.

4b. Continuous.

4c. Continuous.

4d. Continuous.

4e. Categorical. The patient either has it or they don't.

4f. This could be both depending on what we mean by this. This would be a continuous variable if we are referring to the CSF WBC count, the RBC count, the CSF glucose, or the CSF protein. This would be a categorical variable if we are considering the CSF to be normal or abnormal, or if we are considering the gram stain result (organisms versus no organisms).

4g. Categorical. The patient either has it or they don't.

5. The basic descriptive statistic is the mean oxygen saturations in each group. Other commonly cited descriptive statistics are the standard deviations and the ranges for each group, which would describe the spread of the data. The inferential statistical test would be a T-test or ANOVA.

6. This difference is statistically significant, but it is not very clinically important because the difference between 95.6% and 94.5% is only about 1%. Continuous pulse oximetry readings will frequently fluctuate by 2 to 4 percentage points on the same patient without any clinical changes occurring.

7. The oxygen saturation (like most biomedical measurements) is not normally distributed. Most biomedical measurements have a theoretical limit on their values. Oxygen saturation values cannot exceed 100%. Thus, if one creates a distribution of oxygen saturation measurements, it will show a few points below 80%, a few more points between 80% and 90%, a fair number of points between 90% and 95%, a large number of points between 95% and 100%, and no points about 100%. This is not bell shaped. Other examples of theoretical limits are: glucose values cannot go below zero, respiratory rates will not go below 10, etc.

8a. These groups are not significantly different. The mean plus or minus two standard deviations should contain approximately 95% of the area under the bell shaped curve. Thus, the shapes of these curves are wide with substantial overlap. It is not likely that these groups will be shown to be significantly different.

8b. These standard deviations are small, so the bell shaped curves are very narrow and they do not overlap each other. Thus, it is likely that these groups will be shown to be significantly different from each other.

8c. This one is not easily determined. A T-test would have to be run to calculate the p value. The two means are fairly close to each other, but the standard deviation is also small.