Lessons from an old Bathroom Scale on Measurement Reliability

If you use an old bathroom scale to weigh yourself and step on and off it a few times, you’ll probably get different readings, even within seconds. These measurements of your weight (more correctly, your mass, but I’ll stay with weight) vary randomly. But they shouldn’t.

Unfortunately, old bathroom scales produce weights that are not very reliable or consistent. The weights (or scores) they display vary because they comprise error variance. Sometimes your weight is a bit up, other times a bit down if you step on and off the scale a few times. Error variance is random – you can’t predict it each time you step on and off.

Simply stated, the old bathroom scale is showing weights or scores that are not reliable. Reliability refers to the consistency of scores.

Similarly, many dissertations and theses use questionnaires or tests that measure knowledge, attitudes, perceptions, or some construct. The scores they produce should be reliable. If you are using a measurement instrument in your dissertation, you will need to describe the reliability of the scores of your measurement instrument, among its other properties.

So, how do you measure the reliability of the scores produced by a measurement instrument?

As an example, say that we use the old bathroom scale to weigh many individuals. The weights of these individuals vary as they should do because some people are bigger than others. But, there is also error variance in the scores or weights produced by the scale.

If 20% of the variability we observe in the weights of these individuals is error or random due to the inconsistencies in the scores produced by our old scale, then 80% of the variability would be free from such random error. Thus, the reliability coefficient of the scores produced by our scale would equal .80. Similarly, if 30% of the observed variability in the scores is random or error variance, then the reliability coefficient of the scores produced by our scale would equal .70.

So, why is it important to have reliable scores?

Consider, for example, a situation in which we need to assess how well a new diet pill works. We measure the weights of a random sample of overweight individuals, then they take the daily diet pill for a month, and thereafter we re-measure their weights using the same measurement scale we used before they started taking the pill.

If we use our old unreliable measurement scale for our pre- and post-measurements, some of the variability in the weights would be error or random. In the extreme situation, our very unreliable scores produced by our old inconsistent scale may contain so much random error in the scores that we would be unable to evaluate the changes in the individuals’ weights from before to after taking the diet pill. So, we wouldn’t know if the diet pill was working!

This is the first reason why it is important to have reliable scores – poor score reliability reduces the potential for evaluating change. Any instrument that produces unreliable scores measures inconsistently, so changes in inconsistent scores cannot imply real change. Not cool.

The second reason is that the error variance in the scores of an unreliable measurement instrument affects what is being measured because the scores are full of random error. This means that the validity of the scores – what the scale is supposed to be measuring – is compromised. So, measurement instruments that produce scores with poor reliability will also have compromised validity. Also, not cool.

The third, fourth and fifth reasons are that the random error, or noise, in our unreliable scores reduces the power of statistical tests. Recall that power is the probability of finding a significant difference that exists. But detecting significance is increasingly difficult with random error in our scores as we need bigger differences for significance. So, unreliability reduces the power of statistical tests, and in turn, reduces the associated effect sizes. This also reduces the observed correlations between our scores and other measures, so that our instrument is a poorer predictor of outcomes than it should be. Not cool at all.

In short, it is very important that the measurement instruments that you use in your dissertation measure reliably (more so than the old bathroom scale that you can replace). I hope I have convinced you of this.

What other types of measurement reliability are there?

Up to now, we have discussed test-retest reliability. However, there are many other forms of score reliability, all involving consistency. These include parallel forms reliability measured by the consistency of scores on different versions of the same test, inter-rater reliability measured by the consistency of scores of different raters of the same individual, and internal consistency reliability measured by Coefficient alpha or Cronbach’s alpha. Almost always, students use Cronbach’s Alpha to report the reliability of the scores of their measurement scales.

However, Alpha is one of the most frequently misunderstood and misinterpreted statistics in theses and dissertations involving measurement. But this is a topic for a future blog post.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top