Skip to content

Winsorizing: How to Complicate a Simple Idea

Last year I came across the use of this term and had to look it up. I ultimately understood the idea as soon as I read the description, but it made me think of the propensity of technologists to invent terminology. But, of course, one benefit of creating specialized languages is excluding those who do not understand the language.

For example, the legal profession excels at taking terms and evolving them to have novel or bewitching meanings.

I learned this term when reading the information in a financial trading case. Then, I figured out why it was called “winsorizing.” It was nothing mysterious: the person who first described it in literature germane to that field was surnamed Winsor.

I was recently reminded of this as I read the trial transcript in preparation for my testimony. The attorney called attention to this term and explained its meaning. I realized that an essential part of what I do when demystifying technology is to try and come up with intuitive explanations for what we do and why.

So, winsorizing is the process of removing “outliers” from the data. For example, the attorney spoke to a jury of a dozen people. The example I would have used would have been to explain that if I had a group composed of the jurors plus Elon Musk, the average net worth of the group would be around the net worth of Elon Musk divided by 13. One data point dominates that group. Thus, the idea of removing “outliers” is to remove points that have a disproportionate effect on the group. If we don’t do that, we might conclude that jurors are rich, which is not likely the case. Removing such samples usually leads to better analysis, but it may not be evident until you explain why. Tying it to something the reader (for a report) or audience (jury) can understand is powerful and helpful in making a random term “make sense.”

The Elon of Pineapples

Another example of this I like to use also appeared to me while reading the trial transcript. Covid is still very much an active threat, and they are testing people to ensure they do not have Covid. There is a presumption that a positive test result means you have Covid, but that is often not the case. This is a general problem with tests.

My usual way of explaining this is to start with a highly accurate test: 99.5%. So, for 1000 people, that test will find five positive results even if none of those people have the condition.

In this case, the “positive predictive value” of the test is 0%. Therefore, the probability of someone with a positive test result having the condition (whether it doesn’t exist or doesn’t apply to the population) is always a false positive.

That’s not the real world in which we live, though. The condition does exist but is rare. So, if the need for which we are testing is 0.5%, then 5 people in 1000 have an actual positive test. Our 99.5% accurate test will find almost 5 people (in the 995, that shouldn’t be positive), and thus we expect to see 10 people with a positive test result. Not the “positive predictive value” is approximately 50% (half-true positives, half false positives.)

Why does this matter? The “rapid antigen” test for Covid is between 45% and 97% accurate. It becomes more challenging to evaluate some of this because as the test becomes less accurate, we have to start worrying about false-negative results. For the moment, let’s ignore that because my point is more focused than that. At 97%, we get 30 false-positive results in a group of 1000 people. So, how do we protect against that?

Easy. We omit people that are more likely to be negative. This helps increase the prevalence of the condition. In my earlier example, the prevalence was 0.5%. If the prevalence had been 5%, so 50 people in 1000 were positive, and we had 5 false positives (4.75 since a false positive can only happen in the group of 950 negative people), then our positive predictive value is much better: 90%.

Thus, usually, we’re told to only test “if you have symptoms.” This helps you eliminate the people that are unlikely to have Covid, and thus for the group that does test, the prevalence is higher, and the “positive predictive value” of the test is better. In Court, the prevalence will be relatively low, and thus a positive result without symptoms is likely to be a false positive. With 97% accuracy, there are 30 false positives per 1000. If the prevalence is 1%, there are 10 true positives. The “positive predictive value” is 25% (10 true positives out of 40 total positives.)

Explaining things systematically takes time and patience, but it is advantageous because it helps people better understand what we’re doing and why we’re doing it.

Now, I hope my Covid test is not a false positive so I can provide my testimony to the court successfully.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: