Last year I came across the use of this term and had to look it up. I ultimately understood the idea as soon as I read the description, but it made me think of the propensity of technologists to invent terminology. But, of course, one benefit of creating specialized languages is excluding those who do not understand the language.
For example, the legal profession excels at taking terms and evolving them to have novel or bewitching meanings.
I learned this term when reading the information in a financial trading case. Then, I figured out why it was called “winsorizing.” It was nothing mysterious: the person who first described it in literature germane to that field was surnamed Winsor.
I was recently reminded of this as I read the trial transcript in preparation for my testimony. The attorney called attention to this term and explained its meaning. I realized that an essential part of what I do when demystifying technology is to try and come up with intuitive explanations for what we do and why.
So, winsorizing is the process of removing “outliers” from the data. For example, the attorney spoke to a jury of a dozen people. The example I would have used would have been to explain that if I had a group composed of the jurors plus Elon Musk, the average net worth of the group would be around the net worth of Elon Musk divided by 13. One data point dominates that group. Thus, the idea of removing “outliers” is to remove points that have a disproportionate effect on the group. If we don’t do that, we might conclude that jurors are rich, which is not likely the case. Removing such samples usually leads to better analysis, but it may not be evident until you explain why. Tying it to something the reader (for a report) or audience (jury) can understand is powerful and helpful in making a random term “make sense.”
Another example of this I like to use also appeared to me while reading the trial transcript. Covid is still very much an active threat, and they are testing people to ensure they do not have Covid. There is a presumption that a positive test result means you have Covid, but that is often not the case. This is a general problem with tests.
My usual way of explaining this is to start with a highly accurate test: 99.5%. So, for 1000 people, that test will find five positive results even if none of those people have the condition.
In this case, the “positive predictive value” of the test is 0%. Therefore, the probability of someone with a positive test result having the condition (whether it doesn’t exist or doesn’t apply to the population) is always a false positive.
That’s not the real world in which we live, though. The condition does exist but is rare. So, if the need for which we are testing is 0.5%, then 5 people in 1000 have an actual positive test. Our 99.5% accurate test will find almost 5 people (in the 995, that shouldn’t be positive), and thus we expect to see 10 people with a positive test result. Not the “positive predictive value” is approximately 50% (half-true positives, half false positives.)
Why does this matter? The “rapid antigen” test for Covid is between 45% and 97% accurate. It becomes more challenging to evaluate some of this because as the test becomes less accurate, we have to start worrying about false-negative results. For the moment, let’s ignore that because my point is more focused than that. At 97%, we get 30 false-positive results in a group of 1000 people. So, how do we protect against that?
Easy. We omit people that are more likely to be negative. This helps increase the prevalence of the condition. In my earlier example, the prevalence was 0.5%. If the prevalence had been 5%, so 50 people in 1000 were positive, and we had 5 false positives (4.75 since a false positive can only happen in the group of 950 negative people), then our positive predictive value is much better: 90%.
Thus, usually, we’re told to only test “if you have symptoms.” This helps you eliminate the people that are unlikely to have Covid, and thus for the group that does test, the prevalence is higher, and the “positive predictive value” of the test is better. In Court, the prevalence will be relatively low, and thus a positive result without symptoms is likely to be a false positive. With 97% accuracy, there are 30 false positives per 1000. If the prevalence is 1%, there are 10 true positives. The “positive predictive value” is 25% (10 true positives out of 40 total positives.)
Explaining things systematically takes time and patience, but it is advantageous because it helps people better understand what we’re doing and why we’re doing it.
Now, I hope my Covid test is not a false positive so I can provide my testimony to the court successfully.
Recently one of my clients contacted me and asked about what I know concerning “determining authorship.” I pointed out that I have a published paper on the subject (Plagiarism Reduction). An integral part of doing that work was learning how the tools we use to detect plagiarism actually work. But, of course, my work was about code plagiarism (a common problem in education and my practice) and not prose plagiarism. The case my client asked about was attempting to determine the authorship of a written work. The facts of the case involve collaboration between a group of authors that is being unravelled (by lawyers, of course). Thus the question arose if there is a way to automatically identify which sections were written by which authors.
Thus, I agreed to write about this area to provide some background and point to potential tools. First, I want to touch on the fundamental reason this becomes a legal issue: any work of authorship, whether it is for entertainment, business, course work, or a computer program, comes with a significant right: Copyright. While the laws about copyright vary somewhat from country to country, overall, there is pretty broad agreement about these rights and their international interpretation. This is based upon an agreement (the “Berne Convention“) first signed in 1886, though amended over the past few years. Since most of my work is in the United States, I am most familiar with Copyright law from a U.S. perspective. Copyright protects a specific expression. In that regard, it is a narrow intellectual property right. Thus, if we each write a piece of computer software that does the same thing, we would each have a copyright on our implementation. If part of the code turns out to be the same, that could be because there’s only one way to express something, or it could be since one of us copied it from the other.
How do we tell when “identical” copyright material is, in fact, the same? Suppose I write a computer program in FORTRAN. In that case, I then use a unique piece of software called a compiler that translates the instructions in the programming language into instructions that the processor in the computer understands. The compiled version of my program is a derivative work. The rights of the original author are not eliminated in the derivation process. This is important because software, in particular, is often re-used. Indeed used software is generally better than new software because the latter is more likely to contain errors (“bugs”) that will need to be found and fixed.
Thus, we usually consider the compiled version of the computer program to be “the same.” A small change to a literary work does not eliminate the copyright. We’d consider them to be “the same” even though they are not strictly identical. So, when we’re trying to figure out who wrote something, the question becomes how to determine that. A wealth of techniques have been developed over the years to do authorship analysis. These are motivated by historical concerns (e.g., “who wrote this anonymous text,”) legal concerns (“is this the same code,”) and authorship questions (“did the student submit work they copied from elsewhere.”)
Commercial services, such as Turnitin and Grammarly, have productized some of the techniques that have been developed over the years and can draw upon a plethora of public and private sources, so they can take a written work and map it to other examples they have seen previously. For instance, in the CS 6200 course at Georgia Tech, which is the one where I implemented the plagiarism reduction intervention that was effective, we used a well-known tool called MOSS (“Measure of Software Similarity”). MOSS uses an interesting technique: it compares the program’s abstract syntax tree against other implementations. Doing this strips away the elements that are not germane to the program and instead focuses on what the program does as captured by the signature of the abstract syntax tree (AST). Once a piece of software source code is “big enough,” it becomes possible to say that matches in the ASTs are unlikely to be coincidental. So, small bits of code can be identical, but one cannot conclude much. However, when we have 100+ of those small bits of code, and it turns out that 50% of the code has the same AST, then we have strong evidence of shared heritage.
One approach I have seen in more traditional literary works is similar to this AST style comparison, but it is not the only technique. Since I periodically review the literature on this topic, I took this opportunity to highlight some interesting papers that I found in my most recent search. Increasingly, I see statistical machine learning techniques used to facilitate rapid, automated detection.
In “Authorship Identification on Limited Samplings,” the authors refined machine learning? (ML) techniques to find those that are “the most efficient method of authorship identification using the least amount of samples.” They do an excellent job of summarizing the ML techniques in use today: Naive Bayes, SVMs, and neural networks. Those go beyond the scope of what I want to write about today, but they are effective and frequently used techniques for finding patterns in large data sets, such as written works.
In “A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces,” the authors look at using unsupervised learning (a technique in machine learning that does not rely upon prior data for “training”) on document clusters to identify similarity for short pieces of text (around a paragraph long). This technique could help take a single document and split it into sections and then apply these techniques to the collection. One would expect the works of multiple authors to form “clusters” due to the similarity between different paragraphs attributable to each author. This work is intriguing though the authors do caution: “[t]horough experimentation with standard metrics indicates that there still remains an ample room for improvement for authorial clustering, especially with shorter texts.” From my reading, I suspect their techniques would be appropriate for a small number of authors. This could then be applied either to the case my client first asked about (joint authors in a large document) or an educational setting (to identify “patchworking” of a few different documents together into a single document submitted by a student.)
Another intriguing approach I found was the idea of using contextual considerations. For example, in “Semantic measure of plagiarism using a hierarchical graph model,” the authors extracted “topic feature terms” and used them to construct an acyclic graph (“hierarchical”) of these terms. They then used graph analytic techniques to identify the similarity of specific sub-graphs [admittedly, I have concerns that this approach may not scale well since my recollection, since confirmed, is that sub-graph isomorphism (“these two subgraphs are equivalent”) is known to be difficult.]. One intriguing outcome of this is that it allowed the authors to find plagiarism where the words were changed, but the meaning was not – much like MOSS does with computer programs by examining the ASTs rather than the plain text.
This described technique reminded me of Carmine Guida’s work at Georgia Tech for plagiarism detection. His Master’s Thesis was “PLAGO: A system for plagiarism detection and intervention in massive courses.” I was peripherally involved in his work while he was doing it, and aspects stand out in my mind today. The tool that he constructed (and I had an opportunity to use) related to using explicit techniques for identifying common authorship: n-grams (a technique for splitting text into sets of n words for analysis), stop words (used to detect sentence structure), structural matching (using the stop words), and stemming (using the root stem of an analysis word, rather than the full term). However, one of the most intriguing aspects of his work was stylistic evaluations. When the style of writing changes in a text, it often indicates a switch in authorship. For example, he talked about changes in sentence length being one indicator in our discussion about this. Word complexity is another. Thus, structural techniques like this can identify structural similarity and stylistic shifts suggesting a change in authorship.
Oren Halvani’s Ph.D. thesis also seems relevant: “Practice-Oriented Authorship Verification.” Halvani focuses on authorship verification, which is truly core to the original question that my client asked and prompted me to go back and look at current work in the fields. Again, Dr. Halvani’s work is directly relevant: “these characteristics can be used to assess the extent to which AV methods are suitable for practical use, regardless of their detection accuracy.” His work explores specific mechanisms for verifying authorship, providing both a theoretical and empirical basis for employing these techniques.
This stroll through current work confirmed what I had known was previously valid: this is an active area of research. Tools such as Turnitin and Grammarly may be sufficient to my client’s objective, but if not, there is a vast body of recent work that shows it is possible to do this. Of course, if there’s a commercial solution available that can be used, it is likely the best thing to do. I did note that Turnitin had some reasonably significant limits to the amount of analysis one could do in this regard. From my reading of their website, the size of the text being compared had size limits and required commercial accounts (see their page about iThenticate, which would be used for comparing a small set of non-public texts.)
In my use of plagiarism detection tools, I find that I still have to confirm the findings of the tools. Sometimes there are “boilerplate” aspects that the tools do not detect and exclude. Sometimes, the tools do not consider things that support claims of similarity to me (e.g., in code, the comments and debug strings are often clear examples of code re-use). In the end, having a tool alone is not sufficient for having an expert use the tools to find suspect areas and then explain why specific regions represent plagiarism.
In this post, I have discussed some of the background related to document plagiarism. However, this is not the only way others can co-opt that information. In patents, the author provides a detailed description of an invention in exchange for a limited monopoly on using the invention. In trade secrets, the owner of the secrets does not describe them publicly and relies upon their secrecy. If someone else independently discovers the secret, their protection is lost.
In my IP expert practice, much of my work is related to trade secrets. There are two common scenarios that I see routinely: (1) someone thinks what they know is new and innovative and a secret, or (2) someone thinks what they learned is well-known and not secret. Trade secrets can be powerful and fragile but common in technology cases. Many technologies are “well known” yet aren’t known to someone who rediscovers them. This is a discussion for a future post!