Categories
Mathematical Technical Concepts Testimony

Winsorizing: How to Complicate a Simple Idea

Last year I came across the use of this term and had to look it up. I ultimately understood the idea as soon as I read the description, but it made me think of the propensity of technologists to invent terminology. But, of course, one benefit of creating specialized languages is excluding those who do not understand the language.

For example, the legal profession excels at taking terms and evolving them to have novel or bewitching meanings.

I learned this term when reading the information in a financial trading case. Then, I figured out why it was called “winsorizing.” It was nothing mysterious: the person who first described it in literature germane to that field was surnamed Winsor.

I was recently reminded of this as I read the trial transcript in preparation for my testimony. The attorney called attention to this term and explained its meaning. I realized that an essential part of what I do when demystifying technology is to try and come up with intuitive explanations for what we do and why.

So, winsorizing is the process of removing “outliers” from the data. For example, the attorney spoke to a jury of a dozen people. The example I would have used would have been to explain that if I had a group composed of the jurors plus Elon Musk, the average net worth of the group would be around the net worth of Elon Musk divided by 13. One data point dominates that group. Thus, the idea of removing “outliers” is to remove points that have a disproportionate effect on the group. If we don’t do that, we might conclude that jurors are rich, which is not likely the case. Removing such samples usually leads to better analysis, but it may not be evident until you explain why. Tying it to something the reader (for a report) or audience (jury) can understand is powerful and helpful in making a random term “make sense.”

The Elon of Pineapples

Another example of this I like to use also appeared to me while reading the trial transcript. Covid is still very much an active threat, and they are testing people to ensure they do not have Covid. There is a presumption that a positive test result means you have Covid, but that is often not the case. This is a general problem with tests.

My usual way of explaining this is to start with a highly accurate test: 99.5%. So, for 1000 people, that test will find five positive results even if none of those people have the condition.

In this case, the “positive predictive value” of the test is 0%. Therefore, the probability of someone with a positive test result having the condition (whether it doesn’t exist or doesn’t apply to the population) is always a false positive.

That’s not the real world in which we live, though. The condition does exist but is rare. So, if the need for which we are testing is 0.5%, then 5 people in 1000 have an actual positive test. Our 99.5% accurate test will find almost 5 people (in the 995, that shouldn’t be positive), and thus we expect to see 10 people with a positive test result. Not the “positive predictive value” is approximately 50% (half-true positives, half false positives.)

Why does this matter? The “rapid antigen” test for Covid is between 45% and 97% accurate. It becomes more challenging to evaluate some of this because as the test becomes less accurate, we have to start worrying about false-negative results. For the moment, let’s ignore that because my point is more focused than that. At 97%, we get 30 false-positive results in a group of 1000 people. So, how do we protect against that?

Easy. We omit people that are more likely to be negative. This helps increase the prevalence of the condition. In my earlier example, the prevalence was 0.5%. If the prevalence had been 5%, so 50 people in 1000 were positive, and we had 5 false positives (4.75 since a false positive can only happen in the group of 950 negative people), then our positive predictive value is much better: 90%.

Thus, usually, we’re told to only test “if you have symptoms.” This helps you eliminate the people that are unlikely to have Covid, and thus for the group that does test, the prevalence is higher, and the “positive predictive value” of the test is better. In Court, the prevalence will be relatively low, and thus a positive result without symptoms is likely to be a false positive. With 97% accuracy, there are 30 false positives per 1000. If the prevalence is 1%, there are 10 true positives. The “positive predictive value” is 25% (10 true positives out of 40 total positives.)

Explaining things systematically takes time and patience, but it is advantageous because it helps people better understand what we’re doing and why we’re doing it.

Now, I hope my Covid test is not a false positive so I can provide my testimony to the court successfully.

Categories
Uncategorized

Blockchain: Fad for Innovative Technology?

Recently, I wrote about how blockchains work. However, I did not really delve into what it is good for? Maybe the answer is (like war) absolutely nothing!

As is so often the case in the real world, the truth is somewhere in-between. Blockchain solves a specific, real problem. However, it has also become a shiny new buzzword and is used to promote some very dubious offerings. There is so much fraud and crime surrounding blockchain-based work that stories about hacks, such as cryptojacking – where a website installs malicious software into your web browser to “earn money” while solving blockchain problems.

The technology has expanded rapidly, with numerous uses of blockchain, both benign and malign in appearance. Some actually make a lot of sense to me. For example:

For example, an old friend reached out to talk about his latest venture, indaod.io. This is an intriguing example of a class of uses that I have seen for blockchains that make sense. In this case, it provides an interesting technology for improving the timeshare business. Buying and selling timeshares is difficult for many reasons, several of which can be addressed using a blockchain technology solution. For example, three reasons listed in this online article, 9 Reasons Why Timeshares Are a Bad Investment, can be solved by a solution like my friend’s: it makes ownership more transparent, which can be used to ensure the person selling the timeshare truly owns it; simplifies temporary rental of it, and; encouraging a better resales process. Of course, this does not fix all of the issues, but it is an excellent example of good use.

Another case is one I suggested to add value to a friend’s work on performing “livestock facial recognition.” Such a system could be combined with a blockchain representing ownership of the given animal, providing provenance (the chain of ownership,) ease of transferability, and better tools for preventing theft. Again, this is not something we can do yet, but the technology is far enough along to make sense, and it solves a real problem.

Other uses of blockchain technology are more challenging to evaluate. For example, the Ethereum blockchain technology model is widely used because it provides abilities beyond the basic blockchain idea. A crucial part of that is the idea it can contain a contract. While businesses routinely use contracts today, such agreements are written in natural language formats that can have ambiguity. An Ethereum “smart contract” is written in a language that has a specific definition of its behavior. It enables someone writing a contract to formally validate that the agreement does what is expected. That is surprisingly hard – after all, we have lots of programmers writing many programs, yet we routinely find they struggle to “get them right.”

One specific example of a smart contract that I keep running into is the “non-fungible token.” The term “fungible” might be familiar to you, or perhaps it is slightly vague. Essentially, it captures that something can be replaced with an “equivalent” object. In cooking, many things are fungible: you can substitute margarine for butter, for example. The results aren’t necessarily identical. Some things are not substitutable. For example, a unique cultural or architectural element, such as the Mona Lisa, does not have any substitute. Thus, a “non-fungible token” is a “token” (entry on the blockchain) that represents verifiable ownership of something. This is the opposite of cryptocurrency. Indeed, few of us worry about the specific currency we have – if it is a £20 note, it is likely just as good as any other combination of currency adding up to the same amount. Of course, sometimes specific currency units become valuable for some reason, such as when they are misprinted in a way that makes them unique and interesting.

Personally, I have mixed feelings about NFTs. The scheme I suggested earlier with cows and timeshares makes sense. There are blockchain-based land title registries, which I think are a great use of distributed ledger technology. The challenge with the generic term “NFT” is that you need to understand what the NFT represents to determine if it has value. For example, I suggested NFTs that could represent ownership of a real thing, but many of the NFTs being marketed represent a reference to a virtual object. If the object is itself part of the NFT, say a digital image, and the ownership of the image is transferred, it might have value. Then again, that signed first edition of The Shining has value as well, but it does not give you the rights to do anything beyond owning that one copy. In other words, the right to create copies or derivatives need not be what was sold as part of the NFT. Thus, if you buy an NFT, you might find yourself asking if you bought a usable template for making bags of poo, or just a bag of poo itself, or a URL that points to a picture that someone made of a bag of poo that anyone else can use or access. The value of this is something you can leave on your own.

From my perspective, the interesting aspect of all this is trying to break things down and explain the process: what is a blockchain, what is a smart contract, what is a Turing Complete language like Solidity, how do these get used, etc. While potentially complicated, I have found most people can understand the basics. From that basic model, it’s then possible to explore some specific issues, whether it is for crypto-currency, smart contracts, NFTs, or any other uses that people keep finding for blockchain.

I expect that as the interest in NFTs continues to expand, I’ll have more opportunities to put my skills to good use, explaining how these technologies work and applying that to the legal cases that continue to arise around them.

Categories
Block chain

Blockchain as understood from an Expert

I have been working in distributed systems for decades. The fundamental problem when we have multiple active sources of information is finding a way to achieve consensus – agreement about what has happened. The simplest way to reach a consensus is to have one decision-maker. If there is only a single source of “truth” as to what happened, anyone else that wants to know queries that truth.

However, disparate decision-makers often control individual resources in the real world. The challenge is to ensure that all decision-makers involved in a given event agree on the outcome. In Computer Science, it is databases that first had to face this problem.

A single database can practice techniques that ensure the consistency of events. What this means is that even if multiple pieces of information within the database need to be modified to carry out a particular operation, it is possible to ensure that even in the face of failures, the information within the database is consistent and thus definitive.

A typical model I have used when teaching this basic concept is a bank machine that dispenses cash. If you walk up to that machine, insert your card, enter your pin, and ask to withdraw $20, multiple distinct bits of information must all happen, or none happen:

  • You need to authenticate yourself (card + pin)
  • Your account needs to be debited $20
  • You need to be given $20 in currency

If anything fails in any of these steps, nothing should change: your card is returned to you, your account is not debited, and you don’t receive your cash. Such an “in balance” system is said to be consistent.

Let’s suppose that you use the ATM of a different bank than where your money is stored. Now we have distinct actors:

  • You, with your card and pin
  • The bank machine you are using
  • The bank that owns the bank machine you are using
  • The network that coordinates between the bank that owns the bank machine you are using
  • Your bank, notably your account with that bank.

Everything needs to work correctly, but now you have distinct actors. Each bank trusts the network and has presumably been vetted so that the banks and the network are all trusted. So, when the bank machine you are using verifies that you have the card and know the PIN, however, that is done is enough for the network and your bank to trust that you are who you say you are. Then the steps to dispense your funds are the same. You don’t get any cash if anything goes wrong, and your account isn’t debited.

I chose a bank as the example because banks routinely use ledgers – a list of transactions that move funds between accounts – or into your hand. Electronic ledgers are a bit different than paper ledgers in that the latter is more difficult to change after the fact since that often leaves marks. Indeed, the best practice is not to change an incorrect entry but rather to add another transaction to the ledger to correct the previous error. So, for example, we might void a transaction by posting the inverse transaction to the ledger.

How can we know when an electronic ledger has been modified? First, we could record it in something difficult to change after the fact, such as write-once media. Another approach we can use is to break our ledger up into sets of transactions. Logically, you can think of this as being like a page within a ledger. For a computer, we can then compute a “checksum” over the values within that ledger. I won’t bore you with the details, but it is possible to calculate such checksums to make it very difficult to change the records within the set and still end up with the same checksum. So, one way to protect an electronic ledger is to compute an additional value, called the “hash” or “checksum,” that depends upon all of the ledger entries within a given set. If we publish the checksums in some fashion, we now have a way to know that the ledger has not been modified after the point the checksum has been published.

A blockchain adds one more bit of information to the ledger entries: it also incorporates the checksum of the prior set of ledger entries. In other words, if we think of our ledger as being a series of pages, the first entry on each new page happens to be the checksum of the previous page. Then we compute a checksum for the new page with all the transactions. This “chains together” these sets of transactions. Now, to change the value of an older ledger page requires changing every page after it. So we actually only need to publish the most recent checksum to verify the entire chain.

This is what creates a “blockchain.” A “block” consists of:

  • The checksum of the previous block
  • A set of transactions;
  • Any other data we want in the block;

From this, we can compute the checksum of the current block. The key to “preserving” this “blockchain” is publishing those checksums. That is (more or less) how blockchains like Bitcoin and Ethereum function. They have some additional steps, but they work by publishing the ledger pages with their checksums – the blocks that make up the chain. When enough “nodes” (computers) in the network accept a new block, it becomes “confirmed” and challenging to change. Since it is easy to compute those checksums, the blocks are easy to confirm. Changing an existing block on this chain does not work because nodes do not permit changing history. Anyone with the blockchain can confirm it. The other nodes will ignore someone that attempts to change it since the changes won’t match the published information.

Thus, the real benefit of using a blockchain is that it provides a way to reach consensus and then confirm that consensus that is resilient in the face of bad actors. The simple implementations of blockchain generally require at least a majority of the participants to collude in order to rewrite the blockchain. On top of that, the cost of re-computing the blockchain, which is required to “change the past,” goes up as the blockchain proceeds.

There is a fair bit of hype around blockchain; some are deserved. In future posts, I will discuss more about some of those uses, with an eye towards how I consider them as an expert.

Categories
Copyrights

Effective Mechanisms for Authorship Determination

Recently one of my clients contacted me and asked about what I know concerning “determining authorship.” I pointed out that I have a published paper on the subject (Plagiarism Reduction). An integral part of doing that work was learning how the tools we use to detect plagiarism actually work. But, of course, my work was about code plagiarism (a common problem in education and my practice) and not prose plagiarism. The case my client asked about was attempting to determine the authorship of a written work. The facts of the case involve collaboration between a group of authors that is being unravelled (by lawyers, of course). Thus the question arose if there is a way to automatically identify which sections were written by which authors.

Thus, I agreed to write about this area to provide some background and point to potential tools. First, I want to touch on the fundamental reason this becomes a legal issue: any work of authorship, whether it is for entertainment, business, course work, or a computer program, comes with a significant right: Copyright. While the laws about copyright vary somewhat from country to country, overall, there is pretty broad agreement about these rights and their international interpretation. This is based upon an agreement (the “Berne Convention“) first signed in 1886, though amended over the past few years. Since most of my work is in the United States, I am most familiar with Copyright law from a U.S. perspective. Copyright protects a specific expression. In that regard, it is a narrow intellectual property right. Thus, if we each write a piece of computer software that does the same thing, we would each have a copyright on our implementation. If part of the code turns out to be the same, that could be because there’s only one way to express something, or it could be since one of us copied it from the other.

How do we tell when “identical” copyright material is, in fact, the same? Suppose I write a computer program in FORTRAN. In that case, I then use a unique piece of software called a compiler that translates the instructions in the programming language into instructions that the processor in the computer understands. The compiled version of my program is a derivative work. The rights of the original author are not eliminated in the derivation process. This is important because software, in particular, is often re-used. Indeed used software is generally better than new software because the latter is more likely to contain errors (“bugs”) that will need to be found and fixed.

Thus, we usually consider the compiled version of the computer program to be “the same.” A small change to a literary work does not eliminate the copyright. We’d consider them to be “the same” even though they are not strictly identical. So, when we’re trying to figure out who wrote something, the question becomes how to determine that. A wealth of techniques have been developed over the years to do authorship analysis. These are motivated by historical concerns (e.g., “who wrote this anonymous text,”) legal concerns (“is this the same code,”) and authorship questions (“did the student submit work they copied from elsewhere.”)

Commercial services, such as Turnitin and Grammarly, have productized some of the techniques that have been developed over the years and can draw upon a plethora of public and private sources, so they can take a written work and map it to other examples they have seen previously. For instance, in the CS 6200 course at Georgia Tech, which is the one where I implemented the plagiarism reduction intervention that was effective, we used a well-known tool called MOSS (“Measure of Software Similarity”). MOSS uses an interesting technique: it compares the program’s abstract syntax tree against other implementations. Doing this strips away the elements that are not germane to the program and instead focuses on what the program does as captured by the signature of the abstract syntax tree (AST). Once a piece of software source code is “big enough,” it becomes possible to say that matches in the ASTs are unlikely to be coincidental. So, small bits of code can be identical, but one cannot conclude much. However, when we have 100+ of those small bits of code, and it turns out that 50% of the code has the same AST, then we have strong evidence of shared heritage.

One approach I have seen in more traditional literary works is similar to this AST style comparison, but it is not the only technique. Since I periodically review the literature on this topic, I took this opportunity to highlight some interesting papers that I found in my most recent search. Increasingly, I see statistical machine learning techniques used to facilitate rapid, automated detection.

In “Authorship Identification on Limited Samplings,” the authors refined machine learning? (ML) techniques to find those that are “the most efficient method of authorship identification using the least amount of samples.” They do an excellent job of summarizing the ML techniques in use today: Naive Bayes, SVMs, and neural networks. Those go beyond the scope of what I want to write about today, but they are effective and frequently used techniques for finding patterns in large data sets, such as written works.

In “A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces,” the authors look at using unsupervised learning (a technique in machine learning that does not rely upon prior data for “training”) on document clusters to identify similarity for short pieces of text (around a paragraph long). This technique could help take a single document and split it into sections and then apply these techniques to the collection. One would expect the works of multiple authors to form “clusters” due to the similarity between different paragraphs attributable to each author. This work is intriguing though the authors do caution: “[t]horough experimentation with standard metrics indicates that there still remains an ample room for improvement for authorial clustering, especially with shorter texts.” From my reading, I suspect their techniques would be appropriate for a small number of authors. This could then be applied either to the case my client first asked about (joint authors in a large document) or an educational setting (to identify “patchworking” of a few different documents together into a single document submitted by a student.)

Another intriguing approach I found was the idea of using contextual considerations. For example, in “Semantic measure of plagiarism using a hierarchical graph model,” the authors extracted “topic feature terms” and used them to construct an acyclic graph (“hierarchical”) of these terms. They then used graph analytic techniques to identify the similarity of specific sub-graphs [admittedly, I have concerns that this approach may not scale well since my recollection, since confirmed, is that sub-graph isomorphism (“these two subgraphs are equivalent”) is known to be difficult.]. One intriguing outcome of this is that it allowed the authors to find plagiarism where the words were changed, but the meaning was not – much like MOSS does with computer programs by examining the ASTs rather than the plain text.

This described technique reminded me of Carmine Guida’s work at Georgia Tech for plagiarism detection. His Master’s Thesis was “PLAGO: A system for plagiarism detection and intervention in massive courses.” I was peripherally involved in his work while he was doing it, and aspects stand out in my mind today. The tool that he constructed (and I had an opportunity to use) related to using explicit techniques for identifying common authorship: n-grams (a technique for splitting text into sets of n words for analysis), stop words (used to detect sentence structure), structural matching (using the stop words), and stemming (using the root stem of an analysis word, rather than the full term). However, one of the most intriguing aspects of his work was stylistic evaluations. When the style of writing changes in a text, it often indicates a switch in authorship. For example, he talked about changes in sentence length being one indicator in our discussion about this. Word complexity is another. Thus, structural techniques like this can identify structural similarity and stylistic shifts suggesting a change in authorship.

Oren Halvani’s Ph.D. thesis also seems relevant: “Practice-Oriented Authorship Verification.” Halvani focuses on authorship verification, which is truly core to the original question that my client asked and prompted me to go back and look at current work in the fields. Again, Dr. Halvani’s work is directly relevant: “these characteristics can be used to assess the extent to which AV methods are suitable for practical use, regardless of their detection accuracy.” His work explores specific mechanisms for verifying authorship, providing both a theoretical and empirical basis for employing these techniques.

This stroll through current work confirmed what I had known was previously valid: this is an active area of research. Tools such as Turnitin and Grammarly may be sufficient to my client’s objective, but if not, there is a vast body of recent work that shows it is possible to do this. Of course, if there’s a commercial solution available that can be used, it is likely the best thing to do. I did note that Turnitin had some reasonably significant limits to the amount of analysis one could do in this regard. From my reading of their website, the size of the text being compared had size limits and required commercial accounts (see their page about iThenticate, which would be used for comparing a small set of non-public texts.)

In my use of plagiarism detection tools, I find that I still have to confirm the findings of the tools. Sometimes there are “boilerplate” aspects that the tools do not detect and exclude. Sometimes, the tools do not consider things that support claims of similarity to me (e.g., in code, the comments and debug strings are often clear examples of code re-use). In the end, having a tool alone is not sufficient for having an expert use the tools to find suspect areas and then explain why specific regions represent plagiarism.

In this post, I have discussed some of the background related to document plagiarism. However, this is not the only way others can co-opt that information. In patents, the author provides a detailed description of an invention in exchange for a limited monopoly on using the invention. In trade secrets, the owner of the secrets does not describe them publicly and relies upon their secrecy. If someone else independently discovers the secret, their protection is lost.

In my IP expert practice, much of my work is related to trade secrets. There are two common scenarios that I see routinely: (1) someone thinks what they know is new and innovative and a secret, or (2) someone thinks what they learned is well-known and not secret. Trade secrets can be powerful and fragile but common in technology cases. Many technologies are “well known” yet aren’t known to someone who rediscovers them. This is a discussion for a future post!

Categories
Uncategorized

Improving Patent Family Value

As an inventor, one of the things I did not appreciate is how to maximize the value of a patent family. I suspect that one reason for this is that the attorney with whom I did much of my work focused on drafting the patent and nursing it through the prosecution process (note: “prosecution” in this use means “getting it through the patent process” not “enforce it.”)

Since that time, I have worked with litigators and patent brokers. Litigators taught me that patent owners could use one trick to “keep the patent prosecution alive,” which means that the patent owner continues to submit new claims against the original specification. From a litigation perspective, the patent owner can file new claims using the original specification and, if successful, have a patent that can then be enforced against potential infringers. Brokers taught me that the value of a patent is much higher if a potential buyer can file new claims on the original specification because it makes the patent family far more valuable in potential litigation.

Multiple patents against the same specification share a common priority date and a common expiration date. Usually, multiple patents against the same specification are considered a “family” of patents.

One good example of this is a well-known patent owned by Leland Stanford Jr. University (most people call it “Stanford,” though.) This is US Patent 6,285,999. It is a seminal patent because it provides the original description (“teaching” in patent parlance) of ranking web pages based upon how many other web pages are referenced. The algorithm is commonly used in my area of computer science (“systems”) and is referred to as PageRank. In addition, PageRank is well known enough that it has its own Wikipedia page.

On January 10, 1997, the original specification was filed as provisional application US3520597P. Thus, this is the “priority date” of the subsequent patent applications because they are all based upon the same common specification.

If you review the history of this patent, the first actual application was filed on January 9, 1998, the last day the provisional application was valid (that period of validity was one year; as far as I know, it still is.) The patent (6,285,999) was granted on September 4, 2001. The “Notice of Allowance” from the patent office was issued on April 23, 2001. The patent issue fee was paid on July 11, 2001. The second application was filed on July 2, 2001.

Because the second application was filed before the patent was issued, it “continued” the application process against the original specification. This process was repeated ten additional times. Thus, 12 different applications were filed against the same specification. The most recent application was awarded a patent on May 13, 2014 (8,725,726).

If there is no active continuation application on file with the USPTO when a patent issues, that specification is complete. Therefore, it is now part of the “prior art,” and no future patent claims can be inferred from that original specification.

Bottom line? Suppose you want to maximize the profit potential of your patents (as an inventor). In that case, it is good to keep an application open as it allows you (or a subsequent owner of the patent) to file an additional application focused on specific claims that can then be used to protect your invention.

I realize some people may not be familiar with Pagerank. However, this algorithm is the basis of the technology that launched Google. Larry Page, the inventor, was a graduate student at Stanford at the time. Thus, this is likely one of the most valuable patents ever granted.

Categories
Uncategorized

Meta-Data

Much of my work relates to meta-data. That is, “data about data.” For example, the name, size, and creation date of a given file is a form of meta-data. One of the areas of computer technology I have been working in for decades is storage, particularly the part of storage that converts physical storage (local or remote) into logical storage.

Usually, we call the software that converts physical storage into logical storage a file system. One significant benefit of using file systems is that they provide a (mostly) uniform model for accessing “unstructured data” (files).

Traditionally, we organize files into directories. Directories, in turn, can be categorized into other directories. This is then presented to users as a hierarchical information tree, starting with a “root” and then descending, with each directory containing more directories and other files.

I have already mentioned a few classes of information maintained by file systems: name, size, creation date. Many file systems also provide additional information (meta-data) about files, including:

  • Who can access this file?
  • When was the file last modified (note that this is distinct from when it was created)?
  • When was the file last accessed (often without being modified)?
  • Can the file be written (the “read-only” bit is quite common)?
  • Is the file encrypted?
  • Is the file compressed?
  • Is the file stored locally?
  • Are there special tags (“extended attributes”) applied to the file?

Not all file systems support all these different meta-data elements. For example, some file systems have limitations, such as timestamps that are only accurate to the nearest few seconds; it’s typical only to update the “last access” time once an hour (or longer). This is because there is a cost associated with changing that information that can have a measurable impact on the file system’s performance.

File systems are not the only place where we find meta-data. For example, when you take a photograph with your camera or your phone, it usually stores this in a standard format such as JPEG and other image formats. For image formats, this is known as the Exchangeable Image File Format (EXIF). Information here, which has changed over time and may not necessarily be recorded (it depends upon the device taking the photo, for example), includes timestamps, camera settings, possibly a thumbnail, copyright information, and geo-location data.

Analyzing and understanding meta-data can be directly helpful when it comes to looking at image files. Ironically, when the meta-data for an image is consistent, you can’t tell if it has been tampered with. Yet, when the meta-data for an image is inconsistent, you can reasonably conclude that the image has been modified in some way.

For example, a case that came up for me a couple of years back asked me to review another expert’s report. That expert stated they had a copy of the file as extracted from a hard disk drive, and they had it from a compact flash device. The meta-data varied between the two files.

The version of the image on the hard disk showed:

  • File system modification was November 10, 2005, 20:25:04
  • EXIF creation was November 10, 2005, 20:25:04
  • EXIF CreatorTool was Photoshop Adobe Elements 3.0
  • EXIF Model was Canon EOS 20D

The version of the image on the compact flash (CF) device showed timestamps of:

  • File system modification was November 10, 2005, 20:25:04
  • File system creation was November 10, 2005, 20:25:04

The expert report did not indicate what the EXIF data of the original file showed. However, what was clear is that the image had been loaded into Adobe Elements 3.0 (which, interestingly enough, was distributed with the Canon EOS 20D). While I did not have a Canon EOS 20D to verify (if it had been my report, I would have suggested doing so) and thus could not confirm that it didn’t write “Photoshop Adobe Elements 3.0” into the EXIF meta-data, I did not think that was likely (and the other expert stated it did not).

So, I was able to conclude that “the meta-data on the image is consistent with it being modified.” Why?

  • The name of the application was written into the image. Thus, at a minimum, the image’s meta-data was modified, even if the actual contents were not modified (remember, I didn’t have the original images; I was just looking at meta-data).
  • The timestamps were identical between the CF copy and the hard drive copy. When an application modifies a file, it usually does so to a new copy and then renames the new copy of the file to the old copy of the file. But then the timestamps would normally not be modified back to the original timestamps. But, of course, the application might do that. So, again, if I had been writing the expert report, I’d have tested to make sure Elements 3.0 worked as I expected it would. Since the original expert stated it did, I was able to concur with that expert’s analysis.
  • If an application overwrites the existing file, the creation timestamp and the modification timestamp will differ.

EXIF meta-data can be modified – I use Photoshop to look at and modify meta-data sometimes (e.g., to add copyright or strip out geo-location information before I post the photo). Still, the file system wouldn’t modify it.

File system meta-data can be modified – an application can invoke operating system calls and change those timestamps, but

I decided to check what information Photoshop shows me now. It uses the newer (and more general/extensible) XMP meta-data format:

XMP Meta-data from a PNG file that I created

And here are the file system timestamps for that file:

Native timestamp information from the system where the data is stored

Notice that the access timestamp has been updated (because I read it with Notice that the access timestamp has been updated (because I read it with Photoshop) but the modify and change times have not been updated. Since this was a Linux system, I had to dig a bit more to extract the creation timestamp (the Ext4 file system stores the creation timestamp, but most utilities use an older interface that does not make it available)

Extracting the creation timestamp on my Linux system

As you can see, the other timestamps also match, and the original creation time (“crtime” versus “change time,” which is shown as “ctime”) is the same as the modified time.

Thus, I know that the application created and wrote the file in succession – notice that the creation time and modified time are slightly different (that second value is in nanoseconds, so it is too small to show up when displayed as an “accurate to the nearest second” display). However, the creation time is slightly smaller than the modified time. Then the change time is a second later. This is precisely what I’d expect to see:

  • The application creates a new file with a temporary name. This sets the creation timestamp of the file.
  • The application writes data to the new file. This sets the modified timestamp of the file.
  • Application renames the temporary named file to the final named file. This is a change to the file meta-data, which updates the change time. Since the file contents did not change, the modified timestamp doesn’t change. That access timestamp is today, as I opened the file to look at its meta-data.

Meta-data tells a story; it isn’t necessarily inviolable, but modifying it in a consistent way with “how things work” is more complicated than one might imagine. As our computer systems have become more sophisticated, our mechanisms for verifying meta-data have similarly improved. For example, it used to be that the “state of the art” in signing a document was to sign it physically. If you were paranoid, you might initial each page, which made it more challenging to modify. Today, you can digitally sign a PDF document; that signature covers the document’s content and includes a timestamp along with a unique signature associated with the signing person. At present, faking such a digital signature is out of reach and modifying the actual document is impractical. That’s the power of combining meta-data with digital signatures.

Categories
Claims Litigation Patents

US7,069,546

Generic framework for embedded software development

Each morning I receive an e-mail detailing the patent legal cases that have been filed from the fine folks at RPX Corporate, part of their RPX Insight service. I generally don’t have time to review all the cases, but I often pick several complaints to read. The morning as I write this (June 28, 2019) I noticed a lawsuit against Microsoft. Since I am involved in the technology space, cases involving technology companies pique my interest.

As I began to read the basic description of the case – the use of modular software development practices in embedded systems – the more it piqued my curiosity. After all, I did quite a lot of work with Windows NT starting before it was released in 1993. One of its key features was the use of it in embedded systems. Since I had intended on describing how I do patent analysis, I thought this would be a good one with which to start, as I have some expertise in this area, having co-authored Windows NT Device Driver Development, a book that is still commonly available and still relevant today.

The place to start is at the actual original patent: this is the definitive document on what this patent covers. It includes the specification, which is text and art that explain the problem as well as the general intent of the techniques of the patent itself. In my experience, having been involved in about a dozen patent applications and patent prosecutions, the specification is generally drafted with the input of both the inventor(s) and the patent attorney. This isn’t a requirement – someone familiar with the structure and drafting of patent applications as well as the technology could certainly do both. In my experience those skill sets seldom show up in a single person.

The actual patent is, however, the claims of the patent. These describe what is claimed by the inventors. The specification can explain existing technologies or alternative solutions, but those are background, not the invention. Thus, the claims set forth the specifics of the invention claimed by the patent.

Patents consist of independent claims and dependent claims. The independent claims stand on their own, while dependent claims rely upon an independent claim or another dependent claim. For example, dependent claims might specify particular details of a specific embodiment (implementation or realization) of the patent relative to some prior claim of the patent.

So I start by looking at the claims. If you have not read a patent before, you might find the format of it peculiar; that’s because it is shaped by many years of legal precedent, custom, and so forth. As the law regarding patents evolves, so does the language used in patents, so over time you will notice a shift in how patents are drafted.

This patent has two independent claims and 43 dependent claims. Claim 1 is an independent claim:

A method for producing embedded software, comprising:
providing one or more generic application handler programs, each such program comprising computer program code for performing generic application functions common to multiple types of hardware modules used in a communication system; generating specific application handler code to associate the generic application functions with specific functions of a device driver for at least one of the types of the hardware modules, wherein generating the specific application handler code comprises defining a specific element in the specific application handler code to be handled by one of the generic application functions for the at least one of the types of the hardware modules, and registering one of the specific functions of the device driver for use in handing the defined specific element; and compiling the generic application handler programs together with the specific application handler code to produce machine readable code to be executed by an embedded processor in the at least one of the types of the hardware modules.

Ideally, it is good if someone familiar with the field of the patent can read the claims and understand the invention. In reality, it is common to find that the patent uses language which is not immediately obvious. The general rule then is that if the specification defines the terms, then that is the meaning we should give that term. If the specification does not define the term, then we should give it the ordinary meaning that someone who knows the field would ascribe to it (“one of ordinary skill in the art”).

One of the challenges of patent litigation can actually be figuring out what the patent means. This often involves trying to understand the context of terminology within their historical context. So, if you’re trying to prove you understand what some term that “everybody knew” actually meant you will find yourself looking through old reference materials: books, dictionaries, devices, you name it. An easy mistake to make is to assume the current meaning is, in fact, the meaning at the time the patent was filed. Many terms do retain their meaning, but some terms shift over time as the field of the invention evolves.

So what does this mean? The term “embedded software” to me means “software that operates on a device and is inherently involved in the functionality of that device”. We use embedded software all the time – it is in numerous things, including televisions, refrigerators, cars, wearable devices, etc. Thus, this is not a particularly limiting term to use at this point. Operating systems software is typically one of the essential parts of embedded systems.

So to my first read, “A method for producing embedded software” seems like a broad sweeping statement. If you have an embedded device, you need to be able to produce the embedded software, assuming there is a computational device of some sort involved (e.g., a central processing unit).

Let’s move on then: “providing one or more generic application handler programs”. Here is where the terminology starts to narrow down what this invention is about. I’m not quite sure yet what a “generic application handler program” is (remember, I haven’t read the specification yet – I’m trying to understand the patent from the claims). My mental model at this point is that we have some programs that provide a range of functions. The term generic suggests to me that there is some mechanism for providing specialization – the idea of using a “plug-in” model. This is surprisingly common in systems software; in fact it is a common technique in software in general. The first time you sit down to write the software to talk to the very first printer, you really don’t know what that software should look like. By the time you are writing the same software for your 5th or 10th printer, you’ll have noticed that there is quite a lot of commonality about how they operate, though there are typically some sections that are specific to the device. Thus, you can extract “common functionality” and separate it from “device specific functionality”. To me, this would mesh with the term generic.

I must admit, I’m still a bit vague on the significance of “generic application handler programs” because it seems broad. But let’s continue further, perhaps things will become clearer from context: “each such program comprising computer program code for performing generic application functions common to multiple types of hardware modules used in a communication system”. Well, programs in my area do correspond to computer program code for performing… something. This is one of those points where I start thinking about what one means by “computer program code”. I suspect it won’t matter, but this is one of the areas that can quickly become fuzzy. I’ll save that for a more theoretical blog post at some point.

Anything that consists of a series of instructions that can be carried out by a computer likely falls into the category of “computer program code”; having said that, I suspect in this case they mean “binary code that corresponds to the instruction set of the computer”. I will run with that for now. That leads to “… for performing generic application functions…” This use of generic still leaves me feeling this is quite broad. The next bit helps narrow it a bit: “… to multiple types of hardware modules…” This is why I was interested. Hardware! Now we are talking about an area in which I’m steeped in experience, having written my first Ethernet driver back in 1987. If you are not familiar with Ethernet (IEEE 802.1), that is one of the earliest standardized computer communications mechanisms. And then the last bit “… used in a communication system;” I’m feeling warm and fuzzy now, because “communication system” to me normally means “network” and, as I observed earlier, that’s an area which which I’m familiar well before 2001. In fact, one of the earliest examples of noticing commonality of hardware devices was networking. Rather than ask every network device vendor to write a separate device driver for their device, we could construct a common code module that implemented the things that were common across devices (hence making it generic) and leave the much smaller bits of interacting directly with the hardware, which often varies according to the specific hardware, to a developer familiar with that hardware. For programs using the network, this is also beneficial because it means we don’t need to rewrite those programs to use different network hardware. One early example of this was the Network Device Interface Specification (NDIS) developed by Microsoft and 3COM and included in various Windows versions, including Windows NT.

At least the picture is getting clearer at this point. Let’s see where we go from here: “… generating specific application handler code to associate the generic application functions with specific functions of a device driver for at least one of the types of the hardware modules”. I think this sounds like the invention wants to generate the code so that the developer can “fill in the blanks”. This would ease the burden on the developer building the hardware specific code – much like giving them a template with comments that say “add code here to get your device to do X”. This sounds like meta-programming to me. The idea of doing this is certainly not new, but perhaps it was novel in 2001. I’d need to do more research to find out.

Fortunately, a model of what this patent means is forming for me.

Next we have “wherein generating the specific application handler code comprises defining a specific element in the specific application handler code to be handled by one of the generic application functions for the at least one of the types of the hardware modules”. Remember when I was talking about the language of patents? This is a great example of it. The goal here, as I understand it, is to try and be inclusive with the language. If we start with a communications device, we usually start simply: we have a send operation and a receive operation. But if I have a communications device that only wants to receive, perhaps because it is only monitoring network traffic, I wouldn’t need a send operation at all. Rather than list all the possible operations and the various permutations, the intent here is to capture the idea that “we have a list of ‘generic functions’ that might be implemented, but don’t all need to be implemented”.

Another example? You don’t need any support for writing to a CD-ROM. It’s a waste of time. If you look at the Windows CDFS file systems code for example, you will note that it has routines for handling reading from a CD-ROM, but not writing to a CD-ROM. Makes sense. So the intent of this language is to capture the broad range of possibilities that occur in the device space. While that CDFS code is fairly recent, it is not remarkably different than the version that Microsoft publicly distributed in 1994 (the first time I saw it).

One benefit of breaking this wall of text up into smaller bite sized chunks is that we have an opportunity to breath because the sentence continues: “wherein generating the specific application handler code comprises defining a specific element in the specific application handler code to be handled by one of the generic application functions for the at least one of the types of the hardware modules”. I admit, I had to read (and re-read) this several times to try and make sense of it. So I’ll try to translate it back: we have some code that we need to talk to the hardware. We have to have some way to map the relevant parts of that hardware specific code to the generic operation being performed. In other words, to return to my hypothetical network device, I have to have some way to map from the generic send operation to the hardware specific send operation bits – after all, I can’t call the hardware specific receive code, since it can’t send a message. That makes sense to me.

We’re making progress! Next fragment: “and registering one of the specific functions of the device driver for use in handing the defined specific element”. I’m quite familiar with the concept of registering specific functions – this is a common model for layering abstraction in systems. For example, I’ve done a fair bit of file systems work over the years and it is quite common to find that file systems “insert” themselves into the system by registering the functions they provide to perform particular operations. This sort of decomposition of functionality into “common” (or generic) and “specialized” has been around for quite a long time; certainly longer than I’ve been building systems software.

Now the last fragment: “and compiling the generic application handler programs together with the specific application handler code to produce machine readable code to be executed by an embedded processor in the at least one of the types of the hardware modules.” We have some very specific bits at this point. Note that it says compiling. That would immediately eliminate (at least for this claim) the possibility that this code is not compiled – so this wouldn’t apply if the code were interpreted. I could nit-pick the point about “together”. Would this exclude the case in which you just linked the code together (e.g., you used a pre-existing library), but I’ll defer considering that further for now. I don’t really have much problem with “machine readable code”. The limitation that it is “to be executed by an embedded processor…” is similarly not really very limiting. Essentially most, if not all, the CPUs we used in 2001 (or today) can be used as embedded processors. In fact, we normally use the same device drivers for desktop computers, which are not embedded systems, and stand-alone devices, which are embedded systems. Thus, again, not really very limiting. The last bit just says it has to have something to do with at least one piece of hardware. That makes sense to me – why bother compiling code for hardware, if that hardware isn’t present in the system?

I’ll likely re-read this again, but breaking it down in this fashion helps me better understand the general scope – of the first claim.

Actually, the second claim is much easier, since it just builds upon the first claim:

A method according to claim 1, wherein providing the generic application handler programs comprises providing an application program interface (API) to enable a system management program in the communication system to invoke the generic application functions.

So, this claim just specifies one of the generic interfaces is some sort of management operation. To allow a program to access this, we need an API. APIs are certainly not new, nor are system management programs. This claim is much easier to understand. In my experience, dependent claims typically provide these sorts of small, focused instances of the broader claim upon which they are based.

At this point I’ll break. There’s more analysis that can be done here, but I’ve made a good start. I am a bit surprised at the breadth of this patent. Such breadth is a two-edged sword for the patent holder when it comes to enforcement: the broader the claims, the more likely they are to find many people potentially infringing upon the patent, but also the more difficult it is to defend the patent against work that might anticipate this. Narrow patents, in my experience, are more difficult to enforce but when you find an instance where someone is practicing the patent, the narrow patent is more difficult for them to challenge successfully.

Lucio Development LLC v Microsoft Corp, Case 6:19-cv-283.

Categories
Uncategorized

The Journey Begins

Thanks for joining me!

Good company in a journey makes the way seem shorter. — Izaak Walton

Jump through the portal

This blog, unlike those I have done before, is focused on my consulting work in the litigation support domain. Since I am working with technology, I thought I would start looking at interesting patents, which I find through the patent dispute process, and discuss them in the context of how I would approach them as an expert.

I am the primary inventor on 11 US patents in the technology space. I was personally involved in their prosecution. I have been involved in several patent disputes in the past and while I have yet to testify at trial, I have been through the other stages of the process.

In addition to inventing those patents, I also owned them for a while, as they were assigned to me after leaving my last company. I ultimately went through the patent disposition process as well, working with a broker to sell them. Each aspect of my involvement in the patent process has taught me quite a bit about how it works.

In the coming weeks and months I’ll be sharing different aspects of the patent process from my own unique perspective. I expect to discuss:

  • An Expert’s perspective on patent litigation. I get a daily report of new patent cases filed in the United States from the folks at Rational Patent (RPX). In all fairness, I don’t have time to go through all the complaints filed on a given day normally, so I pick those that look of interest to me.
  • My perspective on patent prosecution. What distinguishes a good patent from a bad patent from my perspective as an expert as well as someone who has gone through more than a dozen patent prosecutions.
  • My experiences in monetizing my patents. For small inventors, this can be one of the most challenging aspects of the patent process. Indeed, it is only by going through the process that I’ve learned quite a bit about the process and how it works.

As an expert, one of my goals is to help demystify technology as much as possible. Arthur C. Clarke said: Any sufficiently advanced technology is indistinguishable from magic. My goal is to demystify the technology, so it is no longer magic. Hence, my tag line, a portmanteau to honor Clarke’s memory: Any sufficiently advanced magic is indistinguishable from technology.