I know we have talked about meta-data in the past (see previous discussion of meta-data here) concerning photographs (Meta-Data), but today I turn my attention to a broader discussion of what it is and why it matters.
Meta-data is literally “data about data.” Since I have decades of experience with file systems, I tend to think of it in the terms we use in file systems: the name, the timestamps, and the file’s attributes are its meta-data. This is information that a file system (for example) maintains about the file and provides as needed to applications. Those applications, in turn, use that meta-data to provide you with a visualization of the files using that meta-data as guidance.
File system meta-data is relatively minimal, yet it has proven to be more or less sufficient for more than a half-century. Some file systems augment their meta-data by storing “extra” information with a file; typically, these are referred to as “extended attributes.” From what I can tell, this idea of having arbitrary meta-data maintained by a file system stems from work done in the 1980s by Jeff Mogul. Even today, support for extended attributes remains unusual. For instance, Microsoft’s main file system (NTFS) supports them but does not expose them to typical Windows applications.
Thus, what often happens is meta-data becomes part of the data itself. This is a fundamental entwining of data and meta-data that we rely upon extensively today. For example, applications might recognize that a particular piece of data corresponds to a uniform resource identifier and, in turn, display that to users as a link. Some applications allow embedding a link within the text. For example, WordPress, the software I use to maintain this blog, permits me to embed the link. Photographs contain meta-data about the photo within the file, and other applications similarly do this. Some of my work relates to figuring out the structure and format of meta-data maintained by application programs for forensic purposes (e.g., to figure out how and why specific data and meta-data exist in a particular format.)
One of my continual frustrations is that many people eschew the value of meta-data. For example, today, I received an e-mail that was obviously phishing. The first thing I did was look at the meta-data that accompanied the e-mail. I captured that meta-data and removed the spoofed e-mail address as it looked like it might belong to a real person. It is long, so if you want to see it look here. In all fairness, the e-mail itself was a poor-quality phishing attempt. It was interesting because it was sent from an internal e-mail address. The headers, however, provided some handy information. None of that information was present in the e-mail itself.
Thus, meta-data often looks like gibberish but is often rich with helpful information and insight. For example, this e-mail appears to have been sent using the mail API interface (“mapi”):
Received: from BN8PR07MB6897.namprd07.prod.outlook.com ([fe80::71fb:7c18:45a:4c7]) by BN8PR07MB6897.namprd07.prod.outlook.com ([fe80::71fb:7c18:45a:4c7%8]) with mapi id 15.20.5438.024; Thu, 21 Jul 2022 13:15:59 +0000
Someone connected to a Microsoft e-mail server using a reasonably standard e-mail client using the MAPI interface sent the message at 1:15:59 UTC. The message was then sent to a different server via the SMTP electronic mail protocol:
Received: from BN8PR07MB6897.namprd07.prod.outlook.com (2603:10b6:408:7b::20) by BN7PR07MB5089.namprd07.prod.outlook.com (2603:10b6:408:31::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5438.20; Thu, 21 Jul 2022 13:15:59 +0000
I found it mildly interesting that these servers use IPv6 addresses (2603:10b6:408:7b::20, for example) rather than IPv4 addresses (which have the aaa.bbb.ccc.ddd style text format.) While not germane to the phishing e-mail, it tells me about how Microsoft’s internal networks are now configured.
Finally, that e-mail was sent to a different server – presumably the one where my e-mail box is hosted:
Received: from BN7PR07MB5089.namprd07.prod.outlook.com (::1) by
CO2PR07MB2565.namprd07.prod.outlook.com with HTTPS; Thu, 21 Jul 2022 13:16:08 +0000
Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=gatech.edu;
Though it looks like these two services are co-located (the ::1 address is IPv6 for “localhost.”). The alarming bit about this e-mail was that it appeared to be from someone within the organization. It provided a link to obtain the recipient’s e-mail address and password for their Microsoft Apps for Business 365 account. The meta-data tells us a story (only a small amount of which I unraveled) that is invaluable.
The challenge I often face when working with electronic discovery is that this meta-data is “scraped away” by a system focused on physical documents that have been modified to handle electronic records. Nothing is worse than seeing a scanned image with a word in a color I associate with a link to another unrelated document. This is quite common when working with technology companies as data from Confluence (a popular tool for documenting technical resources) often involves linking to other information, including task tracking information, source code, and other documents within the Confluence system. These documents can then be exported as PDFs. The legal discovery process usually then converts these PDFs into images. The images are subsequently loaded into a document discovery database (such as Relativity), where the images are converted back into text via optical character recognition (OCR.) Any meta-data, such as the links behind the pretty blue words I want to click and follow, does not work. It makes trying to understand context, provenance, and relationships vastly more complicated and dramatically drives up the cost of discovery for the parties.
In addition to being “inconvenient,” removing meta-data can also be done for less savory reasons: it can be used to obfuscate the fact that a document has been modified.
When I write an expert report, I normally provide a digitally signed version of that report in portable document (PDF) format. Unlike a physical signature, which (sort of) protects the signed page, a digital signature is on the content of some or all of the file, thus ensuring that none of the signed information has been modified. To demonstrate this, I decided to sign my curriculum vitae.
If I were to take that PDF and then convert it into an image, the digital signature would be lost. Small changes can be made, and it is challenging to detect such a discrepancy. Looking at the PDF document in a capable PDF viewer will confirm if the signature over the document is valid.
I could have made a visible “digital signature” box, but it would not have made the document more secure – the visual nature of it is to provide information to human users. The programs do not need that to verify the signature.
Thus, meta-data is essential and often provides considerable insight that is stripped away in the legal discovery process. When possible, I advise clients to obtain original PDF documents precisely because this meta-data is very useful. Of course, other electronic documents, such as photographs, also have useful meta-data within the files where the data needed to reproduce the picture is located. If all you are given is a printed copy of the photograph, that EXIF meta-data has been discarded. While this is not evidence of tampering, it does make tampering more likely, and thus the value of such evidence is diminished.
Hopefully, the legal process will come to appreciate the benefits of ensuring that both the data and the meta-data are available to experts to provide better insight and greater surety about our work. In the meantime, I will always strive to do the best job possible within the constraints I am given because the position of a good expert is to provide truthful insight.