Making sense of massive data dumps

Public release of a jumble of 23,000 evidence photos and nearly 10,000 PDFs was a gift to exercise my data ingest skills

The Challenge

To keep my tech skills current, one useful source of data challenge is the way the US government release floods of data in inconvenient forms, as they did in the Epstein case. 23,000 one page photos of evidence, then thousands of documents in PDF form. Then you find that they bodged redactions on the court records last August, so I wrote a bit of Python to fetch the text hidden beneath painted back rectangles and voila; I can list the redacted words only and/or the text with the redacted content now present.

Note that redactions of PDF's are usually done more thoroughly, removing the text and obfuscating the bounding boxes where the text was present. In the case of the August 2025 release of 2022 law case files, FOIA data and interview transcripts, someone painted black rectangles atop text only, leaving the text underneath still present.

I OCRd 23,000 one page evidence pics from their August drop using Gemini AI too. Then the last 444 files it refused to convert on copyright grounds (pictures of newspaper columns and book pages) with another library. While my unredact code was running, it also converted the PDFs to text format, redactions found or not.

Some of the PDF documents contained voluminous collections of photos, so I enhanced my Python code to output all picture files stored within PDF files. The user can specify a directory name and if desired, tell my code to step down the directory structure containing PDFs underneath recursively.

End result is that all the written words are in text files on my Mac. While I’ve no interest in the content (I’ll leave that to journalists), if I do a search for “Prince” on my Mac, it’s scrolling text for 10 mins.

My work was on all data released to the public by the US Dept of Justice up to end December 2025. They then released a further 360GB of files in January 2026. This was problematic for two reasons. (1) the space required to store and process this data far exceeds the storage space I currently possess. (2) the redactions are, i'm told, executed poorly and may contain images that are illegal to store and process in the UK. So, i've sat this one out.

Next stages - for another volume data ingest exercise - is to use Retrieval Augmented Generation techniques to build a custom large language model. This will allow end users to query the complete data asset and summarise it's content on demand.

Some consumers of the files suggested unredacting the poorly redacted Epstein Court documents using manual copy and paste. They'd quickly be frustrated by the scale; on the other hand, my Python code processed everything in the time it took me to drink a mug of coffee. Look ma, no hands:

Results

I ended up with a complete hierarchy of text documents that mirrored the structure of the released court files. These could be searched using Unix/Linux grep (pattern matching) commands.

I also provided a "build a bear" Python application that took a list of files in which a specific text string occurred, and which consolidated those text files only into a data source that could be fed into NotebookLM. This allowed a user to query events, timelines and to summarise activities of any specific individual within the files using English commands.

Page updated

Report abuse