Text and Images from PDFs

Often stored in huge files. Often useful to separate out.

The second tranche of Epstein files in December 2025 moved the goalposts again. Many of the images submitted in evidence were distributed within huge multi-page documents. The challenge was to separate out text where it was provided or photographed. Where images occurred, to separate those out and to be able to ask my AI software "What can you see in this picture" - which it would also describe in text form. At least the parts visible and not covered with black redaction marks. The end result was that the whole data collection became searchable for my client journalists.

Some users found that documents submitted from court records in 2022 had been really badly redacted; some had painted black rectangles over text but have otherwise left the text underneath present. Users found they could select a paragraph of text and paste it into another document, where all the obscured text reappeared. Hence a lot of chatter about "just having to copy and paste" to see all the redacted files; however, the volume of documents was such that this would take years to process by hand.

I wrote a Python program that was able to separately list all the words positioned under black rectangles, or to output the full text without redactions:

Fortunately, all files submitted in and after December 2025 were redacted properly (it's a standard feature of Adobe Acrobat Pro and other PDF editing software).

If you need to extract text, images or both from volumes of PDF documents, I have all the tools to help you. Also to add metadata to tell you - in English - what each image shows.

Page updated

Report abuse