In December 2023, the New York Times filed suit against OpenAI, claiming that ChatGPT was built from “uncompensated use” of the Times’ “intellectual property.” The filing showcased hundreds of examples in which answers from the chatbot were nearly identical to articles published by the Times. More disturbingly, multiple answers that were attributed to the Times hadn’t ever been published, and some even contained false information.
It may be years before courts rule on the legality of data-collection methods used by OpenAI (and, by extension, most or all commercial Large Language Models (LLMs)). In the meantime, publishers are expressing concern about how and when their content is ingested into LLMs, and urgently seeking technical solutions to prevent their intellectual property from being used to train LLMs and regurgitated in unpredictable ways by AI tools.
Why publishers of high value PDFs are concerned
Some digital documents are already sold under contracts that expressly forbid indexing, summarization, and use in training LLMs. In nearly all cases these documents are distributed as PDF files, because PDFs are digital containers for information that can be controlled at the document level.
For example, restrictions on use in AI are now included in the sale agreements for PDFs of technical standards, for several reasons:
The text of standards documents is precise and any condensation or rephrasing might change the meaning of important parts, which for certain types of standards (aeronautics, fuels, heavy machinery, etc.) could lead to catastrophic outcomes.
Even the most sophisticated LLMs regularly introduce errors, aka “hallucinations”, into their output and there is currently no reliable way to distinguish between the good answers derived from the original text and those made up by the model. Some systems attempt to address this concern by including references to the source text, but links only mitigate risk of inaccuracy if the user properly compares the answer with the citation.
Data ingested into a LLM by one user may become part of the training set for that LLM and thus be made available to other users, in violation of copyright. Most commercial LLM offerings now include some kind of segmentation to prevent unsanctioned use, but the effectiveness of these controls remains untested.
How AI LLMs extract content from PDFs for training
The mechanism by which LLMs ingest PDF content is, at a high level, no different from that used by search engines: it involves extraction of text from the PDF and then processing to organize and structure the text to enable indexing or, for LLMs, “tokenization.”
There are significant technical differences between how the extracted text is processed/tokenized, but insofar as both approaches are built from the initial PDF text extraction they expose the same risks and complications, including:
Some PDFs have no text to extract, because they are images of pages. In this case the text must be generated using Optical Character Recognition (OCR), which is generally accurate but almost never produces perfect results.
PDFs are typically unstructured data and some elements – especially tables, charts and special formatting – are difficult to parse, even for the most sophisticated tools.
PDFs often contain complex punctuation and other typographical constructs that may change the meaning of text if not properly understood, especially if the text is in an unusual language.
The text of encrypted PDFs, however, cannot be extracted. So a PDF that has been encrypted – either with a password or via a Security Handler like FileOpen – cannot be either indexed or ingested into an LLM.