PDF: Past, present, and future
PDF is one of the most ubiquitous formats in use today.
However much of a technophobe someone might be, they’ll probably understand that if you say “I’ll send you a PDF”, they’ll be getting an electronic document that they can view and print out.
But it wasn’t always thus…
DOS Days
Think back to the late 1980s, and the PC explosion was in full swing. These strange beige boxes were appearing on desks across businesses. Where once having a computer on your desk marked you out as a technical wizard, they were swiftly becoming ubiquitous office tools.
The driving force behind the proliferation of PCs within offices wasn’t the Internet (that was still to come), it was office automation; Word processors, spreadsheets, databases, and “desktop publishing”.
And offices then, even more than now, ran on paper. Whatever you did with your computer, you probably wanted to print it out; whether it was a letter, sales figures, a database generated report, or a flier for your business, it’d probably end up on paper.But connecting a printer to a PC was a tricky business. These days, if you plug a printer into your computer, Windows (or MacOS, or whatever) will install a printer driver for it, and every application can then print to it. Back in the good old days of DOS, there was no such centralised support; you needed a printer driver for every different application; Word Perfect would need a different piece of software to Word. And different drivers had different capabilities. It was a minefield.
Various different companies stepped forwards to try to solve this problem, and the ones we remember today are Hewlett Packard and Adobe. Both of them attempted to define a standard language that could be used to communicate to a range of different printers with different abilities.
HP came up with PCL, which is still in use today. It had the advantage of being simple enough to run on relatively cheap hardware, at the cost of output files being largely tailored for specific printers.
Adobe’s offering was Postscript; a more complex, and powerful system, where a single output file could be rendered across a wider range of devices, and would look good on all of them. This meant you could print it on your office printer as a proof, and then run off more copies at a professional print shop from the same file. The cost for this was that it required more powerful hardware to run on; a Postscript printer would generally cost noticeably more than the equivalent PCL one.Both these formats are still in use today, and are supported by the Ghostscript range of software. Indeed, Ghostscript was the first non-Adobe implementation of Postscript, and remains the most widely installed Postscript interpreter in the world.
To Adobe’s infinite credit, they decided to throw the specification for Postscript open to the world. It enabled people to write both their own drivers, and their own implementations.
The future arrives!
Fast forward a few years, and the next change to the landscape was the Internet. A wealth of information at everyone’s fingertips. The chance for people to download all the information they could dream of (at 33kbps until your sister picked up the phone and lost your dial up connection).
But what format was that information in? On the early Internet, text files abounded. Great for raw information, but lacking a certain graphical flair. And while HTML based web pages were amazing at the time, they didn’t match up to the creative abilities of people’s new home PCs.
Corel Draw, Illustrator, Photoshop, and other such packages had brought high quality graphics to people’s desktops. Great, but every package is incompatible. If you’ve drawn your latest creation in Corel Draw, and you want to send it to your friend who only has Photoshop, what can you do? How can your friend view the files you’ve made in Word Perfect when he only has Word?
Again, Adobe stepped forward with a solution: PDF, the Portable Document Format. The core design brief was that whatever machine you viewed a PDF on, it should look the same. It had to be capable of representing pretty much any graphical effect or layout that any package could create.
Based (at least initially) on the same graphical model as Postscript, it removed some of the more expensive to implement aspects of Postscript, and packaged the data to make it easy to navigate the file (rather than purely being able to walk pages in order).
Released in 1993, Adobe again threw the specification for PDF open to the public; anyone could create documents, anyone could consume them. It very quickly became the “go to” format of choice.
Accordingly, Ghostscript was updated to display PDFs, releasing in early 1995 as the first non-Adobe PDF viewer. It cleverly leveraged its existing Postscript engine to implement PDF support.
A decade of incremental updates
While the spec was published for free, Adobe kept control over it, and continued to drip feed new features to it.
1994 brought passwords, encryption, and enhanced color representations.
Then in 1996, interactive abilities were added; instead of being purely static things, PDFs sprouted checkboxes and fill-in fields, etc, enabling them to represent live forms that could be filled in and resubmitted back over the web.
In 2000, Digital signatures were introduced, allowing some degree of confidence that PDFs hadn’t been tampered with. New graphical features were added to allow better representation of color across devices.
In 2001, PDF was amended with its largest graphical change to date; the ability to represent transparency. Other new features included JBIG2, enhanced encryption and the start of a drive towards accessibility.
Updates in 2003 (adding JPEG2000 and enhanced compression) and 2004 (Opentype fonts and further encryption options) were the last real change to the graphical ability of PDF.
While other updates were pushed out for the rest of the decade they didn’t drastically affect the core abilities of PDF. The format was mature.
Ghostscript’s handling of PDF similarly matured over this time; it could now both create and consume PDFs of all flavours. Not only that, but Artifex produced the first versions of MuPDF, written from the ground up to render PDFs both fast and in high quality, with the smallest footprint possible.
Subset specs
While the core technology of PDF was largely stable at this point, parallel development continued on a series of specifications, such as PDF-X (“PDF for Exchange”), PDF-A (“PDF for Archive”) etc.
These take the form of restricted subsets of PDF; any PDF-X or PDF-A file is a valid PDF file, but only uses a specified fraction of the capabilities.
The idea is that by restricting the huge amount of latitude that you have within a PDF file, you can further ensure consistent processing of the data within.PDF-A in particular is a key technology that enables librarians and other archivists to be as such as possible that any information stored as a PDF-A file will remain consistently and perfectly readable for the foreseeable future; whatever advances in technology may come, the spec for PDF-A is sufficiently fixed that readers will always be able to accurately render files.
An actual open standard
Adobe took its final step in its curation of PDF in 2008 by ceding control of the standard to ISO. The Adobe specification version 1.7 was republished (with inferior typesetting!) as ISO 32000-1.
Since 2011 work on the specification has continued under the aegis of the PDF Association, an open collection of people from across the PDF industry; both consumers and producers. Adobe continues to take an active role in developments, but the open nature of the association means that work can be driven by other interested parties too.
The most visible fruits of this work appeared in 2017, when the PDFA and ISO produced ISO 32000-2, the specification for PDF 2.0.
PDF 2.0 had the effect of consolidating the standard; grey areas were clarified, and rules were laid down for how both official and unofficial extensions would be handled in future.
Both Ghostscript and MuPDF support PDF 2.0 now. Indeed, Ghostscript has now retired the old Postscript-based PDF engine, and now relies on a new interpreter written in C, with the benefits of more than 30 years of experience of the format. The same trusted graphics engine underlies the new, faster, more reliable, easier to maintain front end.
Beyond 2.0
So what next for PDF? What will happen in the next 20 years?
Well, PDF is more than 30 years old at this point. The last major update to the spec was 8 years ago, and that was largely a cleanup operation. So, PDF must be done, right? All the problems solved.
Would that it were so!
We’re in the digital age, freed from the tyranny of an office full of paper, dealing with documents electronically. And yet, the key technology we’re using for this is a format designed to exactly mimic paper.
PDF was conceived as a way of representing the look of a page, not to be an encoding of the logical content that made up that page. It was this ability to look correct everywhere that was instrumental in its success and its widespread adoption, but it has become its greatest weakness.If you send someone a document, they’ll inevitably want to change it; maybe to update some figures, or change a date. Or maybe to do a wholesale rewriting of text. PDF does not easily accommodate this kind of change.
If you are sent a paper copy of a letter, and there is an error in it, what can you do? Well, you can retype it, or you can try to apply whiteout and type over it. It’s ironic that in our modern paperless office PDF broadly offers you the same options!
Similarly, PDF is not the friendliest format to get data out in a reusable form. Every year, millions of PDF documents are produced with tables of information in - from government reports, to bank statements, to scientific papers. A treasure trove of information - if only we could get it out in a form that’s reliably machine readable.
Even something as conceptually simple as an invoice is a non-trivial problem for PDFs. Various countries (notably Germany and France) are leaning heavily into e-invoicing, by making PDFs that not only look like traditional invoices, but have an easily machine readable version embedded too. But how do you know the electronic version and the human readable version actually have the same information in?
The current computing wave is that of AI; LLMs hungry to devour vast quantities of information that it can regurgitate in new, exciting, profit-generating ways. Most of the time, the information that those AIs are desperate to ingest is in PDF format (both for primary training and at runtime via RAG). Improving the extractability of data with PDFs is an ongoing problem.
Artifex are actively at work in this area, with many customers using MuPDF (and the enhanced Python bindings for it, PyMuPDF and PyMuPDF4LLM) to enable such workflows.
Not only is the number of PDFs being created increasing, PDFs themselves are getting bigger. Larger and larger datasets are being output into PDF. PDFs are being produced with other files embedded within them.
Another area of work is therefore to reduce the size of PDFs by improving the compression methods used. Artifex are actively working with the PDF association on forthcoming versions of PDF that will bring state of the art compression methods into the standard. Smaller files, with no loss of quality saving on data transmission times and storage costs.
So, while PDF has solved many problems in its 3 decades of life, its very success has lead to the creation of many more.
Hopefully, Artifex will be here for the next 30 years to continue to solve these - and make more!