Why use PDF to XML API?

Extract PDF to XML
With our PDF to XML API, you can convert PDF to XML format with information about text values, tables, fonts, images, and object positions.

Support for damaged and scanned text

PDF.co engine provides automated support for damaged text and images from text recognition. Built-in OCR (Optical Character Recognition) supports PDF files with mixed content and multiple languages.


Web API Supports Multiple Languages

PDF to XML converter API can be used by software developers from programming languages such as PHP, Javascript, .NET and ASP.NET, C#, Java, Visual Basic, and many others. Find source code samples in our API documentation.

Business Automation Platforms Integrations

If you are not a developer, you can also easily automate your PDF operations via popular business automation platforms: ZapierMakeAirtableSalesforceGoogle Apps Script, and 300+ more.


PDF to XML Converter API – Sample & Demo

Take a look at the Sample PDF File for this demo.

Screenshot of Sample PDF
Screenshot of Sample PDF

The code snippets below are in different programming languages. They can convert the Sample PDF File above into XML.

The final result will look like this.

Screenshot of Output XML
Screenshot of output XML

Before we proceed with the code. Let us first check the /v1/pdf/convert/to/xml parameters and its uses.

Endpoint to Convert PDF to XML Format

URL: https://api.pdf.co/v1/pdf/convert/to/xml
Method: POST
Parameter Description
url required. Link to the source file.
lang optional. english by default. Sets OCR (image to text extraction) language to be used for scanned PDF when a scanned document is detected or input is PNG, JPG images. Other supported values: eng, spa, deu, fra, jpn, chi_sim, chi_tra, kor. You can also specify two languages to be used on the same page, for example: eng+deu, jpn+kor or other combinations.
inline optional. Must be one of: true to return data as inline or false to return link to an output file (default).
unwrap optional. Unwrap lines to a single line within table cells when lineGrouping is enabled. Must be one of true or false.
pages optional. Comma-separated list of page indices (or ranges) to process. IMPORTANT: the very first page starts with (zero). To set a range use the dash , for example: 0, 2-5, 7-.
rect optional. Defines coordinates for extraction, e.g. 51.8, 114.8, 235.5, 204.0. Must be a string.
encrypt optional. Enable encryption for the output file: true or false
async optional. Runs processing asynchronously. Returns jobId to use with job/checktrue or false
name optional. Output file name.
profiles optional. Must be a String. Set custom configuration. See profiles examples here
lineGrouping optional. Line grouping with table cells. Set to 1 to enable the grouping. Must be a string.

Now we are ready to write some codes.

cURL Code Snippet

curl --location --request POST 'https://api.pdf.co/v1/pdf/convert/to/xml' \
--header 'x-api-key: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data-raw '{
"url": "https://pdf-temp-files.s3.amazonaws.com/49e77ae7473e47d1a32eac28ffd0c161/sample.pdf"

This sample code and other cURL source code samples are available here.

So, now you’ve learned how to convert PDF to XML format.

Now let’s see this program in action.

Output XML using cURL
Output XML using cURL

The PDF to XML sample code in JavaScript is available here.

The PDF to XML sample code in PHP is available here.

The sample code for PDF to XML in Python is here.

The PDF to XML sample code in Java is available here.

The PDF to XML sample code in C# is available here.

Sign Up

NOTE: Use PDF.co Document Classifier to know the source of the document. You can easily create and maintain classification rules with the desktop-based Classifier Testing Tool (see the details here)