How to Convert PDF to XML from Uploaded File for PDF to XML API in Python with Web API

Learn to Write Code to Convert PDF to XML from Uploaded File for PDF to XML API in Python: Simple How To Tutorial

The documentation is written to assist you to apply all the necessary features on your side. Web API was designed to assist PDF to XML API in Python. Web API is the flexible Web API that includes a full set of functions from e-signature requests to data extraction, OCR, images recognition, PDF splitting, and PDF splitting. Can also generate barcodes and read barcodes from images, scans, and PDF.

The SDK samples displayed below explain how to quickly make your application do PDF to XML API in Python with the help of Web API. This Python sample code can be used by copying and pasting it into your project. You can also refer to our GitHub repository for getting source code at this location. Once done, just compile your project and click Run. Writing Python applications mostly includes various stages of the software development so even if the functionality works please check it with your data and the production environment. Web API – a free trial version is available on our website. Also, there are other code samples to help you with your Python application included in the trial version.

On-demand (REST Web API) version:
 Web API (on-demand version)

On-premise offline SDK for Windows:
 60 Day Free Trial (on-premise)

import os
import requests # pip install requests

# The authentication key (API Key).
# Get your own by registering at
API_KEY = "******************************************"

# Base URL for Web API requests

# Source PDF file
SourceFile = ".\\sample.pdf"
# Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'.
Pages = ""
# PDF document password. Leave empty for unprotected documents.
Password = ""
# Destination XML file name
DestinationFile = ".\\result.xml"

def main(args = None):
    uploadedFileUrl = uploadFile(SourceFile)
    if (uploadedFileUrl != None):
        convertPdfToXml(uploadedFileUrl, DestinationFile)

def convertPdfToXml(uploadedFileUrl, destinationFile):
    """Converts PDF To XML using Web API"""

    # Prepare requests params as JSON
    # See documentation:
    parameters = {}
    parameters["name"] = os.path.basename(destinationFile)
    parameters["password"] = Password
    parameters["pages"] = Pages
    parameters["url"] = uploadedFileUrl

    # Prepare URL for 'PDF To XML' API request
    url = "{}/pdf/convert/to/xml".format(BASE_URL)

    # Execute request and get response as JSON
    response =, data=parameters, headers={ "x-api-key": API_KEY })
    if (response.status_code == 200):
        json = response.json()

        if json["error"] == False:
            #  Get URL of result file
            resultFileUrl = json["url"]            
            # Download result file
            r = requests.get(resultFileUrl, stream=True)
            if (r.status_code == 200):
                with open(destinationFile, 'wb') as file:
                    for chunk in r:
                print(f"Result file saved as \"{destinationFile}\" file.")
                print(f"Request error: {response.status_code} {response.reason}")
            # Show service reported error
        print(f"Request error: {response.status_code} {response.reason}")

def uploadFile(fileName):
    """Uploads file to the cloud"""

    # Prepare URL for 'Get Presigned URL' API request
    url = "{}/file/upload/get-presigned-url?contenttype=application/octet-stream&name={}".format(
        BASE_URL, os.path.basename(fileName))
    # Execute request and get response as JSON
    response = requests.get(url, headers={ "x-api-key": API_KEY })
    if (response.status_code == 200):
        json = response.json()
        if json["error"] == False:
            # URL to use for file upload
            uploadUrl = json["presignedUrl"]
            # URL for future reference
            uploadedFileUrl = json["url"]

            # 2. UPLOAD FILE TO CLOUD.
            with open(fileName, 'rb') as file:
                requests.put(uploadUrl, data=file, headers={ "x-api-key": API_KEY, "content-type": "application/octet-stream" })

            return uploadedFileUrl
            # Show service reported error
        print(f"Request error: {response.status_code} {response.reason}")

    return None

if __name__ == '__main__':



Now that we’ve already reviewed the source code along with the output, Let’s analyze the code a bit.

Initially, we’re gathering all the necessary information for the endpoint request for PDF to XML conversion. API_KEY variable holds the API key and it’s passed in the request header for authentication purposes. We’ve also specified parameters for the source PDF file (SourceFile), Page numbers (Pages) whose data would be converted to XML, Destination location (DestinationFile) where output XML will be stored, etc.

This program is logically divided into two functions, uploadFile and convertPdfToXml. As the name suggests function uploadFile will upload PDF to cloud and get the public URL and convertPdfToXml function is using the uploaded PDF file’s public URL and performing the XML conversion.

Uploading the input PDF file to cloud, is a very simple and straight forward process. First we’re requesting for a pre-signed URL. When making the call for pre-signed URL we’re using the endpoint /file/upload/get-presigned-url, and also passing the input file name in the request input. The output of this API call, consists of the returned pre-signed URL and the public URL for the uploaded file. Then we’re using this pre-signed URL to uplaod the actual file with PUT reqeust. Please note here, files uploaded to cloud are temporary and only availabe for a few hours.

The API endpoint /pdf/convert/to/xml is used here to perform PDF to XML conversion. For this endpoint request, we’re preparing the JSON request data. In the request, we’re passing the API key x-api-key in the request header. Upon completion of the API request, we’ll have XML converted in url parameter.

The generated XML data contains PDF data as well as other useful properties such as font information, co-ordinate information of the extracted text, etc.

The PDF to XML endpoint can be configured as per our requirements. The following are some of the additional parameters we can provide, please refer to documentation for more information.

rect Defines coordinates for data extraction.
lang Sets OCR language to be used for scanned PDF, PNG, JPG documents when extracting data from them.
async Runs procssing asynchronously. When this parameter is enabled, it returns JobId, and upon completion of that job output can be retrieved.
inline If this parameter is enabled, then response will contain XML data directly instead of the URL of that data.

Please try to execute this sample in your development machine to get more out of this article. Thank you for reading!



Get 60 Day Free Trial

See also:


Get Your API Key

See also: