How to Convert PDF to XML from Uploaded File for PDF to XML API in Python with PDF.co Web API

Learn to Write Code to Convert PDF to XML from Uploaded File for PDF to XML API in Python: Simple How To Tutorial

The documentation is written to assist you to apply all the necessary features on your side. PDF.co Web API was designed to assist PDF to XML API in Python. PDF.co Web API is the flexible Web API that includes a full set of functions from e-signature requests to data extraction, OCR, images recognition, PDF splitting, and PDF splitting. Can also generate barcodes and read barcodes from images, scans, and PDF.

The SDK samples displayed below explain how to quickly make your application do PDF to XML API in Python with the help of PDF.co Web API. This Python sample code can be used by copying and pasting it into your project. You can also refer to our GitHub repository for getting source code at this location. Once done, just compile your project and click Run. Writing Python applications mostly includes various stages of the software development so even if the functionality works please check it with your data and the production environment.

PDF.co Web API – a free trial version is available on our website. Also, there are other code samples to help you with your Python application included in the trial version.

On-demand (REST Web API) version:
 Web API (on-demand version)

On-premise offline SDK for Windows:
 60 Day Free Trial (on-premise)

ConvertPdfToXMLFromUploadedFile.py

import os
import requests # pip install requests

# The authentication key (API Key).
# Get your own by registering at https://app.pdf.co/documentation/api
API_KEY = "******************************************"

# Base URL for PDF.co Web API requests
BASE_URL = "https://api.pdf.co/v1"

# Source PDF file
SourceFile = ".\\sample.pdf"
# Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'.
Pages = ""
# PDF document password. Leave empty for unprotected documents.
Password = ""
# Destination XML file name
DestinationFile = ".\\result.xml"


def main(args = None):
    uploadedFileUrl = uploadFile(SourceFile)
    if (uploadedFileUrl != None):
        convertPdfToXml(uploadedFileUrl, DestinationFile)


def convertPdfToXml(uploadedFileUrl, destinationFile):
    """Converts PDF To XML using PDF.co Web API"""

    # Prepare requests params as JSON
    # See documentation: https://apidocs.pdf.co
    parameters = {}
    parameters["name"] = os.path.basename(destinationFile)
    parameters["password"] = Password
    parameters["pages"] = Pages
    parameters["url"] = uploadedFileUrl

    # Prepare URL for 'PDF To XML' API request
    url = "{}/pdf/convert/to/xml".format(BASE_URL)

    # Execute request and get response as JSON
    response = requests.post(url, data=parameters, headers={ "x-api-key": API_KEY })
    if (response.status_code == 200):
        json = response.json()

        if json["error"] == False:
            #  Get URL of result file
            resultFileUrl = json["url"]            
            # Download result file
            r = requests.get(resultFileUrl, stream=True)
            if (r.status_code == 200):
                with open(destinationFile, 'wb') as file:
                    for chunk in r:
                        file.write(chunk)
                print(f"Result file saved as \"{destinationFile}\" file.")
            else:
                print(f"Request error: {response.status_code} {response.reason}")
        else:
            # Show service reported error
            print(json["message"])
    else:
        print(f"Request error: {response.status_code} {response.reason}")


def uploadFile(fileName):
    """Uploads file to the cloud"""
    
    # 1. RETRIEVE PRESIGNED URL TO UPLOAD FILE.

    # Prepare URL for 'Get Presigned URL' API request
    url = "{}/file/upload/get-presigned-url?contenttype=application/octet-stream&name={}".format(
        BASE_URL, os.path.basename(fileName))
    
    # Execute request and get response as JSON
    response = requests.get(url, headers={ "x-api-key": API_KEY })
    if (response.status_code == 200):
        json = response.json()
        
        if json["error"] == False:
            # URL to use for file upload
            uploadUrl = json["presignedUrl"]
            # URL for future reference
            uploadedFileUrl = json["url"]

            # 2. UPLOAD FILE TO CLOUD.
            with open(fileName, 'rb') as file:
                requests.put(uploadUrl, data=file, headers={ "x-api-key": API_KEY, "content-type": "application/octet-stream" })

            return uploadedFileUrl
        else:
            # Show service reported error
            print(json["message"])    
    else:
        print(f"Request error: {response.status_code} {response.reason}")

    return None


if __name__ == '__main__':
    main()

 

Output

Now that we’ve already reviewed the source code along with the output, Let’s analyze the code a bit.

Initially, we’re gathering all the necessary information for the PDF.co endpoint request for PDF to XML conversion. API_KEY variable holds the PDF.co API key and it’s passed in the request header for authentication purposes. We’ve also specified parameters for the source PDF file (SourceFile), Page numbers (Pages) whose data would be converted to XML, Destination location (DestinationFile) where output XML will be stored, etc.

This program is logically divided into two functions, uploadFile and convertPdfToXml. As the name suggests function uploadFile will upload PDF to PDF.co cloud and get the public URL and convertPdfToXml function is using the uploaded PDF file’s public URL and performing the XML conversion.

Uploading the input PDF file to PDF.co cloud, is a very simple and straight forward process. First we’re requesting PDF.co for a pre-signed URL. When making the call for pre-signed URL we’re using the endpoint /file/upload/get-presigned-url, and also passing the input file name in the request input. The output of this API call, consists of the returned pre-signed URL and the public URL for the uploaded file. Then we’re using this pre-signed URL to uplaod the actual file with PUT reqeust. Please note here, files uploaded to PDF.co cloud are temporary and only availabe for a few hours.

The PDF.co API endpoint /pdf/convert/to/xml is used here to perform PDF to XML conversion. For this endpoint request, we’re preparing the JSON request data. In the request, we’re passing the PDF.co API key x-api-key in the request header. Upon completion of the API request, we’ll have XML converted in url parameter.

The generated XML data contains PDF data as well as other useful properties such as font information, co-ordinate information of the extracted text, etc.

The PDF to XML endpoint can be configured as per our requirements. The following are some of the additional parameters we can provide, please refer to documentation for more information.

rect Defines coordinates for data extraction.
lang Sets OCR language to be used for scanned PDF, PNG, JPG documents when extracting data from them.
async Runs procssing asynchronously. When this parameter is enabled, it returns JobId, and upon completion of that job output can be retrieved.
inline If this parameter is enabled, then response will contain XML data directly instead of the URL of that data.

Please try to execute this sample in your development machine to get more out of this article. Thank you for reading!

VIDEO

ON-PREMISE OFFLINE SDK

Get 60 Day Free Trial

See also:

ON-DEMAND REST WEB API

Get Your API Key

See also: