How to convert PDF to XML from uploaded file for PDF to XML API in Python with PDF.co Web API

Learn to write code convert PDF to XML from uploaded file for PDF to XML API in Python: Simple How To Tutorial

The documentation is written to assist you to apply all the necessary features on your side. PDF.co Web API was designed to assist PDF to XML API in Python. PDF.co Web API is the flexible Web API that includes a full set of functions from e-signature requests to data extraction, OCR, images recognition, PDF splitting, and PDF splitting. Can also generate barcodes and read barcodes from images, scans, and PDF.

The SDK samples displayed below explain how to quickly make your application do PDF to XML API in Python with the help of PDF.co Web API. This Python sample code can be used by copying and pasting it into your project. You can also refer to our GitHub repository for getting source code at this location. Once done, just compile your project and click Run. Writing Python application mostly includes various stages of the software development so even if the functionality works please check it with your data and the production environment.

PDF.co Web API – free trial version is available on our website. Also, there are other code samples to help you with your Python application included in the trial version.

On-demand (REST Web API) version:
 Web API (on-demand version)

On-premise offline SDK for Windows:
 60 Day Free Trial (on-premise)

ConvertPdfToXMLFromUploadedFile.py

import os
import requests # pip install requests

# The authentication key (API Key).
# Get your own by registering at https://app.pdf.co/documentation/api
API_KEY = "******************************************"

# Base URL for PDF.co Web API requests
BASE_URL = "https://api.pdf.co/v1"

# Source PDF file
SourceFile = ".\\sample.pdf"
# Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'.
Pages = ""
# PDF document password. Leave empty for unprotected documents.
Password = ""
# Destination XML file name
DestinationFile = ".\\result.xml"


def main(args = None):
    uploadedFileUrl = uploadFile(SourceFile)
    if (uploadedFileUrl != None):
        convertPdfToXml(uploadedFileUrl, DestinationFile)


def convertPdfToXml(uploadedFileUrl, destinationFile):
    """Converts PDF To XML using PDF.co Web API"""

    # Prepare requests params as JSON
    # See documentation: https://apidocs.pdf.co
    parameters = {}
    parameters["name"] = os.path.basename(destinationFile)
    parameters["password"] = Password
    parameters["pages"] = Pages
    parameters["url"] = uploadedFileUrl

    # Prepare URL for 'PDF To XML' API request
    url = "{}/pdf/convert/to/xml".format(BASE_URL)

    # Execute request and get response as JSON
    response = requests.post(url, data=parameters, headers={ "x-api-key": API_KEY })
    if (response.status_code == 200):
        json = response.json()

        if json["error"] == False:
            #  Get URL of result file
            resultFileUrl = json["url"]            
            # Download result file
            r = requests.get(resultFileUrl, stream=True)
            if (r.status_code == 200):
                with open(destinationFile, 'wb') as file:
                    for chunk in r:
                        file.write(chunk)
                print(f"Result file saved as \"{destinationFile}\" file.")
            else:
                print(f"Request error: {response.status_code} {response.reason}")
        else:
            # Show service reported error
            print(json["message"])
    else:
        print(f"Request error: {response.status_code} {response.reason}")


def uploadFile(fileName):
    """Uploads file to the cloud"""
    
    # 1. RETRIEVE PRESIGNED URL TO UPLOAD FILE.

    # Prepare URL for 'Get Presigned URL' API request
    url = "{}/file/upload/get-presigned-url?contenttype=application/octet-stream&name={}".format(
        BASE_URL, os.path.basename(fileName))
    
    # Execute request and get response as JSON
    response = requests.get(url, headers={ "x-api-key": API_KEY })
    if (response.status_code == 200):
        json = response.json()
        
        if json["error"] == False:
            # URL to use for file upload
            uploadUrl = json["presignedUrl"]
            # URL for future reference
            uploadedFileUrl = json["url"]

            # 2. UPLOAD FILE TO CLOUD.
            with open(fileName, 'rb') as file:
                requests.put(uploadUrl, data=file, headers={ "x-api-key": API_KEY, "content-type": "application/octet-stream" })

            return uploadedFileUrl
        else:
            # Show service reported error
            print(json["message"])    
    else:
        print(f"Request error: {response.status_code} {response.reason}")

    return None


if __name__ == '__main__':
    main()

 

Output

Now that we’ve already reviewed source code along with output, Let’s analyze code a bit.

Initially, we’re gathering all necessary information for PDF.co endpoint request for PDF to XML conversion. API_KEY variable holds PDF.co API key and it’s passed in request header for authentication purpose. We’ve also specified parameters for source PDF file (SourceFile), Page numbers (Pages) whose data would be converted to XML, Destination location (DestinationFile) where output XML will be stored, etc.

This program is logically divided into two functions, uploadFile and convertPdfToXml. As the name suggests function uploadFile will upload PDF to PDF.co cloud and get public URL and convertPdfToXml function is using uploaded PDF file’s public URL and performing XML conversation.

Uploading input PDF file to PDF.co cloud, is very simple and straight forward process. First we’re requesting PDF.co for pre-signed URL. When making call for pre-signed URL we’re using endpoint /file/upload/get-presigned-url, and also passing input file name in request input. Output of this API call, consists of returned pre-signed URL and public URL for uploaded file. Then we’re using this pre-signed URL to uplaod actual file with PUT reqeust. Please note here, files uploaded to PDF.co cloud are temporary and only availabe for few hours.

PDF.co API endpoint /pdf/convert/to/xml is used here to perform PDF to XML conversation. For this endpoint request, we’re preparing JSON request data. In request we’re passing PDF.co API key x-api-key in request header. Upon completion of API request, we’ll have converted XML in url parameter.

Generated XML data contains PDF data as well as other useful properties such as font information, co-ordinate information of extracted text, etc.

PDF to XML endpoint can be configured as per our requirements. Following are the some of the additional parameters we can provide, please refer to documentation for more information.

rect Defines coordinates for data extraction.
lang Sets OCR language to be used for scanned PDF, PNG, JPG documents when extracting data from them.
async Runs procssing asynchronously. When this parameter is enabled, it returns JobId, and upon completion of that job output can be retrieved.
inline If this parameter is enabled, then response will contain XML data directly instead of URL of that data.

Please try to execute this sample in your development machine to get more out of this article. Thank you for reading!

VIDEO

ON-PREMISE OFFLINE SDK

Get 60 Day Free Trial

See also:

ON-DEMAND REST WEB API

Get Your API Key

See also:

 

Related Pages:

Related Samples: