Google Invoice Parser to Read PDF Invoices and Orders with Google Script and PDF.co

Imagine a scenario where we have thousands of PDFs containing invoices of different companies in different formats. Now if we have to prepare reports around these files, we’ll need to know fields such as invoice number, invoice amount, invoice date, and so on. If we try to extract this information manually from x number of invoices, it’ll be a very time-consuming process and error-prone for sure.

Luckily there are few services out there that can help us in these kinds of situations. In this article, we’ll try to extract invoice information from PDF using PDF.co API endpoint named document parser. We’ll create sample solutions in Google Spreadsheet and App Scripts. After we’re finished with this project, we can parse any invoice PDF of any format and we’ll get invoice information right into Spreadsheet.

Review Final Spreadsheet

Below is the spreadsheet we’ve designed and the final output that we’re going to achieve.

Spreadsheet

We have the following highlighted controls here.

  • PDF.co: This is a custom-generated menu using Google App Script. Upon clicking the submenu “Read Invoice Data”, it’ll process the input PDF URL and write all results.
  • PDF.co API Key: It’s the PDF.co API Key, which is required to authenticate PDF.co requests. You will get this API key upon signing up with PDF.co.
  • Invoice Input PDF: URL of input PDF Invoice.

Output

In the below image, we are fetching basic invoice fields from standard invoice PDF URL.

Fetch Basic Invoice Fields Postman

PDF.co API endpoint for document parser is pretty generic and it tries to identify invoice fields using AI. Hence if we pass input PDF documents in different formats it will have all fields and their position within the document will be different.

PDF.co does a fantastic job in extracting and identifying data from divergent invoice formats. For example, take a look at the following image. Here, we are using a new invoice sample and running the same program to identify invoice fields.

Google Script

I believe you’ll be excited to jump right into source code. Before doing that, let’s review the PDF.co endpoint which does the heavy lifting of parsing documents.

PDF.co Endpoint For Document Parser

PDF.co Document parser is an AI-powered API endpoint that can automatically parse PDF, PNG, JPG documents and extract fields, tables, or values. It has an inbuilt template for parsing invoices which we observe in this article. However, we can create our own template based on our requirements and use it to extract values.

API Endpoint

https://api.pdf.co/v1/pdf/documentparser

Sample Input Request

{
    "url": "https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/document-parser/sample-invoice.pdf",
    "outputFormat": "JSON",
    "templateId": "1",
    "async": false,
    "encrypt": "false",
    "inline": "true",
    "password": "",
    "profiles": "",
    "storeResult": false
}

Sample Output

{
    "body": {
        "objects": [
            {
                "name": "companyName",
                "objectType": "field",
                "value": "ACME Inc.",
                "rectangle": [
                    0.0,
                    0.0,
                    0.0,
                    0.0
                ]
            }
...

Output PDF

Here is the sample output PDF.

Output PDF

Note: Go here for our Postman collection of PDF.co APIs.

You can follow the links below to know more about template generation or Document Parser API itself.

Now we’re ready to dive into the Google Apps source code.

Source Code

/* Initial Declarations */
let Cell_PDFcoApiKey, PDFCoApiKey, Cell_InputPDFUrl, InputPDFUrl, Cell_CompanyName, Cell_BillTo, Cell_InvoiceNo, Cell_InvoiceDate, Cell_DueDate, Cell_SubTotal, Cell_Tax, Cell_TotalAmount, Cell_Error;

/* 
 * Handle PDF.co Menu on Google Sheet Open Event 
 */
function onOpen() {
  var spreadsheet = SpreadsheetApp.getActive();
  var menuItems = [
    {name: 'Read Invoice Data', functionName: 'ParseInvoiceData'} 
  ];
  spreadsheet.addMenu('PDF.co', menuItems);
}

/**
 * Parse Invoice Data and put it into a spreadsheet
 */
function ParseInvoiceData() {
  // Initial Cell reference and clean
  AssignInitialCellRefAndClean();

  if(!ValidateInputValues()){
    return;
  }

  // Prepare Payload
  var data = {
    "url": InputPDFUrl,
    "outputFormat": "JSON",
    "templateId": "1",
    "async": false,
    "encrypt": "false",
    "inline": "true",
  };

  // Prepare Request Options
  var options = {
    'method' : 'post',
    'contentType': 'application/json',
    'headers': {
      "x-api-key": PDFCoApiKey
    },
    // Convert the JavaScript object to a JSON string.
    'payload' : JSON.stringify(data)
  };
  
  // Get Response
  // https://developers.google.com/apps-script/reference/url-fetch
  var pdfCoResponse = UrlFetchApp.fetch('https://api.pdf.co/v1/pdf/documentparser', options);

  var pdfCoRespContent = pdfCoResponse.getContentText();
  var pdfCoRespJson = JSON.parse(pdfCoRespContent);

  // Display Result
  if(!pdfCoRespJson.error){
    Cell_CompanyName.setValue(ExtractFieldFromResponse(pdfCoRespJson, "companyName"));
    Cell_BillTo.setValue(ExtractFieldFromResponse(pdfCoRespJson, "companyName2"));
    Cell_InvoiceNo.setValue(ExtractFieldFromResponse(pdfCoRespJson, "invoiceId"));
    Cell_InvoiceDate.setValue(ExtractFieldFromResponse(pdfCoRespJson, "dateIssued"));
    Cell_DueDate.setValue(ExtractFieldFromResponse(pdfCoRespJson, "dateDue"));
    Cell_SubTotal.setValue(ExtractFieldFromResponse(pdfCoRespJson, "subTotal"));
    Cell_Tax.setValue(ExtractFieldFromResponse(pdfCoRespJson, "tax"));
    Cell_TotalAmount.setValue(ExtractFieldFromResponse(pdfCoRespJson, "total"));
  }
  else{
    Cell_Error.setValue(pdfCoRespJson.message);
  }
}

/**
 * Validate Input Values before request
 */
function ValidateInputValues(){
  if(InputPDFUrl === ""){
    Cell_Error.setValue("Please Provice Input Invoice PDF URL");
    return false;
  }

  if(PDFCoApiKey === ""){
    Cell_Error.setValue("Please Provice PDF.co API Key");
    return false;
  }

  Cell_Error.setValue("");
  return true;
}

/**
 * Extract field value from JSON response
 */
function ExtractFieldFromResponse(respJson, fieldName){
  let retVal = "";

  if(respJson && respJson.body && respJson.body.objects && respJson.body.objects.length > 0){
    const objectField = respJson.body.objects.filter(x => x.name === fieldName && x.objectType === "field");
    if(objectField && objectField.length > 0){
      retVal = objectField[0].value;
    }
  }

  return retVal;
}

/**
 * Assign Initial References
 */
function AssignInitialCellRefAndClean(){
  var spreadsheet = SpreadsheetApp.getActive();

  // Assign Cell References
  Cell_PDFcoApiKey = spreadsheet.getRange("B1");
  PDFCoApiKey = Cell_PDFcoApiKey.getValue();

  Cell_InputPDFUrl = spreadsheet.getRange("B2");
  InputPDFUrl = Cell_InputPDFUrl.getValue();

  Cell_CompanyName = spreadsheet.getRange("A5");
  Cell_BillTo = spreadsheet.getRange("B5");
  Cell_InvoiceNo = spreadsheet.getRange("C5");
  Cell_InvoiceDate = spreadsheet.getRange("D5");
  Cell_DueDate = spreadsheet.getRange("E5");
  Cell_SubTotal = spreadsheet.getRange("F5");
  Cell_Tax = spreadsheet.getRange("G5");
  Cell_TotalAmount = spreadsheet.getRange("H5");
  Cell_Error = spreadsheet.getRange("A3:H3");


  Cell_CompanyName.setValue("");
  Cell_BillTo.setValue("");
  Cell_InvoiceNo.setValue("");
  Cell_InvoiceDate.setValue("");
  Cell_DueDate.setValue("");
  Cell_SubTotal.setValue("");
  Cell_Tax.setValue("");
  Cell_TotalAmount.setValue("");
  Cell_Error.setValue("");

}

I believe source code is very easy to understand, let’s review it briefly.

Review Source Code

This sample demonstrates how we can use PDF.co API endpoint for extracting PDF right inside our Google Spreadsheet using AppScript.

We took some predefined cells and assumed that they would contain a specific value. For example, we are using cell “B1” to get PDF.co API value. We used a similar approach to get input values and display output. I am sure we can achieve this better in different ways, but it works for our example.

With that said, we are declaring variables initially for these input/output cells.

/* Initial Declarations */
let Cell_PDFcoApiKey, PDFCoApiKey, Cell_InputPDFUrl, InputPDFUrl, Cell_CompanyName, Cell_BillTo, Cell_InvoiceNo, Cell_InvoiceDate, Cell_DueDate, Cell_SubTotal, Cell_Tax, Cell_TotalAmount, Cell_Error;

These variables are allocated in the method named “AssignInitialCellRefAndClean”. It also clears the value of output cells.

When the spreadsheet is loaded/opened by the user an event “onOpen“ is executed. Hence whatever code is written inside this method is getting called initially. We’re building a menu named “PDF.co” and it’s submenu “Read Invoice Data” then.

var menuItems = [
    {name: 'Read Invoice Data', functionName: 'ParseInvoiceData'} 
  ];

Please note that this menu is referring to a function named “ParseInvoiceData”. Hence whenever a sub-menu named “Read Invoice Data” is clicked, it’ll execute the function “ParseInvoiceData”. Nice wiring!

Function “ParseInvoiceData” is the heart of this program! We’re preparing the PDF.co execution payload and handling it’s response here. Also, API request specification is done here. Note that we’re passing the PDF.co API key in the request header. This is very important, as the API key is responsible for the authentication of your request.

  // Prepare Request Options
  var options = {
    'method' : 'post',
    'contentType': 'application/json',
    'headers': {
      "x-api-key": PDFCoApiKey
    },
    // Convert the JavaScript object to a JSON string.
    'payload' : JSON.stringify(data)
  };

Once we get a response, we parse it using the function “ExtractFieldFromResponse” and fill output cells with it.

With that, we have a job done here! Great!

Summary

In this article, we learned how to leverage PDF.co to simply input a PDF URL into a spreadsheet cell, click a menu, and get all the information extracted and nicely placed at their respective places.

Useful links from this article:

Video Guide