Automate Text Extraction from Image or Scanned PDF Receipts using PDF.co and Zapier

Sep 2, 2024·6 Minutes Read

In this tutorial, we’ll demonstrate how to automate data extraction from scanned PDF receipts using PDF.co and Zapier. Below is an example of the PDF receipt we’ll be using in this demo.

Sample PDF Receipt
Sample PDF Receipt

Step 1: Google Drive New File in Folder

Use the Google Drive "New File in Folder" trigger to obtain the specific scanned PDF receipt.

Google Drive New File in Folder

Step 2: Setup Google Drive Configuration

  • Select "My Google Drive" as the drive for this automation.
  • Next, choose the specific folder within your Google Drive where the PDF file is stored.
Setup Google Drive Configuration

Step 3: Test Trigger

Test the trigger to ensure it correctly obtained the specific file in your Google Drive folder.

Test Trigger

Step 4: Add PDF.co App

Next, let's add the PDF.co app and utilize the Document Parser feature to extract data from the scanned PDF receipt.

Add PDF.co App

Step 5: Setup PDF.co Configuration

  • In the Source File URL field, select the File property to retrieve files, allowing PDF.co to access the necessary file for processing.
  • For the Template Id, enter the Id associated with the receipt’s template. You can find a guide on how we created the template for this document below.
Setup PDF.co Configuration

Step 6: Test PDF.co Action

Now, let’s move on to testing the action to extract data from the scanned PDF receipt.

Test PDF.co Action

Step 7: Add Google Sheets

Now, let's add the Google Sheets module and select the Create Multiple Spreadsheet Row action to save all the extracted information to Google Sheets.

Add Google Sheets

Step 8: Configure Create Multiple Spreadsheet Row

  • Select "My Google Drive" as the designated drive.
  • Specify the name of the Spreadsheet and Worksheet where you want to save the extracted information.
  • Then, map all the extracted data to the appropriate columns in the spreadsheet.
Configure Create Multiple Spreadsheet Row
 Configure Create Multiple Spreadsheet Row 2

Step 9: Test File Upload

Test the Create Spreadsheet Row action by sending a request to Google Sheets to ensure the data is correctly added to the spreadsheet.

Test File Upload

Step 10: Open Google Sheets Spreadsheet

After a successful test, open the Google Sheets spreadsheet to review the extracted information that has been added.

Open Google Sheets Spreadsheet
Extracted information added to the Google Sheet

Step 11: Template Creation Guide

In this step, we will teach you how to create the template for this specific scanned PDF receipt.

Step 1: Open PDF.co Account

First, go to your PDF.co account and click on the Document Parser Manager. On the top right, click on the Create New Template to open the Online Template Editor. Here’s a direct link: https://app.pdf.co/document-parser/templates/new

Step 2: Template Editor

Next, click on the Load Test PDF or Image button to open the scanned PDF receipt. You can either copy and paste the sample template in the Edit Template to run the template right away or start from scratch.

To start from scratch, click on the +Add Object button and select the Add FIELD based on TEXT SEARCH. This is the object that we will use to parse all the non-table text such as Company Name, Receipt #, etc.

Template Editor

Step 3: Parse Company Name

To get the Company Name, we can use the $$funcFindCompany special function. This will find the first company name that it encounters in the document. You can add it in the Expression field. Make sure to check the Regex box every time you use the special functions, macros, and regular expressions.

Parse Company Name

Step 4: Parse Receipt Number

To extract the Receipt #, enter RECEIPT{{Spaces}}#{{Spaces}}(?<value>{{AnythingGreedy}}) in the Expression field and check the "Regex" box.

Parse Receipt Number

Step 5: Parse Bill To Value

To extract the "Bill To" value, use {{LineStart}}{{Spaces}}(?<value>{{SentenceWithSingleSpaces}}){{Spaces}}{{SentenceWithSingleSpaces}}{{Spaces}}RECEIPT DATE in the Expression field.

Parse Bill To Value

Step 6: Parse Ship To Value

To extract the "Ship To" value, use {{LineStart}}{{Spaces}}{{SentenceWithSingleSpaces}}{{Spaces}}(?<value>{{SentenceWithSingleSpaces}}){{Spaces}}RECEIPT DATE in the Expression field.

Parse Ship To Value

Step 7: Extract Table Items

To extract the table items, we'll use the ADD TABLE field based on the TEXT SEARCH object. Enter the below Expression to capture all the items, then click the "Run Template" button to view the results.

{
  "start": {
    "expression": "QTY{{Spaces}}DESCRIPTION",
    "regex": true
  },
  "end": {
    "expression": "Subtotal{{Spaces}}{{Number}}",
    "regex": true
  },
  "row": {
    "expression": "{{LineStart}}{{Spaces}}(?{{Digits}}){{Spaces}}(?{{SentenceWithSingleSpaces}}){{Spaces}}(?{{Number}}){{Spaces}}(?{{Number}})",
    "regex": true
  },
  "columns": [
    {
      "name": "qty",
      "dataType": "integer"
    },
    {
      "name": "description",
      "dataType": "string"
    },
    {
      "name": "unitPrice",
      "dataType": "decimal"
    },
    {
      "name": "amount",
      "dataType": "decimal"
    }
  ]
}
Extract Table Items

After completing the extraction process, save the template by clicking on the "Save Template and Return" button. Then, retrieve the Template Id to use in your automation workflow.

In this tutorial, you learned how to automate the text extraction in a scanned PDF receipt using PDF.co and Zapier. You also learned how to use the different Document Parser objects, special function and macros to extract specific text and table items.

Related Tutorials

See Related Tutorials