Automate Text Extraction from Image or Scanned PDF Receipts using PDF.co and Zapier
In this tutorial we will show you how you can automate the text extraction from scanned PDF receipts when you have hundreds of them daily using PDF.co and Zapier.
Step 1: Sample Scanned Receipt
We will extract everything in this scanned PDF receipt except for the addresses. To follow along, you can get the sample files here.
![Sample Scanned PDF Receipt](/_next/image?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2F2niroq9z%2Fproduction%2Ffe7eee6b3281d54d2aff835cddc6d81ad2442e7d-669x673.png&w=1920&q=75)
Step 2: Setup PDF.co
We stored our scanned PDF receipt in our Google Drive folder and we assume that you have already set up the Zapier Trigger step.
We will jump straight to the Action Step. In this step, let’s choose PDF.co as the App and the Document Parser as the Action Event.
![Use PDF.co Document Parser To Extract Text From Receipt](/_next/image?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2F2niroq9z%2Fproduction%2Fee4b29f646280c393ca6b7ebb60eb04c1b1568d4-891x443.png&w=1920&q=75)
Step 3: Configure Document Parser
Let’s set up the Document Parser.
- In the Input field, select the scanned PDF receipt link.
- In the Template Id, enter the Id for the receipt’s template. We have a guide on how we made the template for this document below.
![Configure Document Parser](/_next/image?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2F2niroq9z%2Fproduction%2F964d4fb9f5b2e53ba91b81982bf22bc1382bd95e-889x463.png&w=1920&q=75)
Step 4: Test Document Parser
Let’s send our configuration to PDF.co to make sure that we set it up correctly.
![Test Document Parser Configuration](/_next/image?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2F2niroq9z%2Fproduction%2F58056ce0db49e7aec75f37adf7507186f89ff61f-895x469.png&w=1920&q=75)
Step 5: Parsed Scanned PDF
Great! PDF.co processed our request successfully and returned the parsed text from the scanned PDF receipt.
![Parsed Text From Scanned PDF Receipt](/_next/image?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2F2niroq9z%2Fproduction%2Fce9f39ba40aee1d0244f45a19df48d859009101f-889x563.png&w=1920&q=75)
Step 6: Template Creation Guide
In this step, we will teach you how to create the template for this specific scanned PDF receipt.
First, go to your PDF.co account and click on the Document Parser. On the top right, click on the New Template link to open the Online Template Editor. Here’s a direct link: https://app.pdf.co/document-parser/templates/new
Next, click on the Load Test PDF or Image button to open the scanned PDF receipt. You can either copy and paste the sample template in the Edit Template to run the template right away or start from scratch.
![Online Template Editor](/_next/image?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2F2niroq9z%2Fproduction%2F99017e09b94b642efc7ef44d29df0e8020f81133-567x377.png&w=1200&q=75)
Then, click on the +Add Object button and select the Add FIELD based on TEXT SEARCH. This is the object that we will use to parse all the non-table text such as Company Name, Receipt #, etc.
To get the Company Name, we can use the $$funcFindCompany
special function. This will find the first company name that it encounters in the document. You can add it in the Expression field. Make sure to check the Regex box every time you use the special functions, macros, and regular expressions.
![Parse Company Name](/_next/image?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2F2niroq9z%2Fproduction%2F448f591b6a5d387ca781221c2f1e9cb1e40f63cd-995x455.png&w=2048&q=75)
To get the Receipt #, add RECEIPT{{Spaces}}#{{Spaces}}(?<value>{{AnythingGreedy}})
in the Expression field and check the Regex check box.
![Parse Receipt Number](/_next/image?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2F2niroq9z%2Fproduction%2Ffc351396d4f79e4b9f070726cde8f71647221a25-1173x499.png&w=3840&q=75)
Getting the Bill To name and Ship To name is a bit complex. You can use {{LineStart}}{{Spaces}}(?<value>{{SentenceWithSingleSpaces}}){{Spaces}}{{SentenceWithSingleSpaces}}{{Spaces}}RECEIPT DATE
to get the Bill To name and {{LineStart}}{{Spaces}}{{SentenceWithSingleSpaces}}{{Spaces}}(?<value>{{SentenceWithSingleSpaces}}){{Spaces}}RECEIPT DATE
to get the Ship To name.
To get the table items, we will use the ADD TABLE field based on TEXT SEARCH object. Add the following to the Expression field to get all the items and click the Run Template button to see the result.
{
"start": {
"expression": "QTY{{Spaces}}DESCRIPTION",
"regex": true
},
"end": {
"expression": "Subtotal{{Spaces}}{{Number}}",
"regex": true
},
"row": {
"expression": "{{LineStart}}{{Spaces}}(?{{Digits}}){{Spaces}}(?{{SentenceWithSingleSpaces}}){{Spaces}}(?{{Number}}){{Spaces}}(?{{Number}})",
"regex": true
},
"columns": [
{
"name": "qty",
"dataType": "integer"
},
{
"name": "description",
"dataType": "string"
},
{
"name": "unitPrice",
"dataType": "decimal"
},
{
"name": "amount",
"dataType": "decimal"
}
]
}
![Parse Table Items](/_next/image?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2F2niroq9z%2Fproduction%2F121bcf7c2badca3d18f59473daf155fdc7ffd2e5-1251x525.png&w=3840&q=75)
You can save the template and get the Template ID by clicking on the Save Template and Return button.
In this tutorial, you learned how to automate the text extraction in a scanned PDF receipt using PDF.co and Zapier. You also learned how to use the different Document Parser objects, special function and macros to extract specific text and table items.