Extract Table with Text from PDF (Node.js) in JavaScript using PDF.co Web API
In this tutorial, we will show you how to extract tables from a PDF with JavaScript using the PDF.co Web API.
Introduction
This tutorial will teach you how to extract a table from a PDF with NodeJS.
Below is the image of the source PDF invoice and the extracted table with text output in JSON format.
Steps
The following steps explain how to setup your environment, details the programming and explains how to run the program.
Source Code and Template
To begin extracting tables from a PDF, open Visual Studio Code (or your favorite editor) and save the following files.
Note: You can also download a zip bundle from the page with the source code and template.
Install Requests Module
As the JavaScript code relies on uploading files to PDF.co internal storage we need to install a node module to handle the request.
- Navigate to where you downloaded the files or where you want to use the project and install the required
node modules
. - To install the requests module, type
npm install requests
in your command line interface (CLI).
You will notice a new node_modules
folder has appeared - don't worry this is expected!
Insert API Key
On line 12
in the JavaScript file, insert your API key inside the double quote. You can get the API key in your PDF.co Dashboard.
Source and Destination File
On line 15
, add your source PDF file, then on line 19
type your desired output filename. Aside from JSON output, you can also extract tables with text in CSV and XML formats.
Note: On Mac & Linux systems filenames do not require the leading./
before the filenames!
Add Template
On line 96
, check the path to the template name. The Document Parser supports both JSON and YML template formats. For more details about the Document Parser, check out this page.
Note: On Mac & Linux systems the leading ./
before the filename is not required.
Run JavaScript Program
To run the program, simply type node app.js
in the command line interface (CLI).
At this point you should see the resulting output file (result.json
)with the table data extracted!
Use Cases - Extract Specific Data from Invoices
Here are a few use cases for the extraction of specific data from invoices:
Accounts Payable Automation
By extracting data such as invoice numbers, dates, vendor names, and amounts from scanned invoices, companies can streamline the process of paying bills, reducing errors and saving time.
Expense Tracking
Data Extraction from receipts and invoices for business expenses helps individuals to easily keep track of their spending and ensure that they are accurately recording expenses for tax and accounting purposes.
Compliance
Many industries have strict regulations around financial reporting and record-keeping. By extracting data from scanned invoices, companies can ensure that they are complying with these regulations and avoid fines and penalties.
Business Intelligence
By extracting data from scanned invoices and other financial documents, companies can gain insights into their business performance. For example, they may be able to identify trends in spending, compare the profitability of different products or services, or assess the financial health of their business.
Video Guide
Here’s a short demo guide showing how to extract tables from a PDF with JavaScript using the PDF.co Web API with NodeJS.