Extract Table with Text from PDF (Node.js) in JavaScript using PDF.co Web API

In this tutorial, we will show you how to extract tables from a PDF with JavaScript using the PDF.co Web API.

Introduction

This tutorial will teach you how to extract a table from a PDF with NodeJS.

Below is the image of the source PDF invoice and the extracted table with text output in JSON format.

Steps

The following steps explain how to setup your environment, details the programming and explains how to run the program.

Step 1: Source Code and Template

To begin extracting tables from a PDF, open Visual Studio Code (or your favorite editor) and save the following files.

Note: You can also download a zip bundle from the page with the source code and template.

Step 2: Install Requests Module

As the JavaScript code relies on uploading files to PDF.co internal storage we need to install a node module to handle the request.

  • Navigate to where you downloaded the files or where you want to use the project and install the required node modules.
  • To install the requests module, type npm install requests in your command line interface (CLI).

You will notice a new node_modules folder has appeared - don't worry this is expected!

Step 3: Insert API Key

On line 12 in the JavaScript file, insert your API key inside the double quote. You can get the API key in your PDF.co Dashboard.

API Key

Step 4: Source and Destination File

On line 15, add your source PDF file, then on line 19 type your desired output filename. Aside from JSON output, you can also extract tables with text in CSV and XML formats.

File guide

Note: On Mac & Linux systems filenames do not require the leading./ before the filenames!

Step 5: Add Template

On line 96, check the path to the template name. The Document Parser supports both JSON and YML template formats. For more details about the Document Parser, check out this page.

Parse function example

Note: On Mac & Linux systems the leading ./ before the filename is not required.

Step 6: Run JavaScript Program

To run the program, simply type node app.js in the command line interface (CLI).

At this point you should see the resulting output file (result.json)with the table data extracted!

Use Cases - Extract Specific Data from Invoices

Here are a few use cases for the extraction of specific data from invoices:

Accounts Payable Automation

By extracting data such as invoice numbers, dates, vendor names, and amounts from scanned invoices, companies can streamline the process of paying bills, reducing errors and saving time.

Expense Tracking

Data Extraction from receipts and invoices for business expenses helps individuals to easily keep track of their spending and ensure that they are accurately recording expenses for tax and accounting purposes.

Compliance

Many industries have strict regulations around financial reporting and record-keeping. By extracting data from scanned invoices, companies can ensure that they are complying with these regulations and avoid fines and penalties.

Business Intelligence

By extracting data from scanned invoices and other financial documents, companies can gain insights into their business performance. For example, they may be able to identify trends in spending, compare the profitability of different products or services, or assess the financial health of their business.

Video Guide

Here’s a short demo guide showing how to extract tables from a PDF with JavaScript using the PDF.co Web API with NodeJS.