If you need to extract data from a lot of PDF documents coming from different sources then the best way is to first sort them by the vendor. We’ve created a PDF Classifier tool that is available in both cloud and on-premise versions of our PDF Extractor API.

  1. How it Works
  2. Create and Test Classification Rules
  3. Test Classification Rules on Folders With PDF Documents
  4. Test Classification Rules on Scanned Documents
  5. Copy Rules as JSON
  6. How to Use PDF Classifier in PDF.co

How it Works

PDF.co Document processing workflow

  • Create rules as CSV (comma-separated values) where every row has the following columns: classname, OR or AND logic (OR is used by default), keyword1 or phrase1, keyword2 or phrase2,…
  • Test these rules on your sample PDF files
  • Generate JSON request for use with PDF.co or just save rules as CSV and pass the link along with all requests
  • Use pdf/classifier endpoint in PDF.co (cloud) or API Server (on-prem)
  • pdf/classifier endpoint will return detected class for input PDF, JPG, PNG, or TIFF document

To make it easy to quickly test, maintain, update your classification rules we’ve created the classification rules testing tool that is available as a part of the PDF Multitool desktop app (download page is here).

Here are the short demos of what you can do with this tool (it works fast and no Internet is required because no files are uploaded).

Create and Test Classification Rules

Use the spreadsheet-like interface to define new classes with rules, use plain text, use regular expressions and quickly test rules to see how they work on your PDF documents.

Create And Test Classification Rules
Create And Test Classification Rules

Test Classification Rules on Folders With PDF Documents

As the ultimate goal is to sort PDF files in a batch, you can test classification rules on folders with PDF files to see which class every file will produce.

Test Classification Rules On Folder With PDF Files
Test Classification Rules On Folder With PDF Files

Test Classification Rules on Scanned Documents

You can also test classification rules on scanned documents as well.

Test Classification Rules On Selected JPG, PDF Files And Folder
Test Classification Rules On Selected JPG, PDF Files, and Folder

Copy Rules as JSON

You can save classification rules into a CSV file or you can simply copy ready-to-use JSON request. You can use this request with PDF.co (cloud) and API Server (on-prem version of PDF.co).

Export Classification Rules In JSON Format
Export Classification Rules In JSON Format

How to Use PDF Classifier in PDF.co

In this tutorial, we will demonstrate how to use the PDF Classifier in PDF.co. To follow along, you can download the file here. We will use both PDF Multitool and PDF.co to showcase this functionality. If you haven’t yet, you can download the PDF Multitool here.

Step 1 – First, download and open the file in the PDF Multitool Classifier Test Tool.

Open File In Classifier Testing Tool

Step 2 – Run the Test Rules.

Click Test Rules Button

Step 3 – Click on the Copy Rules for PDF.co or API Server.

Copy Rules For PDF.co

Step 4 – Paste the JSON in your favorite text editor and copy the rulesCSV value.

Copy The rulesCSV Value

Step 5 – Open the PDF.co Request Tester. This will require that you log in to your PDF.co account. Here’s the direct link https://app.pdf.co/request-tester.

Step 6 – Under PDF.co API endpoint field, select the /pdf/classifier.

Select PDF Classifier Endpoint

Step 7 – Remove the rulescsv parameter’s default value and paste the rulesCSV value that we copied from the Classifier Test Tool.

Replace rulescsv Value

Step 8 – Run the Request button to see the output.

Run Request

Step 9 – Click on the output file link to preview.

PDF.co PDF Classifier Output

Step 10 – Here’s the PDF.co PDF Classifier in action.

PDF.co PDF Classifier Demo
PDF.co PDF Classifier Demo

Get the PDF classification testing tool now from this page (you need to download the PDF Multitool app).