Extract Table Data from PDF Using Python and PDF.co Web API

Jan 21, 2025·3 Minutes Read

In this tutorial, we will demonstrate how to extract table data from a PDF file using Python with the PDF.co Web API. We will utilize the /v1/pdf/documentparser endpoint, which outputs data in JSON format. We will work with a sample multipage PDF to read the table data.

Sample Multipage PDF
Sample Multipage PDF

Step 1: Install Pip Request

To begin, we need to install the requests module, which will help us make HTTP requests to the PDF.co API. In your command line or terminal, type the following command and hit Enter to install the requests library: python -m pip install requests

Step 2: Source Code Samples

Next, copy the Python sample code from this link. Then, paste the code into your editor (e.g., Visual Studio Code, PyCharm, or any editor of your choice).

Step 3: Setup Python Code Configuration

Now, let's set up the Python code with your specific configurations:

  • API Key: Insert your API Key in the designated area within the code. You can find your API Key on your PDF.co Dashboard.
  • Source File: Specify the name of the PDF file from which you want to extract table data.
  • Output File Name: Enter the name for the output JSON file. You can also choose other output formats like XML or CSV.
  • Template File: Provide the name of your template file for extracting table data. To create a template, use the Document Parser Template Editor. Refer to the tutorial on creating a new template for guidance.

For this demonstration, we will use Asynchronous mode for conversion. This will allow us to process the conversion in the background, making the program more efficient.

Setup Python Code Configuration

Step 4: Save the Python Program

After configuring the code settings, save the Python program in your chosen directory.

Save the Python Program

Step 5: Execute the Program

Once you have saved the program, run the Python script to see the results. If all configurations are correct, the program will successfully extract the table data from your PDF document. After execution, navigate to your Python folder to locate the generated JSON file.

Execute the Program

Step 6: View JSON Result

Finally, open the output JSON file in your preferred JSON viewer. You will see the extracted table data formatted in JSON.

JSON Result
JSON Result

In this tutorial, you learned how to extract table data from a PDF in Python using the PDF.co Web API.

Related Tutorials

See Related Tutorials