Extract Table Data from PDF Using Python and PDF.co Web API
In this tutorial, we will demonstrate how to extract table data from a PDF file using Python with the PDF.co Web API. We will utilize the /v1/pdf/documentparser endpoint, which outputs data in JSON format. We will work with a sample multipage PDF to read the table data.
data:image/s3,"s3://crabby-images/2d096/2d096da92057c358c22971e2fb016d749b49eb8b" alt="Sample Multipage PDF"
Step 1: Install Pip Request
To begin, we need to install the requests
module, which will help us make HTTP requests to the PDF.co API. In your command line or terminal, type the following command and hit Enter to install the requests
library: python -m pip install requests
Step 2: Source Code Samples
Next, copy the Python sample code from this link. Then, paste the code into your editor (e.g., Visual Studio Code, PyCharm, or any editor of your choice).
Step 3: Setup Python Code Configuration
Now, let's set up the Python code with your specific configurations:
- API Key: Insert your API Key in the designated area within the code. You can find your API Key on your PDF.co Dashboard.
- Source File: Specify the name of the PDF file from which you want to extract table data.
- Output File Name: Enter the name for the output JSON file. You can also choose other output formats like XML or CSV.
- Template File: Provide the name of your template file for extracting table data. To create a template, use the Document Parser Template Editor. Refer to the tutorial on creating a new template for guidance.
For this demonstration, we will use Asynchronous mode for conversion. This will allow us to process the conversion in the background, making the program more efficient.
data:image/s3,"s3://crabby-images/3a65d/3a65ddc19bd97b7d4d6a17d3fdec0305d096c40c" alt="Setup Python Code Configuration"
Step 4: Save the Python Program
After configuring the code settings, save the Python program in your chosen directory.
data:image/s3,"s3://crabby-images/7979e/7979ecbf9de28660e69db1b72799156df9d2efcb" alt="Save the Python Program"
Step 5: Execute the Program
Once you have saved the program, run the Python script to see the results. If all configurations are correct, the program will successfully extract the table data from your PDF document. After execution, navigate to your Python folder to locate the generated JSON file.
data:image/s3,"s3://crabby-images/00bf4/00bf4cda5f8032352924757aa1e26e05dc98437e" alt="Execute the Program"
Step 6: View JSON Result
Finally, open the output JSON file in your preferred JSON viewer. You will see the extracted table data formatted in JSON.
data:image/s3,"s3://crabby-images/e954b/e954b0f65fc03eee7aa9719c67831a4e743b96b3" alt="JSON Result"
In this tutorial, you learned how to extract table data from a PDF in Python using the PDF.co Web API.
Related Tutorials
data:image/s3,"s3://crabby-images/708ab/708ab1fff1041b667446e8bda0ee2399b271ea6d" alt="Tutorial default thumbnail"
data:image/s3,"s3://crabby-images/708ab/708ab1fff1041b667446e8bda0ee2399b271ea6d" alt="Tutorial default thumbnail"
data:image/s3,"s3://crabby-images/708ab/708ab1fff1041b667446e8bda0ee2399b271ea6d" alt="Tutorial default thumbnail"
data:image/s3,"s3://crabby-images/708ab/708ab1fff1041b667446e8bda0ee2399b271ea6d" alt="Tutorial default thumbnail"