Extract an Article from a Newspaper PDF using Python and PDF.co Web API

Jan 9, 2025·6 Minutes Read

In this guide, we'll demonstrate how to extract an article from a newspaper PDF using Python and the PDF.co Web API. Specifically, we’ll use the /v1/pdf/convert/to/text endpoint to extract the article's content.

We'll walk through the process with a sample newspaper PDF to show how to efficiently extract a news article using the API in Python.

Sample PDF

Step 1: Install the requests Library

Before you begin, ensure that the requests library is installed in your Python environment. This library is necessary for sending HTTP requests to the PDF.co Web API. To install it, open your terminal and run the following command: python -m pip install requests.

Step 2: Access the Source Code

Prepare the Python script that will handle the PDF-to-text conversion. Copy and paste the sample code into your preferred Python code editor (e.g., Visual Studio Code, PyCharm)

Step 3: Configure the Python Code

Update the script with your specific settings:

API Key:

Obtain your API key from your PDF.co dashboard and insert it into the designated spot in the script.

Source PDF:

Enter the name of the PDF file you want to extract the article from.

Output Text File:

Specify the name for the output file where the extracted text will be saved.

Asynchronous Mode:

For better performance, use asynchronous mode to allow the program to continue running while the conversion happens in the background.

Configure the Python Code

Step 4: Save Python Program

Once you’ve configured the script, save it in your preferred directory.

Save Python Program

Step 5: Run the Program

Execute the Python script. If set up correctly, the script will begin the extraction process and generate a JSON file containing the extracted data.

Run the Program

Step 6: View the Extracted Data in TEXT Format

After the script finishes, navigate to the directory where the script is located. Open the output text file in a text editor to view the extracted article, now in a structured text format.

Extracted Data in TEXT Format
Extracted Data in TEXT Format

In this tutorial, you learned how to extract an article from a newspaper PDF in Python using the PDF.co Web API. You learned how to use the PDF Extractor Web API to convert a PDF file to Text. You also learned how to install the requests module.

Related Tutorials

See Related Tutorials