How to Extract Hyperlinks in PDF with Python using Web API

Python has a large set of libraries for handling different types of operations. To extract the data from PDF, we will use the Web API.

In this article, We are going to extract hyperlinks from PDF in Python using Web API

  1. Install Request Module
  2. Python Sample Code
  3. API Key
  4. Source File and Destination
  5. Custom Profile
  6. Run Program
  7. Output
  8. PDF to JSON Demo

We have here a sample PDF and will extract the hyperlinks using Python

Sample PDF with Hyperlinks
Sample PDF with Hyperlinks

Step 1: Install Request Module

  • First, install the request module. Type python -m pip install request in your command line.

Step 2: Python Sample Code

  • Next, let’s add the Python sample code in the Visual Studio Code Editor. You can also use your favorite editor in Python. Kindly click this link for the source code.


Step 3: API Key

  • Then, add the API Key. You can get the API Key in your dashboard. API Key

Step 4: Source File and Destination

  • In line 12, input the source PDF file name.
  • In line 18, type in your desired JSON output file name.

Source File and Destination

Step 5: Custom Profiles

  • In line 56, we will use a set advanced conversion profile { "OutputStructure": "OnlyLinks", "OutputTransformation": "$..text" }. It will extract all links in a PDF.

Advanced Conversion Profile

Step 6: Run Program

  • once the program runs successfully, check your program folder to view the output.

Run Program

Step 7: Output

  • Here are the extracted links in JSON format.

Extracted Multiple Links

Step 8: PDF to JSON Demo

  • Here’s a quick demo in PDF to JSON advanced conversion.
PDF to JSON in Action
PDF to JSON in Action

In this article, you learned how to extract hyperlinks from PDF in Python. You also learned how to use Web API to extract multiple links from a PDF.