How to Extract Information from PDFs using PDF.co Web API in JavaScript
This tutorial and the sample source code explain extracting valuable information from any PDF using the PDF Info functionality from PDF.co Web API. The tutorial provides a guideline for the complete extraction process using the Javascript programming language. The users can use PDF Info Reader API to efficiently extract the required information from any PDF document required and use it however they want.
Features of PDF Info Reader API Endpoint
The PDF.co Web API provides tools to extract any required information from a provided PDF document effectively. The PDF.co Info Reader API endpoint works efficiently by gathering detailed information about any PDF document. The users can even have information regarding its properties and the security permissions used by the document.
Moreover, the API allows checking information, including PDF form fields. The PDF form can include checkboxes, list boxes, text fields, radio boxes, and combo boxes. The users can use this page for one-time checking of such information. Below is a detailed demo explaining these features to help users comprehend them thoroughly.
One of the best features of the PDF.co Web API is that it provides high security for its users. The API maintains security by transmitting the user’s documents and data files via encrypted connections. Users can learn more about the PDF.co API security here.
Endpoint Parameters
Following is a comprehensive explanation of the available parameters of the PDF Info Reader API for better user understanding.
- url: It is a required parameter that contains the link to the source file. It provides the URL of the input PDF document from which the users want to extract information. The users can provide the source file link from various platforms supported by the API, such as Dropbox, Google Drive, or the built-in file storage of PDF.co Web API. The users can encrypt or decrypt any input or output data file using the user-controlled data encryption functionality.
Note: If the users are getting error messages like “Too many Requests” and “Access Denied” while providing the input URL, they can add a cache to enable built-in caching functionality.
- httpusername: It is an optional parameter that takes the http auth user name if it is necessary to access the source URL.
- httppassword: It is an optional parameter that takes the http auth password if it is required to access the source file.
- async: It is an optional parameter used for asynchronous processing. This parameter returns a job ID the user might need to check the background job’s status. The possible states of the background job can be failed, working, success, or aborted. It is a boolean parameter that can either be true or false.
- profiles: It is an optional parameter that must be a string. The users can set additional customized configurations for file tuning and extra options using this parameter.
How to Extract PDF using API in JavaScript
The following source code shows users how to extract information from a sample PDF document using the PDF.co Info Reader API. The sample code in Javascript shows how to gather relevant information from a PDF sample document. The users can upload their respective documents or provide a link to them in the URL parameter to collect valuable data. The users can have information such as text fields, page count, permissions, checkboxes, and many others.
The code contains a sample PDF form as an example here. The users must provide a generated API key by logging into the PDF.co platform login and the pdf file URL in the API request for the API to work. Moreover, the PDF.co API returns the resulting output of the provided file URL from which the user can separate required information such as the author, page number, password, permissions, bookmarks, and information regarding the PDF content.
Sample Code Snippet for PDF Extraction
Following is an example code to extract information from a PDF form using PDF.co Web API:
var https = require("https");
const API_KEY = "***********";
// Direct URL of source PDF file.
const SourceFileUrl = "https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/pdf-info/sample.pdf";
// Prepare request to `Replace Text from PDF` API endpoint
var queryPath = `/v1/pdf/info`;
// JSON payload for api request
var jsonPayload = JSON.stringify({
url: SourceFileUrl
});
var reqOptions = {
host: "api.pdf.co",
method: "POST",
path: queryPath,
headers: {
"x-api-key": API_KEY,
"Content-Type": "application/json",
"Content-Length": Buffer.byteLength(jsonPayload, 'utf8')
}
};
// Send request
var postRequest = https.request(reqOptions, (response) => {
response.on("data", (d) => {
// Parse JSON response
var data = JSON.parse(d);
if (data.error == false) {
console.log(data)
}
else {
// Service reported error
console.log(data.message);
}
});
}).on("error", (e) => {
// Request error
console.log(e);
});
// Write request data
postRequest.write(jsonPayload);
postRequest.end();
Source File
Below is the screenshot of the source file used for this example.
Output of Extracted PDF
Below is the screenshot of the output provided after the execution of the above sample code.
Quick Guide to Extract Information From PDF Forms Using PDF.co Web API
Following is a sample guide to explain to the users the working of sample code for extracting information using the Info Reader API endpoint:
- The users must import the necessary packages to send API requests. The “require” parameter represents all the crucial modules. On the other hand, “request” is an essential Node.js module that helps make an HTTP call. In this scenario, the required package is “path”.
- After logging in to the PDF.co website, the users can obtain their respective API keys to access their Web API. The users can not send a direct request, so they have to use this specific API key as an access token in the header for authentication.
- Since the users might have to use the code several times, the best approach is to use the API key in variable declaration. This approach allows users only to change the variable and edit the complete code file without changing anything in the actual code.
- The next step is to declare and initialize the source file and query path variables for the document to extract critical information from it. The query path contains the information of the relevant API endpoint the users have to utilize in the scenario.
- The users have to provide the relevant information in the JSON payload. For example, the field names, page numbers, and the information to be added. Here the code provides only the source URL.
- The users can use the “async” parameter if the file is large. It will help run the process asynchronously for smooth working.
- The users must declare a variable to provide API options such as API endpoint URL, method, and headers. The code contains “reqOptions” for initializing such information in this example.
- The next and final step is to send the POST request to the API endpoint and wait for its response. The status 200 represents the successful request, and the code provides the required output after execution. However, if the request is unsuccessful, the users can modify and execute the code again.