[KShare] Using Katalon Studio to read PDF files directly on a webpage

Hi Community members, :wave:

Are you looking to get text of information from a PDF file during a WebUI test, such as in the example below? :point_down:

Then, in this topic, we will be showing you how to set up and use Katalon Studio to verify information from within a PDF file hosted on a website.

Due to PDF files not containing HTML, they do not have web objects/elements that can be used by the Katalon Studio Object Spy and Recorder. In order to workaround this restraint, code can be implemented to parse the contents on the PDF file directly from the website during a test case.

1. Adding the Test Case code

The first step to set up the code for parsing PDF files is to copy the following into your test case:


import org.openqa.selenium.WebDriver as WebDriver
import org.openqa.selenium.remote.LocalFileDetector as LocalFileDetector
import org.openqa.selenium.support.events.EventFiringWebDriver as EventFiringWebDriver
import com.kms.katalon.core.configuration.RunConfiguration as RunConfiguration
import com.kms.katalon.core.webui.driver.DriverFactory as DriverFactory
import com.kms.katalon.selenium.driver.CRemoteWebDriver as CRemoteWebDriver
import com.kms.katalon.core.webui.driver.WebUIDriverType as WebUIDriverType
import com.kms.katalon.core.windows.keyword.WindowsBuiltinKeywords as Windows
import static com.kms.katalon.core.testobject.ObjectRepository.findWindowsObject

Function/Method Code:

// Identify the driver
EventFiringWebDriver driver = DriverFactory.getWebDriver()

// PDF Keyword call
def pdf = CustomKeywords.'com.pdf.reader.ReadPdfFromBrowser.PdfReaderUtil'(url, driver)

// Create each line of text from the .PDF file
def lines = pdf.split('\\r?\\n')

// Parse & print each individual line, at this point you can modify the code
// within the loop to look for a specific piece of text or collect the data
for (String line : lines) {

2. Adding the Custom Keyword code

After adding the previous code to your test case, you will then need to implement a Custom Keyword that contains further code for parsing the .PDF file. Custom Keywords require a package to be contained in within Katalon Studio. You can see from the screenshots below how to create a package and a Custom Keyword:

In this example, it is in a package we have created called โ€œcom.pdf.readerโ€:

package com.pdf.reader

import static com.kms.katalon.core.checkpoint.CheckpointFactory.findCheckpoint
import static com.kms.katalon.core.testcase.TestCaseFactory.findTestCase
import static com.kms.katalon.core.testdata.TestDataFactory.findTestData
import static com.kms.katalon.core.testobject.ObjectRepository.findTestObject

import com.kms.katalon.core.annotation.Keyword
import com.kms.katalon.core.checkpoint.Checkpoint
import com.kms.katalon.core.cucumber.keyword.CucumberBuiltinKeywords as CucumberKW
import com.kms.katalon.core.mobile.keyword.MobileBuiltInKeywords as Mobile
import com.kms.katalon.core.model.FailureHandling
import com.kms.katalon.core.testcase.TestCase
import com.kms.katalon.core.testdata.TestData
import com.kms.katalon.core.testobject.TestObject
import com.kms.katalon.core.webservice.keyword.WSBuiltInKeywords as WS
import com.kms.katalon.core.webui.keyword.WebUiBuiltInKeywords as WebUI

import internal.GlobalVariable

import java.io.BufferedInputStream;
import java.io.File;
import java.io.RandomAccessFile;
import java.net.URL;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;

public class ReadPdfFromBrowser {

	PDDocument pdDoc;

	public String PdfReaderUtil(String html, WebDriver driver){

		String pdfFileInText = "";

		URL url = new URL(html);
		BufferedInputStream fileToParse = new BufferedInputStream(

		pdDoc = PDDocument.load(fileToParse);

		if (!pdDoc.isEncrypted()) {

			PDFTextStripperByArea stripper = new PDFTextStripperByArea();

			PDFTextStripper tStripper = new PDFTextStripper();

			pdfFileInText = tStripper.getText(pdDoc);
		return pdfFileInText;

3. Adding the Apache PDFBox

Apache PDFBox is needed in order to handle specific .PDF commands within Katalon Studio. It can be downloaded here. After downloading, you need to add the .jar as an external library within Katalon Studio:

:warning: Please note that this guide works best for PDF files that primarily contain text. And the more stylings and images that are included in the file, the more likely the parser will run into an error.

:white_check_mark: After this is all implemented, the test case should be able to open the PDF file, parse the information within it, and print each individual line of text from the file.

4. Other helpful resources

The two links below are for Custom Keywords for Katalon Studio that allow for further PDF management, including comparing, extracting, and saving parts of a PDF file.



Thank you the Katalon Product Support team (@support.squad) for yet another helpful topic for our forum members and Enterprise users. And, this topic was prepared by a new face in the Product Support team:

Jordan profile pic
Jordan Bartley (@jordan.bartley) - Product Support Specialist at Katalon
Jordan is a Product Support Specialist on Katalonโ€™s US Support Team. He has worked in Quality Assurance, Automated Testing, and Specialist Support roles for several years before joining Katalon Product Support. Through his experience, he shows an innate desire to assist clients, take on challenging problems, and work cross functionally with team members to create new solutions.