[KShare] How to read PDF files when working with "blob:https://" web pages

Hi Community members, :wave:

If ever you encounter a file from a website stored in the blob format, you can now download the PDF documents and verify its content using this guide. Read along below :point_down:

:information_source: What is a Blob?

A Blob URL does not refer to data the exists on the server, it refers to data that your browser currently has in memory, for the current page . It will not be available on other pages, it will not be available in other browsers, and it will not be available from other computers.

:point_right: To learn more about blob, please refer to this page: Blob - Web APIs | MDN (mozilla.org)

The solution

Creating a set of scripts that help to open the custom browser to access a Blob type website (Blob websites are special URL links such that the URL shows blob:https://, and are used to store files temporarily in the browser’s memory instead of the file being hosted on the website’s server) and download the PDF.

After doing so, the script locates the last downloaded file in your local machine (in this case the downloaded PDF). After retrieving the PDF file it is then parsed for the content it has stored. For example, we can perform actions with the pdf file for storing IDs from the PDF and even using it for any checking purposes.

Breakdown of the solution

We will be dealing with few scripts and its working explained in this breakdown:

Firstly, we will be creating a custom keyword that includes a class with a function that helps to open a custom browser (in this scenario it’s a Blob website). This helps us to open a customized browser rather than a regular browser window.

  • The custom browser ,in this case, is used because we want to have the option to download a file automatically on clicking the link of the same which is not applicable in a regular browser.
  • To start with the download process, We would create a script where we navigate to the webpage and click the download option/link present there. After that we assign a specific location to the downloaded file as this helps in easily locating the file and performing any operations.

  • There may be a question arising as” Couldn’t we access the files directly from the Blob?” To answer that question we are unable to do that and so we must first download the files first to access them. To further elaborate, We cannot open the said file as it is not a file stored in our system but rather stored in the browser’s temporary storage. In addition to that, the link pertaining to the files contained are not a normal URL so it can’t interact with the web drivers.
  • Lastly, to perform any operation after downloading the document, we are required to navigate to the file location and open it. After that , to read a file we could use the Java File functions after converting the PDF file to text based document.

  • In order to read a PDF file, we also need Apache PDFBox as an external library. To read more about Apache PDFBox, how to add it as an external library, and how to implement the Custom Keyword “PdfReaderUtil”, please refer to the article below:
  • For example , if we want to print a serial number or any value from the file we parse it to the location of the said file in the text format and we could store the values present after locating the said information.For additional information regarding the working of the project, please refer here.
1 Like

Thank you the Product Support team (@support.squad) and Jordan Bartley (@jordan.bartley) for this article!

Jordan profile pic
Jordan Bartley (@jordan.bartley) - Product Support Specialist
Jordan has worked in Quality Assurance, Automated Testing, and Specialist Support roles for several years before joining Katalon. Through his experience, he shows an innate desire to assist clients, take on challenging problems, and work cross functionally with team members to create new solutions.
1 Like