Taking screenshots from PDF file with Apache PDFBox

plaidshirtakos · January 5, 2020, 10:27pm

I use pdfbox to generate images from all the pages of a PDF file. File is given in an URL.

public String Pdf2Image(String html, WebDriver driver){
    URL url=new URL(GlobalVariable.url);
    HttpURLConnection connection=(HttpURLConnection)url.openConnection();
    InputStream is=connection.getInputStream();

    PDDocument document = PDDocument.load(is);
    PDFRenderer pdfRenderer = new PDFRenderer(document);
    for (int page = 0; page < document.getNumberOfPages(); ++page) {
        BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
        ImageIOUtil.writeImage(bim, GlobalVariable.fileAzon + "-" + (page+1) + ".png", 300);
    }
    document.close();
}

It doesn’t work for PDF files, which text content couldn’t be copied and printing is disabled. I get following error message during execution. I couldn’t get InputStream .

org.codehaus.groovy.runtime.InvokerInvocationException: java.io.IOException: Error: End-of-File, expected line
    at com.pdf.reader.ReadPdfFromBrowser.invokeMethod(ReadPdfFromBrowser.groovy)
    at com.kms.katalon.core.main.CustomKeywordDelegatingMetaClass.invokeStaticMethod(CustomKeywordDelegatingMetaClass.java:50)
    at print.run(print:10)
    at com.kms.katalon.core.main.ScriptEngine.run(ScriptEngine.java:194)
    at com.kms.katalon.core.main.ScriptEngine.runScriptAsRawText(ScriptEngine.java:119)
    at com.kms.katalon.core.main.TestCaseExecutor.runScript(TestCaseExecutor.java:337)
    at com.kms.katalon.core.main.TestCaseExecutor.doExecute(TestCaseExecutor.java:328)
    at com.kms.katalon.core.main.TestCaseExecutor.processExecutionPhase(TestCaseExecutor.java:307)
    at com.kms.katalon.core.main.TestCaseExecutor.accessMainPhase(TestCaseExecutor.java:299)
    at com.kms.katalon.core.main.TestCaseExecutor.execute(TestCaseExecutor.java:233)
    at com.kms.katalon.core.main.TestCaseMain.runTestCase(TestCaseMain.java:114)
    at com.kms.katalon.core.main.TestCaseMain.runTestCase(TestCaseMain.java:105)
    at com.kms.katalon.core.main.TestCaseMain$runTestCase$0.call(Unknown Source)
    at TempTestCase1578260748461.run(TempTestCase1578260748461.groovy:23)
Caused by: java.io.IOException: Error: End-of-File, expected line
    at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1124)
    at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2603)
    at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2574)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1222)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1122)
    at org.apache.pdfbox.pdmodel.PDDocument$load.call(Unknown Source)
    at com.pdf.reader.ReadPdfFromBrowser.Pdf2Image(ReadPdfFromBrowser.groovy:72)
    ... 14 more

plaidshirtakos · January 6, 2020, 8:46am

UPDATE: Sample pdf, as I tried to reproduce original pdf document: https://gofile.io/?c=WYPqpZ

Timo_Kuisma1 · January 6, 2020, 8:55am

hello,

this is working one, done by IntelliJ, cause not able to open Katalon in my new laptop (license issue)

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.rendering.ImageType;

import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;

import javax.imageio.ImageIO;

public class ScreenshotFromPdf {

    public static void Pdf2Image(String html, WebDriver driver) throws IOException, InterruptedException {

        Thread.sleep(5000);
        URL url=new URL(html);
        HttpURLConnection connection=(HttpURLConnection)url.openConnection();
        InputStream is=connection.getInputStream();

        PDDocument document = PDDocument.load(is);
        PDFRenderer pdfRenderer = new PDFRenderer(document);

        for (int page = 0; page < document.getNumberOfPages(); ++page) {
            BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
            File outputFile = new File(System.getProperty("user.dir") + "/src/main/DataFiles/" + page + "image.jpg");
            ImageIO.write(bim, "jpg", outputFile);
        }
        document.close();
        driver.close();
    }

    public static void main(String[] args) throws IOException, InterruptedException {
        // do something here...
        System.setProperty("webdriver.chrome.driver","C:\\Users\\xxxx\\IdeaProjects\\chromedriver.exe");
        WebDriver driver = new ChromeDriver();
        driver.get("http://www.vandevenbv.nl/dynamics/modules/SFIL0200/view.php?fil_Id=5515");
        String url = driver.getCurrentUrl();
        ScreenshotFromPdf.Pdf2Image(url, driver);
    }
}

plaidshirtakos · January 6, 2020, 11:03am

It is not working. Please try it with that file: http://aplaidshirt.epizy.com/samplePDF.pdf

I got that error message:

Exception in thread "main" java.io.IOException: Error: End-of-File, expected line
	at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1124)
	at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2595)
	at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2574)
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1222)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1122)
	at ScreenshotFromPdf.Pdf2Image(ScreenshotFromPdf.java:24)
	at ScreenshotFromPdf.main(ScreenshotFromPdf.java:43)

Timo_Kuisma1 · January 6, 2020, 11:43am

hello you,

this is totally different scenario, you have to first download pdf file to your disk
2. open file in a browser
3. take screenshot

i am now implementing code for that

plaidshirtakos · January 6, 2020, 12:15pm

I see, but download function is disabled, I can’t save that file.

Timo_Kuisma1 · January 6, 2020, 12:29pm

hello you,

this works if we have existing pdf file in a disk

import java.net.HttpURLConnection;
import java.net.URL;
import java.util.*;
import java.util.List;

import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.remote.CapabilityType;
import org.openqa.selenium.remote.DesiredCapabilities;

import javax.imageio.ImageIO;

public class ScreenshotFromPdf {


    public static void Pdf2Image2(String html, WebDriver driver) throws IOException, InterruptedException {

        Thread.sleep(5000);

        BufferedInputStream in = new BufferedInputStream(new URL(html).openStream());

        PDDocument document = PDDocument.load(in);
        PDFRenderer pdfRenderer = new PDFRenderer(document);

        for (int page = 0; page < document.getNumberOfPages(); ++page) {
            BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
            File outputFile = new File(System.getProperty("user.dir") + "/src/main/DataFiles/" + page + "image.jpg");
            ImageIO.write(bim, "jpg", outputFile);
        }
        document.close();
        driver.close();
    }


    public static WebDriver setChromeOptions(String downloadPath){

        ChromeOptions options = new ChromeOptions();
        //String downloadPath = folder;
        //String downloadsPath = System.getProperty("user.home") + "/Downloads";
        System.out.println ("downloadpath "+downloadPath);

        Map<String, Object> chromePrefs = new HashMap<String, Object>();
        chromePrefs.put("profile.default_content_settings.popups", 0);
        chromePrefs.put("download.default_directory", downloadPath);
        chromePrefs.put("download.prompt_for_download", false);
        chromePrefs.put("plugins.plugins_disabled", "Chrome PDF Viewer");
        //options.addArguments("--headless");
        options.addArguments("--window-size=1920,1080");
        options.addArguments("--test-type");
        options.addArguments("--disable-gpu");
        options.addArguments("--no-sandbox");
        options.addArguments("--disable-dev-shm-usage");
        options.addArguments("--disable-software-rasterizer");
        options.addArguments("--disable-popup-blocking");
        options.addArguments("--disable-extensions");
        options.setExperimentalOption("prefs", chromePrefs);
        DesiredCapabilities cap = DesiredCapabilities.chrome();
        cap.setCapability(ChromeOptions.CAPABILITY, options);
        cap.setCapability(CapabilityType.ACCEPT_SSL_CERTS, true);

        System.setProperty("webdriver.chrome.driver","C:\\Users\\xxxxx\\IdeaProjects\\chromedriver.exe");
        WebDriver driver = new ChromeDriver(cap);
        return driver;
    }

    public static String getFileFromFolder(String downloadPath){

        File path = new File(downloadPath);
        File[] files = path.listFiles(new FilenameFilter() {
            @Override
            public boolean accept(File dir, String name) {
                // Automatically weeds out directories
                return name.toLowerCase().endsWith(".pdf");
            }
        });

        String fileName = "";
        for (File file : files) {
            System.out.println(file.getName());
            fileName = file.getName();
        }

        return fileName;
    }


    //Cross platform solution to view a PDF file
        public static void openInBrowser(String pdfFilePath) {

            try {

                File pdfFile = new File(pdfFilePath);
                if (pdfFile.exists()) {

                    if (Desktop.isDesktopSupported()) {
                        Desktop.getDesktop().open(pdfFile);
                    } else {
                        System.out.println("Awt Desktop is not supported!");
                    }

                } else {
                    System.out.println("File is not exists!");
                }

                System.out.println("Done");

            } catch (Exception ex) {
                ex.printStackTrace();
            }

        }

    public static void main(String[] args) throws IOException, InterruptedException {

        WebDriver driver = setChromeOptions(System.getProperty("user.dir") + "\\src\\main\\DataFiles");

	//you have to find out how to click download button
        //save and download pdf file from disk
        //driver.get("https://gofile.io/?c=WYPqpZ");
        //driver.findElement(By.id("fileInfoDownload")).click();
        //Thread.sleep(3000);

	//existing pdf file in disk
        String pdfDir = System.getProperty("user.dir") + "\\src\\main\\DataFiles";
        String pdfFile = getFileFromFolder(pdfDir);
        openInBrowser(pdfDir+"/"+pdfFile); //open file in edge, don't know why, set chrome as default browser
        //this url should be get from chrome
        String pdfUrl = "file:///C:/Users/timok/IdeaProjects/JideaProjects/src/main/DataFiles/samplePDF.pdf";
        //open downloaded pdf file in browser
        driver.get("file:///C:/Users/timok/IdeaProjects/JideaProjects/src/main/DataFiles/samplePDF.pdf");
        String url2 = driver.getCurrentUrl();
        ScreenshotFromPdf.Pdf2Image2(url2, driver);

    }
}

Timo_Kuisma1 · January 6, 2020, 12:37pm

hi,

actually this works too, need to click page that pop up will disappear
after that able to click download button
//save and download pdf file from disk
driver.get(“https://gofile.io/?c=WYPqpZ”);
driver.findElement(By.id(“fileInfoDownload”)).click();
Thread.sleep(3000);

Timo_Kuisma1 · January 6, 2020, 12:49pm

hi,

fixed

    //save pdf to disk
    driver.get("https://gofile.io/?c=WYPqpZ");
    Thread.sleep(5000);
    driver.findElement(By.xpath("/html/body/div[2]/div/div[10]/button[2]")).click();
    Thread.sleep(1000);
    driver.findElement(By.id("fileInfoDownload")).click();
    Thread.sleep(3000);

no needed to use this method, add comment
//openInBrowser(pdfDir+"/"+pdfFile);

plaidshirtakos · January 6, 2020, 12:57pm

I see, but on that site, where file is originally located, saving is disabled. I just tried to reproduce situation with this demo file: http://aplaidshirt.epizy.com/samplePDF.pdf So please don’t use save function in your solution, because it will not work. Please use original pdf, without gofile file hosting solution: http://aplaidshirt.epizy.com/samplePDF.pdf , but without using file save option.

Timo_Kuisma1 · January 6, 2020, 12:59pm

hi,

what mean saving is disabled?
I am using this site https://gofile.io/?c=WYPqpZ
and works fine

Timo_Kuisma1 · January 6, 2020, 1:02pm

hi,

ok, with this url http://aplaidshirt.epizy.com/samplePDF.pdf use my first script which was posted here

anon46315158 · January 6, 2020, 1:02pm

@plaidshirtakos I can open and save (download) the pdf file from the link you provided.
so … what exactly is disabled?

plaidshirtakos · January 6, 2020, 1:15pm

@anon46315158 : This isn’t original file, original file can’t be downloaded. I just reproduced a pdf with same details.

plaidshirtakos · January 6, 2020, 1:16pm

I tried both, but got same error. I attach also an evidence.

Timo_Kuisma1 · January 6, 2020, 1:20pm

hi,

debug where the issue is

anon46315158 · January 6, 2020, 1:20pm

got it, so in fact the issue is at the original URL you try to access it, not with the pdf himself.
can you share that URL? (i think not, just asking … but worth to try)

plaidshirtakos · January 6, 2020, 1:49pm

I see it, InputStream is empty, but I don’t know why. Same code is working with your pdf, but not working with mine,

Timo_Kuisma1 · January 6, 2020, 1:56pm

hi,

issue is here
http://aplaidshirt.epizy.com/samplePDF.pdf?i=1

real url has attributes too ?=1

need to investigate it

problem is here
PDDocument document = PDDocument.load(is);
it’s parser issue, maybe need to encode ?=, will try

plaidshirtakos · January 6, 2020, 2:47pm

I tried with this ending too, but it has the same effect, I got error message. PDF file is opened in both cases.

Topic		Replies	Views
Capture screenshot and read the text in it for comparision Testing Efficiency katalon-studio	9	845	August 8, 2023
Validating PDF getting End of File Error Exception Katalon Studio katalon-studio , web-testing	42	10005	May 10, 2024
Handle in-browser opened PDF files Katalon Studio katalon-studio , web-testing	5	2088	January 6, 2020
How to assert on content of downloaded PDF file Katalon Studio katalon-studio	8	1234	September 15, 2022
Get values of the PDF Katalon Studio katalon-studio , web-testing	3	849	September 27, 2023

Taking screenshots from PDF file with Apache PDFBox

Related Topics