Taking screenshots from PDF file with Apache PDFBox

I use pdfbox to generate images from all the pages of a PDF file. File is given in an URL.

public String Pdf2Image(String html, WebDriver driver){
    URL url=new URL(GlobalVariable.url);
    HttpURLConnection connection=(HttpURLConnection)url.openConnection();
    InputStream is=connection.getInputStream();

    PDDocument document = PDDocument.load(is);
    PDFRenderer pdfRenderer = new PDFRenderer(document);
    for (int page = 0; page < document.getNumberOfPages(); ++page) {
        BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
        ImageIOUtil.writeImage(bim, GlobalVariable.fileAzon + "-" + (page+1) + ".png", 300);
    }
    document.close();
}

It doesn’t work for PDF files, which text content couldn’t be copied and printing is disabled. I get following error message during execution. I couldn’t get InputStream .

org.codehaus.groovy.runtime.InvokerInvocationException: java.io.IOException: Error: End-of-File, expected line
    at com.pdf.reader.ReadPdfFromBrowser.invokeMethod(ReadPdfFromBrowser.groovy)
    at com.kms.katalon.core.main.CustomKeywordDelegatingMetaClass.invokeStaticMethod(CustomKeywordDelegatingMetaClass.java:50)
    at print.run(print:10)
    at com.kms.katalon.core.main.ScriptEngine.run(ScriptEngine.java:194)
    at com.kms.katalon.core.main.ScriptEngine.runScriptAsRawText(ScriptEngine.java:119)
    at com.kms.katalon.core.main.TestCaseExecutor.runScript(TestCaseExecutor.java:337)
    at com.kms.katalon.core.main.TestCaseExecutor.doExecute(TestCaseExecutor.java:328)
    at com.kms.katalon.core.main.TestCaseExecutor.processExecutionPhase(TestCaseExecutor.java:307)
    at com.kms.katalon.core.main.TestCaseExecutor.accessMainPhase(TestCaseExecutor.java:299)
    at com.kms.katalon.core.main.TestCaseExecutor.execute(TestCaseExecutor.java:233)
    at com.kms.katalon.core.main.TestCaseMain.runTestCase(TestCaseMain.java:114)
    at com.kms.katalon.core.main.TestCaseMain.runTestCase(TestCaseMain.java:105)
    at com.kms.katalon.core.main.TestCaseMain$runTestCase$0.call(Unknown Source)
    at TempTestCase1578260748461.run(TempTestCase1578260748461.groovy:23)
Caused by: java.io.IOException: Error: End-of-File, expected line
    at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1124)
    at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2603)
    at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2574)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1222)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1122)
    at org.apache.pdfbox.pdmodel.PDDocument$load.call(Unknown Source)
    at com.pdf.reader.ReadPdfFromBrowser.Pdf2Image(ReadPdfFromBrowser.groovy:72)
    ... 14 more

UPDATE: Sample pdf, as I tried to reproduce original pdf document: https://gofile.io/?c=WYPqpZ

hello,

this is working one, done by IntelliJ, cause not able to open Katalon in my new laptop (license issue)

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.rendering.ImageType;

import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;

import javax.imageio.ImageIO;

public class ScreenshotFromPdf {

    public static void Pdf2Image(String html, WebDriver driver) throws IOException, InterruptedException {

        Thread.sleep(5000);
        URL url=new URL(html);
        HttpURLConnection connection=(HttpURLConnection)url.openConnection();
        InputStream is=connection.getInputStream();

        PDDocument document = PDDocument.load(is);
        PDFRenderer pdfRenderer = new PDFRenderer(document);

        for (int page = 0; page < document.getNumberOfPages(); ++page) {
            BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
            File outputFile = new File(System.getProperty("user.dir") + "/src/main/DataFiles/" + page + "image.jpg");
            ImageIO.write(bim, "jpg", outputFile);
        }
        document.close();
        driver.close();
    }

    public static void main(String[] args) throws IOException, InterruptedException {
        // do something here...
        System.setProperty("webdriver.chrome.driver","C:\\Users\\xxxx\\IdeaProjects\\chromedriver.exe");
        WebDriver driver = new ChromeDriver();
        driver.get("http://www.vandevenbv.nl/dynamics/modules/SFIL0200/view.php?fil_Id=5515");
        String url = driver.getCurrentUrl();
        ScreenshotFromPdf.Pdf2Image(url, driver);
    }
}

It is not working. Please try it with that file: http://aplaidshirt.epizy.com/samplePDF.pdf

I got that error message:

Exception in thread "main" java.io.IOException: Error: End-of-File, expected line
	at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1124)
	at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2595)
	at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2574)
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1222)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1122)
	at ScreenshotFromPdf.Pdf2Image(ScreenshotFromPdf.java:24)
	at ScreenshotFromPdf.main(ScreenshotFromPdf.java:43)

hello you,

this is totally different scenario, you have to first download pdf file to your disk
2. open file in a browser
3. take screenshot

i am now implementing code for that :slight_smile:

I see, but download function is disabled, I can’t save that file.

hello you,

this works if we have existing pdf file in a disk

import java.net.HttpURLConnection;
import java.net.URL;
import java.util.*;
import java.util.List;

import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.remote.CapabilityType;
import org.openqa.selenium.remote.DesiredCapabilities;

import javax.imageio.ImageIO;

public class ScreenshotFromPdf {


    public static void Pdf2Image2(String html, WebDriver driver) throws IOException, InterruptedException {

        Thread.sleep(5000);

        BufferedInputStream in = new BufferedInputStream(new URL(html).openStream());

        PDDocument document = PDDocument.load(in);
        PDFRenderer pdfRenderer = new PDFRenderer(document);

        for (int page = 0; page < document.getNumberOfPages(); ++page) {
            BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
            File outputFile = new File(System.getProperty("user.dir") + "/src/main/DataFiles/" + page + "image.jpg");
            ImageIO.write(bim, "jpg", outputFile);
        }
        document.close();
        driver.close();
    }


    public static WebDriver setChromeOptions(String downloadPath){

        ChromeOptions options = new ChromeOptions();
        //String downloadPath = folder;
        //String downloadsPath = System.getProperty("user.home") + "/Downloads";
        System.out.println ("downloadpath "+downloadPath);

        Map<String, Object> chromePrefs = new HashMap<String, Object>();
        chromePrefs.put("profile.default_content_settings.popups", 0);
        chromePrefs.put("download.default_directory", downloadPath);
        chromePrefs.put("download.prompt_for_download", false);
        chromePrefs.put("plugins.plugins_disabled", "Chrome PDF Viewer");
        //options.addArguments("--headless");
        options.addArguments("--window-size=1920,1080");
        options.addArguments("--test-type");
        options.addArguments("--disable-gpu");
        options.addArguments("--no-sandbox");
        options.addArguments("--disable-dev-shm-usage");
        options.addArguments("--disable-software-rasterizer");
        options.addArguments("--disable-popup-blocking");
        options.addArguments("--disable-extensions");
        options.setExperimentalOption("prefs", chromePrefs);
        DesiredCapabilities cap = DesiredCapabilities.chrome();
        cap.setCapability(ChromeOptions.CAPABILITY, options);
        cap.setCapability(CapabilityType.ACCEPT_SSL_CERTS, true);

        System.setProperty("webdriver.chrome.driver","C:\\Users\\xxxxx\\IdeaProjects\\chromedriver.exe");
        WebDriver driver = new ChromeDriver(cap);
        return driver;
    }

    public static String getFileFromFolder(String downloadPath){

        File path = new File(downloadPath);
        File[] files = path.listFiles(new FilenameFilter() {
            @Override
            public boolean accept(File dir, String name) {
                // Automatically weeds out directories
                return name.toLowerCase().endsWith(".pdf");
            }
        });

        String fileName = "";
        for (File file : files) {
            System.out.println(file.getName());
            fileName = file.getName();
        }

        return fileName;
    }


    //Cross platform solution to view a PDF file
        public static void openInBrowser(String pdfFilePath) {

            try {

                File pdfFile = new File(pdfFilePath);
                if (pdfFile.exists()) {

                    if (Desktop.isDesktopSupported()) {
                        Desktop.getDesktop().open(pdfFile);
                    } else {
                        System.out.println("Awt Desktop is not supported!");
                    }

                } else {
                    System.out.println("File is not exists!");
                }

                System.out.println("Done");

            } catch (Exception ex) {
                ex.printStackTrace();
            }

        }

    public static void main(String[] args) throws IOException, InterruptedException {

        WebDriver driver = setChromeOptions(System.getProperty("user.dir") + "\\src\\main\\DataFiles");

	//you have to find out how to click download button
        //save and download pdf file from disk
        //driver.get("https://gofile.io/?c=WYPqpZ");
        //driver.findElement(By.id("fileInfoDownload")).click();
        //Thread.sleep(3000);

	//existing pdf file in disk
        String pdfDir = System.getProperty("user.dir") + "\\src\\main\\DataFiles";
        String pdfFile = getFileFromFolder(pdfDir);
        openInBrowser(pdfDir+"/"+pdfFile); //open file in edge, don't know why, set chrome as default browser
        //this url should be get from chrome
        String pdfUrl = "file:///C:/Users/timok/IdeaProjects/JideaProjects/src/main/DataFiles/samplePDF.pdf";
        //open downloaded pdf file in browser
        driver.get("file:///C:/Users/timok/IdeaProjects/JideaProjects/src/main/DataFiles/samplePDF.pdf");
        String url2 = driver.getCurrentUrl();
        ScreenshotFromPdf.Pdf2Image2(url2, driver);

    }
}

image

hi,

actually this works too, need to click page that pop up will disappear
after that able to click download button
//save and download pdf file from disk
driver.get(“https://gofile.io/?c=WYPqpZ”);
driver.findElement(By.id(“fileInfoDownload”)).click();
Thread.sleep(3000);

hi,

fixed

    //save pdf to disk
    driver.get("https://gofile.io/?c=WYPqpZ");
    Thread.sleep(5000);
    driver.findElement(By.xpath("/html/body/div[2]/div/div[10]/button[2]")).click();
    Thread.sleep(1000);
    driver.findElement(By.id("fileInfoDownload")).click();
    Thread.sleep(3000);

no needed to use this method, add comment
//openInBrowser(pdfDir+"/"+pdfFile);

I see, but on that site, where file is originally located, saving is disabled. I just tried to reproduce situation with this demo file: http://aplaidshirt.epizy.com/samplePDF.pdf So please don’t use save function in your solution, because it will not work. Please use original pdf, without gofile file hosting solution: http://aplaidshirt.epizy.com/samplePDF.pdf , but without using file save option.

hi,

what mean saving is disabled?
I am using this site https://gofile.io/?c=WYPqpZ
and works fine

hi,

ok, with this url http://aplaidshirt.epizy.com/samplePDF.pdf use my first script which was posted here

@plaidshirtakos I can open and save (download) the pdf file from the link you provided.
so … what exactly is disabled?

@bionel : This isn’t original file, original file can’t be downloaded. I just reproduced a pdf with same details.

I tried both, but got same error. I attach also an evidence.

hi,

debug where the issue is

got it, so in fact the issue is at the original URL you try to access it, not with the pdf himself.
can you share that URL? (i think not, just asking … but worth to try)

I see it, InputStream is empty, but I don’t know why. Same code is working with your pdf, but not working with mine,

hi,

issue is here
http://aplaidshirt.epizy.com/samplePDF.pdf?i=1

real url has attributes too ?=1

need to investigate it

problem is here
PDDocument document = PDDocument.load(is);
it’s parser issue, maybe need to encode ?=, will try

I tried with this ending too, but it has the same effect, I got error message. PDF file is opened in both cases.