Need help getting all text from a PDF - Firefox

Hi all,
I’m testing a questionnaire and want to verify that all the test data I’m throwing at it is actually saving. The final product of the questionnaire is a long PDF, and I want to open it and check that it’s got everything (well, at least that all the strings saved somewhere).

I’ve got all the evaluation logic running but am really having trouble consistently grabbing all the text from the PDF. In chrome, a ctrl+a ctrl+c has been sufficient to get everything on the clipboard. But Firefox has been giving me endless problems. Things I’ve attempted:

WebUI.getText(pageObj) - Only grabs the first bit of the PDF, since Firefox won’t load everything until you scroll through the document
select all + copy - same issue as above

Attempts to scroll through the document so that the above methods will work:
WebUI.scrollToPosition() - never seemed to work, possibly because the PDF was opening as a new browser window
WebUI.switchToWindowIndex() - had to guess and check on the index, and even after switching, while getText would grab from the top of the PDF, scrollToPosition wouldn’t do anything
WebUI.sendKeys(pageObj, Keys.chord(Keys.PAGE_DOWN)) - worked, but only if I manually clicked on the PDF
WebUI.click(pageObj) - no help at all
WebUI.clickOffset(pageObj, offX, offY) - thought maybe clicking the only object on the page wasn’t working cause its position was somehow wonky, but I haven’t found any offset that works
bot.mouseMove(300, 300)
WebUI.delay(1)
bot.mousePress(InputEvent.BUTTON1_DOWN_MASK)
bot.mouseRelease(InputEvent.BUTTON1_DOWN_MASK) - this robot worked! I was so happy when I found this, even though it was really annoying to have running while I was doing other things and I had to leave some screen real estate for the tests. The robot clicked on the pdf, sendkeys made it scroll and load the whole thing, then getText could grab it and go.

However, I’ve now found out that the robot doesn’t work while my screen is locked. Since these tests are meant to run ~20 times overnight, that solution is no good.

Can anyone think of other tactics to try? I just can’t figure out how to get the PDF to reliably accept key commands, nor a way to get all of its contents without that.

Firefox (as far as I know) uses PDF.js (GitHub - mozilla/pdf.js: PDF Reader in JavaScript). If you’re having legitimate issues testing PDF documents with it, you might want to raise an issue on GitHub.

Chrome (again, as far as I know) uses a resident binary executable that clearly has advantages since it renders the PDF almost natively.

That said, you could try applying a plugin to Firefox to hopefully give you better control:

However, with that plugin, you’re not really using Firefox itself. But then again, you’re not meant to be finding flaws in pdf.js either. Your call.

1 Like

If I’m correctly remembering the earlier steps in building this stuff, when I was starting a test by opening a static example PDF from its address and reading it, it wasn’t having all of these issues. Most of the headaches have arisen after opening the real PDF from its button on a webpage; something about hopping to that new window doesn’t seem to be catching on quite right. So I’m hesitant to say whether or not these problems are coming from the PDF viewer itself.

Definitely going to give this a try, since my Chrome results have been a lot smoother. Thanks a bunch for the idea, will post results later!

2 Likes