Visual Inspection of Excel and PDF

I have published a GitHub project:


Motivation

A few years ago, I worked for a media company. I was asked to develop a tool software, which was a sort of web scraping tool. The target page was this; it is still available to public at MAR 2022. The page looks as follows:

target

You don’t need to understand this page in detail except the following points.

  1. The page contains several links <a href="…​">EXCEL</a> tags to the URLs of Excel files. It contains links to PDF files <a href="…​">PDF</a> as well.
  2. The content of Excel & PDF files are updated by the publisher irregularly. The frequency of updates are not determined. By tracking the record, I see once a month, or once per 2 months.
  3. The URL string of Excel/PDF files are fixed. The file names won’t change at updates.
  4. The publisher does not provide any push-style notification (like RSS) for this page. The publisher expects that those who are interested in the information of this page should continue watching the page as often as possible.
  5. The company I worked for had a serious interest in the Excel files. Some employees were asked to visit this page everyday. They were asked, when found updates, to trigger an action to process the Excel files.
  6. The employees hated this job. They wanted some system to automate this bullshit job.

Since then I worked long to solve this problem. Finally I have got a solution. Let me describe it.

Problem to solve

My solution should:

  1. visit the web page regularly and automatically, for example once a day
  2. compare the content of Excel files visually:
  • download the current version of .xlsx files, compare the content with some version to find if any updates there are.
  • report the result of comparison in a human-readable format so that the staff can check what occured.
  • store the previous versions of .xlsx file somewhere so that the software can compare the current one against.
  1. If any updates found, take some action. For example, transfer the .xlsx files into some organizational file server and send some notification to those who are concerned: send messages to MQ and E-mail etc.

The 1st problem (running a process regularly and automatically) will be implemented using Linux cron, Windows Task Scheduler, and some Continuous Integration servers like Jenkins.

The 3rd problem (taking some actions when your script find it necessary) will be solved by straight-forward custom programming. This would deserve another planning and efforts.

Here I will focus on the 2nd problem. How can my software compare 2 Excel files? How can it detect that the current Excel is updated since the last? How can it present the difference between the current and previous Excel files? My Visual Inspection in Katalon Studio showed that my product is capable of comparing pairs of PNG images and pairs of text files regardless its format — HTML, XML, JSON, CSV, CSS, JS, etc.

However, .xlsx and .pdf are binary files. Any binary files require some application programs that recognizes its internal file format. My text differ module in the “materialstore” library can not deal with those binary files.

Solution

In this project I will show you a example of comparing 2 Excel files — chronologically previous one and current one.

For this project, I developed a group of classes categorized as "Mapper" :

The current "Mapper"s includes the following:

  1. Excel2CSVMapper
  2. PDF2ImageMapper
  3. PDF2HTMLMapper
  4. RSSAmznPress2ExcelMapper

"Mapper"s will be driven by the framework of

Description

Target Application Under Test

I won’t use this page as the testbed for demonstration, because

  1. this page is not updated frequent enough. It will be updated only once per a month at most. I want a URL as testbed which is updated more frequently; I want once per 1 or 2 days.
  2. this page is owned by a governmental organization of JP. I do not like bothering them.

Instead I would use the following URL as the testbed for my demonstration:

This URL provides a RSS feed in XML format, is updated more frequently (several times on Wed, Thu, Fri). And I believe that the publisher would not stop me accessing it using my automated software.

Demonstration

How to activate the demo

open the Test Suite "Test Suite/Patrol/TS_AmznPress" and run it.

TS

Sequence diagram

The following diagram illustrates the process sequence of Test Case/main/AmznPress/Main_Chronos

sequence

How the data is transformed, explained step-by-step

The sole input is the RSS document published by Amazon.com. The contents will change day by day. A snapshot looks like this:

<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0" xml:base="https://press.aboutamazon.com/">
  <channel>
    <title>Amazon.com, Inc. - Press Room News Releases</title>
    <link>https://press.aboutamazon.com/</link>
    <description>Amazon.com, Inc. - Press Room News Releases</description>
    <language>en</language>
    ...
    <item>
      <title>Amazon Announces Partnerships with Universities and Colleges in Texas to Fully Fund Tuition for Local Hourly Employees</title>
      <link>https://press.aboutamazon.com/news-releases/news-release-details/amazon-announces-partnerships-universities-and-colleges-texas</link>
      <description>Amazon employees in the U.S. will benefit from new Career Choice partnerships with more than 140 Universities and Colleges including several colleges and universities in Texas as well as national non-profit online providers Southern New Hampshire University , Colorado State University – Global,</description>
      <pubDate>Thu, 03 Mar 2022 12:45:00 -0500</pubDate>
      <dc:creator>Amazon.com, Inc. - Press Room News Releases</dc:creator>
      <guid isPermaLink="false">31586</guid>
    </item>
    ...

This RSS XML document will be internally converted into an Excel xlsx file, like this

Example Excel

And then the Excel file will be converted into a CSV text file, like this

publishedDate,uri,title,link,description,author
Sat Mar 05 10:00:00 JST 2022,31591,Amazon travaille en collaboration avec des ONG et ses employés pour offrir un soutien immédiat au peuple ukrainien,https://press.aboutamazon.com/news-releases/news-release-details/amazon-travaille-en-collaboration-avec-des-ong-et-ses-employes,"Comme beaucoup d'entre vous à travers le monde, nous observons ce qui se passe en Ukraine avec horreur, inquiétude et cœur lourds. Bien que nous n’ayons pas d'activité commerciale directe en Ukraine, plusieurs de nos employés et partenaires sont originaires de ce pays ou entretiennent un lien","Amazon.com, Inc. - Press Room News Releases"
Fri Mar 04 02:45:00 JST 2022,31586,Amazon Announces Partnerships with Universities and Colleges in Texas to Fully Fund Tuition for Local Hourly Employees,https://press.aboutamazon.com/news-releases/news-release-details/amazon-announces-partnerships-universities-and-colleges-texas,"Amazon employees in the U.S. will benefit from new Career Choice partnerships with more than 140 Universities and Colleges including several colleges and universities in Texas as well as national non-profit online providers Southern New Hampshire University , Colorado State University – Global,","Amazon.com, Inc. - Press Room News Releases"
...

Report

The Test Suites/Patrol/TS_AmznPress will eventually generate a report in HTML format. In this report you can see the result of visual comparison of 2 CSV files, which is equal to the comparison of 2 Excel files.

report

How the test script is written

Main script

The AmznPress/Main_Chronos script is the entry point of overall processing.

It drives sub modules, which includes broadly 4 stages of processing.

  1. Materialize stage
  2. Map stage
  3. Reduce stage
  4. Report stage

The script as “Materialize state” will get access to the target URL, download the web resources (e.g, RSS XML file), save it into the “store” directory on disk. The files stored in the “store” is called “Material”.

The script as “Map stage” will read a Material from the store, and write back a Material into the store. It will carry out a series of format conversion. E.g, RSS XML to Excel, Excel to CSV, PDF to PNG image, PDF to HTML.

It is likely the case that a single script processes both of “Materialize stage” and “Map stage” sequentially. Also it is likely the case that a single script processes multiple source URLs and iterate over them.

The “Materialize stage” and “Map stage” is heavily dependent on each use cases. You are supposed to design these stages and implement them for yourself.

The “Reduce stage” will construct 2 sets of “MaterialList” = list of files to compare. And it will compare the pairs of Materials to generate the diff information.

The “Report stage” will compile a report of comparison result.

The “Reduce stage” and “Report stage” of the sample are highly modularized, and possibly you would just reuse the sample script. It should work for you.

Sub modules

External dependencies

This project uses various open-source products, such as “Apache Commons IO”, “java-diff-utils”, “jsoup”, “Apache POI”. For the detail about the dependencies, please check the build.gradle .

Applicability to your cases

The “materialstore” and “materialstore-mapper” library are general purposed Groovy library. You should be able to apply them to any URL to solve your problems.

Only the “Materialize stage” depends on the Selenium-java API in order to get access to Web resources through web browser. The “Reduce stage” and “Report stage” have no dependency on Selenium.

I used Katalon Studio for build a demonstration. But these 2 libraries do not depend on Katalon Studio API at all. You can build a Patrol using Gradle/Maven/Ant with the “materialstore” + “materialstore-mapper”.

Conclusion

The employees of the company I worked a few years ago required a software tool that could automate visiting this URL, check if the Excel files are updated or not, and if any changes are found, fire some data processing action. I would name this type of tool as Patrol . Now I am finally ready to build a Patrol for them. However, I am too late, I quit the job already. I regret that I can not help them.

Appendix

2 Likes