Working with Page Titles and Content with Special Characters

Hi,

I’ve been searching through the Web Testing forms to see if I could find what would be the best practice for dealing with web content where Katalon appears to either:

-Does not recognize a character
-Consider it a special character.
-Or its a reserved character (where you use some sort of escape character.

That being said, in my below example I am reading through a CSV file and then comparing the Page Title on the https://duckduckgo.com/ website.

When Katalon reads my CSV file it does not appear to recognize the dash like character in the title.

This is an example of my CSV file:

Test_Title,Test_URL,Test_Page_Content
“DuckDuckGo — Privacy, simplified.”,https://duckduckgo.com/,Tired of being tracked online? We can help.

I copied the content directly via Google Chrome Developer (copy element)

This is an example of the failed compare message when the character is not recognized:

Test Cases/1. Test Setup/1. Smoke Test - URL, Page Title and Content check from a file FAILED.
Reason:
Assertion failed: 

assert WebUI.getWindowTitle() == Test_Title
             |                |  |
             |                |  DuckDuckGo � Privacy, simplified.
             |                false
             DuckDuckGo — Privacy, simplified.

Any recommendations\documentation I should read to handle situations like this are welcome.

Thank you.

Perhaps it’s a bug - but I’m not sure what encoding Katalon expects/supports for text files.

Do you happen to know what encoding your csv is using? utf-8? ANSI? Are you able to change it, or are you not in control of its production?

There are several hyphen-like characters. Please learn the following:

@Randy.Ramkissoon

You showd the following string contained in your CSV file:

In this string, there is a U+2014 — EM DASH.

On the other hand, the title of DuckDuckGo site contains U+002D - HYPHEN-MINUS.

These 2 characters are different.

Why your CSV file contains U+2014 — [EM DASH] rather than U+002D - [HYPHEN-MINUS]?
— I do not know. It is most probable that you manually typed it, though you may not be aware of.

How did I find this?

I will tell you.

I made a custom Keyword: my.StringUtils:

package my

public class StringUtils {

	/**
	 * convert the input string
	 * while escaping all non ASCII characters of which UNICODE code point is larger than 128
	 *
	 * E.g.,
	 * String s = "Hello\u2010world"
	 * println(s)                                    // --> "Hello‐world"
	 * println(StringUtils.escapeNonAsciiChars(s))   // --> "Hello\u2010world"
	 */
	static String escapeNonAsciiChars(String str) {
		StringBuilder sb = new StringBuilder()
		for (int i = 0; i < str.length(); i++) {
			int codepoint = str.codePointAt(i)
			if (codepoint < 128) {
				sb.append(str.charAt(i))
			} else {
				sb.append("\\u").append(String.format("%04X", codepoint))
			}
		}
		return sb.toString()
	}
}

I made a Test Case TC1:

def Test_Title = 'DuckDuckGo — Privacy, simplified.'
def escapedTitle = my.StringUtils.escapeNonAsciiChars(Test_Title)
println escapedTitle

When I ran the test case, I got the following output in the console.

2021-02-11 09:52:05.134 INFO  c.k.katalon.core.main.TestCaseExecutor   - START Test Cases/checkDuckDuckGoPageTitle
2021-02-11 09:52:06.901 DEBUG testcase.checkDuckDuckGoPageTitle        - 1: Test_Title = "DuckDuckGo — Privacy, simplified."
2021-02-11 09:52:06.910 DEBUG testcase.checkDuckDuckGoPageTitle        - 2: escapedTitle = StringUtils.escapeNonAsciiChars(Test_Title)
2021-02-11 09:52:06.961 DEBUG testcase.checkDuckDuckGoPageTitle        - 3: println(escapedTitle)
DuckDuckGo \u2014 Privacy, simplified.
2021-02-11 09:52:06.980 INFO  c.k.katalon.core.main.TestCaseExecutor   - END Test Cases/checkDuckDuckGoPageTitle

I am sure you have \u2014 (EM Dash) in your CSV file.

Hi Russ. Sorry I did not reply. For some reason I did not receive any alert e-mails from Katalon. I will attach my sample CSV at the very latest post. Thanks for sharing your observations.

Did you ever noticed the message at the top of the forum page?

I did see visual alert after i signed into the Katalon forum. Just no e-mail alert.

list-of-webpages.zip (613 Bytes)

I’ve attached a copy of my CSV for your reference.

@Russ_Thomas

Just so you two know I am inherently Super Efficient :slight_smile: … I used copy and paste while:

(1) Chrome Developer to get the page title.

(2) Then from error message in Katalon.

So … not as smart as Kaz … but Highly Efficient … LOL

Is the solution that I should encode my CSV file to a certain format? I assume you want to review my CSV first.

Thank you for the zip file disclosed.
In the CSV file I did find a strange character.

I do not see the reason why you have the strange character. It’s you who should know it. Nobody else will do.

How to fix this? — You can edit the CSV file with your favourites text editor manually. Will you require any other method?

I am sure, Katalon is not guilty.

In between (1) and (2), you must have used some GUI tool opened and you pasted a string which was copied at (1). I guess, the tool you used converted a UNICODE U+002D - [HYPHEN-MINUS] into something else when you pasted the string.

Some of sophisticated GUI applications for NON-programmers are interfering and do convert some characters to other silently. Especially UNICODE U+002D - [HYPHEN-MINUS] could be troublesome. Some applications want to treat UNICODE U+002D - [HYPHEN-MINUS] in special manner.

Let me show you an example, MARKDOWN language of Discourse, upon which this Katalon Forum is hosted, renders consecutive UNICODE U+002D - [HYPHEN-MINUS] characters in a markup document:

HYPHENS IN --- BETWEEN

into an EM-DASH in presentation view

HYPHENS IN — BETWEEN

Which tool did you use? I guess you used famous Microsoft Excel, but I am not sure. If you remember which GUI tool you used, then why not you try to reproduce your mischievous CSV?

Hi Kaz,

After the last issue with Katalon and Excel I switched to using CSV files. Since I have to use Windows at work, I’m using Notepad++ as the text editor. I copied the element from Chrome Developer to the CSV using the below command.

Do you have a recommended text editor if I am in Windows?

I was wrong.

In the screenshot you provided last, I found a EM-DASH like character ー in the <title> element.
EM-DASH_in_browser

I use my PC with LANG=ja_JP, not LANG=en_US.

When I open the https://duckduckgo.com/ page, the <title> text contained EM-DASH like character as well.

<title>DuckDuckGo — プライバシー保護をシンプルに。</title>

Which language do you use on your PC? not LANG=en_US?

If you open the URL on a different PC with LANG=en_US, then the <title> text may be different depending on the language setting.

For Reference this is what Notepad++ is showing when I open the CSV

Looks like my employer has the default US English chosen in Chrome:

@kazurayam @Russ_Thomas

Hi Guys,

I ended up copying the character from the CSV into this web character identification tool and was able to confirm it is an EM DASH:

https://www.babelstone.co.uk/Unicode/whatisit.html

But when you look at the Katalon Data Viewer look at how it is not recognizing the character:

I am not sure what “Katalon Data Viewer” is. It seems I have never used it.

But I guess the “Katalon Data Viewer” is not careful enough (has a bug) for character encoding. It is reading characters streams as encoded by ISO-8859-1 (Latin-1), not by UTF-8, possibly.

I think, Katalon Team should be notified of this bug.

@ThanhTo
@duyluong
@devalex88

I thought so all along…

@Russ_Thomas

Could you put this to the “Bug” category?