Wednesday, May 12, 2010

Uncovering myths about Globalization testing- Approach to generate Localized test data

Myth 23: It is possibly the right strategy to randomly pick the test data specific to the localized language you are testing

While performing Internationalization testing, a tester largely deals with testing the product on the languages not known to him. Imagine a tester from India testing a product on Spanish Operating system. While it is largely possible to do the Internationalization testing without knowing the language
but there are certain traps related to this ignorance of the underlying language that a tester should be aware of. One of such trap is usage of test data.

Very often the lack of knowledge of language prompts the tester to just use a "few" input characters of the language under test while ignoring the rest. This is obviously a dangerous situation as it leads to some holes in the test coverage. The below example illustrates what i was mentioning here-

E.g. considering that a tester is testing Spanish langage, he is most likely to include Spanish specific input characters such as- á,é,í,ó,ú,ñ,ü and may be more investigative ones will use there above list in CAPS, or in combination with English characters, numbers etc. But the basic question that most of testers resist to ask is whether my test data is covering all the possible Input characters for the language under test ?
May be one reason this question is not asked is because of lack of complete knowledge of the language under test. Some may argue the validity of this question with an argument- Is it really necessary to test all the possible Input characters that a langauge offers ? This is a fair questionconsidering say while testing English language- if a tester uses "a" as a test data or if he uses "z" as a test data, wouldnt the result be the same all the time ?
May be Yes.
But when we are considering the test data generation of Localized characters, there are some more crucial points to be considered. This is what i intend to cover in the upcoming sections with a idea on how the test data of "foreign" languages can be better classified and used while testing.


How do i know which Input characters should be tested for a particular language ?
I have generally found this website quite good and informative. This has separate web pages for different languages and the tester can easily see which Input characters define each language. More on this in upcoming examples.

Is there a reasonable way to ensure that the localized test data gets appropriately covered ?
This website would not only allow you to know the characters used in a particular language but also the appropriate ways of classifying the same.

In my experience, dividing the Input characters of a particular language in to appropriate classes is always a better approach. Let me try and explain using an example.

Consider Spanish language for example-

Webpage listing all the chatacter inputs:
http://tlt.its.psu.edu/suggestions/international/bylanguage/spanish.htm

E.g. the below section represent Input character classes for Spanish language characters in a boarder sense. And while coming up with the test data for a particular field say- Password, depending on the rules and the length accepted one or more characters from each of these classes can be used as a test data.

Spanish Language Input characters classes:
Capitals:
Á, É, Í, Ó,Ú,Ñ,Ü (commas are only used as separators)

Lower case:
á,é,í,ó,ú,ñ,ü

Punctuation
¿,¡,º,ª,«,»,€:

Special Spanish representations:
HTML entity codes (HTML entity codes are the codes which allow browsers and screen readers to process data as the appropriate language) e.g. for the character á, the HTML entity code is á
The reason to include these codes as a separate class is because if these codes are used separately as the Test data in an input fields, the application might interpret this as a single letter. This possiblity is espacially true for web based applications.

English Lower case characters:
a-z

English Upper case characters:
A-Z

Numeric representations:
1,2,3,4....

Special characters (EN):
~`!@#$%^&*()_+-={}[]|\:;"'<,>.?/

Any Known problematic Spanish characters (not included above):

The reason to include English Input characters in Spanish test data is that users generally use English along side with Spanish characters with some commonality being there in characters usage.
Further in our example- say is password is accepted to be 20 or less character length, some of the test data can be (notice that test data has one or more characters from each identified class).

Less than 20 characters
Áñº«gz24@

=20 characters
Üó€»sH224&ÜÍ¿¡

Greater than 20 characters
Ó᪫»tyKL7845`!@º

Far Greater than 20 characters
(If there are no limits on the UI)

0 characters
(blank)

Important point to note that each test data above includes atleast one character from the Input data classes identified for Spanish language above. That is the key as this approach helps ensure that all the classes of data is utilized appropriately and no Input character is actually left to chance.

Some questions related to this that i would be working to address in upcoming posts-
- The strategy to divide the Input characters into classes seems good for the languages which has limited set of characters like European lanaguages. Can this approach be used for Asian languages like which Japanese, Chinese, Korean etc. which deals with multiple writing scripts as well as thousands of characters ?
- In larger context, what could be the test straregy totest Unicode feature ? Unicode being one of the key Globalization features that is built in the product and should be tested comprehensively from the point of view of Globalization testing.

For a complete list of Globalization testing myths uncovered in the past, please visit here .

No comments: