National Language Support in HTML

Supporting languages other than English is not a straightforward conversion or extension for an application designed exclusively for English. There are various things that need to be considered in all the tiers (client, middle and database). I wouldn’t go into all of these details here, but just those related to HTML.

I was trying to blindly make an application display and intake Korean language. No, I don’t know Korean. I just have a bunch of korean text to play with and I blindly copy paste. Initially I had issues with getting the data into the database. Once that got resolved, then I had issues with displaying, navigating and form submission. In this post, I am going to talk only about the display/navigation and form submission part.

In the end everything in the computer is bits and bytes, 0s and 1s. If only one has anticipated the need for supporting lots of languages by the computers, there would have been one and only one list of all the characters of all the languages in the world. However, since that didn’t happen, we ended up with various character sets. These character sets also have different types of encoding. That is, while character sets provide the listing (say, A = 1, B = 2, C = 3 etc), encoding provides a means of representing these numbrers (fixed byte width encoding, variable byte width encoding etc). Unicode is one of the modern character sets which supports various languages. So, applications which need to support multiple languages should start using Unicode. One of the encoding supporting Unicode character set is UTF-8. So, using UTF-8 should solve most non-English related problems.

With respect to HTML, following are the things to consider

a. Display
b. URLs
c. Form Submission

What needs to be done for each of them is explained below.

Display: Browser needs to get a hint on the charater encoding used for the content. This can be sent as a http response header to the browser. Specifically in JSP, it’s possible to specify this using the contentType attribute of the page directive.


<%@page ... contentType="text/html; charset=UTF-8"%>

This tells browser what character set and encoding to use. Ofcourse, inspite of setting this, the browser might still display ??? (in Firefox) or || in IE7 if the appropriate language fonts are not available. So, one should ensure that the browser also can display the language your application is trying. A simple html page with a bunch of the language words should be used as a standalone test to make sure that the client side is capable. Then you can triage the problem in your application.

URLs: When creating hyperlinks that need to contain parameters whose values are non-English, these parameters need to be explicitly encoded. For example, in Java, this can be done using


URLEncoder.encode(value,"UTF-8");

Note that it’s important to use the encode function that explicitly passes the encoding. Otherwise, the default encoding of the OS will be used and may cause problems.

To make sure that this is working properly, for example in Firefox, on mouse over to the link, the url is displayed in the status bar and the encoded word, which would have otherwise been all % and hexadecimal characters, would look like the actual word. I believe when displaying the URL in the status bar, the parameter values are being decoded.

Now, having ensured that the URLs are encoded appropriately, we only solved half the problem. The server still may not be aware that it needs to perform decoding of the parameters. To ensure that this is known to the server, in case of Tomcat, it’s possible to specify the encoding using the URIEncoding attribute of the Connector element.


<Connector ... URIEncoding="UTF-8"/>

Now, each parameter received from the URI is explicitly decoded by the server and the request.getParameter will give the decoded values

Form Parameters: So far, the display and URL parameters are taken care of. The one last thing is, the data actually submitted by the end user. When the contentType is used for the display purposes as indicated above, I believe that the browser automatically uses the same encoding for the data sent through a form submission. Otherwise, it’s possible to specify the encoding explicitly using the attribute accept-charset of the form element.

Assuming everything is done right on the browser, the server still is not aware of this. And the URIEncoding for tomcat described above is not applicable for form POSTs because, the data sent by a form POST is not URI Encoded. As a result, for tomcat, one has to explicitly set


request.setCharacterEncoding("UTF-8");

With the three things above, it should be possible to display, navigate and accept input of non-English characters. I am more familiar with Java and Tomcat. So, the examples above are specific to these. But hopefully, one can easily extend this knowledge to their choice of language and web server.

This whole info took me about 1hr of experimenting and searching on the web to figure out. Hope fully people who read this article don’t have to spend so much.

Advertisements

Leave a comment

Filed under Tech - Tips

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s