Doublebyte Fontbase Rendering Indic Localization For FOSS Platforms

By Prof. Jitendra Shah

25 March 2005

Background:

A large number of softwares , paticularly graphics based ones, have standard facilities to use ttf fonts. Unicode based ttf fonts are also supported. However these softwares donot have the rendering facility required for the complex texts such as indian language scripts. The CJK languages which use unicode based fonts use only ttf fonts. The localization in CJK has been effectected due to ease of unicode ttf fonts and has been tardy due to non-availabilty of facility to render unicode opentype fonts :
ttf, opentype and unicode, ISCII and non-unicode indian fonts :
Nature of standard: While ttf and opentype are corporate standards for text representation, unicode is a free standard , being aligned to ISO10646 governed by ISO Geneva.
ttf: ttf is an Adobe standard for font encoding where there is a one to one correspondence between encoding and glyph. The glyphs are represented in vector form (splines etc) and hence are scaleable.
Opentype: This is a unicode based font standard created jointly by Adobe and Microsoft and has multibyte representation of text. This is a superset of ttf. The text representation in unicde is rendered using rules incorporated in the font . These rules are of the form of substitution ( replace a sequence of characters by so-and-so glyph ) or positioning ( if so-and-so character comes after so-and-so character, then place it in so-and-so position).
Indian non-unicode fonts: These have been a great facility in the past, but for that very reason some people consider it a scourge of localisation now. These legacy systems were the non-standard ( i.e. vendor specific standard) way of representing text , using vendor specific application software on , only proprietory platforms.
Unicode and ISCII: These are character representations and are not concerned with display and hence rendering or font ( except is some rare circumstances, and that has been a debatable issue within the unicode developer community). Given the limitations of ASCII , another standard was proposed for indian texts which was independent of the font. The idea was that representations should be independent of font and there may be applications that may render after transforming the text into encoding for appropriate font. However , except the government data in some cases, most data is stored in non-IISCII, non-standard format which encourages vendor lock-in. Due to vested interests created in such vendor lock-ins, there has been a major ( and unfortunately quite effective) resistance to change over to standard like unicode
Fonts in graphics applications:
Display Interaction: Most graphics software , when localized, will require a small part of interaction in indian script and require to display a lot of stored data in indian scripts. The interaction includes input in dialog boxes , to be eventually part of some database from which the graphics display needs character info.
Localization: Localization involves translation of the stored info into local language and store the same as character info. It is possible to convert/translate all the stored character info into local language and store the same.
Rendering: The same will need to be eventually displayed either on a graphics screen ( as pixels) or in text display as in menus, console or be required for printing.
Conversion: The character info may be converted on the fly into the encoding as required for display or rendering for printing etc.
Indic Localization of softwares:
a. Approaches in FOSS text rendering: On the FOSS (i.e. GNU/Linux) platform, unicode compliance has become a standard. However rendering facilities are restricted due to inadequacies of the system level rendering technologies. In FOSS domain,leaving aside the console level text, X graphics libraries determine the rendering. INDIX approach by NCST (now C-DAC) was a valiant effort to provide a generic solution to complex text rendering, by separating character representation from rendering . However it based itself on the idea of modifying X library , and hoped the world (X) community will accept the modification. That did not happen. Now Indix has been stuck in the particular version of X which is obsolete. Now the X consortium has adopted a modular approach for language specific rendering. Now X library development and maintenance has been split in two branches. Both have adopted pango library approach for indian scripts. There are several other approaches in vogue, independent of X library . Thus OpenOffice.org uses a separate approach and Mozilla uses (with pango patch) another approach.
b.Chinese path is our path for Text rendering: A number of softwares have only text content and need a lot of interaction in text. Often, they resort to (application specific) special machinery for GUI ( which is system based) and content ( which may be system based or application specific) in local languages. These approaches may override the system based rendering or bypass the same. The CJK path is to use unicode for representation for increasing the range of characters and use the system level machinery for rendering.
c.It is not known if anyone has used that approach in india. However this is exactly what ISCII-ISFOC combination does. I am suggesting the same concept to be emulated in unicode.
Implementation :
Following steps will be necessary

1. Identifying applications that need this approach
2. Identifying fonts that can be so developed in dual form, i.e opentype and doublebyte ttf
3. Identifying the number of character combinations to be represented in doublebyte font
4. Developing scripts to create the characters and glyphs for doublebyte font
5.Testing if speed is affected by introducing such fonts. If adversely affected ( not anticipated by yours truly) methods of fast look up of glyphs for display that can be incorporated in applications
6. Creating locales that use these fonts and adding the same to CLDR ( as already there are locales for non-indic scripts e.g french, with .UTF8 and .ISO8859-1)
7. Writing converters ( trivial exercise , programming wise) from unicode opentype to unicode doublebyte .

What will not change:
It may be noted

  • Storage of text info does not change from unicode
  • Tranlations and tools for translations remain unchanged.
  • Applications already internationalised for CJK can be used straight away.

PROSImmediate availability of many applications in indian languages without a technological breakthrough will open the doors for a lot of educational applications for immediate localization.
Cons
There is dual mode of character representation, even if temporary and only for display. However applications are replete with such duplications.
Lest it be construed as megalomanic , let me clarify that this is not a very big idea. First of all, as said earlier, this is what is used by CJK and is on the lines of ISCII-ISFOC. Secondly, if it is true that no one else has used this approach, it will be interesting to speculate why. My guess is proprietory platforms , even if using this may not let us know. Secondly many institutions and vendors have generated vested interest in lock-in fonts and hence would not be thinking of an open approach. Thirdly, localization teams working on FOSS platforms have been so engrossed in text matters that graphics applications are not yet their priority
What localization can be undertaken
Why others may not be thinking on this line:

  • grass : a GIS applications
  • blender : a solod modelling softwares
  • GIMP : an image manipulation tools
  • net-logo: a modelling and simulation package for deeper learning of complex problems, from domains in art to science to technology to social scieces
  • and many others