Report on Dealing with Syriac in Unitex and Java
Thi Ai Vi Tran
This document reports the results of our tests experiencing the support of Syriac in java in the context of using it with the unitex application.
2. Unitex application
Unitex is a corpus processing system, based on automata-oriented technology. The Application allows to handle electronic resources such as electronic dictionaries and grammars and to apply them.
The main functions are
· Building, checking and applying electronic dictionaries
· Pattern matching with regular expressions and recursive transition networks
· Applying lexicon-grammar tables
· Handling ambiguity via the text automaton
Unitex have two important parts
· User interfaces written in java
· Functions written in C
2.1 User interface
2.1.1 Manipulating a text
Editing a text file
2.1.2 Displaying and eiting Graphs
2.1 Functions written in C
These executable files include function such as transcode, Construct FST-Text, convert FST-Text to Text, Compile to GRF, Sort Dictionary …
The following screenshots illustrate for these functions
TransCode ‘c:\temp\Abc\French\Corpus\80jours.txt’ from French to English
For instance, the steps to construct a FST-Text
3. The problem when displaying syriac characters in java
Unitex can not display and edit Syriac characters. The following problems occur in Unitex and in Java
· The characters in a word don’t link together
When displaying syriac characters with Swing library controls, it works fine, however the characters in a word don’t link together. We have tried many fonts but the result is still the same. It is clearly not because of fonts since these fonts work well on Microsoft Office Word.
Here are two lines of syriac characters. The first one doesn’t link and the second is correct.
All the controls in AWT lcannot display syriac characters except the TextField control. The others don’t understand the fonts. When changing the syriac fonts, nothing happen. Here is the list of fonts we have tested.
East Syriac Adiabene
East Syriac Ctesiphon
Serto Jerusalem Outline
Java just accepts the ‘Estrangelo Edessa’ and ‘Estrangelo Talada’ fonts
Here is the picture of a java application dealing with Syriac
JTextField, JTextArea, JList are Swing controls
TextField, TextArea, List are AWT controls
Setting fonts for a java control
TextField text = new TextField() ;
Locale m_Locale = new Locale( "ar", "SY" );
Font m_Font = new Font ("Estrangelo Edessa", Font.PLAIN, 30));
text.setFont( m_Font );
text.setText( "Syriac characters \n \u0712\u072A\u071D\u072B\u071D\u072C \u0710\u071D\u072C\u0718\u0717\u071D \u0717\u0718\u0710 \u0721\u0720\u072C\u0710");
4. Syriac characters work well in J#
v Pros :
All the controls in J# can display, edit and find syriac characters correctly. If we want to convert Unitex from java to J# we have to do the following steps.
· Rewrite the user interface of Unitex
· Reuse all the functions in C (reuse all of .exe files)
v Cons :
Although syriac works well in J# we can not convert all the functions from Unitex in java to J#.
The following problems are :
· The controls of the java libraries don’t understand the controls of the J# libraries
· Some libraries used by Unitex are not supported by J# (ex : Frame)
· Although J# is inspired from Java, the syntax is different between java and J# (ex : the get length String class function in java length() and get_Length() in J#)
A heavy work has to be done to convert Unitex from java to J# especially concerning the FSGraph functions to display text graphs
5. Existing application for displaying and editing syriac in J# we have done so far
The application for displaying syriac in J# includes so far the following functions
ü Open a txt file
ü Save a txt file
ü Save As a txt file
ü Select All
Opening a txt file
Displaying a txt file and other functions
Finding a character or a string
Syriac clearly does not work with Java. We have tested different libraries from Sun and IBM such as AWT or Swing and java controls like TextField, TextLayout or TextArea, the support is not correct. Actually from what we have tested, java and unitex still have problems to fully support other languages such as Arabic languages, Chinese, Japanese, … The problem seems to be in the java Unicode interpreter, the fonts manager and the Unicode code itself.
Our recommendation is to use j# especially with Visual Studio 2005 since the support of j# fonts use the same libraries that MS Office. The problem with j# is that unfortunately it is not 100% compliant with java meaning that the work to convert a java application to j# is not trivial.