Report on Dealing with Syriac in Unitex and Java

 

 

Thi Ai Vi Tran

tran@isys.ucl.ac.be

 

1.            Introduction

 

This document reports the results of our tests experiencing the support of Syriac in java in the context of using it with the unitex application.

 

2.            Unitex application

 

Unitex is a corpus processing system, based on automata-oriented technology. The Application allows to handle electronic resources such as electronic dictionaries and grammars and to apply them.

 

The main functions are

·         Building, checking and applying electronic dictionaries

·         Pattern matching with regular expressions and recursive transition networks

·         Applying lexicon-grammar tables

·         Handling ambiguity via the text automaton

 

Unitex have two important parts

    ·         User interfaces written in java

    ·         Functions written in C

 

2.1        User interface

 

2.1.1         Manipulating a text

 

                 

 

                  Editing a text  file

 

 

2.1.2        Displaying and eiting Graphs

 

 

 

2.1        Functions written in C

         These executable files include function such as transcode, Construct FST-Text, convert FST-Text to Text, Compile to GRF, Sort Dictionary …

 

The following screenshots illustrate for these functions

 

TransCode ‘c:\temp\Abc\French\Corpus\80jours.txt’ from French to English

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

For instance, the steps to construct a FST-Text

 

 

 

 

 

 

 

3.            The problem when displaying syriac characters in java

 

Unitex can not display and edit Syriac characters. The following problems occur in Unitex and in Java

 

·         The characters in a word don’t link together

 

When displaying syriac characters with Swing library controls, it works fine, however the characters in a word don’t link together. We have tried many fonts but the result is still the same. It is clearly not because of fonts since these fonts work well on Microsoft Office Word.

 

Here are two lines of syriac characters. The first one doesn’t link and the second is correct.

 

 

 

 

All the controls in AWT lcannot display syriac characters except the TextField control. The others don’t understand the fonts. When changing the syriac fonts, nothing happen. Here is the list of fonts we have tested.

 

Estrangelo Antioch

Estrangelo Edessa

Estrangelo Midyat

Estrangelo Nisibin

Estrangelo QenNeshrin

Estrangelo Talada

Estrangelo TurAbdin

East Syriac Adiabene

East Syriac Ctesiphon

Serto Batnan                           

Serto Jerusalem

Serto Jerusalem Outline

Serto Kharput

Serto Malankara

Serto Mardin

Serto Qezhayya

Serto Urhoy

 

 

 

Java just accepts the ‘Estrangelo Edessa’  and ‘Estrangelo Talada’ fonts

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Here is the picture of a java application dealing with Syriac

 

 

 

JTextField, JTextArea, JList are Swing controls

TextField, TextArea, List are AWT controls

 

 

Setting fonts for a java control

 

TextField text = new TextField() ;

Locale m_Locale = new Locale( "ar", "SY" );

Font m_Font = new Font ("Estrangelo Edessa", Font.PLAIN, 30));

text.setLocale( m_Locale);

text.setFont( m_Font );

text.setText( "Syriac characters \n \u0712\u072A\u071D\u072B\u071D\u072C \u0710\u071D\u072C\u0718\u0717\u071D \u0717\u0718\u0710 \u0721\u0720\u072C\u0710");

 

 

4.            Syriac characters work well in J#

 

v            Pros :

All the controls in J# can display, edit and find syriac characters correctly. If we want to convert Unitex from java to J# we have to do the following steps.

 

·               Rewrite the user interface of Unitex 

·               Reuse all the functions in C (reuse all of  .exe files)

 

v            Cons :

 Although syriac works well in J# we can not convert all the functions from Unitex in java to J#.

 

The following problems are :

·               The controls of the java libraries don’t understand the controls of the J#  libraries

·               Some libraries used by Unitex are not supported by J# (ex : Frame)

·               Although J# is inspired from Java, the syntax is different between java and J# (ex : the get length String class  function in java length() and get_Length() in J#)

 

v            Note:

A heavy work has to be done to convert Unitex from java to J# especially concerning the FSGraph functions to display text graphs

 

 

5.            Existing application for displaying and editing syriac in J# we have done so far

 

The application for displaying syriac in J# includes so far the following functions

 

ü            Open a txt file

ü            Save a txt file

ü            Save As a txt file

ü            Select All

ü            Copy

ü            Paste

ü            Find

ü            Undo

 

Opening a txt file

 

 

 

Displaying a txt file and other  functions

 

 

Finding a character or a string

 

 

 

 

 

 

6.            Conclusion

 

Syriac clearly does not work with Java. We have tested different libraries from Sun and IBM such as AWT or Swing and java controls like TextField, TextLayout or TextArea, the support is not correct. Actually from what we have tested, java and unitex still have problems to fully support other languages such as Arabic languages, Chinese, Japanese, … The problem seems to be in the java Unicode interpreter, the fonts manager and the Unicode code itself.

 

Our recommendation is to use j# especially with Visual Studio 2005 since the support of j# fonts use the same libraries that MS Office. The problem with j# is that unfortunately it is not 100% compliant with java meaning that the work to convert a java application to j# is not trivial.