Title: A Letter on Java

Author:

Date: 03 November 1998 13:15:28 +00:00 or Tue, 03 November 1998 13:15:28 +00:00

Summary:

Body:

Dear Francis,

I find George Wendle's dislike of the Java Date class interesting. According to the 1.1.6 API specification, Date is not deprecated as he heard, but many of its methods are because they did not work well with internationalisation. Now, a Date is used to store a time in milliseconds, and a Calendar (which can be extended to support Chinese calendars etc as well as the Gregorian one) is used to view a Date. The relationship is quite clearly explained in the specification. The deprecated Date methods do handle the year as an integer minus 1900, so this does leave scope for Y2K problems in poor implementations (although there is nothing in the specification that limits the year to two digits), but this is only in the deprecated methods.

One thing that I don't like about Java is its handling of international characters. The intention is admirable, but why are all those "byte to character converters" only briefly mentioned in the specification and hidden away in the sun.io package without support for anybody who wants to customise them? For example, I could quite easily write the classes to add HZ encoding of Chinese (which is common in email and Usenet but not supported by Java), but, if I did so, then I would be relying on undocumented stuff and reverse engineering, and all the risks that this entails. Honestly, I can't even find a definitive list of the converters available, but then, when I decompiled some of them (albeit with a not-so-good decompiler that usually outputs opcodes or gives up), I'm hardly surprised that they don't want to advertise them too much. Let me give you five examples:

Any encoding with a state (e.g. a byte sequence that switches into and out of the encoding, like JIS or one of the EBCDICs) will throw an InternalError in some circumstances, most notably when called with one single character. The reason is that an internal array is dimensioned to (maximum bytes per character * number of characters), and, although the "getMaxBytesPerChar" methods do return numbers that are big enough to include the "switch into the encoding" sequences, they are not big enough to also include the "switch out of the encoding" ones, and you can guess the rest. Didn't they test this stuff?
The "JIS auto detect" converters. These try to detect one of three common types of Japanese encoding (JIS, Shift-JIS and EUC). Unfortunately, they only look at the very start of a stream and then stick by their judgement. In my application, the start of the stream happened to be an HTTP header, and, because this was all 7-bit characters, the class always chose JIS by process of elimination. Not to worry - in Japanese you can get away with throwing all your data at each converter in turn and trying again if you get an exception. It is not that difficult to write a converter that copes with changing encoding systems on the fly (I did it in a few lines of C++), but this can go wrong. I would at least expect the state to not be finally set until Japanese characters have actually been processed, though. But again, if I modified the library then my code could be version or even implementation dependent. Most of the reason why I bothered to use Java anyway (rather than write my own in C++ and have done with it) is so that improvements in Java automatically make my program support more encodings. I suspect that the existing converter was written by somebody who had nothing to do with Japan and was just hacking out code to complete a library (if that sounds familiar to anyone).
The UTF-8 and related converters (UTF8 is sometimes used for Chinese). When I saw these, I found to my horror that they can sometimes generate more than one Java 'char' for what is actually one character, and programmers are assured throughout the specification that one 'char' contains one character. This compromises the whole Java internationalisation philosophy, potentially bringing back all the old problems of "this code assumes that one char equals one character and it won't port to Chinese" and so on. Further, they use the user-defined area of Unicode, which really messes things up for people who are thinking of using that area themselves. Granted, if you want to represent millions of characters in a language that only uses two-byte Unicode, you have to do something awkward, but they could at least have documented it.
The Korean KSC5601 converter contains a private member variable called outputSize, which should be set to 0 in several places (most notably in the reset() method) but is not. As a result, undefined characters can sometimes be written to the output, leading to buffer overruns and internal errors. (To see the effect, try looping through all the Unicode characters while catching the conversion exceptions and it should go wrong at about &1100, which I think is Korean.)
At one point the ISO-2022 superclass constructs a String with the platform's default encoding, assuming that this encoding will not change the byte sequence. It also has poor handling of unrecognised escape sequences and it and its subclasses do not properly distinguish between the various planes of the CNS11643 Chinese encoding; this would show up as wrong characters.

The other thing about the Java character converters is the omissions. There are some very common encodings that are not supported, and some very obscure ones that are. It seems to me that this is because some encodings could be supported without writing any more code, and it's easy to just add a different character-mapping table. The whole thing seems as though somebody was trying to impress the management by supporting as many encodings as they could, without regard to which ones, like an email program that supports hundreds of binary formats but not uuencode. Some common encodings are supported, but there are others that I would not expect to see omitted from such a large library, yet they are.

So, I'm glad that the ISO meeting is in Tokyo; as George says it may discourage American participation, but I hope that a decent Asian programmer seriously sorts them out. They need it. By the time you read this, it should all have already happened, and it will be interesting to see if any changes are made.

Another thing I don't like about Java is its half-finished Web implementation. If you're going to make getting Web pages part of the API, you might as well add support for pages that require authentication - at present this requires user interaction, so you can't write a proxy or CGI gateway with it. If you're going to put picture display (e.g. GIF display) in the AWT, you might as well put GIF writing in there as well, or at least if you're going to have an AWT class for an off-screen buffer then you might as well give it methods to read the points. Also you might as well make it so that you can have an off-screen buffer without having to instantiate anything else, so you can do graphical operations without having to display anything at all (this would make it very easy to write a CGI program that returns a GIF of a given Unicode character in a given font, for example).

I was very excited about the Java JIT, what with its claims that it can go faster than C++ because it can do more processor-specific optimisations and so on, but the implementation on Sun's website was a bit of a joke - it added several seconds to program execution while it compiled, which may be all right in some applications but not every time you get a CGI query! Can't the JIT save the data structures it generates for use next time the program is run? The alternative would be to have a non-portable LRWP (long-running web process), which could take up quite a chunk of resources if done in Java. In my case, I did most in C++ and spawned Java only for the encoding conversion, and eventually I thought "this is silly" and wrote my own library in C++. And before anyone asks, yes I did pinch a few mapping tables, but I could have got them from elsewhere had I been online at the time, and anyway it's for personal use (if running a Web server counts as personal use).

Oh, one final thing: How to make Java throw an access violation or a core dump (at least, the version I've got): Create a process with Runtime.exec(), make sure that that process finishes, and try to access its input and output streams. And there's nothing in the documentation about this, not even a boolean hasItFinishedYet() method.

On a related subject, Suradet Jitprapaikulsarn's letter made me think. Does she really want a book in English? I still have a vivid image of how a certain immigrant must have felt when I dumped a huge library copy of Deitel & Deitel on her, and Suradet's pupils might find it even more difficult. Are translations available for the books reviewed in C Vu? Should C Vu mention such facts along with authors and ISBN numbers? Are there books in other languages but not in English, that some ACCU members can review? If so, should the reviews be written in English? (I'd say yes they should, if English is used as a common language throughout ACCU, because that way someone can recommend a book on the review alone even if they don't speak its language) English is not as universal as you might think, and if we want to be really international then we have a long way to go.

Regards

Silas S. Brown

Notes:

More fields may be available via dynamicdata ..