Journal Articles

CVu Journal Vol 12, #3 - May 2000 + Programming Topics
Browse in : All > Journals > CVu > 123 (22)
All > Topics > Programming (877)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: A Tale of Old Java

Author: Administrator

Date: 03 May 2000 13:15:36 +01:00 or Wed, 03 May 2000 13:15:36 +01:00

Summary: 

Body: 

I'd venture to suggest that for all but the very youngest, C and/or C++ has featured in the past of many Java programmers. Even if you are primarily a C/C++ programmer, you may at some point come in to contact with Java. If so, you may be in for one or two nasty surprises, one of which I hope you'll be able to avoid after reading about the following real-life example.

This is a snippet of Java I was recently trying to fix (apart from that array declaration in the first line, it could be C++, couldn't it?):

  byte[] cbuf = new byte[destBuf.length];
  int k, i;
  for(i = k = 0; i < destIndex; i++, k++) {
    cbuf[k] = destBuf[i];
// skip the run-length codes 
// between 0x80 and 0xC0
    if(cbuf[k] >= 0x80 && cbuf[k] < 0xC0) {
      cbuf[++k] = destBuf[++i];
      }
  ...
  ... etc.

There is a problem here, which is at least partly due to confusion over the difference between value and representation. See if you can spot what the problem is. While you ponder, I'll tell you a little more about the application it was taken from.

Like many of my colleagues, I have a Palm Pilot. This is a handy little device; one of the many useful things it can do is store and display documents. If you enjoy reading a little fiction just before you go to sleep, and your latest bedtime material is a download from Project Gutenberg [Gutenberg], you're far more likely to persuade your spouse to share the bed with a PDA than a laptop! Now unused RAM isn't necessarily abundant on Pilots, so there is a de facto standard document file format which includes compression [DOC]. A freeware Java tool [Brisk] is available to prepare such documents from plain text files - the problematic snippet above comes from that tool's compression routine, hence the reference in the comment to run-length codes.

Spotted it yet? If not, here's a little digression which may point you in the right direction. What did I mean earlier by the difference between value and representation? It's very simple, and it's at the heart of our programming activities, so naturally we tend to forget about it! When we see something like 0x80, we happily interpret 0x80 as meaning 8016 , which is 12810. We also tend to interpret 0x80 as the bit pattern 10000000, and it is then all too easy to assume that this represents 100000002 - they do, after all, look remarkably similar! Well, a bit pattern means precisely what I want it to: nothing more, and nothing less. It all depends upon the precise definition of the relationship between bit pattern and the value it represents, but much of the time it is convenient to blur the distinction between value and representation. Of course, there are occasions when it is crucial that we should avoid this blurring. Later on, I will try and make it clear when I want 0x80 to mean the numeric value 8016 , and when I want it to mean the bit pattern 10000000.

Now it would also be nice if the languages we use helped us to make this distinction when we need to. Well I don't know about you, but to me, "byte" means a concrete unit of storage, whereas "short" and "long" represent more abstract concepts (after all, the first is a noun and the latter two are adjectives we use as nouns). When I see the types short and long in Java, I think "short integer" and "long integer". When I see byte, I don't think "integer stored in a byte", rather I tend towards thinking "8-bit unit of storage", and then I slide down the slippery slope to thinking bit pattern. I wonder if anyone else has this "problem"? Why wasn't the type called a tiny? At least we have an unambiguous nomenclature in the field of communications, where we use "octet" to refer to an 8-bit segment of a stream. This octet may at some stage end up in a byte of storage, but what it represents is neither specified nor implied by its name alone.

Enough digression, and enough clues. Time to put you out of your misery if you're not there yet: Java only has signed numbers (it only interprets binary representations of numeric values as signed numbers), so a variable of type byte can only take values from -128 to +127. This means that cbuf[k] >= 0x80 is never true! (And coincidentally, cbuf[k] < 0xC0 is always true, although it will never be evaluated.)

Let us look at this more closely, because I might have convinced you that it isn't immediately obvious what 0x80 means in this context. You might expect 0x80 to be implicitly cast to a byte for the purposes of the comparison, and the effect of that casting might be to do precisely nothing (other than take the least significant byte...) You might then expect the comparison to be the logical equivalent of bytevalue >= -128 which you might reasonably expect to be always true.

OK - I'll stop trying to confuse you and tell you what is actually happening. There are two aspects of Java which we need to understand, the first of which is the nature of literals. In Java source code, 0x80 is (for the purposes of the Java Language Specification [Gosling-]) a HexIntegerLiteral, which is one of the types of IntegerLiteral. An IntegerLiteral is of type int, unless it has the suffix l or L, in which case it is of type long. So 0x80 is effectively of type int, with value 128. (The HexIntegerLiteral with value -128 is 0xFFFFFF80.)

The second aspect of Java which is relevant here is the application of binary numeric promotion. The full details are too lengthy to reproduce in their entirety [Promotion]; all we need to know is that in this case, both operands of the >= comparison operator are promoted to type int by "widening conversion." Widening conversion preserves value, in this case by sign-extension of the twos-complement representation [Widening]. This means that when the application runs, the comparison operator "sees" an integer value in the range -128 to 127 on the left hand side, and an integer value of 128 on the right. It's now obvious why cbuf[k] >= 0x80 is never true.

(Suppose that this was C++, and we'd used the type unsigned char instead of byte, just to make life a little easier. Are you sure you know how the analogous rules would apply to the evaluation of this expression? What about if we'd used signed char, or even just char?)

So, how do we fix the code so that it does what we want it to? Well, we want to trigger the execution of the body of the if block if the representation of cbuf[k] is in the range 0x80 to 0xC0, which means that the byte value is in the range -128 to -64. (After what I've said about distinguishing value and representation, you might take offence at my referring to a representation as having a range! I suppose I ought to characterise the representation as a mapping, where the byte values 0 to 12710 map to the bit pattern representations 00000000 to 01111111, and the values -12810 to -1 map to the bit pattern representations 10000000 to 11111111, the bit patterns in each half of the mapping incrementing as though they were binary numbers. I hope that makes things a bit more palatable.) How can we cleanly change the code, bearing this in mind? One possible fix is to change the comparison line to

if(cbuf[k] < (byte)0xC0) {

I wonder what you think about this? I can't decide whether it's evil or cute, and here's why. That cast on the right of the comparison might trick you into thinking that byte values are being compared; after all, the left hand side is a byte, isn't it? Don't forget binary numeric promotion. Oh, and I didn't mention narrowing conversion, did I? [Narrowing] That comparison operator is ultimately going to see two integer operands; the left hand side undergoes widening conversion as before, so that it is notionally an int in the range -128 to 127, but what happens on the right hand side? Recall that 0xC0 is a HexIntegerLiteral, and so is notionally the 32-bit 0x000000C0. When it is cast to byte, it undergoes narrowing conversion, which does not necessarily preserve value. In a narrowing conversion, the relevant number of most significant bits are dropped so that the result fits into the new representation. In our case we end up with the byte value with representation 8-bit 0xC0, which we now know is a value of -64. Finally, it undergoes widening conversion to the int required for the comparison operator, and it ends up with the 32-bit representation 0xFFFFFFC0, but still value -64.

This is evil because "hidden" conversions are occurring, but cute because the effect is the same as if two bytes were being compared. Another "simple" fix is

if(cbuf[k] < 0xFFFFFFC0) {

or, equivalently,

if(cbuf[k] < -64) {

I don't like either of these, because they begin to obscure what we are trying to achieve. How about adding a bit of redundancy

if(cbuf[k] >= (byte)0x80 
          && cbuf[k] < (byte)0xC0) {

on the grounds that it looks like the preceding comment, and that the first comparison may be optimized out anyway? Again, I don't like this. What would you do? (No, don't recode it in C or your favourite language - I want engineering Java solutions!)

The end of this tale is that just before I sent my discovery to the code's author, I discovered (too late) that I was looking at old Java (well old code, at least). Somehow I had gotten hold of a rather dated version - the latest version came with a list of bug fixes which included the bug I'd been looking at. (As an aside, you will know that bugs in compression and decompression can be what I call "loud". The tiniest error can have a huge effect, but that's a story for another time!) I was quite keen to see this particular fix, and I think you'll find it interesting:

destBuf = new byte[(buf.length*3)/2];
int k, i, destTemp;
for(i = k = 0; i < destIndex; i++, k++){
  destBuf[k] = destBuf[i];
  destTemp = destBuf[k] & 0xff;
  // skip the run-length codes
  if(destTemp >= 0x80 && destTemp < 0xc0){
    destBuf[++k] = destBuf[++i];
  }
...
... etc.

I will come clean. When I first saw this, though I knew it worked, I could not quite see how. We have an extra line, destTemp = destBuf[k] & 0xff, which you might think performs a redundant mask of the byte destBuf[k], and then performs a widening conversion to an int before assigning a value in the range -128 to 127 to destTemp. Hmmm - that would put us almost back where we started. It helps to know that binary numeric promotion also applies to integer bitwise operators. So destBuf[k] & 0xff actually has the effect of promoting the byte destBuf[k] by sign extension to 32 bits, and then zeroing the top 24 bits. That mask isn't redundant after all - it's vital to zero those sign extension bits. So destTemp, an int, has a value in the range 0 to 255 when we come to that familiar if statement. This time, we really do need that extra destTemp >= 0x80 comparison.

I'm not going to comment on what I think is the best fix; I think you should, though! A final foray into "interesting" code. What do you think Java will make of the following code fragment?

  byte b = -128;
  int i = -2147483648;
  b = (byte)-b;
  i = -i;

Remember that a byte variable is unable to hold the value 128, and an int is unable to hold 2147483648. There are a number of options here. Perhaps the code won't compile. (It certainly won't if you take away that (byte) cast. Why?) Perhaps there'll be an overflow at runtime, and an ArithmeticException will be thrown. Perhaps the code will run quite happily, and i and b will end up with values you might not expect. Since there are plenty of free JDKs around, why don't you spend a few moments investigating this? If after you do this you come to the conclusion that Java's behaviour is odd, then I should remind you that at least this is defined behaviour. What value should this fragment of C++ leave in i?

  char c = 255;
  int i = c;

I think that's enough for now. Although there are further interesting aspects to this topic, I hope I've aroused your interest enough for you to go and investigate them yourselves. When I first realised that I'd spent some valuable time fixing code that had already been fixed, I must admit I felt I'd been wasting my time. But I decided to share the experience because I recognised that here were some important lessons. I hope you agree.

References

[Gutenberg] Project Gutenberg Official Home Site: http://promo.net/pg/ (Project Gutenberg is generating a "plain" text archive of as many books as it can, copyright issues permitting. By the time you read this, more than 2,500 texts should be available. From the Official Home Site, you should be able to find your nearest mirror site.)

[DOC] The DOC Format: http://pyrite.linuxbox.com/ (Unless a particular item is difficult to find on a particular website, I try to give the URL for the home page only, as so many sites seem to be re-arranged at regular and too-frequent intervals, invalidating specific URLs. On this home page, you should be able to find a direct link to "The DOC Format".)

[Brisk] Brisk Software Home Page: http://www.qni.com/~brisk/ (Look for MakeDocJ in Pilot Software.)

[Gosling-] James Gosling, Bill Joy & Guy Steele: The Java Language Specification Addison-Wesley. Reading, Mass. 1996. - Section 3.10.1 Integer Literals. (This book is written in a quite readable way, especially for a reference work, but for me it has a major irritation. Perhaps to try and "lighten" the text (a worthy aim), or perhaps for less worthy reasons, the authors have thought fit to include numerous quotations. A good quotation is always apposite, and often carries a succinct message. Practically none of the quotations in this book have either of these qualities - they seem to have been largely selected on the basis of containing a particular keyword, the meaning or message of the quotation being irrelevant. This unfortunately seems to be a growing tendency amongst authors, and to my mind is the opposite of wit. However, there is humour (of a kind) in this book. Look up "prime" in the index.)

[Promotion] Ibid. - Section 5.6.2 Binary Numeric Promotion

[Widening] Ibid. - Section 5.1.2 Widening Primitive Conversions

[Narrowing] Ibid. - Section 5.1.3 Narrowing Primitive Conversions

Notes: 

More fields may be available via dynamicdata ..