UTF-8

UTF-8 is an efficient encoding of Unicode character-strings that recognizes the fact that the majority of text-based communications are in ASCII, and it therefore optimizes the encoding of these characters.

For a good introduction to UTF-8, see:
RFC 2044.

UTF strings are formed as follows:
 
bit
7
6
5
4
3
2
1
0
byte 1
Message Length MSB
byte 2 
Message Length LSB
bytes 3...
Encoded Character Data

Note that despite our use of little-endian (LSB, MSB) byte ordering in other fields elsewhere in the protocol specification, the Message Length field of a UTF string is in network order, i.e. big-endian; MSB, LSB.

Note also that the Message Length is the number of bytes of encoded string characters, rather than just the number of characters. For ASCII strings, however, these are the same, since for ASCII codes 0x01 to 0x7F, the encoded characters have the following format:
 
bit
7
6
5
4
3
2
1
0
 
0
ASCII code of character

Thus the ASCII text string OTWP would be encoded in UTF-8 as:
 
 
bit
7
6
5
4
3
2
1
0
byte 1
Message Length MSB (0x00)
 
0
0
0
0
0
0
0
0
byte 2 
Message Length LSB (0x04)
 
0
0
0
0
0
1
0
0
byte 3 
'O' (0x4F) 
 
0
1
0
0
1
1
1
1
byte 4 
'T' (0x54) 
 
0
1
0
1
0
1
0
0
byte 5 
'W' (0x57) 
 
0
1
0
1
0
1
1
1
byte 6 
'P' (0x50) 
 
0
1
0
1
0
0
0
0

 In Java, the writeUTF() and readUTF() methods of data streams read and write data in this format.
 


Discussion

ASC on UTF
If I was being non-international flavoured, a one byte length followed by ASCII characters would probably suffice (for strings up to 255 characters), but I guess we should be thinking of the "NLS" aspect of this and going for Unicode, in which case UTF-8 seems perfectly reasonable, and only wastes one length byte for ASCII strings.
AN comment
Even in non-English speaking countries, strings are still specified in English, and so in fact ASCII with either a 0x00 terminator, or a preceding length byte would be perfectly alright. However, UTF does not impose a significant overhead, and in light of IBM's NLS requirements it would be appropriate to use the UTF encoding.


Last Modified: 1-Jul-99