For a good introduction to UTF-8, see:
RFC
2044.
UTF strings are formed as follows:
bit |
|
|
|
|
|
|
|
|
byte 1 |
|
|||||||
byte 2 |
|
|||||||
bytes 3... |
|
Note that despite our use of little-endian (LSB, MSB) byte ordering in other fields elsewhere in the protocol specification, the Message Length field of a UTF string is in network order, i.e. big-endian; MSB, LSB.
Note also that the Message Length is the number of bytes of encoded
string characters, rather than just the number of characters. For ASCII
strings, however, these are the same, since for ASCII codes 0x01 to 0x7F,
the encoded characters have the following format:
bit |
|
|
|
|
|
|
|
|
|
|
Thus the ASCII text string OTWP would be encoded in UTF-8 as:
bit |
|
|
|
|
|
|
|
|
byte 1 |
|
|||||||
|
|
|
|
|
|
|
|
|
byte 2 |
|
|||||||
|
|
|
|
|
|
|
|
|
byte 3 |
|
|||||||
|
|
|
|
|
|
|
|
|
byte 4 |
|
|||||||
|
|
|
|
|
|
|
|
|
byte 5 |
|
|||||||
|
|
|
|
|
|
|
|
|
byte 6 |
|
|||||||
|
|
|
|
|
|
|
|
|
In Java, the writeUTF() and readUTF() methods
of data streams read and write data in this format.