Wednesday, January 23, 2013

Encodings - what are they ?

At the very lowest level a string is represented as bunch of bytes.

In some programming languages that's about as sophisticated as it gets.

In most of the very early computers everything was "English" which eventually got standardized. This was known as ASCII, there is another known as EBCDIC but it's less common on PC's. In the very early days of digital computing, each byte represented one and only one character (as they only used 7 of the 8 bits) so there were bytes with values from 0-127 to make up a "string". Using ASCII, a value of 65 was used to indicate an "A", for example.

But not everyone reads and write English, so eventually computer vendors came up with "extended versions of ASCII" by using the 8th bit and adding in whole new sets of characters (from 128 to 255). These are referred to in several ways - code pages, extended ASCII and a few others.

But since there are thousands of languages and there are only 256 possible values for each byte you had to know which specific "extended ASCII" set was being used to determine what the extra characters were. This was the "encoding" for what has become known as single byte languages. And there are many of them.

Eventually, it was apparent that even all 256 possibilities could not represent some languages, for example Chinese, so new schemes were devised.

Over time these have evolved and a standard called Unicode was devised. Now there are several Unicode variations (UTF-8, UTF-16, UCS-2. UTF-32/UCS-4, etc) that can handle representing every "character" in every written language. However Unicode variations no longer use just single bytes to represent the various characters. UTF-8 for instance may use several successive bytes to represent one character.

All of this is essentially a long way of saying that you have to know the encoding of a bunch of bytes in order to get the correct string value from it. If you have a string of data that is encoded in UTF-8, but try to use it as if it were in ASCII encoding, you may find that some characters will not be what you expect.

So what does this mean for Real Studio developers? If you are creating and dealing with strings entirely from within your Real Studio apps, you have it easy. Everything is treated as UTF-8, which is a versatile and popular encoding. Perhaps one of the best things about UTF-8 is that you never have to worry about the byte order of the data (or endianness) across platforms. UTF-8 is always the same on every platform.

But problems can occur for you when you start to accept data from outside sources that should be treated as text, such as from databases, files, serial ports or the Internet.

There are two methods that are particularly helpful when dealing with text from outside sources: DefineEncoding and ConvertEncoding.

You use the DefineEncoding method to specifically state the encoding of the incoming text. This is how you tell your app to "interpret the following bunch of bytes as if it is this encoding". Of course, this means that you actually have to know the encoding of the incoming data so you can tell your app what it is. This code specifies that data in a MemoryBlock (perhaps that was read in using a BinaryStream) contains text using the UTF-16 encoding:

Dim myString As String
myString = DefineEncoding(MyMemoryBlock.StringValue(0,8), Encodings.UTF16 )

When you already have text in one encoding, but need to work with it in another encoding you use the ConvertEncoding method. For example, you may have a database that only works with UTF-16, so you want to convert text entered by the user to UTF-16 before you write it to the database:

Dim name As String = NameField.Text
name = ConvertEncoding(name, Encodings.UTF16)

The Encodings module includes an enumeration with dozens of common encodings predefined for you.

So make sure when you get data from outside of your applications that you know what encoding it is or that you have defined it before you being using it! 

1 comment:

Norman Palardy said...

Joe added some more clarity to this on the NUG with

To go into a bit of background about strings, first... A string in Real Studio is a collection of bytes and an encoding that tells the framework how to interpret the bytes. Unlike some other languages and frameworks, it's not defined as being a series of Unicode codepoints (or something else that is guaranteed to always represent textual content and be valid).

A string with a nil encoding is basically a bag of bytes, much like a MemoryBlock. These are returned from things like BinaryStream.Read or strings in structures.

A string with an encoding is saying "these bytes represent text via this encoding". The framework trusts that you're not lying to it and that the bytes are actually valid in the given encoding.

When you pass the framework a string that is meant to be used as text, like a button's caption or a string to draw via DrawString, it expects you to have a valid encoding. If you give it a string with a nil encoding or a string with bytes that are invalid in the encoding you specified, we try to recover the best that we can. When encountering a situation that we can't really make heads nor tails of, we insert the Unicode replacement character (the diamond).

Going forward to DefineEncoding and ConvertEncoding, which were also mentioned in the thread... DefineEncoding creates a new string from an existing string's bytes but tags them with the encoding you specified. It's simply reinterpreting the data. ConvertEncoding, on the other hand, takes a string that already has a valid encoding and alters its bytes to be encoded in the encoding object you pass.

When you read from a file, structure, socket, or some other sort of byte stream, you should be sure to call DefineEncoding on the result if you mean to use it as text later on. Some of the accessors like BinaryStream.Read already take an encoding parameter to let you skip the DefineEncoding step.

One other thing to be aware of is that string operations that are performed with a string with a nil encoding always end up with a string that also has a nil encoding. For example, concatenating a UTF-8 string with a string that has a nil encoding results in a string with a nil encoding.