Unicode

Introduction

Unicode is the term used for extended character sets which allows displaying text in many languages (including those with a lot of different characters like Japanese, Korean etc.). It solves the ASCII problem of having a lot of table dedicated for each language by having a unified table where each character has its own place. If an application needs to be used by such countries, it's highly recommended to use the unicode mode when starting the development, as it will reduce the number of bugs due to the conversion from ascii to unicode.

To simplify, unicode can be seen as a big ascii table, which doesn't have 256 characters but up to 65536. Thus, each character in unicode will need 2 bytes in memory (this is important when dealing with pointers for example). The special 'character' (.c) PureBasic type automatically switches from 1 byte to 2 bytes when unicode is enabled.

Here are some links to have a better understanding about unicode (must have reading):

General unicode information
Unicode and BOM

To check if the program is compiled in "Unicode mode", you can use the compiler directives with #PB_Compiler_Unicode constant. To compile a program in "Unicode mode" the related option can be set in the Compiler options.

Unicode and Windows

On Windows, PureBasic internally uses the UCS2 encoding which is the format used by the Windows unicode API, so no conversions are needed at runtime when calling an OS function. When dealing with an API function, PureBasic will automatically use the unicode one if available (for example MessageBox_() will map to MessageBoxW() in unicode mode and MessageBoxA() in Ascii mode). All the API structures and constants supported by PureBasic will also automatically switch to their unicode version. That means same API code can be compiled either in unicode or ascii without any change.

Unicode is only natively supported on Windows NT and later (Windows 2000/XP/Vista): a unicode program won't work on Windows 95/98/Me. There is solution by using the 'unicows' wrapper DLL but it is not yet supported by PureBasic. If the Windows 9x support is needed, the best is to provide two version of the executable: one compiled in ascii, and another in unicode. As it's only a switch to specify, it should be quite fast.

UTF-8

UTF-8 is another unicode encoding, which is byte based. Unlike UCS2 which always takes 2 bytes per characters, UTF-8 uses a variable length encoding for each character (up to 4 bytes can represent one character). The biggest advantage of UTF-8 is the fact it doesn't includes null characters in its coding, so it can be edited like a regular text file. Also the ASCII characters from 1 to 127 are always preserved, so the text is kind of readable as only special characters will be encoded. One drawback is its variable length, so it all string operations will be slower due to needed pre-processing to locate a character is in the text.

PureBasic uses UTF-8 by default when writing string to files in unicode mode (File and Preference libraries), so all texts are fully cross-platform.

The PureBasic compiler also handles both Ascii and UTF-8 files (the UTF-8 files need to have the correct BOM header to be handled correctly). Both can be mixed in a single program without problem: an ascii file can include an UTF-8 file and vice-versa. When developing a unicode program, it's recommended to set the IDE in UTF-8 mode, so all the source files will be unicode ready. As the UTF-8 format doesn't hurt as well when developing ascii only programs, it is not needed to change this setting back.