Tech Archive: Unicode in Windows

UNICODE

• Real problem with localizations has always been manipulating different character sets.

• The problem is that some languages and writing systems have so many symbols that one byte,which offer no more than 256 different symbols at the best,is just not enough .

•UNICODE offers a simple and consistent way of representing strings.

•All characters in a Unicode string are 16 bit value(2 bytes).

•Because Unicode represents each character with 16 bit value, more than
65,000 characters are available, making it possible to encode all the characters that make up written languages throughout the world.

Advantages of Unicode

• It enables easy data exchange between the languages

• It allows you to distribute a single binary .exe or DLL file that supports all languages.

• It improves the efficiency of your application.

Writing Unicode Source Code

• It is possible to write single source code file so that it can be compiled with or without using Unicode-you need only to define two macros UNICODE and _UNICODE to make the change and recompile.

• To take advantage to Unicode character strings,some data types have been defined.The standard C header file,String.h,has been modified to define a data type named wchar_t ,which is the data type of Unicode character:

typedef unsigned short wchar_t;

•The standard C run time string functions,such as strcpy,strchr and strcat operate only on ANSI strings only,so they developed an equivalent Unicode functions begin with wcs(wide character strings) such as wcscpy,wcschr and wcscat.

•To set up dual compatibility (ANSI and Unicode) include TChar.h file instead of String.h.

•TChar.h exists for the sole purpose of helping us to create ANSI/Unicode generic source code files.It consists of a set of macros that you should use in your source code instead of making direct calls to either the str or wcs functions.

•If you define _UNICODE when you compile your source code ,the macro reference to wcs set of functions,otherwise to str set of functions.

•TChar.h include some additional macros.

•To define an array of string characters that is ANSI/Unicode ,usefollowing TCHAR datatype:

If _UNICODE defined,TCHAR is declared as( for UNICODE )

typedef wchar_t TCHAR;

If _UNICODE notdefined,TCHARis declared as(for ANSI)

typedef char TCHAR;

eg: We can create allocate a string of characters as follows:

TCHAR szString[100];
•By Default Microsoft C++ compiler compiles all strings as though theywere ANSIstrings,not Unicode strings.So to create pointer to strings we have to specify ‘L’ before the literal strings which informs the compiler that the string should be compiled as a Unicode string.

Create Pointers to Strings:

TCHAR *szError= L”Error”

•We need another macro that selectively adds the uppercase L before a literal String._TEXT macro also defined in TChar,h file.

If _UNICODE is defined ,_TEXT is defined as

#define _TEXT(x) L ## x

If _UNICODE is not defined ,_TEXT is defined as

#define _TEXT(x) x

So to create Pointers to Strings rewrite the above line:

TCHAR *szError = _TEXT(“Error”);

_TEXT can also be used for literal characters like:

If(szError[0] == _TEXT(‘U’) { …………}

•_UNICODE is used for C run time header file.

•UNICODE macro is used for Windows header files.

•Usually we need to define both above macros when compiling a source code module.

•In Windows : WCHAR –used as Unicode character. PWSTR---Pointer to a Unicode string. PCWSTR--- Pointer to a constant Unicode String.

•There are different windows functions defined in WinUser.h such as :

#ifdef UNICODE
#define CreateWindowEx CreateWindowExW
#else
#define CreateWindowEx CreateWindowExA
#endif //UNICODE

•In ShlWApi.h file we have windows OS string Functions like StrCat,StrChr,StrCmp----with both Unicode and ANSI versions like
StrCatW and StrCatA.

•Windows Function to determine whether the text file is ANSI or Unicode characters we use the function

DWORD IsTextUnicode(CONST PVOID pvBuffer, int cb ,PINT pResult);

pvBuffer --- address of a buffer that we want to test.void because we don’t know it is ANSI or UNICODE.

Cb----- specifies the number of bytes pvBuffer points to.Again,u don’t know what’s in the buffer,cb is count of bytes rather than a count of characters.

pResult----address of an integer that u must initialize before calling IsTextUnicode.We can Intialize this integer to indicate which tests u want IsTextUnicode to perform.We can also pass NULL for this parameter to perform every test it can.
TRUE is returned if buffer contains Unicode text, otherwise FALSE is returned.

•Translating Strings between Unicode and ANSI
The windows Function MultiByteToWideChar converts multibyte to character string to wide character string.

The windows function WideCharToMultiByte converts a wide character string to its multibyte string equivalent .

Tech Archive

Unicode in Windows

Oracle Java Blogs Latest

TechNet Magazine - Latest

DevX

Java Web Services - ServerSide.com

Java Technology - SDN

IBM developerWorks

Apache Jakarta Project

Java Lobby

Mkyong.com

Google Code Blog

Martin Fowler

Java Oreilly

J2EE Patterns - ServerSide.com

JavaRanch: "OO, Patterns, UML and Refactoring"

Google Code: News

SourceForge.net New Releases

Developer.com

JavaWorld