宽字符 wchar

it2024-10-12 56

宽字符

宽字符（Wide character）是电脑抽象术语（没有规定具体实现细节），表示比8位字符还宽的数据类型。不同于Unicode。

Unicode

ISO/IEC 10646:2003 Unicode4.0 指出：

"The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers." 翻译：“wchar_t的宽度属于编译器的特性，且可以小到8位。所以程序若需要跨过所有C和C++ 编译器的可携性，就不应使用wchar_t存储Unicode文字。wchar_t类型是为存储编译器定义的宽字符，在部分编译器中，其可以是Unicode字符。”

"ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension."

操作系统

对于Windows API及Visual Studio编译器，wchar_t是16位宽。由于不能在单个wchar_t字符中，支持系统所有可表示的字符（即UTF-16小尾字符），因而破坏了ANSI/ISO C标准。

在类Unix系统中，wchar_t是32位宽。单个wchar_t字符可表示任意UTF-32大尾字符。

程序设计语言

C/C++

wchar_t在ANSI/ISO C中是一个数据类型。某些其它的编程语言也用它来表示宽字符。在 ANSI C程序库头文件中，<wchar.h>和<wctype.h>处理宽字符。

最初，C90语言标准定义了类型wchar_t：

"an integral type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales" (ISO 9899:1990 §4.1.5)

C语言与C++语言于2011年发布的各自语言标准中引入了固定大小的字符类型char16_t与char32_t。wchar_t仍保持由编译器实现定义其细节。

Python

Python使用wchar_t作为字符类型Py_UNICODE的基础。它依赖于该系统是否 wchar_t“兼容于被选择的Python Unicode编译版本”。

宽窄转换

任何非宽字符的字符集，无论是单字节字符集（SBCS），还是（可变长）多字节字符集（MBCS），都称作窄字符集（narrow character set）。

宽字符集的一个用途，是作为任意两个窄字符集相互转换的中间表示。

宽字符集与窄字符集的转换，有多种方法。

使用Windows AP

例如：

// we want to convert an MBCS string in lpszA int nLen = MultiByteToWideChar(CP_ACP, 0, lpszA, -1, NULL, NULL); LPWSTR lpszW = new WCHAR[nLen]; MultiByteToWideChar(CP_ACP, 0, lpszA, -1, lpszW, nLen); // use it to call OLE here pI->SomeFunctionThatNeedsUnicode(lpszW); // free the string delete[] lpszW;

C语言标准库函数mbstowcs()和wcstombs()

定义于stdlib.h。需要预先分配目标缓冲区。

最新回复(0)