在 Windows XP 多语言简中环境下，用 VC2005 中的 std::fstream 打开中文名文件，系统报错找不到此文件。

std::ifstream file("\xd6\xd0.txt"); // GBK 编码的 "中.txt"
if (!file)
{
    std::cerr << "Cannot open file!"; // Oops!
}

原因

在 VC2005 std::fstream 的打开文件的函数实现里，传入的 char const* 文件名作为多字节首先被 mbstowcs 转换成宽字节后，再转发给 Unicode 版本的 API 进行实际的打开文件操作。见 fiopen.cpp ：

_MRTIMP2_NCEEPURE FILE *__CLRCALL_PURE_OR_CDECL _Fiopen(const char *filename,
    ios_base::openmode mode, int prot)
    {    // open wide-named file with byte name
    wchar_t wc_name[FILENAME_MAX];

    if (mbstowcs_s(NULL, wc_name, FILENAME_MAX, filename, FILENAME_MAX - 1) != 0)
        return (0);
    return _Fiopen(wc_name, mode, prot);
    }

问题是， mbstowcs 函数需要知道编码类型才能正确地将多字节文本转换成宽字节的 unicode，很可惜这个编码类型并没有体现在函数的参数列表里，而是隐含依赖全局的 locale 。更不幸的是，全局 locale 默认没有使用系统当前语言，而是设置为无用的 C locale 。于是 GBK 编码的文件名在 C locale 下转换出错，悲剧发生了…

解

知道原因，解就很简单了。在直接或间接调用 mbstowcs 函数前，先用 setlocale 将全局 locale 设为当前系统默认的 locale

setlocale(LC_ALL, "");

如果在非中文系统上处理 GBK 编码，就需要明确指定中文 locale

setlocale(LC_ALL, "chs"); // chs 是 VC 里简中 locale 的名字

还有一种方法，直接使用宽字节版本的打开文件 API。之前的编码由自己转换好，避免系统语言环境设置的影响。在 VS2005 中 fstream 有个扩展，可以直接打开宽字节文件名：

std::ifstream file(L"\u4E2D.txt"); // UCS2 编码的“中.txt”

API 隐藏依赖关系是不好的，这意谓着外部环境能通过潜规则来影响 API 的功能。这降低了 API 的复用性、可测性。运行时更容易出现意外错误。进一步设想，如果环境原来的 locale 是被其它代码块故意设置的，为了修正打开中文名文件的 bug 冒然修改全局 locale ，很可能会让依赖于原 locale 工作的代码出现 bug 。在这样的 API 设计下，如果要尽量避免顾此失彼的发生，我们可以在修改 locale 前保存当前的 locale 状态，用完后再将 locale 恢复。我们用 RAII 手法来封装这样的逻辑

class  scoped_locale
{
public:
    scoped_locale(std::string const&amp; loc_name)
        : _new_locale(loc_name)
        , _setted(false)
    {
        try
        {
            char const* old_locale = setlocale(LC_CTYPE, _new_locale.c_str());

            if (NULL != old_locale)
            {
                _old_locale = old_locale;
                _setted = true;
            }
        }
        catch (...)
        {
        }
    }

    ~scoped_locale()
    {
        try
        {
            if (_setted)
            {
                char const* pre_locale = setlocale(LC_CTYPE, _old_locale.c_str());

                if (pre_locale)
                {
                    assert(pre_locale == _new_locale);
                    _setted = false;
                }
            }
        }
        catch (...)
        {
        }
    }

private:
    std::string _new_locale;
    std::string _old_locale;
    bool _setted;
};

原代码可以改为

{
    scoped_locale change_locale_to("");
    std::ifstream file("\xd6\xd0.txt"); // GBK 编码的“中.txt”
    if (!file)
    {
        std::cerr << "Cannot open file!"; // Oops!
    }
}

如果是多线程环境，还需要查明 locale 的全局性是进程级别还是线程级别的。如果是进程级别，那甚至还会有潜在的进程间相互影响的风险。从这点上来看，C/C++ 标准库中 mbstowcs 的设计是有瑕疵的。这也从反面体现了 Dependency Injection 思想的重要性。在 Win32 API 有个类似的函数 WideCharToMultiByte() ，它的作用也是进行多字节到宽字节的编码转换，但在 API 设计上，它并不使用全局的 code page 而是要求用户将 code page 作为首个参数显示传入。这样就避免了 mbstowcs 的问题。我们可以再将它封装一下，直接对 std::string 做编码转换

std::wstring native_to_utf16(std::string const& native_string)
{
    UINT const codepage = CP_ACP;
    DWORD const sizeNeeded = MultiByteToWideChar(
        codepage, 0, native_string.c_str(), -1, NULL, 0);

    std::vector<wchar_t> buffer(sizeNeeded, L'\0');

    if (0 == MultiByteToWideChar(codepage, 0,
            native_string.c_str(), -1,
            &buffer[0], buffer.size()))
    {
        throw std::runtime_error("wrong convertion from native string to utf16");
    }

    return std::wstring(buffer.begin(), buffer.end());
}

道可叨

Free Will

setlocale 与 mbstowcs 的问题

问题

原因

解

引申