Can I read Chinese characters with ReadList correctly?
Following the comments above, I think I've managed to find the answer, that is, as m_goldberg and librik said, ReadList
doesn't support character encoding, and maybe that's one of the reasons it's fast.
However, that doesn't mean we can't make use of ReadList
. In fact, following the advice from mfvonh, I found that Import
internally uses ReadList
to read a.txt
first and then converts it to the right encoding with ToCharacterCode
and FromCharacterCode
after a lot of judgments that I don't understand very well and seem to be redundant. So why not omit those judgments?:
Export["a.txt", "这乱码问题该怎么解决呢\n***\n1234\n这样解决呀"];
FromCharacterCode[ToCharacterCode[ReadList["a.txt", Record](*,"ISOLatin1"*)],
"UTF8"] // AbsoluteTiming
Import["a.txt"]; // AbsoluteTiming
{0.0010000, {"这乱码问题该怎么解决呢", "***", "1234", "这样解决呀"}} {0.0440000, Null}
Not sure if this will fail in more complicated cases.
I think the most straightforward method is to read the data as a byte content, then interpret that as an UTF8 text. This is what it would look like:
FromCharacterCode[BinaryReadList["a.txt"], "UTF8"]
It will be slightly more performant than the other suggestions as it avoids any unnecessary conversions. Be aware that you need to break into lines via e.g. StringSplit
if so desired...