Convert GB2312 to UTF-8
I might be thinking a bit too simple here, but if it's just this one plain text file, you could try the following:
- Replace all
&
by&
, all<
by<
and all>
by>
(to be on the safe side) Prepend the following to the text file:
<html><head><meta http-equiv="Content-Type" content="text/html; charset=gb2312" /></head><body><pre>
Open the file in your favorite browser
- Select and copy all text
- Paste it in Notepad and save as UTF-8.
You'd be done with this before you could have written any code to do the conversion or downloaded any programs that would do the conversion for you.
Of course, I'm not a hundred percent sure this'll work, and your browser would need the correct fonts and everything, but considering you're working with these kinds of files I'm assuming you already have those.
You can try this online service that uses the Open Source iconv
utility.
You can also install Charco, a command-line version of it on your machine.
For GB2312
, you can use CP936
as the encoding.
If you are a .Net developer you can make a small tool that does just that.
I've struggled with this as well and found that it was actually simple to solve from a programmatic point of view.
All you need is something like this (I tested it and it works):
In C#
static void Main(string[] args) {
string infile = args[0];
string outfile = args[1];
using (StreamReader sr = new StreamReader(infile, Encoding.GetEncoding(936))) {
using (StreamWriter sw = new StreamWriter(outfile, false, Encoding.UTF8)) {
sw.Write(sr.ReadToEnd());
sw.Close();
}
sr.Close();
}
}
In VB.Net
Private Shared Sub Main(ByVal args() As String)
Dim infile As String = args(0)
Dim outfile As String = args(1)
Dim sr As StreamReader = New StreamReader(infile, Encoding.GetEncoding(936))
Dim sw As StreamWriter = New StreamWriter(outfile, false, Encoding.UTF8)
sw.Write(sr.ReadToEnd)
sw.Close
sr.Close
End Sub