QString in Persian

I was far to curious to wait for reply and toyed a bit on my own:

I copied the text سلام (in English: "Hello") and pasted it into Nodepad++ (which used UTF-8 encoding in my case). Then I switched to View as Hex and got:

snapshot of Notepad++ - hex dump of "سلام"

The ASCII dump on right side looks a bit similar to what OP got unexpectedly. This let me believe that the bytes in readData are encoded in UTF-8. Hence, I took the exposed hex-numbers and made a little sample code:

testQPersian.cc:

#include <QtWidgets>

int main(int argc, char **argv)
{
  QByteArray readData = "\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85";
  QString textLatin1 = QString::fromLatin1(readData);
  QString textUtf8 = QString::fromUtf8(readData);
  QApplication app(argc, argv);
  QWidget qWin;
  QGridLayout qGrid;
  qGrid.addWidget(new QLabel("Latin-1:"), 0, 0);
  qGrid.addWidget(new QLabel(textLatin1), 0, 1);
  qGrid.addWidget(new QLabel("UTF-8:"), 1, 0);
  qGrid.addWidget(new QLabel(textUtf8), 1, 1);
  qWin.setLayout(&qGrid);
  qWin.show();
  return app.exec();
}

testQPersian.pro:

SOURCES = testQPersian.cc

QT += widgets

Compiled and tested in cygwin on Windows 10:

$ qmake-qt5 testQPersian.pro

$ make

$ ./testQPersian

snapshot of testQPersian

Again, the output as Latin-1 looks a bit similar to what OP got as well as what Notepad++ exposed.

The output as UTF-8 provides the expected text (as expected because I provided a proper UTF-8 encoding as input).

May be, it's a bit confusing that the ASCII/Latin-1 output vary. – There exists multiple character byte encodings which share the ASCII in the lower half (0 ... 127) but have different meanings of bytes in the upper half (128 ... 255). (Have a look at ISO/IEC 8859 to see what I mean. These have been introduced as localizations before Unicode became popular as the final solution of the localization problem.)

The Persian characters have surely all Unicode codepoints beyond 127. (Unicode shares the ASCII for the first 128 codepoints as well.) Such codepoints are encoded in UTF-8 as sequences of multiple bytes where each byte has the MSB (the most significant bit – Bit 7) set. Hence, if these bytes are (accidentally) interpreted with any ISO8859 encoding then the upper half becomes relevant. Thus, depending on the currently used ISO8859 encoding, this may produce different glyphs.


Some continuation:

OP sent the following snapshot:

Snapshot (provided by OP)

So, it seems instead of

d8 b3 d9 84 d8 a7 d9 85

he got

00 08 d8 b3 d9 84 d8 a7 d9 85

A possible interpretation:

The server sends first a 16 bit length 00 08 – interpreted as Big-Endian 16 bit integer: 8, then 8 bytes encoded in UTF-8 (which look exactly like the one I got with playing above). (AFAIK, it's not unusual to use Big-Endian for binary network protocols to prevent endianess issues if sender and receiver have natively different endianess.) Further reading e.g. here: htons(3) - Linux man page

On the i386 the host byte order is Least Significant Byte first, whereas the network byte order, as used on the Internet, is Most Significant Byte first.


OP claims that this protocol is used DataOutput – writeUTF:

Writes two bytes of length information to the output stream, followed by the modified UTF-8 representation of every character in the string s. If s is null, a NullPointerException is thrown. Each character in the string s is converted to a group of one, two, or three bytes, depending on the value of the character.

So, the decoding could look like this:

QByteArray readData("\x00\x08\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85", 10);
//QByteArray readData = socket->readAll();
unsigned length
  = ((uint8_t)readData[0] <<  8) + (uint8_t)readData[1];
QString text = QString::fromUtf8(dataRead.data() + 2, length);
  1. The first two bytes are extracted from readData and combined to the length (decoding big-endian 16 bit integer).

  2. The rest of dataRead is converted to QString providing the previously extracted length. Thereby, the first 2 length bytes of readData are skipped.