Is UTF-8 an encoding or a character set?
Is UTF-8 an encoding or a character set?
UTF-8 is an encoding and that term is used in the RFC that defines it which is quoted below.
I often see the terms "encoding" and "charset" used interchangeably
Prior to Unicode, if you wanted to use an alphabet† like Cyrillic or Greek, you needed to use a encoding that only encoded to characters in that alphabet. Thus, the terms encoding and charset were often conflated but they mean different things.
Now though, Unicode is usually the only character set you need to worry about since it contains characters for most written languages you'll have to deal with, except Klingon.
† - Alphabet, a kind of *character set* where characters correspond directly to sounds in a spoken language.A character set is a mapping from code-units (integers) to characters, symbols, glyphs, or other marks in a written language. Unicode is a character set that maps 21b integers to unicode codepoints. The Unicode Consortium's glossary describes it thus:
Unicode
- The standard for digital representation of the characters used in writing all of the world's languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language. It is used by all modern computers and is the foundation for processing text on the Internet. Unicode is developed and maintained by the Unicode Consortium: http://www.unicode.org.
- A label applied to software internationalization and localization standards developed and maintained by the Unicode Consortium.
An encoding is a mapping from strings to strings. UTF-8 is an encoding that maps strings of bytes (8b integers) to strings of code-points (21b integers). The Unicode Consortium calls it a "character encoding scheme" and it is defined in RFC 3629.
The originally proposed encodings of the UCS, however, were not compatible with many current applications and protocols, and this has led to the development of UTF-8
The Unicode Standard calls it an encoding form or an encoding scheme. Unicode has a single set of characters (known as the Unicode character set, or Universal Character Set), and all the UTF encoding forms and encoding schemes can encode all the characters in that set.
As happens with many other terms, programmers seem to have a tendency to just misappropriate terms here and there, and this is just one more instance of this.
UTF-8 is an encoding, in the sense that it encodes a sequence of abstract integers – the unicode codepoints which indicate abstract characters – into a set of bytes. (Through unicode spectacles, you could say that a 'character set' such as ISO-8859-1 is also a table-driven 'encoding', in the sense that it encodes a small number of codepoints as bytes, but this is verging towards an abuse of terminology, and probably isn't very helpful).
The sequence of integers is (in some fundamental sense) the 'unicode string', but in order to save these on a disk or send them over a network, you need to encode them as a sequence of bytes. UTF-8 is one way of doing that, UTF-16 is another: one unicode string will be represented as two different streams of bytes if it's encoded in the two different ways.
There are multiple fine answers here, but just yesterday I spent some time trying to boil this issue down to some minimal size, so this provides a happy opportunity to reuse that text:
Joel Spolsky's article on The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) is quite good, I think. It's (surely) been mentioned here before, but it bears repeating. I think it's not completely minimal, though.
On the couple of occasions when I've had to explain 'unicode' to a colleague, it's been the notion of the abstract Unicode codepoints that's turned out key to the illumination. The structure of my successful explanations has been something like this:
The Unicode consortium has (with much agonising and negotiation) managed to give a number to a large fraction of the characters in use. These numbers are (jargon) called 'codepoints'.
'The Letter A' has a codepoint, and this is independent of fonts. Thus 'A' and 'a' have different codepoints, but roman, bold, italic, serif, sans serif (et very much cetera) are not distinguished. Japanese kanji, tengwar and klingon characters (for example) have codepoints (this gets attention).
A 'unicode string' is (conceptually) a sequence of codepoints. This is a sequence of mathematical integers. It does not make sense to ask whether these are bytes, 2-byte or 4-byte words; the sequence has nothing to do with computers.
If, however, you want to send that sequence of integers to someone, or save it on a computer disk, you have to do something to encode it. You could also write down the sequence of numbers on a piece of paper, but let's specialise to computers at this point. If you want to store or send this on a computer, you have to transform these integers into a sequence of bytes. There are multiple procedures for doing that, and each of these procedures is named an 'encoding'. One of these 'encodings' is UTF-8.
When you 'read a Unicode file', you are starting with a sequence of bytes, on disk, and conceptually ending up with a sequence of integers. If the 'unicode file' is indicated, somehow, to be encoded in UTF-8, then you have to decode that sequence of bytes to get the sequence of integers, using the algorithm defined in RFC 3629. All of the subsequent operations on the 'unicode string' are defined in terms of the sequence of codepoints, and the fact that it started off, on disk, as 'UTF-8' is forgotten.
UTF-8 is an encoding. Encodings are however often called character sets, and many protocols therefore use the parameter name charset
for a parameter that specifies character encoding. As such, charset
is just an identifier.