NSAttributedString initWithHTML incorrect character encoding?
The previous answer here works, but mostly by accident.
Making an NSData
with NSUnicodeStringEncoding
will tend to work, because that constant is an alias for NSUTF16StringEncoding
, and UTF-16 is pretty easy for the system to identify. Easier than UTF-8, which apparently was being identified as some other superset of ASCII (it looks like NSWindowsCP1252StringEncoding
in your case, probably because it's one of the few ASCII-based encodings with mappings for 0x8_ and 0x9_).
That answer is mistaken in quoting the documentation for NSCharacterEncodingDocumentAttribute
, because "attributes" are what you get out of -initWithHTML
. That's why it's NSDictionary **
and not just NSDictionary *
. You can pass in a pointer to an NSDictionary *
, and you'll get out keys like TopMargin/BottomMargin/LeftMargin/RightMargin, PaperSize, DocumentType, UTI, etc. Any values you try to pass in through the "attributes" dictionary are ignored.
You need to use "options" for passing values in, and the relevant option key is NSTextEncodingNameDocumentOption
, which has no documented default value. It's passing the bytes to WebKit for parsing, so if you don't specify an encoding, presumably you're getting WebKit's encoding-guessing heuristics.
To guarantee the encoding types match between your NSData
and NSAttributedString
, what you should do is something like:
NSString *html = @"“Hello” World";
NSData *htmlData = [html dataUsingEncoding:NSUTF8StringEncoding];
NSMutableAttributedString *as =
[[NSMutableAttributedString alloc] initWithHTML:htmlData
options:@{NSTextEncodingNameDocumentOption: @"UTF-8"}
documentAttributes:nil];
Swift version of accepted answer is:
let htmlString: String = "Hello world contains html</br>"
let data: Data = Data(htmlString.utf8)
let options: [NSAttributedString.DocumentReadingOptionKey: Any] = [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
]
let attributedString = try? NSAttributedString(data: data,
options: options,
documentAttributes: nil)
Use [html dataUsingEncoding:NSUnicodeStringEncoding]
when creating the NSData and set the matching encoding option when you parse the HTML into an attributed string:
The documentation for NSCharacterEncodingDocumentAttribute
is slightly confusing:
NSNumber, containing an int specifying the
NSStringEncoding
for the file; for reading and writing plain text files and writing HTML; default for plain text is the default encoding; default for HTML is UTF-8.
So, you code should be:
NSString *html = @"“Hello” World";
NSData *htmlData = [html dataUsingEncoding:NSUTF8StringEncoding];
NSDictionary *options = @{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
NSCharacterEncodingDocumentAttribute: @(NSUTF8StringEncoding)};
NSMutableAttributedString *as =
[[NSMutableAttributedString alloc] initWithHTML:htmlData
options: options
documentAttributes:nil];