This link has been bookmarked by 173 people . It was first bookmarked on 02 Mar 2006, by Jeff Tucker.
-
02 Nov 09
-
25 Oct 09
-
20 Oct 09
-
15 Oct 09
Rémy Sanlaville"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
-
05 Oct 09
-
16 Sep 09
Richard MarquezHaven’t mastered the basics of Unicode and character sets? Please don’t write another line of code until you’ve read this article.
-
03 Sep 09
-
13 Aug 09
-
18 May 09
-
13 May 09
-
04 May 09
-
29 Apr 09
Chris LasherJoel Spolsky on a very useful topic that becomes pertinent to many programmers at some point in their careers. Dated 2003, but still pertinent today.
-
13 Apr 09
-
era eA bit C-centric but nevertheless a classic
20060619-0123 article character development encoding software unicode
-
08 Apr 09
-
06 Apr 09
drew ReeceJoel Spolsky on character encoding for programmers.
-
24 Mar 09
-
20 Mar 09
-
Postel's Law about being "conservative in what you emit and liberal in what you accept" is quite frankly not a good engineering principle
-
-
18 Mar 09
Peter Jacobsongreat article outlining just what the title says...
-
23 Feb 09
-
04 Feb 09
cminAll that stuff about "plain text = ascii = characters are 8 bits" is not only wrong, it's hopelessly wrong, and if you're still programming that way, you're not much better than a medical doctor who doesn't believe in germs. \n\nUnicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.
-
All that stuff about "plain text = ascii = characters are 8 bits" is not only wrong, it's hopelessly wrong
-
But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work.
- 7 more annotations...
-
-
Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct.
-
Not every Unicode string in the wild has a byte order mark at the beginning.
-
UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes.
-
The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits)
-
There's something called UTF-7
-
those unicode code points can be encoded in any old-school encoding scheme, too!
-
Internet Explorer actually does something quite interesting: it tries to guess
-
-
-
01 Feb 09
-
25 Jan 09
-
21 Jan 09
-
25 Nov 08
-
23 Oct 08
-
09 Oct 08
-
30 Sep 08
-
22 Aug 08
-
29 Jul 08
Dan HowardGood article on understanding character-encodings (UTF-8, Latin-1, Unicode, and this sort of thing).
-
11 Jul 08
-
07 Jul 08
fullness TimeIf you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.
There Ain't No Such Thing As Plain Text.
If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.-
Please do not write another line of code until you finish reading this article.
-
I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a little bit oversimplified.
- 22 more annotations...
-
-
The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII
-
This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare
-
Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255.
-
Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages. So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided.
-
Meanwhile, in Asia, even more crazy things were going on to take into account the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 bits.
-
But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. But of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented.
-
In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.
-
In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story.
-
Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal.
-
There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway.
-
numbers in two bytes
-
two bytes
-
Couldn't it also be
-
couldn't bear the idea of doubling the amount of storage it took for strings, and anyway, there were already all these doggone documents out there using various ANSI and DBCS character sets
-
most people decided to ignore Unicode for several years and in the meantime things got worse.
-
English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong. Only the rest of the world has to jump through hoops.
-
UTF-16
-
And in fact now that you're thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode code points can be encoded in any old-school encoding scheme, too! For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with one catch: some of the letters might not show up! If there's no equivalent for the Unicode code point you're trying to represent in the encoding you're trying to represent it in, you usually get a little question mark: ? or, if you're really good, a box. Which did you get? -> �
-
There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks.
-
any code point
-
The Single Most Important Fact About Encodings
If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.
There Ain't No Such Thing As Plain Text.
If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.
-
It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy... how can you read the HTML file until you know what encoding it's in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:
<html>
<head>
<!--StartFragment --><meta http-equiv="Content-Type" content="text/html; charset=utf-8">But that meta tag really has to be the very first thing in the <head> section because as soon as the web browser sees this tag it's going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.
-
-
-
17 Jun 08
-
24 May 08
-
16 Apr 08
-
09 Apr 08
-
Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don't feel bad.
-
early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode. So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.
- 11 more annotations...
-
-
UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
-
This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong. Only the rest of the world has to jump through hoops.
-
So far I've told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it's high-endian UCS-2 or low-endian UCS-2. And there's the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.
There are actually a bunch of other ways of encoding Unicode. There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There's UCS-4
-
It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.
There Ain't No Such Thing As Plain Text.
-
There are over a hundred encodings and above code point 127, all bets are off.
-
For an email message, you are expected to have a string in the header of the form
Content-Type: text/plain; charset="UTF-8"
For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself -- not in the HTML itself, but as one of the response headers that are sent before the HTML page.
-
The web server itself wouldn't really know what encoding each file was written in, so it couldn't send the Content-Type header.
It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy...
-
What do web browsers do if they don't find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear
-
Postel's Law about being "conservative in what you emit and liberal in what you accept" is quite frankly not a good engineering principle
-
we decided to do everything internally in UCS-2 (two byte) Unicode, which is what Visual Basic, COM, and Windows NT/2000/XP use as their native string type
-
When CityDesk publishes the web page, it converts it to UTF-8 encoding, which has been well supported by web browsers for many years.
-
-
-
08 Apr 08
-
05 Apr 08
-
31 Mar 08
-
29 Mar 08
-
Ankit Chaturvedigood reading
-
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
-
-
28 Mar 08
-
27 Mar 08
-
22 Mar 08
-
20 Mar 08
-
19 Mar 08
-
18 Mar 08
-
17 Mar 08
-
14 Mar 08
-
11 Mar 08
-
06 Feb 08
-
03 Feb 08
-
26 Nov 07
-
12 Nov 07
-
11 Nov 07
-
26 Oct 07
-
04 Oct 07
-
21 Sep 07
-
03 Jul 07
-
20 Jun 07
-
13 Jun 07
-
03 Jun 07
-
24 May 07
-
07 May 07
-
20 Apr 07
-
28 Mar 07
-
27 Mar 07
-
27 Feb 07
-
04 Feb 07
-
23 Jan 07
-
16 Dec 06
-
06 Dec 06
-
21 Nov 06
-
30 Oct 06
-
25 Oct 06
-
24 Oct 06
-
23 Oct 06
-
09 Sep 06
-
07 Sep 06
-
05 Sep 06
-
28 Aug 06
Adrian BengtsonVad en mjukvaruutvecklare absolut minst måste veta om teckenkodning.
character encoding teckenkodning code utf-8 iso-latin teckentabeller Webbutveckling Programmering
-
Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and you never quite know what it should be?
Did you ever get an email from your friends in Bulgaria with the subject line "???? ?????? ??? ????"?
I've been dismayed to discover just how many software developers aren't really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese? They have email in Japanese? I had no idea. When I looked closely at the commercial ActiveX control we were using to parse MIME email messages, we discovered it was doing exactly the wrong thing with character sets, so we actually had to write heroic code to undo the wrong conversion it had done and redo it correctly. When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they "couldn't do anything about it." Like many programmers, he just wished it would all blow over somehow.
-
-
24 Aug 06
-
20 Aug 06
-
14 Aug 06
-
06 Aug 06
-
04 Aug 06
-
03 Aug 06
Page Comments
Would you like to comment?
Join Diigo for a free account, or sign in if you are already a member.