The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software

This link has been bookmarked by 1527 people and liked by 1 people. It was first bookmarked on 02 Mar 2006, by Jeff Tucker.

18 Jun 17

kevino
empty
28 Nov 16

There Ain't No Such Thing As Plain Text.

unicode encoding programming introduction text
14 Nov 16

tot0ro
- until one day, they write something that doesn't exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it's Korean and displays it thusly
13 Nov 16

bpenfold
- Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.
- So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided.
- The national versions of MS-DOS had dozens of these code pages, handling everything from English to Icelandic and they even had a few "multilingual" code pages that could do Esperanto and Galician on the same computer! Wow! But getting, say, Hebrew and Greek on the same computer was a complete impossibility unless you wrote your own custom program that displayed everything using bitmapped graphics, because Hebrew and Greek required different code pages with different interpretations of the high numbers.
- This was usually solved by the messy system called DBCS, the "double byte character set" in which some letters were stored in one byte and others took two.
- tem on the planet and some make-believe ones like Klingon, too. Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct.
- So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.
- So far I've told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it's high-endian UCS-2 or low-endian UCS-2. And there's the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.
- There are actually a bunch of other ways of encoding Unicode. There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There's UCS-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes, but, golly, even the Texans wouldn't be so bold as to waste that much memory.
- If there's no equivalent for the Unicode code point you're trying to represent in the encoding you're trying to represent it in, you usually get a little question mark: ? or, if you're really good, a box. Which did you get? -> �
- It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.
- How do we preserve this information about what encoding a string uses? Well, there are standard ways to do this. For an email message, you are expected to have a string in the header of the form
  
  Content-Type: text/plain; charset="UTF-8"
- t would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy... how can you read the HTML file until you know what encoding it's in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:
  
  <html>
  <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
- What do web browsers do if they don't find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used.
11 more annotations...
12 Oct 16

John van Beers
software development string unicode ansi encoding programming
20 Sep 16

Konstantinos
13 Sep 16

dajare
history +

unicode
05 Sep 16

Alejandro Galindo
unicode encoding programming utf8 development i18n reference software
30 Aug 16

zorak1103
Security unicode programming encoding utf8 development i18n software
15 Jul 16

programming tutorial
27 Jun 16

Alexandre Enkerli
diversity
16 Jun 16

Thierry Henrio
encoding utf8
07 Jun 16

Giedrius Kudelis
25 May 16

nevadabill
Unicode, character sets, encoding

unicode font getting started with unicode 'encoding unicode encoding programming utf8 development
15 Apr 16

kenng2014
unicode tutorial
30 Mar 16

Lance England
unicode
- In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story
- In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes
- It does not make sense to have a string without knowing what encoding it uses
1 more annotation...
27 Mar 16

zzanghwi
- Please do not write another line of code until you finish reading this article.
- everything was very simple.
- we had a code for them called ASCII which was able to represent every character using a number between 32 and 127.
- This could conveniently be stored in 7 bits.
- They were used for control characters, like 7 which made your computer beep
- they had their own ideas of what should go where in the space from 128 to 255.
- OEM character se
- ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII,
- That's where encodings come in.
7 more annotations...
22 Mar 16

empenguin
- This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes
- had a whole bit to spare
16 Mar 16

j t
Haven’t mastered the basics of Unicode and character sets? Please don’t write another line of code until you’ve read this article.

UTF-8 UTF-16 base64encode
02 Mar 16

s_m_roberts
19 Feb 16

Dan Novak
work
17 Feb 16

mtartar06
unicode encoding
- The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter "A" was 65, etc. This could conveniently be stored in 7 bits.
- In Unicode, a letter maps to something called a code point which is still just a theoretical concept.
18 Jan 16

Rob
utf8 encoding
29 Dec 15

jpache
acentos info desarrollo basico
30 Oct 15

Farhan Faruque
programming development unicode
25 Oct 15

tanyaleto
23 Oct 15

plasticcones
BartonLee (30 Jun) Dev Code character encoding_unicode
17 Oct 15

Haven’t mastered the basics of Unicode and character sets? Please don’t write another line of code until you’ve read this article.

unicode encoding programming
14 Oct 15

peking88
unicode favorite dev
08 Oct 15

intelliarm
- but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.
- This was usually solved by the messy system called DBCS, the "double byte character set" in which some letters were stored in one byte and others took two.
- Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too.
- code point
- FE FF at the beginning of every Unicode string;
- I've told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits)
- how can you read the HTML file until you know what encoding it's in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127,
- But that meta tag really has to be the very first thing in the <head> section because as soon as the web browser sees this tag it's going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.
- Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages,
7 more annotations...
24 Sep 15

hafley66
unicode encoding programming reference
04 Sep 15

makeller63
"Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and you never quite know what it should be?

Did you ever get an email from your friends in Bulgaria with the subject line "???? ?????? ??? ????"?"

hiddencharacters utf-8 blackbird-training
02 Sep 15

dimitrilw
26 Aug 15

dubeux
Haven’t mastered the basics of Unicode and character sets? Please don’t write another line of code until you’ve read this article.
19 Aug 15

avanrienen
encoding utf8 unicode
18 Jul 15

sophonslidsuk
- ontent-Type
- MIME email message
- PHP has almost complete ignorance of character encoding issues,
- ely using 8 bits for characte
2 more annotations...
07 Jul 15

development
11 Jun 15

buchmiller
unicode programming encoding reference utf8 delicious
07 Jun 15

Adam Bro
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets http://t.co/Fv3zpVWG3E #utf8

— Adam Brodziak (@AdamBrodziak) June 7, 2015

The Absolute Minimum Every Software Developer Absolutely, Pos…

IFTTT Twitter
06 Jun 15

agobbi
- UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).
01 Jun 15

linekin
imported-links system:unfiled
11 May 15

Miguel Castro
unicode
08 Apr 15

Sumi Dero
unicode encoding utf8 i18n
Graham Jones
convert unicode string python
03 Apr 15

ken scott
- Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639.
- This magic number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal.
- We haven't yet said anything about how to store this in memory
- That's where encodings come in.
- Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
- This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong.
- Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet.
- (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).
- So far I've told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it's high-endian UCS-2 or low-endian UCS-2. And there's the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.
- UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.
- The Single Most Important Fact About Encodings
  
  If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.
- There Ain't No Such Thing As Plain Text.
- Almost every stupid "my website looks like gibberish" or "she can't read my emails when I use accents" problem comes down to one naive programmer who didn't understand the simple fact that if you don't tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.
- . how can you read the HTML file until you know what encoding it's in?
- For the latest version of CityDesk, the web site management software published by my company, we decided to do everything internally in UCS-2 (two byte) Unicode, which is what Visual Basic, COM, and Windows NT/2000/XP use as their native string type
- When CityDesk publishes the web page, it converts it to UTF-8 encoding, which has been well supported by web browsers for many years.
14 more annotations...
30 Mar 15

comdown
utf16 unicode encoding programming utf8
- programmers
24 Mar 15

Brooke Smith
apir development java character_encoding utf utf-8 utf-16 characterentities overview
06 Mar 15

dromedary
unicode programming computers
04 Mar 15

zacbraddy
unicode programming knowledge
02 Mar 15

hungrypipo
unicode programming encoding utf8
roozyfx
C++ UTF8
28 Feb 15

jcarlosadm
1_Computer_science 2_Article unicode
25 Feb 15

Mikel Madina
"Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and you never quite know what it should be?

Did you ever get an email from your friends in Bulgaria with the subject line "???? ?????? ??? ????"?

I've been dismayed to discover just how many software developers aren't really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese? They have email in Japanese? I had no idea. When I looked closely at the commercial ActiveX control we were using to parse MIME email messages, we discovered it was doing exactly the wrong thing with character sets, so we actually had to write heroic code to undo the wrong conversion it had done and redo it correctly. When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they "couldn't do anything about it." Like many programmers, he just wished it would all blow over somehow."

ejc-ddj-course master-ddj encoding unicode
09 Feb 15

Eko Gunawan
unicode encoding programming utf8 development software reference
31 Jan 15

Dave Dennis
programming
28 Jan 15

Ann Malysheva
programming unicode
23 Jan 15

hai_ahi
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- mysterious world of character sets, encodings, Unicode, all that stuff.
- ut it won't. When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues,
- blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.
- So I have an announcement to make: if you are a programmer working in 2003 and you don't know the basics of characters, character sets, encodings, and Unicode, and I catch you, I'm going to punish you by making you peel onions for 6 months in a submarine. I swear I will.
- IT'S NOT THAT HARD.
4 more annotations...
19 Jan 15

Scott Bower
unicode programming utf8 encoding development
14 Jan 15

fawadmalik
02 Jan 15

Tom Kleen
Unicode CSCI425 UTF-8 encoding
09 Dec 14

harryi3t
07 Dec 14

Syll Dubh
unicode encoding programming reference software
05 Dec 14

rflopezm
28 Nov 14

rocketshooter
- ASCII
- unaccented English letters
- epresent every character using a number between 32 and 127
- stored in 7 bits
- odes below 32 were called unprintabl
- control characters
- lots of people
- can use the codes 128-255 for our own purposes
- The IBM-PC had something that came to be known as the OEM characte
- ANSI standard
- everybody agreed on what to do below 128, which was pretty much the same as ASCII
- different ways to handle the characters from 128 and on up, depending on where you lived
- Asian alphabets have thousands of letters
- usually solved by the messy system called DBCS, the "double byte character set
- some letters were stored in one byte and others took two
- Unicode
- effort to create a single character set that included every reasonable writing system on the planet
- In Unicode, a letter maps to something called a code point
- Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639
- called a code point
- U+ means "Unicode"
- numbers are hexadecimal. U+0639
- no real limit on the number of letters that Unicode can define
- haven't yet said anything about how to store this in memory or represent it in an email message
- where encodings come in
- early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at
- forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string
- Look at all those zeros!" they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF
- UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes
- In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes
- English text looks exactly the same in UTF-8 as it did in ASCII
- UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings)
- traditional store-it-in-two-byte methods are called UCS-2
- UTF-16 (because it has 16 bits)
- ave to figure out if it's high-endian UCS-2 or low-endian UCS-2
- bunch of other ways of encoding Unicode.
- UTF-7,
- like UTF-8 but guarantees that the high bit will always be zero
- UCS-4, which stores each code point in 4 bytes
- For an email message, you are expected to have a string in the header of the form
  
  Content-Type: text/plain; charset="UTF-8"
- web server would return a similar Content-Type http header along with the web page itself
- not in the HTML itself, but as one of the response headers that are sent before the HTML page
- course this drove purists crazy... how can you read the HTML file until you know what encoding it's in?
- lmost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:
  
  <html>
  <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
- meta tag really has to be the very first thing in the <head> section because as soon as the web browser sees this tag it's going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.
43 more annotations...
26 Nov 14

whubynq
24 Nov 14

Philippe Combot
Unicode
14 Nov 14

Shivam Sharma
- There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway.
10 Nov 14

chrischmo
unicode
08 Nov 14

programming software unicode development delicious
05 Nov 14

Guilherme Pedrosa
Haven’t mastered the basics of Unicode and character sets? Please don’t write another line of code until you’ve read this article.

unicode misc
22 Oct 14

Michael Tim
08 Oct 14

Stephanie Teltz
- "plain text = ascii = characters are 8 bits" is not only wrong
- good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127.
- Most computers in those days were using 8-bit bytes
- Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes."
- ANSI standard.
- everybody agreed on what to do below 128, which was pretty much the same as ASCII,
- lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.
- They were the same below 128 but different from 128 up, where all the funny letters resided.
- messy system called DBCS, the "double byte character set" in which some letters were stored in one byte and others took two.
- Programmers were encouraged not to use s++ and s-- to move backwards and forwards, but instead to call functions such as Windows' AnsiNext and AnsiPrev which knew how to deal with the whole mess.
- Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too.
- In Unicode, a letter maps to something called a code point
- magic number by the Unicode consortium which is written like this: U+0639
- called a code point
- There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway.
- Hello
  
  which, in Unicode, corresponds to these five code points:
  
  U+0048 U+0065 U+006C U+006C U+006F.
- We haven't yet said anything about how to store this in memory or represent it in an email message.
- Encodings
- encodings
- hey, let's just store those numbers in two bytes each
- high-endian or low-endian mode
- two ways to store Unicode
- izarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark
- Not every Unicode string in the wild has a byte order mark at the beginning.
- "Look at all those zeros!" they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn't have minded guzzling twice the number of bytes. But those Californian wimps couldn't bear the idea of doubling the amount of storage it took for strings
- UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes.
- every code point from 0-127 is stored in a single byte.
- English text looks exactly the same in UTF-8 as it did in ASCII
- Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F
- is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet.
- three ways of encoding Unicode
- UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII
- if you don't tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends.
31 more annotations...
26 Sep 14

Cesar Vega
22 Sep 14

kapeidien
uicode
18 Sep 14

skookiesprite
- Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255. The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and a bunch of line drawing characters... horizontal bars, vertical bars, horizontal bars with little dingle-dangles dangling off the right side, etc., and you could use these line drawing character
- Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages
07 Sep 14

Jerry Horton
unicode
20 Aug 14

Gianluca Ciccarelli
- we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter "A" was 65, etc. This could conveniently be stored in 7 bits
- Codes below 32 were called unprintable
- as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes
- In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages
- In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story.
- Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point
- There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes
- people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte
- Not every Unicode string in the wild has a byte order mark at the beginning.
- UTF-8 was another system for storing your string of Unicode code points
- In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
9 more annotations...
03 Aug 14

shingo nakayama
encoding
28 Jul 14

acazsouza
unicode programming encoding joelonsoftware
22 Jul 14

Martin Johnson
unicode encoding programming utf8 development
14 Jul 14

unicode encoding
07 Jul 14

jhave2
" they had their own ideas of what should go where in the space from 128 to 255. "

unicode encoding programming utf8
- In Unicode, a letter maps to something called a code point which is still just a theoretical concept.
- Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041.
- arly implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode
- Not every Unicode string in the wild has a byte order mark at the beginning.
- UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
- Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you'll have to use several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).
- . It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.
  
  There Ain't No Such Thing As Plain Text.
5 more annotations...
03 Jul 14

digerateur
unicode encoding utf8
- you're not much better than a medical doctor who doesn't believe in germs
- ASCII which was able to represent every character using a number between 32 and 127.
- lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255.
- These different systems were called code pages. So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided
- But getting, say, Hebrew and Greek on the same computer was a complete impossibility unless you wrote your own custom program that displayed everything using bitmapped graphics, because Hebrew and Greek required different code pages with different interpretations of the high numbers.
- Asian alphabets have thousands of letters
- It was easy to move forward in a string, but dang near impossible to move backwards.
- Until now, we've assumed that a letter maps to some bits which you can store on disk or in memory:
  
  A -> 0100 0001
  
  In Unicode, a letter maps to something called a code point which is still just a theoretical concept.
- Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal
- So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE
- UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings
- ISO-8859-1, aka Latin-1 (also useful for any Western European language)
- Almost every stupid "my website looks like gibberish" or "she can't read my emails when I use accents" problem comes down to one naive programmer who didn't understand the simple fact that if you don't tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends
- how can you read the HTML file until you know what encoding it's in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters
- But that meta tag really has to be the very first thing in the <head> section because as soon as the web browser sees this tag it's going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified
- Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used
- conservative in what you emit and liberal in what you accept" is quite frankly not a good engineering principle
15 more annotations...
29 Jun 14

Rajanand I
"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets" http://t.co/rDaieVX9H4

tweet
28 Jun 14

ionuto
Develop C++ Unicode
27 Jun 14

Eva Asensio
data-journalism-mooc
25 Jun 14

Tim Beck
unicode encoding
24 Jun 14

lucaskaim
"to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta "

unicode
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About
- (No Excuses!)
- Unicode and Character Sets
1 more annotation...
12 Jun 14

Bernd Oswald
ddj data unicode encoding
11 Jun 14

Mark Owen
data journalism
09 Jun 14

ambdxtrch
29 May 14

jgsogo
unicode encoding software i18n utf8
19 May 14

psxcode
unicode
15 May 14

Jochen Fromm
Joel on Software

unicode encoding programming utf8 i18n
12 May 14

Der Robert
05 May 14

Alex Jensen
charset character set unicode encode
04 May 14

contempt contempt
unicode utf8
02 May 14

Martin Homik
encoding unicode
29 Apr 14

Sorawit Wanitwarodom
unicode charactersets
25 Apr 14

Michael Alt
unicode encoding

< Previous 1 2 3 4 5 6 ... 10 Next >

Public Stiky Notes

Jochen Burkhard on 2010-09-08

Hhm, lazy me...
That is some stuff we really need to take care of!

Page Comments

billso on 2006-07-25

Joel on Software
yc c on 2007-06-03

Nice story of encoding
Yann Esposito on 2008-10-23

At last, I've understood the difference between Unicode and Encoding.
Isaah F on 2012-07-12

jogando.net/mu *28
Jogando.net o MELHOR SERVIDOR DE MU ONLINE DO BRASIL!
Com o Lançamento oficial do Novo servidor Phoenix Ep. 3 Season 6, o 6° Megaultrasuperhiper Evento Castle Siege Premiado com
direito a medalhas e muitas Novidades no servidor HARD! Sendo 7 servers diferenciados proporcionando sua diversão,
ENTRE JÁ NO SITE : http://www.jogando.net/mu/
cadastre-se e ganhe 5 dias vips !
Curta nossa página no Facebook : http://www.facebook.com/pages/jogandonet/371027529618526
By: SweeTDeath

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know... - The Diigo Meta page

Would you like to comment?

Top Tags

Check out another URL