This link has been bookmarked by 4 people . It was first bookmarked on 17 Apr 2007, by Moe Mauch.
-
17 Apr 09
-
26 Jun 08
-
17 Apr 07
-
<span class="pytext">encoding_matrix</span> <span class="pyoperator">=</span> <span class="pyoperator">{</span><br><br/> <span class="pytext">codecs</span><span class="pyoperator">.</span><span class="pytext">BOM_UTF8</span><span class="pyoperator">:</span> <span class="pystring">'utf_8'</span><span class="pyoperator">,</span><br><br/> <span class="pytext">codecs</span><span class="pyoperator">.</span><span class="pytext">BOM_UTF16</span><span class="pyoperator">:</span> <span class="pystring">'utf_16'</span><span class="pyoperator">,</span><br><br/> <span class="pytext">codecs</span><span class="pyoperator">.</span><span class="pytext">BOM_UTF16BE</span><span class="pyoperator">:</span> <span class="pystring">'utf16_be'</span><span class="pyoperator">,</span><br><br/> <span class="pytext">codecs</span><span class="pyoperator">.</span><span class="pytext">BOM_UTF16LE</span><span class="pyoperator">:</span> <span class="pystring">'utf16_le'</span><span class="pyoperator">,</span><br><br/><span class="pyoperator">}</span><br><br/><span class="pykeyword">def</span> <span class="pytext">guess_encoding</span><span class="pyoperator">(</span><span class="pytext">data</span><span class="pyoperator">)</span><span class="pyoperator">:</span><br><br/> <span class="pystring">"""<br><br/> Given a byte string, guess the encoding.<br><br/><br><br/> First it tries for UTF8/UTF16 BOM.<br><br/><br><br/> Next it tries the standard 'UTF8', 'ISO-8859-1', and 'cp1252' encodings,<br><br/> Plus several gathered from locale information.<br><br/><br><br/> The calling program *must* first call<br><br/> locale.setlocale(locale.LC_ALL, '')<br><br/><br><br/> If successful it returns<br><br/> (decoded_unicode, successful_encoding)<br><br/> If unsuccessful it raises a ``UnicodeError``.<br><br/> """</span><br><br/> <span class="pykeyword">for</span> <span class="pytext">bom</span><span class="pyoperator">,</span> <span class="pytext">enc</span> <span class="pykeyword">in</span> <span class="pytext">encoding_matrix</span><span class="pyoperator">.</span><span class="pytext">items</span><span class="pyoperator">(</span><span class="pyoperator">)</span><span class="pyoperator">:</span><br><br/> <span class="pykeyword">if</span> <span class="pytext">data</span><span class="pyoperator">.</span><span class="pytext">startswith</span><span class="pyoperator">(</span><span class="pytext">bom</span><span class="pyoperator">)</span><span class="pyoperator">:</span><br><br/> <span class="pykeyword">return</span> <span class="pytext">data</span><span class="pyoperator">.</span><span class="pytext">decode</span><span class="pyoperator">(</span><span class="pytext">enc</span><span class="pyoperator">)</span><span class="pyoperator">,</span> <span class="pytext">enc</span><br><br/> <span class="pytext">encodings</span> <span class="pyoperator">=</span> <span class="pyoperator">[</span><span class="pystring">'ascii'</span><span class="pyoperator">,</span> <span class="pystring">'UTF-8'</span><span class="pyoperator">]</span><br><br/> <span class="pytext">successful_encoding</span> <span class="pyoperator">=</span> <span class="pytext">None</span><br><br/> <span class="pykeyword">try</span><span class="pyoperator">:</span><br><br/> <span class="pytext">encodings</span><span class="pyoperator">.</span><span class="pytext">append</span><span class="pyoperator">(</span><span class="pytext">locale</span><span class="pyoperator">.</span><span class="pytext">nl_langinfo</span><span class="pyoperator">(</span><span class="pytext">locale</span><span class="pyoperator">.</span><span class="pytext">CODESET</span><span class="pyoperator">)</span><span class="pyoperator">)</span><br><br/> <span class="pykeyword">except</span> <span class="pytext">AttributeError</span><span class="pyoperator">:</span><br><br/> <span class="pykeyword">pass</span><br><br/> <span class="pykeyword">try</span><span class="pyoperator">:</span><br><br/> <span class="pytext">encodings</span><span class="pyoperator">.</span><span class="pytext">append</span><span class="pyoperator">(</span><span class="pytext">locale</span><span class="pyoperator">.</span><span class="pytext">getlocale</span><span class="pyoperator">(</span><span class="pyoperator">)</span><span class="pyoperator">[</span><span class="pynumber">1</span><span class="pyoperator">]</span><span class="pyoperator">)</span><br><br/> <span class="pykeyword">except</span> <span class="pyoperator">(</span><span class="pytext">AttributeError</span><span class="pyoperator">,</span> <span class="pytext">IndexError</span><span class="pyoperator">)</span><span class="pyoperator">:</span><br><br/> <span class="pykeyword">pass</span><br><br/> <span class="pykeyword">try</span><span class="pyoperator">:</span><br><br/> <span class="pytext">encodings</span><span class="pyoperator">.</span><span class="pytext">append</span><span class="pyoperator">(</span><span class="pytext">locale</span><span class="pyoperator">.</span><span class="pytext">getdefaultlocale</span><span class="pyoperator">(</span><span class="pyoperator">)</span><span class="pyoperator">[</span><span class="pynumber">1</span><span class="pyoperator">]</span><span class="pyoperator">)</span><br><br/> <span class="pykeyword">except</span> <span class="pyoperator">(</span><span class="pytext">AttributeError</span><span class="pyoperator">,</span> <span class="pytext">IndexError</span><span class="pyoperator">)</span><span class="pyoperator">:</span><br><br/> <span class="pykeyword">pass</span><br><br/> <span class="pycomment"># latin-1<br><br/></span> <span class="pytext">encodings</span><span class="pyoperator">.</span><span class="pytext">append</span><span class="pyoperator">(</span><span class="pystring">'ISO8859-1'</span><span class="pyoperator">)</span><br><br/> <span class="pytext">encodings</span><span class="pyoperator">.</span><span class="pytext">append</span><span class="pyoperator">(</span><span class="pystring">'cp1252'</span><span class="pyoperator">)</span><br><br/> <span class="pykeyword">for</span> <span class="pytext">enc</span> <span class="pykeyword">in</span> <span class="pytext">encodings</span><span class="pyoperator">:</span><br><br/> <span class="pykeyword">if</span> <span class="pykeyword">not</span> <span class="pytext">enc</span><span class="pyoperator">:</span><br><br/> <span class="pykeyword">continue</span><br><br/> <span class="pykeyword">try</span><span class="pyoperator">:</span><br><br/> <span class="pytext">decoded</span> <span class="pyoperator">=</span> <span class="pytext">unicode</span><span class="pyoperator">(</span><span class="pytext">data</span><span class="pyoperator">,</span> <span class="pytext">enc</span><span class="pyoperator">)</span><br><br/> <span class="pytext">successful_encoding</span> <span class="pyoperator">=</span> <span class="pytext">enc</span><br><br/> <span class="pykeyword">break</span><br><br/> <span class="pykeyword">except</span> <span class="pyoperator">(</span><span class="pytext">UnicodeError</span><span class="pyoperator">,</span> <span class="pytext">LookupError</span><span class="pyoperator">)</span><span class="pyoperator">:</span><br><br/> <span class="pykeyword">pass</span><br><br/> <span class="pykeyword">if</span> <span class="pytext">successful_encoding</span> <span class="pykeyword">is</span> <span class="pytext">None</span><span class="pyoperator">:</span><br><br/> <span class="pykeyword">raise</span> <span class="pytext">UnicodeError</span><span class="pyoperator">(</span><span class="pystring">'Unable to decode input data. Tried the'</span><br><br/> <span class="pystring">' following encodings: %s.'</span> <span class="pyoperator">%</span> <span class="pystring">', '</span><span class="pyoperator">.</span><span class="pytext">join</span><span class="pyoperator">(</span><span class="pyoperator">[</span><span class="pytext">repr</span><span class="pyoperator">(</span><span class="pytext">enc</span><span class="pyoperator">)</span><br><br/> <span class="pykeyword">for</span> <span class="pytext">enc</span> <span class="pykeyword">in</span> <span class="pytext">encodings</span> <span class="pykeyword">if</span> <span class="pytext">enc</span><span class="pyoperator">]</span><span class="pyoperator">)</span><span class="pyoperator">)</span><br><br/> <span class="pykeyword">else</span><span class="pyoperator">:</span><br><br/> <span class="pykeyword">if</span> <span class="pytext">successful_encoding</span> <span class="pyoperator">==</span> <span class="pystring">'ascii'</span><span class="pyoperator">:</span><br><br/> <span class="pycomment"># our default ascii encoding<br><br/></span> <span class="pytext">successful_encoding</span> <span class="pyoperator">=</span> <span class="pystring">'ISO8859-1'</span><br><br/> <span class="pykeyword">return</span> <span class="pyoperator">(</span><span class="pytext">decoded</span><span class="pyoperator">,</span> <span class="pytext">successful_encoding</span><span class="pyoperator">)</span>
-
-
28 Feb 07
Would you like to comment?
Join Diigo for a free account, or sign in if you are already a member.