-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add how to detect the charset of the page
- Loading branch information
Showing
1 changed file
with
12 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
在写爬虫程序(web页面)的时候,经常需要去解析页面的内容,而在解析前就必须要知道该页面是用何种字符集来编码的,这样才能有效的避免乱码的问题。 | ||
|
||
那么如何才能得知目的页面的编码呢?让我们来看看来自“[W3C](http://www.w3.org/TR/html4/charset.html)”的官方解释: | ||
|
||
1. An HTTP "charset" parameter in a "Content-Type" field | ||
2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset" | ||
3. The charset attribute set on an element that designates an external resource | ||
|
||
上述描述中表现了检测页面编码的优先级,也就是说首先会看http头信息中的“Content-Type”字段、如果没有,就去看Meta信息,还没有的话,对于一些外链(如css、JavaScript)就会看这种元素专门的charset字段。如果检查完上述三种方式之后还是无法确定呢?那就采用默认的*ISO-8859-1*字符集来解析。 | ||
|
||
顺便提一句,既然是W3C的标准,那就说明标准浏览器都是这么工作的哦! | ||
|