Python下如何使用BeautifulSoup解析HTML-創(chuàng)新互聯(lián)

這篇文章將為大家詳細(xì)講解有關(guān)Python下如何使用BeautifulSoup解析HTML，小編覺(jué)得挺實(shí)用的，因此分享給大家做個(gè)參考，希望大家閱讀完這篇文章后可以有所收獲。

成都創(chuàng)新互聯(lián)主營(yíng)弓長(zhǎng)嶺網(wǎng)站建設(shè)的網(wǎng)絡(luò)公司,主營(yíng)網(wǎng)站建設(shè)方案,成都App制作,弓長(zhǎng)嶺h5小程序設(shè)計(jì)搭建,弓長(zhǎng)嶺網(wǎng)站營(yíng)銷推廣歡迎弓長(zhǎng)嶺等地區(qū)企業(yè)咨詢

摘要

Beautiful Soup 是一個(gè)可以從 HTML 或 XML 格式文件中提取數(shù)據(jù)的 Python 庫(kù)，他可以將HTML 或 XML 數(shù)據(jù)解析為Python 對(duì)象，以方便通過(guò)Python代碼進(jìn)行處理。

文檔環(huán)境

Centos7.5
Python2.7
BeautifulSoup4

Beautifu Soup 使用說(shuō)明

Beautiful Soup 的基本功能就是對(duì)HTML的標(biāo)簽進(jìn)行查找及編輯。

基本概念-對(duì)象類型

Beautiful Soup 將復(fù)雜 HTML 文檔轉(zhuǎn)換成一個(gè)復(fù)雜的樹形結(jié)構(gòu)，每個(gè)節(jié)點(diǎn)都被轉(zhuǎn)換成一個(gè)Python 對(duì)象，Beautiful Soup將這些對(duì)象定義了4 種類型: Tag、NavigableString、BeautifulSoup、Comment 。

對(duì)象類型	描述
BeautifulSoup	文檔的全部?jī)?nèi)容
Tag	HTML的標(biāo)簽
NavigableString	標(biāo)簽包含的文字
Comment	是一種特殊的NavigableString類型，當(dāng)標(biāo)簽中的NavigableString 被注釋時(shí)，則定義為該類型

安裝及引用

# Beautiful Soup
pip install bs4

# 解析器
pip install lxml
pip install html5lib

# 初始化
from bs4 import BeautifulSoup

# 方法一，直接打開文件
soup = BeautifulSoup(open("index.html"))

# 方法二，指定數(shù)據(jù)
resp = "data"
soup = BeautifulSoup(resp, 'lxml')

# soup 為 BeautifulSoup 類型對(duì)象
print(type(soup))

標(biāo)簽搜索及過(guò)濾

基本方法

標(biāo)簽搜索有find_all() 和find() 兩個(gè)基本的搜索方法，find_all() 方法會(huì)返回所有匹配關(guān)鍵字的標(biāo)簽列表，find()方法則只返回一個(gè)匹配結(jié)果。

soup = BeautifulSoup(resp, 'lxml')

# 返回一個(gè)標(biāo)簽名為"a"的Tag
soup.find("a")

# 返回所有tag 列表
soup.find_all("a")

## find_all方法可被簡(jiǎn)寫
soup("a")

#找出所有以b開頭的標(biāo)簽
for tag in soup.find_all(re.compile("^b")):
  print(tag.name)

#找出列表中的所有標(biāo)簽
soup.find_all(["a", "p"])

# 查找標(biāo)簽名為p，class屬性為"title"
soup.find_all("p", "title")

# 查找屬性id為"link2"
soup.find_all(id="link2")

# 查找存在屬性id的
soup.find_all(id=True)

#
soup.find_all(href=re.compile("elsie"), id='link1')

# 
soup.find_all(attrs={"data-foo": "value"})

#查找標(biāo)簽文字包含"sisters"
soup.find(string=re.compile("sisters"))

# 獲取指定數(shù)量的結(jié)果
soup.find_all("a", limit=2)

# 自定義匹配方法
def has_class_but_no_id(tag):
  return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)

# 僅對(duì)屬性使用自定義匹配方法
def not_lacie(href):
    return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)

# 調(diào)用tag的 find_all() 方法時(shí),Beautiful Soup會(huì)檢索當(dāng)前tag的所有子孫節(jié)點(diǎn),如果只想搜索tag的直接子節(jié)點(diǎn),可以使用參數(shù) recursive=False 

soup.find_all("title", recursive=False)

擴(kuò)展方法

ind_parents()	所有父輩節(jié)點(diǎn)
find_parent()	第一個(gè)父輩節(jié)點(diǎn)
find_next_siblings()	之后的所有兄弟節(jié)點(diǎn)
find_next_sibling()	之后的第一個(gè)兄弟節(jié)點(diǎn)
find_previous_siblings()	之前的所有兄弟節(jié)點(diǎn)
find_previous_sibling()	之前的第一個(gè)兄弟節(jié)點(diǎn)
find_all_next()	之后的所有元素
find_next()	之后的第一個(gè)元素
find_all_previous()	之前的所有元素
find_previous()	之前的第一個(gè)元素

CSS選擇器

Beautiful Soup支持大部分的CSS選擇器 http://www.w3.org/TR/CSS2/selector.html, 在 Tag 或 BeautifulSoup 對(duì)象的 .select() 方法中傳入字符串參數(shù), 即可使用CSS選擇器的語(yǔ)法找到tag。

html_doc = """


 The Dormouse's story


 The Dormouse's story

 
  Once upon a time there were three little sisters; and their names were
  Elsie,
  Lacie
  and
  Tillie;
  and they lived at the bottom of a well.
 

 ...
"""

soup = BeautifulSoup(html_doc)

# 所有 a 標(biāo)簽
soup.select("a")

# 逐層查找
soup.select("body a")
soup.select("html head title")

# tag標(biāo)簽下的直接子標(biāo)簽
soup.select("head > title")
soup.select("p > #link1")

# 所有匹配標(biāo)簽之后的兄弟標(biāo)簽
soup.select("#link1 ~ .sister")

# 匹配標(biāo)簽之后的第一個(gè)兄弟標(biāo)簽
soup.select("#link1 + .sister")

# 根據(jù)calss類名
soup.select(".sister")
soup.select("[class~=sister]")

# 根據(jù)ID查找
soup.select("#link1")
soup.select("a#link1")

# 根據(jù)多個(gè)ID查找
soup.select("#link1,#link2")

# 根據(jù)屬性查找
soup.select('a[href]')

# 根據(jù)屬性值查找
soup.select('a[href^="http://example.com/"]')
soup.select('a[href$="tillie"]')
soup.select('a[href*=".com/el"]')

# 只獲取一個(gè)匹配結(jié)果
soup.select(".sister", limit=1)

# 只獲取一個(gè)匹配結(jié)果
soup.select_one(".sister")

標(biāo)簽對(duì)象方法

標(biāo)簽屬性

soup = BeautifulSoup('Extremely bold
Extremely bold2')
# 獲取所有的 p標(biāo)簽對(duì)象
tags = soup.find_all("p")
# 獲取第一個(gè)p標(biāo)簽對(duì)象
tag = soup.p
# 輸出標(biāo)簽類型 
type(tag)
# 標(biāo)簽名
tag.name
# 標(biāo)簽屬性
tag.attrs
# 標(biāo)簽屬性class 的值
tag['class']
# 標(biāo)簽包含的文字內(nèi)容，對(duì)象NavigableString 的內(nèi)容
tag.string

# 返回標(biāo)簽內(nèi)所有的文字內(nèi)容
for string in tag.strings:
  print(repr(string))

# 返回標(biāo)簽內(nèi)所有的文字內(nèi)容, 并去掉空行
for string in tag.stripped_strings:
  print(repr(string))

# 獲取到tag中包含的所有及包括子孫tag中的NavigableString內(nèi)容，并以Unicode字符串格式輸出
tag.get_text()
## 以"|"分隔
tag.get_text("|")
## 以"|"分隔，不輸出空字符
tag.get_text("|", strip=True)
獲取子節(jié)點(diǎn)
tag.contents # 返回第一層子節(jié)點(diǎn)的列表
tag.children # 返回第一層子節(jié)點(diǎn)的listiterator 對(duì)象
for child in tag.children:
  print(child)

tag.descendants # 遞歸返回所有子節(jié)點(diǎn)
for child in tag.descendants:
  print(child)

獲取父節(jié)點(diǎn)

tag.parent # 返回第一層父節(jié)點(diǎn)標(biāo)簽
tag.parents # 遞歸得到元素的所有父輩節(jié)點(diǎn)

for parent in tag.parents:
  if parent is None:
    print(parent)
  else:
    print(parent.name)

獲取兄弟節(jié)點(diǎn)

# 下一個(gè)兄弟元素
tag.next_sibling 

# 當(dāng)前標(biāo)簽之后的所有兄弟元素
tag.next_siblings
for sibling in tag.next_siblings:
  print(repr(sibling))

# 上一個(gè)兄弟元素
tag.previous_sibling

# 當(dāng)前標(biāo)簽之前的所有兄弟元素
tag.previous_siblings
for sibling in tag.previous_siblings:
  print(repr(sibling))

元素的遍歷

Beautiful Soup中把每個(gè)tag定義為一個(gè)“element”，每個(gè)“element”，被自上而下的在HTML中排列，可以通過(guò)遍歷命令逐個(gè)顯示標(biāo)簽

# 當(dāng)前標(biāo)簽的下一個(gè)元素
tag.next_element

# 當(dāng)前標(biāo)簽之后的所有元素
for element in tag.next_elements:
  print(repr(element))

# 當(dāng)前標(biāo)簽的前一個(gè)元素
tag.previous_element
# 當(dāng)前標(biāo)簽之前的所有元素
for element in tag.previous_elements:
  print(repr(element))

修改標(biāo)簽屬性

soup = BeautifulSoup('Extremely bold')
tag = soup.b

tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1

tag.string = "New link text."
print(tag)

修改標(biāo)簽內(nèi)容（NavigableString)

soup = BeautifulSoup('Extremely bold')
tag = soup.b
tag.string = "New link text."

添加標(biāo)簽內(nèi)容（NavigableString)

soup = BeautifulSoup("Foo")
tag = soup.a
tag.append("Bar")
tag.contents

# 或者

new_string = NavigableString("Bar")
tag.append(new_string)
print(tag)

添加注釋(Comment)

注釋是一個(gè)特殊的NavigableString 對(duì)象，所以同樣可以通過(guò)append() 方法進(jìn)行添加。

from bs4 import Comment
soup = BeautifulSoup("Foo")
new_comment = soup.new_string("Nice to see you.", Comment)
tag.append(new_comment)
print(tag)

添加標(biāo)簽(Tag)

添加標(biāo)簽方法有兩種，一種是在指定標(biāo)簽的內(nèi)部添加（append方法），另一種是在指定位置添加(insert、insert_before、insert_after方法)

append方法

soup = BeautifulSoup("")
tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com" rel="external nofollow" )
new_tag.string = "Link text."
tag.append(new_tag)
print(tag)

* insert方法，是指在當(dāng)前標(biāo)簽子節(jié)點(diǎn)列表的指定位置插入對(duì)象（Tag或NavigableString）

html = 'I linked to example.com'
soup = BeautifulSoup(html)
tag = soup.a
tag.contents
tag.insert(1, "but did not endorse ")
tag.contents

insert_before() 和 insert_after() 方法則在當(dāng)前標(biāo)簽之前或之后的兄弟節(jié)點(diǎn)添加元素

html = 'I linked to example.com'
soup = BeautifulSoup(html)
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.insert_before(tag)
soup.b

* wrap() 和 unwrap()可以對(duì)指定的tag元素進(jìn)行包裝或解包,并返回包裝后的結(jié)果。

```python
# 添加包裝
soup = BeautifulSoup("I wish I was bold.")
soup.p.string.wrap(soup.new_tag("b"))
#輸出 I wish I was bold.

soup.p.wrap(soup.new_tag("div"))
#輸出 I wish I was bold.

# 拆解包裝
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
a_tag = soup.a

a_tag.i.unwrap()
a_tag
#輸出 I linked to example.com

刪除標(biāo)簽

html = 'I linked to example.com'
soup = BeautifulSoup(html)
# 清楚當(dāng)前標(biāo)簽的所有子節(jié)點(diǎn)
soup.b.clear()

# 將當(dāng)前標(biāo)簽及所有子節(jié)點(diǎn)從soup 中移除,返回當(dāng)前標(biāo)簽。
b_tag=soup.b.extract()
b_tag
soup

# 將當(dāng)前標(biāo)簽及所有子節(jié)點(diǎn)從soup 中移除，無(wú)返回。
soup.b.decompose()

# 將當(dāng)前標(biāo)簽替換為指定的元素
tag=soup.i
new_tag = soup.new_tag("p")
new_tag.string = "Don't"
tag.replace_with(new_tag)

其他方法

輸出

# 格式化輸出
tag.prettify()
tag.prettify("latin-1")

使用Beautiful Soup解析后,文檔都被轉(zhuǎn)換成了Unicode，特殊字符也被轉(zhuǎn)換為Unicode，如果將文檔轉(zhuǎn)換成字符串,Unicode編碼會(huì)被編碼成UTF-8.這樣就無(wú)法正確顯示HTML特殊字符了
使用Unicode時(shí),Beautiful Soup還會(huì)智能的把“引號(hào)”轉(zhuǎn)換成HTML或XML中的特殊字符

文檔編碼

使用Beautiful Soup解析后,文檔都被轉(zhuǎn)換成了Unicode，其使用了“編碼自動(dòng)檢測(cè)”子庫(kù)來(lái)識(shí)別當(dāng)前文檔編碼并轉(zhuǎn)換成Unicode編碼。

soup = BeautifulSoup(html)
soup.original_encoding

# 也可以手動(dòng)指定文檔的編碼 
soup = BeautifulSoup(html, from_encoding="iso-8859-8")
soup.original_encoding

# 為提高“編碼自動(dòng)檢測(cè)”的檢測(cè)效率，也可以預(yù)先排除一些編碼
soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
通過(guò)Beautiful Soup輸出文檔時(shí),不管輸入文檔是什么編碼方式,默認(rèn)輸出編碼均為UTF-8編碼
文檔解析器
Beautiful Soup目前支持, “l(fā)xml”, “html5lib”, 和 “html.parser”

soup=BeautifulSoup("")
soup
#輸出： 
soup=BeautifulSoup("", "lxml")
soup
#輸出： 
soup=BeautifulSoup("", "html5lib")
soup
#輸出： 
soup=BeautifulSoup("", "html.parser")
soup
#輸出：

Python的優(yōu)點(diǎn)有哪些

1、簡(jiǎn)單易用，與C/C++、Java、C# 等傳統(tǒng)語(yǔ)言相比，Python對(duì)代碼格式的要求沒(méi)有那么嚴(yán)格；2、Python屬于開源的，所有人都可以看到源代碼，并且可以被移植在許多平臺(tái)上使用；3、Python面向?qū)ο?，能夠支持面向過(guò)程編程,也支持面向?qū)ο缶幊蹋?、Python是一種解釋性語(yǔ)言，Python寫的程序不需要編譯成二進(jìn)制代碼，可以直接從源代碼運(yùn)行程序；5、Python功能強(qiáng)大，擁有的模塊眾多，基本能夠?qū)崿F(xiàn)所有的常見(jiàn)功能。

關(guān)于“Python下如何使用BeautifulSoup解析HTML”這篇文章就分享到這里了，希望以上內(nèi)容可以對(duì)大家有一定的幫助，使各位可以學(xué)到更多知識(shí)，如果覺(jué)得文章不錯(cuò)，請(qǐng)把它分享出去讓更多的人看到。

另外有需要云服務(wù)器可以了解下創(chuàng)新互聯(lián)scvps.cn，海內(nèi)外云服務(wù)器15元起步，三天無(wú)理由+7*72小時(shí)售后在線，公司持有idc許可證，提供“云服務(wù)器、裸金屬服務(wù)器、高防服務(wù)器、香港服務(wù)器、美國(guó)服務(wù)器、虛擬主機(jī)、免備案服務(wù)器”等云主機(jī)租用服務(wù)以及企業(yè)上云的綜合解決方案，具有“安全穩(wěn)定、簡(jiǎn)單易用、服務(wù)可用性高、性價(jià)比高”等特點(diǎn)與優(yōu)勢(shì)，專為企業(yè)上云打造定制，能夠滿足用戶豐富、多元化的應(yīng)用場(chǎng)景需求。

當(dāng)前標(biāo)題：Python下如何使用BeautifulSoup解析HTML-創(chuàng)新互聯(lián)
鏈接地址：http://weahome.cn/article/hojji.html

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

Python下如何使用BeautifulSoup解析HTML-創(chuàng)新互聯(lián)

Python的優(yōu)點(diǎn)有哪些

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管