創(chuàng)新互聯(lián)www.cdcxhl.cn八線動(dòng)態(tài)BGP香港云服務(wù)器提供商,新人活動(dòng)買多久送多久,劃算不套路!
成都創(chuàng)新互聯(lián)主要從事成都網(wǎng)站建設(shè)、網(wǎng)站建設(shè)、網(wǎng)頁(yè)設(shè)計(jì)、企業(yè)做網(wǎng)站、公司建網(wǎng)站等業(yè)務(wù)。立足成都服務(wù)東港,十載網(wǎng)站建設(shè)經(jīng)驗(yàn),價(jià)格優(yōu)惠、服務(wù)專業(yè),歡迎來(lái)電咨詢建站服務(wù):13518219792今天就跟大家聊聊有關(guān)Python爬蟲BeautifulSoup4的使用方法,可能很多人都不太了解,為了讓大家更加了解,小編給大家總結(jié)了以下內(nèi)容,希望大家根據(jù)這篇文章可以有所收獲。
爬蟲——BeautifulSoup4解析器
BeautifulSoup用來(lái)解析HTML比較簡(jiǎn)單,API非常人性化,支持CSS選擇器、Python標(biāo)準(zhǔn)庫(kù)中的HTML解析器,也支持lxml的XML解析器。
其相較與正則而言,使用更加簡(jiǎn)單。
示例:
首先必須要導(dǎo)入bs4庫(kù)
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" # 創(chuàng)建 Beautiful Soup 對(duì)象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # 格式化輸出 soup 對(duì)象的內(nèi)容 print(soup.prettify())
運(yùn)行結(jié)果
The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie ; and they lived at the bottom of a well.
...
四大對(duì)象種類
BeautifulSoup將復(fù)雜的HTML文檔轉(zhuǎn)換成一個(gè)復(fù)雜的樹形結(jié)構(gòu),每個(gè)節(jié)點(diǎn)都是Python對(duì)象,所有對(duì)象可以歸納為4種:
(1)Tag
(2)NavigableString
(3)BeautifulSoup
(4)Comment
1.Tag
Tag 通俗點(diǎn)講就是HTML中的一個(gè)個(gè)標(biāo)簽,例如:
The Dormouse's story The Dormouse's story
上面title head a p 等等HTML標(biāo)簽加上里面包括的內(nèi)容就是Tag,那么試著使用BeautifulSoup來(lái)獲取Tags:
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" # 創(chuàng)建 Beautiful Soup 對(duì)象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # # 打印title標(biāo)簽 print(soup.title) # 打印head標(biāo)簽 print(soup.head) # 打印a標(biāo)簽 print(soup.a) # 打印p標(biāo)簽 print(soup.p) # 打印soup.p的類型 print(type(soup.p))
運(yùn)行結(jié)果
The Dormouse's story The Dormouse's story The Dormouse's story
我們可以利用soup加標(biāo)簽名輕松地獲取這些標(biāo)簽內(nèi)容,這些對(duì)象的類型是bs4.element.Tag。但是注意,它查找的是在所有內(nèi)容中的第一個(gè)符合要求的標(biāo)簽。如果需要查詢所有的標(biāo)簽,后面會(huì)進(jìn)行介紹。
對(duì)于Tag,它有兩個(gè)重要的屬性,就是name和attrs。
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" # 創(chuàng)建 Beautiful Soup 對(duì)象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # soup對(duì)象比較特殊,它的name為[document] print(soup.name) # 對(duì)于其他內(nèi)部標(biāo)簽,輸出的值便為標(biāo)簽本身的名稱 print(soup.head.name) # 打印p標(biāo)簽的所有屬性,其類型是一個(gè)字典 print(soup.p.attrs) # 打印p標(biāo)簽的class屬性 print(soup.p['class']) # 還可以利用get方法獲取屬性,傳入屬性的名稱,與上面的方法等價(jià) print(soup.p.get('class')) print(soup.p) # 修改屬性 soup.p['class'] = "newClass" print(soup.p) # 刪除屬性 del soup.p['class'] print(soup.p)
運(yùn)行結(jié)果
[document] head {'class': ['title'], 'name': 'dromouse'} ['title'] ['title']The Dormouse's story
The Dormouse's story
The Dormouse's story
2.NavigableString
既然我們已經(jīng)得到了標(biāo)簽的內(nèi)容,那么問(wèn)題來(lái)了,我們想要獲取標(biāo)簽內(nèi)部的文字怎么辦呢?很簡(jiǎn)單,用.string即可,例如:
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" # 創(chuàng)建 Beautiful Soup 對(duì)象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # 打印p標(biāo)簽的內(nèi)容 print(soup.p.string) # 打印soup.p.string的類型 print(type(soup.p.string))
運(yùn)行結(jié)果
The Dormouse's story
3.BeautifulSoup
BeautifulSoup對(duì)象表示的是一個(gè)文檔的內(nèi)容。大部分時(shí)候,可以把它當(dāng)作Tag對(duì)象,是一個(gè)特殊的Tag,我們可以分別獲取它的類型,名稱,以及屬性。
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" # 創(chuàng)建 Beautiful Soup 對(duì)象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # 類型 print(type(soup.name)) # 名稱 print(soup.name) # 屬性 print(soup.attrs)
運(yùn)行結(jié)果
[document] {}
4.Comment
Comment對(duì)象是一個(gè)特殊類型的NavigableString對(duì)象,其輸出的內(nèi)容不包括注釋符號(hào)。
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" # 創(chuàng)建 Beautiful Soup 對(duì)象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") print(soup.a) print(soup.a.string) print(type(soup.a.string))
運(yùn)行結(jié)果
Elsie
a標(biāo)簽里的內(nèi)容實(shí)際上是注釋,但是如果我們利用.string來(lái)輸出它的內(nèi)容時(shí),注釋符號(hào)已經(jīng)去掉了。
看完上述內(nèi)容,你們對(duì)Python爬蟲BeautifulSoup4的使用方法有進(jìn)一步的了解嗎?如果還想了解更多知識(shí)或者相關(guān)內(nèi)容,請(qǐng)關(guān)注創(chuàng)新互聯(lián)-成都網(wǎng)站建設(shè)公司行業(yè)資訊頻道,感謝大家的支持。