Python爬蟲BeautifulSoup4的使用方法-創(chuàng)新互聯(lián)

創(chuàng)新互聯(lián)www.cdcxhl.cn八線動(dòng)態(tài)BGP香港云服務(wù)器提供商，新人活動(dòng)買多久送多久，劃算不套路！

成都創(chuàng)新互聯(lián)主要從事成都網(wǎng)站建設(shè)、網(wǎng)站建設(shè)、網(wǎng)頁(yè)設(shè)計(jì)、企業(yè)做網(wǎng)站、公司建網(wǎng)站等業(yè)務(wù)。立足成都服務(wù)東港,十載網(wǎng)站建設(shè)經(jīng)驗(yàn),價(jià)格優(yōu)惠、服務(wù)專業(yè),歡迎來(lái)電咨詢建站服務(wù):13518219792

今天就跟大家聊聊有關(guān)Python爬蟲BeautifulSoup4的使用方法，可能很多人都不太了解，為了讓大家更加了解，小編給大家總結(jié)了以下內(nèi)容，希望大家根據(jù)這篇文章可以有所收獲。

爬蟲——BeautifulSoup4解析器

BeautifulSoup用來(lái)解析HTML比較簡(jiǎn)單，API非常人性化，支持CSS選擇器、Python標(biāo)準(zhǔn)庫(kù)中的HTML解析器，也支持lxml的XML解析器。

其相較與正則而言，使用更加簡(jiǎn)單。

示例：

首先必須要導(dǎo)入bs4庫(kù)

#!/usr/bin/python3
# -*- coding:utf-8 -*- 
 
from bs4 import BeautifulSoup
 
html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
 
# 創(chuàng)建 Beautiful Soup 對(duì)象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 格式化輸出 soup 對(duì)象的內(nèi)容
print(soup.prettify())

運(yùn)行結(jié)果


 
  
   The Dormouse's story
  
 
 
  
   
    The Dormouse's story
   
  
  
   Once upon a time there were three little sisters; and their names were
   
    
   
   ,
   
    Lacie
   
   and
   
    Tillie
   
   ;
and they lived at the bottom of a well.
  
  
   ...

四大對(duì)象種類

BeautifulSoup將復(fù)雜的HTML文檔轉(zhuǎn)換成一個(gè)復(fù)雜的樹形結(jié)構(gòu)，每個(gè)節(jié)點(diǎn)都是Python對(duì)象，所有對(duì)象可以歸納為4種：

（1）Tag

（2）NavigableString

（3）BeautifulSoup

（4）Comment

1.Tag

Tag 通俗點(diǎn)講就是HTML中的一個(gè)個(gè)標(biāo)簽，例如：

The Dormouse's story

The Dormouse's story

上面title head a p 等等HTML標(biāo)簽加上里面包括的內(nèi)容就是Tag，那么試著使用BeautifulSoup來(lái)獲取Tags：

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
 
# 創(chuàng)建 Beautiful Soup 對(duì)象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# # 打印title標(biāo)簽
print(soup.title)
 
# 打印head標(biāo)簽
print(soup.head)
 
# 打印a標(biāo)簽
print(soup.a)
 
# 打印p標(biāo)簽
print(soup.p)
 
# 打印soup.p的類型
print(type(soup.p))

運(yùn)行結(jié)果

The Dormouse's story
The Dormouse's story

The Dormouse's story

我們可以利用soup加標(biāo)簽名輕松地獲取這些標(biāo)簽內(nèi)容，這些對(duì)象的類型是bs4.element.Tag。但是注意，它查找的是在所有內(nèi)容中的第一個(gè)符合要求的標(biāo)簽。如果需要查詢所有的標(biāo)簽，后面會(huì)進(jìn)行介紹。

對(duì)于Tag，它有兩個(gè)重要的屬性，就是name和attrs。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
 
# 創(chuàng)建 Beautiful Soup 對(duì)象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# soup對(duì)象比較特殊，它的name為[document]
print(soup.name)
 
# 對(duì)于其他內(nèi)部標(biāo)簽，輸出的值便為標(biāo)簽本身的名稱
print(soup.head.name)
 
# 打印p標(biāo)簽的所有屬性，其類型是一個(gè)字典
print(soup.p.attrs)
 
# 打印p標(biāo)簽的class屬性
print(soup.p['class'])
# 還可以利用get方法獲取屬性，傳入屬性的名稱，與上面的方法等價(jià)
print(soup.p.get('class'))
 
print(soup.p)
 
# 修改屬性
soup.p['class'] = "newClass"
print(soup.p)
 
# 刪除屬性
del soup.p['class']
print(soup.p)

運(yùn)行結(jié)果

[document]
head
{'class': ['title'], 'name': 'dromouse'}
['title']
['title']
The Dormouse's story
The Dormouse's story
The Dormouse's story

2.NavigableString

既然我們已經(jīng)得到了標(biāo)簽的內(nèi)容，那么問(wèn)題來(lái)了，我們想要獲取標(biāo)簽內(nèi)部的文字怎么辦呢？很簡(jiǎn)單，用.string即可，例如：

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
 
# 創(chuàng)建 Beautiful Soup 對(duì)象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 打印p標(biāo)簽的內(nèi)容
print(soup.p.string)
 
# 打印soup.p.string的類型
print(type(soup.p.string))

運(yùn)行結(jié)果

The Dormouse's story

3.BeautifulSoup

BeautifulSoup對(duì)象表示的是一個(gè)文檔的內(nèi)容。大部分時(shí)候，可以把它當(dāng)作Tag對(duì)象，是一個(gè)特殊的Tag，我們可以分別獲取它的類型，名稱，以及屬性。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
 
# 創(chuàng)建 Beautiful Soup 對(duì)象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 類型
print(type(soup.name))
 
# 名稱
print(soup.name)
 
# 屬性
print(soup.attrs)

運(yùn)行結(jié)果


[document]
{}

4.Comment

Comment對(duì)象是一個(gè)特殊類型的NavigableString對(duì)象，其輸出的內(nèi)容不包括注釋符號(hào)。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
 
# 創(chuàng)建 Beautiful Soup 對(duì)象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
print(soup.a)
 
print(soup.a.string)
 
print(type(soup.a.string))

運(yùn)行結(jié)果


 Elsie

a標(biāo)簽里的內(nèi)容實(shí)際上是注釋，但是如果我們利用.string來(lái)輸出它的內(nèi)容時(shí)，注釋符號(hào)已經(jīng)去掉了。

看完上述內(nèi)容，你們對(duì)Python爬蟲BeautifulSoup4的使用方法有進(jìn)一步的了解嗎？如果還想了解更多知識(shí)或者相關(guān)內(nèi)容，請(qǐng)關(guān)注創(chuàng)新互聯(lián)-成都網(wǎng)站建設(shè)公司行業(yè)資訊頻道，感謝大家的支持。

文章標(biāo)題：Python爬蟲BeautifulSoup4的使用方法-創(chuàng)新互聯(lián)
標(biāo)題路徑：http://weahome.cn/article/pdshh.html

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

Python爬蟲BeautifulSoup4的使用方法-創(chuàng)新互聯(lián)

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管