主頁 > 知識庫 > python中requests庫+xpath+lxml簡單使用

python中requests庫+xpath+lxml簡單使用

安裝

直接用pip安裝，anconda是自帶這個庫的。

pip install requests

簡單使用

requests的文檔

1.簡單訪問一個url：

import requests
url='http://www.baidu.com'
res = requests.get(url)
res.text
res.status_code

!DOCTYPE html>
!--STATUS OK-->
html>
 head>
meta http-equiv=content-type content=text/html;charset=utf-8>
meta http-equiv=X-UA-Compatible content=IE=Edge>
meta content=always name=referrer>
link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css>
title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“/title>
/head> 
body link=#0000cc> 
div id=wrapper> 
div id=head> 
div class=head_wrapper> 
div class=s_form> 
div class=s_form_wrapper>
 div id=lg> 
img hidefocus=true src=//www.baidu.com/img/bd_logo1.jpg width=270 height=129> 
/div>
 form id=form name=f action=//www.baidu.com/s class=fm> input type=hidden name=bdorz_come value=1> 
input type=hidden name=ie value=utf-8> 
input type=hidden name=f value=8> 
input type=hidden name=rsv_bp value=1>
 input type=hidden name=rsv_idx value=1> 
input type=hidden name=tn value=baidu>
span class="bg s_ipt_wr">input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus>/span>
span class="bg s_btn_wr">input type=submit id=su value=ç™¾åº¦ä¸€ä¸‹ class="bg s_btn">/span>
 /form>
 /div>
 /div>
 div id=u1> 
a href=http://news.baidu.com name=tj_trnews class=mnav>æ–°é—»/a>
 a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123/a> 
a href=http://map.baidu.com name=tj_trmap class=mnav>åœ°å›¾/a> 
a href=http://v.baidu.com name=tj_trvideo class=mnav>è§†é¢‘/a> 
a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å§/a> 
noscript> 
a href=http://www.baidu.com/bdorz/login.gif?logintpl=mnu=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç™»å½•/a> /noscript>
 script>
document.write('a + encodeURIComponent(window.location.href+ (window.location.search === " rel="external nofollow"  rel="external nofollow"  rel="external nofollow" " ? "?" : "")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç™»å½•/a>');/script> 
a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ›´å¤šäº§å“/a> 
/div> 
/div> 
/div> 
div id=ftCon> 
div id=ftConw> 
p id=lh> 
a href=http://home.baidu.com>å³äºŽç™¾åº¦/a>
 a href=http://ir.baidu.com>About Baidu/a>
 /p> 
p id=cp>copy;2017nbsp;Baidunbsp;a href=http://www.baidu.com/duty/>ä½¿ç”¨ç™¾åº¦å‰å¿
è¯»/a>nbsp; 
a href=http://jianyi.baidu.com/ class=cp-feedback>æ„è§åé¦ˆ/a>nbsp;äº¬ICPè¯030173å·nbsp; img src=//www.baidu.com/img/gs.gif> /p> /div> /div> /div> /body> /html>

200

亂碼的，是由于沒有轉(zhuǎn)換字符，可以加入res.encoding='utf-8'解決，200是狀態(tài)碼。一般狀態(tài)碼是2xx都沒什么問題的。

1xx：web服務(wù)器正確接收到請求了
2xx：處理成功，比如200表示正常，請求完成；204表示正常無響應(yīng)等
3xx：重定向
4xx：客戶端出現(xiàn)錯誤，比如著名的404找不到
5xx：服務(wù)器出現(xiàn)錯誤，比如500的內(nèi)部錯誤

res.encoding='utf-8'
print(res.text)

!DOCTYPE html>
!--STATUS OK-->
html> 
head>
meta http-equiv=content-type content=text/html;charset=utf-8>
meta http-equiv=X-UA-Compatible content=IE=Edge>
meta content=always name=referrer>
link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css>
title>百度一下，你就知道/title>
/head> 
body link=#0000cc>
 div id=wrapper>
 div id=head> 
div class=head_wrapper> 
div class=s_form>
 div class=s_form_wrapper> 
div id=lg>
 img hidefocus=true src=//www.baidu.com/img/bd_logo1.jpg width=270 height=129> 
/div> 
form id=form name=f action=//www.baidu.com/s class=fm>
 input type=hidden name=bdorz_come value=1> 
input type=hidden name=ie value=utf-8>
 input type=hidden name=f value=8> 
input type=hidden name=rsv_bp value=1>
 input type=hidden name=rsv_idx value=1> 
input type=hidden name=tn value=baidu>span class="bg s_ipt_wr">input id=kw name=w
d class=s_ipt value maxlength=255 autocomplete=off autofocus>/span>
span class="bg s_btn_wr">input type=submit id=su value=百度一下 class="bg s_btn">/span> 
/form>
 /div> 
/div>
 div id=u1> 
a href=http://news.baidu.com name=tj_trnews class=mnav>新聞/a> 
a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123/a> a href=http://map.baidu.com name=tj_trmap class=mnav>地圖/a> 
a href=http://v.baidu.com name=tj_trvideo class=mnav>視頻/a> 
a href=http://tieba.baidu.com name=tj_trtieba class=mnav>貼吧/a> 
noscript> a href=http://www.baidu.com/bdorz/login.gif?logintpl=mnu=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登錄/a> /noscript>
 script>
document.write('a + encodeURIComponent(window.location.href+ (window.location.search === " rel="external nofollow"  rel="external nofollow"  rel="external nofollow" " ? "?" : "")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登錄/a>');
/script>
 a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多產(chǎn)品/a> 
/div> 
/div>
 /div> 
div id=ftCon>
 div id=ftConw>
 p id=lh> 
a href=http://home.baidu.com>關(guān)于百度/a> 
a href=http://ir.baidu.com>About Baidu/a> 
/p> 
p id=cp>copy;2017nbsp;Baidunbsp;a href=http://www.baidu.com/duty/>使用百度前必讀/a>nbsp; 
a href=http://jianyi.baidu.com/ class=cp-feedback>意見反饋/a>nbsp;京ICP證030173號nbsp; img src=//www.baidu.com/img/gs.gif> 
/p> 
/div>
 /div> 
/div> 
/body>
 /html>

主要的點

（1）.用get請求得到的數(shù)據(jù)是一個response對象，用response.text屬性來查看。
（2）.修改編碼形式用response.encoding='utf-8/gbk/...'，encoding是它的一個屬性可以查看response.encoding

res.encoding
>>>:
>'utf-8'

（3）.無論響應(yīng)是文本還是二進制內(nèi)容，我們都可以用content屬性獲得bytes對象：

import requests
url='http://www.baidu.com'
res = requests.get(url)
print(res.content)
print("----------")
print(res.text)
print("----------")
print(type(res))

!DOCTYPE html>\r\n!--STATUS OK-->
html> 
head>
meta http-equiv=content-type content=text/html;charset=utf-8>meta http-equiv=X-UA-Compatible content=IE=Edge>meta content=always name=referrer>link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css>title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93/title>/head> body link=#0000cc> div id=wrapper> div id=head> div class=head_wrapper> div class=s_form> div class=s_form_wrapper> div id=lg> img hidefocus=true src=//www.baidu.com/img/bd_logo1.jpg width=270 height=129> /div> form id=form name=f action=//www.baidu.com/s class=fm> input type=hidden name=bdorz_come value=1> input type=hidden name=ie value=utf-8> input type=hidden name=f value=8> input type=hidden name=rsv_bp value=1> input type=hidden name=rsv_idx value=1> input type=hidden name=tn value=baidu>span class="bg s_ipt_wr">input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus>/span>span class="bg s_btn_wr">input type=submit id=su value=\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b class="bg s_btn">/span> /form> /div> /div> div id=u1> a href=http://news.baidu.com name=tj_trnews class=mnav>\xe6\x96\xb0\xe9\x97\xbb/a> a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123/a> a href=http://map.baidu.com name=tj_trmap class=mnav>\xe5\x9c\xb0\xe5\x9b\xbe/a> a href=http://v.baidu.com name=tj_trvideo class=mnav>\xe8\xa7\x86\xe9\xa2\x91/a> a href=http://tieba.baidu.com name=tj_trtieba class=mnav>\xe8\xb4\xb4\xe5\x90\xa7/a> noscript> a href=http://www.baidu.com/bdorz/login.gif?logintpl=mnu=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>\xe7\x99\xbb\xe5\xbd\x95/a> /noscript> script>document.write(\'a + encodeURIComponent(window.location.href+ (window.location.search === " rel="external nofollow" " ? "?" : "")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">xe7x99xbbxe5xbdx95/a>');/script> a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">\xe6\x9b\xb4\xe5\xa4\x9a\xe4\xba\xa7\xe5\x93\x81/a> /div> /div> /div> div id=ftCon> div id=ftConw> p id=lh> a href=http://home.baidu.com>\xe5\x85\xb3\xe4\xba\x8e\xe7\x99\xbe\xe5\xba\xa6/a> a href=http://ir.baidu.com>About Baidu/a> /p> p id=cp>copy;2017nbsp;Baidunbsp;a href=http://www.baidu.com/duty/>\xe4\xbd\xbf\xe7\x94\xa8\xe7\x99\xbe\xe5\xba\xa6\xe5\x89\x8d\xe5\xbf\x85\xe8\xaf\xbb/a>nbsp; a href=http://jianyi.baidu.com/ class=cp-feedback>\xe6\x84\x8f\xe8\xa7\x81\xe5\x8f\x8d\xe9\xa6\x88/a>nbsp;\xe4\xba\xacICP\xe8\xaf\x81030173\xe5\x8f\xb7nbsp; img src=//www.baidu.com/img/gs.gif> /p> /div> /div> /div> /body> /html>\r\n'
----------
!DOCTYPE html>
!--STATUS OK-->
html>
head>
meta http-equiv=content-type content=text/html;charset=utf-8>meta http-equiv=X-UA-Compatible content=IE=Edge>
meta content=always name=referrer>
link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css>
title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“/title>
/head> body link=#0000cc> div id=wrapper> 
div id=head>
div class=head_wrapper> div class=s_form> 
div class=s_form_wrapper> 
div id=lg> 
img hidefocus=true src=//www.baidu.com/img/bd_logo1.jpg width=270 height=129> /div> form id=form name=f action=//www.baidu.com/s class=fm> input type=hidden name=bdorz_come value=1> 
input type=hidden name=ie value=utf-8> input type=hidden name=f value=8>
input type=hidden name=rsv_bp value=1> 
input type=hidden name=rsv_idx value=1>
input type=hidden name=tn value=baidu>
span class="bg s_ipt_wr">
input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus>/span>span class="bg s_btn_wr">
input type=submit id=su value=ç™¾åº¦ä¸€ä¸‹ class="bg s_btn">
/span>
/form>
/div> 
/div>
div id=u1> 
a href=http://news.baidu.com name=tj_trnews class=mnav>æ–°é—»/a> 
a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123/a> 
a href=http://map.baidu.com name=tj_trmap class=mnav>åœ°å›¾/a> 
a href=http://v.baidu.com name=tj_trvideo class=mnav>è§†é¢‘/a> a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å§/a> 
noscript> 
a href=http://www.baidu.com/bdorz/login.gif?logintpl=mnu=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç™»å½•/a>
/noscript>
script>
document.write('a + encodeURIComponent(window.location.href+ (window.location.search === " rel="external nofollow"  rel="external nofollow"  rel="external nofollow" " ? "?" : "")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç™»å½•/a>');
/script> 
a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ›´å¤šäº§å“/a> 
/div>
/div>
/div> 
div id=ftCon>
div id=ftConw> 
p id=lh> 
a href=http://home.baidu.com>å³äºŽç™¾åº¦/a> 
a href=http://ir.baidu.com>About Baidu/a>
/p>
p id=cp>copy;2017nbsp;Baidunbsp;
a href=http://www.baidu.com/duty/>ä½¿ç”¨ç™¾åº¦å‰å¿è¯»/a>nbsp;
a href=http://jianyi.baidu.com/ class=cp-feedback>æ„è§åé¦ˆ/a>
nbsp;äº¬ICPè¯030173å·nbsp; img src=//www.baidu.com/img/gs.gif> 
/p> 
/div>
/div>
/div> 
/body> 
/html>
class 'requests.models.Response'>

（4）.status_code屬性來查看該請求返回的狀態(tài)碼

2.帶參數(shù)訪問url

（1）.帶http 的頭去訪問可以傳入?yún)?shù)：headers={'User-Agent':'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit'} ，不至于很快就被判斷懲惡爬蟲，把你的IP給封了。
（2）.Cookie

# 獲得指定cookie
r.cookies['cookie_name']
# 傳入cookie 用dict來傳遞
cs = {'token':'密碼','status':'狀態(tài)'}
res = requests.get(url, cookies='cs')

3）.指定超時

res = requests.get(url, timeout=3) #3秒后超時

注意:一般用get方法就可以爬取一些比較簡單容易的網(wǎng)站。

4.requests的一些常用方法和主要參數(shù)

方法	說明
requests.request()	構(gòu)造一個請求，用于以下各種方法的處理
requests.get()	獲取HTML網(wǎng)頁的主要方法，對應(yīng)于HTTP的GET
requests.head()	獲取HTML網(wǎng)頁頭信息的方法，對應(yīng)于HTTP的HEAD
requests.post()	向HTML提交POST請求的方法，對應(yīng)于HTTP的POST
requests.put()	向HTML提交PUT請求的方法，對應(yīng)于HTTP的PUT
requests.patch()	向HTML提交局部修改請求的方法，對應(yīng)于HTTP的PATCH
requests.delete()	向HTML提交刪除請求的方法，對應(yīng)于HTTP的DELETE

requests.get()方法的參數(shù)：
格式：requests.get(url, params=None, **kwargs) 最前面介紹的幾個常用的掌握就夠用了。

#url:要訪問的url地址
# params:url中的額外參數(shù)，可選的，字典或者字典或字節(jié)形式傳遞
# **kwargs:控制訪問的參數(shù)，可選
## headers，timeout，cookies，data，json，proxies，allow_redirects，stream，veriftty，cert，files，auth

5.requests.Response對象的屬性說明

屬性	說明
res.status_code	HTTP請求返回的狀態(tài)碼，200表示連接成功，404表示失敗
res.text	HTTP響應(yīng)內(nèi)容的字符串形式，即url對應(yīng)的頁面內(nèi)容
res.encoding	從HTTP header中猜測的響應(yīng)內(nèi)容的編碼形式，亂碼可以修改防止亂碼
res.content	從內(nèi)容中分析出的響應(yīng)內(nèi)容的編碼方式，備用
res.apparent_encoding	HTTP響應(yīng)內(nèi)容的二進制形式

xpath簡介

述

Xpath是一門在xml文檔中查找信息的語言。Xpath可用來在xml文檔中對元素和屬性進行遍歷。由于html的層次結(jié)構(gòu)與xml的層次結(jié)構(gòu)天然一致，所以使用Xpath也能夠進行html元素的定位。

定位方法 1.絕對路徑定位：

顧名思義，將Xpath表達式從html的最外層節(jié)點，逐層填寫，最后定位到操作元素，一般瀏覽器插件出來都是絕對定位
類似：/html/body/div[1]/div[2]/div[5]/div[1]/div[1]/form/span[2]/input

2.相對路徑定位

通過相對路徑定位元素，提取的是元素的部分特征，只要提取恰當(dāng)，能夠保證版本間穩(wěn)定，是進行自動化測試的首選。
類似：//div[@class='e']/a/p/span/text() @后面是屬性，最后的text()提取標(biāo)簽之間的文本數(shù)據(jù)

3.索引號定位

類似：/html/body/div[1]/div[2]/div[5]/div[1]/div[1]/form/span[last()-1]/input 表示form下倒數(shù)第二個span

4.屬性定位

類似：//*[@id=“kw” and @name=‘wd'] 表示 id 屬性為 kw 且 name 屬性為 wd

5.其它定位方法

還要別的定位方法，不常用，不介紹

lxml簡介

導(dǎo)入lxml 的 etree 庫

from lxml import etree

簡單使用

（1）.利用etree.HTML，將html字符串（bytes類型或str類型）轉(zhuǎn)化為Element對象，Element對象具有xpath的方法，返回結(jié)果列表。

html = etree.HTML(text) 
ret_list = html.xpath("xpath語法規(guī)則字符串")

（2）.xpath方法返回列表的三種情況

返回空列表：根據(jù)xpath語法規(guī)則字符串，沒有定位到任何元素
返回由字符串構(gòu)成的列表：xpath字符串規(guī)則匹配的一定是文本內(nèi)容或某屬性的值
返回由Element對象構(gòu)成的列表：xpath規(guī)則字符串匹配的是標(biāo)簽，列表中的Element對象可以繼續(xù)進行xpath

注意：

（1）.lxml.etree.HTML(html_str)可以自動補全標(biāo)簽

（2）.lxml.etree.tostring函數(shù)可以將轉(zhuǎn)換為Element對象再轉(zhuǎn)換回html字符串

（3）.爬蟲如果使用lxml來提取數(shù)據(jù)，應(yīng)該以lxml.etree.tostring的返回結(jié)果作為提取數(shù)據(jù)的依據(jù)

實例：爬取51.job的大數(shù)據(jù)職業(yè)信息的第一頁【requests+xpath】

分析：打開首頁，搜索大數(shù)據(jù)，定位是蘭州，F(xiàn)12調(diào)式查看，爬取工作名稱和公司名就好了

位置

import requests
from lxml import etree
url = "https://search.51job.com/list/270200,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE,2,1.html?lang=cpostchannel=0000workyear=99cotype=99degreefrom=99jobterm=99companysize=99ord_field=0dibiaoid=0line=welfare="
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
res = requests.get(url,headers=header)
res.encoding = "gbk"
#print(res.text)
data = etree.HTML(res.text)#加載成html樹
job_name = data.xpath("http://div[@class='e']/a/p/span/text()")
cname = data.xpath("/html/body/div[2]/div[3]/div/div[2]/div[4]/div[1]/div/div[2]/a/@title")

到此這篇關(guān)于python中requests庫+xpath+lxml簡單使用的文章就介紹到這了,更多相關(guān)requests庫+xpath+lxml使用內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

Python使用lxml模塊和Requests模塊抓取HTML頁面的教程
requests和lxml實現(xiàn)爬蟲的方法
Python爬蟲基礎(chǔ)之XPath語法與lxml庫的用法詳解
Python lxml模塊安裝教程
python常用request庫與lxml庫操作方法整理總結(jié)

標(biāo)簽：山西山西濟南喀什海南崇左長沙安康

巨人網(wǎng)絡(luò)通訊聲明：本文標(biāo)題《python中requests庫+xpath+lxml簡單使用》，本文關(guān)鍵詞；如發(fā)現(xiàn)本文內(nèi)容存在版權(quán)問題，煩請?zhí)峁┫嚓P(guān)信息告之我們，我們將及時溝通與處理。本站內(nèi)容系統(tǒng)采集于網(wǎng)絡(luò)，涉及言論、版權(quán)與本站無關(guān)。