python爬虫爬取多页内容

it2024-08-13 45

python爬虫爬取多页内容

前几天零组资料库发文关闭，第一个念头是可惜，想着赶紧把资料保存下来，却发现爬虫已经忘得差不多了，，，赶紧复习一波。

不多说，pycharm，启动！不知道爬啥，随便找个网页吧~ url：http://www.netbian.com/index.htm

首选获取目标网址HTML页面

F12提取请求头信息，这里我们只需UA即可根据网页meta标签设置编码格式

代码如下：

import requests from lxml import etree def get_image(): base_url = "http://www.netbian.com/index.htm" headers = { 'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Mobile Safari/537.36' } #获取响应数据 response = requests.get(base_url,headers=headers) response_data = response.content.decode('gbk') # response_code = response.status_code # print(response_code) #保存数据 with open('wall.html','w',encoding='gbk')as f: f.write(response_data) get_image()

本地打开验证：是没有问题的。

不罗嗦了，直接上完整代码：

import requests from lxml import etree def get_image(): try: headers = { 'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Mobile Safari/537.36' } #保存文件路径 path = "C://Users/Administrator/Desktop/image/" #获取响应数据 response = requests.get(url,headers=headers) response_data = response.content.decode('gbk') #判断是否有响应 # response_code = response.status_code # print(response_code) #保存数据 #with open('wall.html','w',encoding='gbk')as f: # f.write(response_data) #数据解析 #1.将数据解析为HTML parse_data = etree.HTML(response_data) #2.将需要的内容以字段的形式赋值给item item_list = parse_data.xpath('//div/ul/li/a/img/@src') #用for循环遍历整个列表并保存 for item in item_list: final_data = requests.get(item,headers=headers).content with open(path + item[-7:],'wb')as f: f.write(final_data) #print(item) except: print('error') def get_page(): #取前10页 urls = ["http://www.netbian.com/index_{}.htm".format(str(i)) for i in range(1,11)] #输出验证 #print(urls) return urls if __name__ == '__main__':#主函数 get_page() for url in get_page(): get_image()

运行结果：简单总结为几个流程： 1.获取目标网址，填充请求头。 2.用urllib或requests保存数据。 3.用，正则，beautifulsoup，xpath解析数据。 4.保存数据。

ps：网络安全从业者，代码功底比较弱，希望大家指正错误。

最新回复(0)