python--爬虫笔记

it2023-10-15 102

1、正则学习

1、. 匹配任意除换行符“\n”外的字符； 2、*表示匹配前一个字符0次或无限次； 3、?表示前边字符的0次或1次重复 4、+或*后跟？表示非贪婪匹配，即尽可能少的匹配，如*？重复任意次，但尽可能少重复； 5、 .*? 表示匹配任意数量的重复，但是在能使整个匹配成功的前提下使用最少的重复。如：a.*?b匹配最短的，以a开始，以b结束的字符串。如果把它应用于aabab的话，它会匹配aab和ab

2、导入re模块 re.findall---方法

a='abcdefg'----原内容 b=re.findall('abc',a)---匹配abc

print b----输出内容 abc

3、正则常用表达式

.---匹配任意换行符"\n"以外的字符----------------re.findall('.*','https://www.baidu.com')----得到https://www.baidu.com \---转义字符 \d---匹配数字[0-9]------------re.findall('\d','https://www.baidu.com/12345.jpg')-----得到['1', '2', '3', '4', '5'],

re.findall('\d+','https://www.baidu.com/12345.jpg')----得到['12345'] \D---匹配非数字[^\d]

\s--- 匹配空格 \S--- 匹配非空格 \w---匹配单词字符a\wc----匹配得到abc \W-- 匹配非单词字符[^\w]----a\Wc---匹配得到 a c {m}--匹配前一个字符m次，如下图：

*---匹配前一个字符0次或无限次(配合.来使用例如：.*)

^---匹配行首----^x---匹配首行为x的字符

$--匹配行尾---$x---匹配行尾为x的字符

+---匹配前一个字符多次

{m,n}--匹配前一个字符最少重复m次，最多重复n次

re.I不区分大小写的匹配-----------如下：设置的源条件为abc，那么通过re.I宣称不区分大小写匹配，则在cc中匹配到ABC

正则匹配后再替换------原数据为192_168_1_1------将_替换为.

4、安装requests库---执行cmd下执行---pip install requests

请求百度测试

r=requests.get('http://www.baidu.com')

r=requests.post('http://www.baidu.com')

r=requests.head('http://www.baidu.com')

r=requests.option('http://www.baidu.com')等

通过python requests发送的数据包的http头如下（很容易被反爬虫设备拦截，因此会通过headers封装头） GET / HTTP/1.1 Host: www.baidu.com User-Agent: python-requests/2.24.0--------自动识别UA为python Accept-Encoding: gzip, deflate Accept: */* Connection: keep-alive

r.text---接收响应内容 r.status_code--返回码 r.headers---响应头 r.cookies--查看cookie

使用headers构造http头（bp抓包，直接复制headers即可）：headers={ 'Host': 'www.baidu.com', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:81.0) Gecko/20100101 Firefox/81.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', 'Accept-Encoding': 'gzip, deflate', 'Referer': 'https://www.baidu.com/s?wd=1&pn=10&oq=1&tn=baiduhome_pg&ie=utf-8&usm=3&rsv_idx=2&rsv_pq=d4904744000177a2&rsv_t=c7bcm95P20AzNzt4lhg8OiiP9WsuSUC2vRAxtSyd6PCoh5XZGwPaCRYIfadc9XKmqSG8', 'Connection': 'close' }

然后发送http请求，，requests.get(url=url,headers=headers)

爬虫实例：

爬取bilibili的视频名称

#coding=utf-8 import requests import re url_start='http://search.bilibili.com/all?keyword=python%E5%AE%89%E5%85%A8&from_source=nav_suggest_new&page=' def lession(): r=requests.get(url) #print (r.text) name=re.findall('<a title=(.*) href',r.text) name_length=len(name) for i in name: print (i) for i in range(1,51): url=url_start+stri lession(url)

测试效果如下：

最新回复(0)