爬虫手记(scrapy实现断点续爬,文章重点在设置)使用scrapy

it2023-02-24  77

爬虫手记(断点续爬)

安装reids数据库

安装scrapy

安装scrapy_redis

以上自行解决。

创建项目

scrapy startprogect commit_spider 进入commit_spider目录scrapy genspider myspider nvd.nist.govnvd.nist.gov是所要爬取网站的根域名

修改settings.py

ROBOTSTXT_OBEY = True PROXY_LIST = [ {"ip_port": "http://211.137.52.158:8080"}, {"ip_port": "http://111.47.154.34:53281"}, {"ip_port": "http://183.220.145.3:80"}, {"ip_port": "http://223.100.166.3:36945"}, {"ip_port": "http://120.194.42.157:38185"}, {"ip_port": "http://223.82.106.253:3128"}, {"ip_port": "http://117.141.155.244:53281"}, {"ip_port": "http://120.198.76.45:41443"}, {"ip_port": "http://123.136.8.122:3128"}, {"ip_port": "http://117.141.155.243:53281"}, {"ip_port": "http://183.196.168.194:9000"}, {"ip_port": "http://117.141.155.242:53281"}, {"ip_port": "http://183.195.106.118:8118"}, {"ip_port": "http://112.14.47.6:52024"}, {"ip_port": "http://218.204.153.156:8080"}, {"ip_port": "http://223.71.203.241:55443"}, {"ip_port": "http://117.141.155.241:53281"}, {"ip_port": "http://221.180.170.104:8080"}, {"ip_port": "http://183.247.152.98:53281"}, {"ip_port": "http://183.196.170.247:9000"},] UA_LIST = ['Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)', 'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)', 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)', 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser; GTB5; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; InfoPath.1; .NET CLR 3.5.30729; .NET CLR 3.0.30618)', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SV1; Acoo Browser; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; Avant Browser)', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; GTB5; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; Maxthon; InfoPath.1; .NET CLR 3.5.30729; .NET CLR 3.0.30618)', 'Mozilla/4.0 (compatible; Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729); Windows NT 5.1; Trident/4.0)', 'Mozilla/4.0 (compatible; Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB6; Acoo Browser; .NET CLR 1.1.4322; .NET CLR 2.0.50727); Windows NT 5.1; Trident/4.0; Maxthon; .NET CLR 2.0.50727; .NET CLR 1.1.4322; InfoPath.2)', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser; GTB6; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; InfoPath.1; .NET CLR 3.5.30729; .NET CLR 3.0.30618)'] DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'Accept-Language': 'en', } DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" SCHEDULER = "scrapy_redis.scheduler.Scheduler" SCHEDULER_PERSIST = True REDIS_URL = "redis://127.0.0.1:6379"

修改middlewares.py

from commit_spider.settings import PROXY_LIST,UA_LIST

新建两个类

class RandomProxy(object): def process_request(self,request,spider): proxy = random.choice(PROXY_LIST) request.meta['proxy'] = proxy['ip_port'] class RandomUserAgent(object): def process_request(self,request,spider): ua = random.choice(UA_LIST) request.headers['User-Agent'] = ua

settings.py中的中间件把这两个类打开

DOWNLOADER_MIDDLEWARES = { 'commit_spider.middlewares.RandomProxy': 543, 'commit_spider.middlewares.RandomUserAgent': 543 }

管道中打开 scrapy_redis管道##

'scrapy_redis.pipelines.RedisPipeline': 400,

以上设置包括动态代理ip更换,userAgent随机更换,以及断点续爬设置使用redis数据库存储断点

接下来编写爬虫代码myspider.py

爬虫代码就是使用response.xpath或者bs4解析页面了,具体如何做都是套路,网上已经很多教程了。

最新回复(0)