scrapy的CrawlSpider类简介

it2024-07-18 56

概述：

CrawlSpider类是Spider的派生类Spider类设计原则是只爬取start_url列表中的网页；CrawlSpider允许我们根据一定的URL规则提取跟进的链接，实现对全网站的爬取CrawlSpider类是爬取一般网站最常用的Spider类

CrawlSpider新增属性和方法：

rules，爬取规则属性parse_start_url()，可重写的方法

rules属性：

爬取规则属性，包含一个或多个Rule对象的元组

每个Rule对爬取网址的动作做出定义，CrawlSpider读取rules的每个Rule并进行解析

Rule定义和参数：

Rule定义和参数：常见参数

link_extractor，也叫做链接提取器，用来定义具体的爬取规则。

爬取网站获取多页实例：https://www.dushu.com/book/1617.html

rules = ( Rule(LinkExtractor(allow=r'/book/1617_[\d].html'), callback='parse_item', follow=True), ) 这里的 allow=r'/book/1617_[\d].html' 是指获取所有页

最新回复(0)