Bug-Scrapy-Filtered_offsite_request

在写爬虫的时候遇见了一个bug,报错信息为 Filtered offsite request domain[‘mydomain’],
从网上找到了答案并成功解决了问题，特别记录一下。

检查allowed_domains

要去查看自己爬取的url是否符合我们的domain。如果我们要爬取的网站的网址为 https://www.example.com/1.html，那么我们就需要添加example.com到我们写好的爬虫中的allow_domains的list列表中。

1	allowed_domains = ['example.com']

看一下Scrapy官方文档的解释。

allowed_domains
An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list (or their subdomains) won’t be followed if OffsiteMiddleware is enabled.

如果设置进行url过滤,那么对于要新添加到爬取队列中的所有的url中必须包含allow_domains中的一个域名才可以，否则这个url就会被舍弃。

设置 dont_filte=True

可以直接停止scrapy对于url的过滤。也就是在新生成的Request中添加 dont_filte=True，比如下面这样。

def parse(self, response):
 yield Request(url='https://example.com', dont_filter=True, callback=self.parse2)
def parse2(self, response):
 pass

看一下Scrapy官方文档的解释：

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

这个方法就比较粗暴，就直接不对url进行过滤。那么也就是说对于所有的爬取到的url都会进行重新爬取，但是如果一旦有url是重复的，可能就会陷入循环。所以尽量不要采用这么暴力的方式。这也是scrapy默认将其设置为False的一个原因。

PS:一定要细心，我个人出现的问题是我想当然的设置了域名，然后真实网站的域名和我设置的域名差一个字母，所以最后就报错了。