1500字范文,内容丰富有趣,写作好帮手!
1500字范文 > Python爬虫从入门到精通:(36)CrawlSpider实现深度爬取_Python涛哥

Python爬虫从入门到精通:(36)CrawlSpider实现深度爬取_Python涛哥

时间:2020-10-07 02:11:58

相关推荐

Python爬虫从入门到精通:(36)CrawlSpider实现深度爬取_Python涛哥

我们来看下CrawlSpider实现深度爬取。

爬取阳光热线标题、状态、和详情页内容。

/political/index/politicsNewest?id=1&type=4&page=

创建CrawlSpider工程

scrapy startproject sunPro

cd sunPro

scrapy genspider -t crawl sun

修改配置文件等

页面解析

提取下页码链接

我们看到这个网站有很多页面,我们先来提取下页码链接。

很容易分析到页面链接的规律,写下正则:

import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Ruleclass SunSpider(CrawlSpider):name = 'sun'# allowed_domains = ['']start_urls = ['/political/index/politicsNewest?id=1&type=4&page=']# 提取页码链接link = LinkExtractor(allow=r'id=1&page=\d+')rules = (Rule(link, callback='parse_item', follow=True),)def parse_item(self, response):print(response)

这里我们主要学习深度爬取,后面只用一页作为案例。follow=False

数据解析

我们来获取当前页的标题、详情页地址和状态

import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom sunPro.items import SunproItemclass SunSpider(CrawlSpider):name = 'sun'# allowed_domains = ['']start_urls = ['/political/index/politicsNewest?id=1&type=4&page=']# 提取页码链接link = LinkExtractor(allow=r'id=1&page=\d+')rules = (Rule(link, callback='parse_item', follow=False),)# 页面数据解析def parse_item(self, response):li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')for li in li_list:title = li.xpath('./span[3]/a/text()').extract_first()detail_url = '' + li.xpath('./span[3]/a/@href').extract_first()status = li.xpath('./span[2]/text()').extract_first()# 保存item提交给管道item = SunproItem()item['title'] = titleitem['detail_url'] = detail_urlitem['status'] = status**手动发送请求**现在我们用手动发送请求的方式解析详情页数据:```python# 页面数据解析def parse_item(self, response):li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')for li in li_list:title = li.xpath('./span[3]/a/text()').extract_first()detail_url = '' + li.xpath('./span[3]/a/@href').extract_first()status = li.xpath('./span[2]/text()').extract_first()# 保存item提交给管道item = SunproItem()item['title'] = titleitem['detail_url'] = detail_urlitem['status'] = statusyield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={'item': item})# 详情页数据解析def parse_detail(self, response):content = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]/pre/text()').extract_first()item = response.meta['item']item['content'] = contentyield item

运行一下,我们就获取了全部数据

完整代码:

sum.py

import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom sunPro.items import SunproItemclass SunSpider(CrawlSpider):name = 'sun'# allowed_domains = ['']start_urls = ['/political/index/politicsNewest?id=1&type=4&page=']# 提取页码链接link = LinkExtractor(allow=r'id=1&page=\d+')rules = (Rule(link, callback='parse_item', follow=False),)# 页面数据解析def parse_item(self, response):li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')for li in li_list:title = li.xpath('./span[3]/a/text()').extract_first()detail_url = '' + li.xpath('./span[3]/a/@href').extract_first()status = li.xpath('./span[2]/text()').extract_first()# 保存item提交给管道item = SunproItem()item['title'] = titleitem['status'] = statusyield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={'item': item})# 详情页数据解析def parse_detail(self, response):content = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]/pre/text()').extract_first()item = response.meta['item']item['content'] = contentyield item

items.py

import scrapyclass SunproItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title = scrapy.Field()status = scrapy.Field()content = scrapy.Field()

Pipeline.py

class SunproPipeline:def process_item(self, item, spider):print(item)return item

settings.py

略~请自己学会熟练配置!

总结

CrawlSpider实现的深度爬取

通用方式:CrawlSpider+Spider实现

关注Python涛哥!学习更多Python知识!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。