1500字范文 > python爬虫招聘-Python爬虫实战-抓取boss直聘招聘信息

python爬虫招聘-Python爬虫实战-抓取boss直聘招聘信息

时间：2022-03-07 00:50:51

实战内容：爬取boss直聘的岗位信息，存储在数据库，最后通过可视化展示出来

PS注意：很多人学Python过程中会遇到各种烦恼问题，没有人帮答疑容易放弃。为此小编建了个Python全栈免费答疑.裙：七衣衣九七七巴而五（数字的谐音）转换下可以找到了，不懂的问题有老司机解决里面还有最新Python教程项目可拿,，一起相互监督共同进步！

0 环境搭建

MacBook Air (13-inch, )

CPU：1.8 GHz Intel Core i5

RAM：8 GB 1600 MHz DDR3

IDE：anaconda3.6 | jupyter notebook

Python版本：Python 3.6.5 :: Anaconda, Inc.

1 安装scrapy

过程在参考链接中，我只说与上面不一致的地方

pip install scrapy

遇到报错，无法调用gcc*解决方案：mac自动弹出安装gcc提示框，点击“安装”即可

安装成功，安装过程中，终端打印出“distributed 1.21.8 requires msgpack, which is not installed.”

解决方案：

conda install -c anaconda msgpack-python

pip install msgpack

2 新建项目

scrapy startproject www_zhipin_com

可以通过 scrapy -h 了解功能

源码文件关系

tree这个命令挺好用，微软cmd中自带，Python没有自带的，可以参考网上代码，自己写一个玩玩。

3 定义要抓取的item

与源代码基本一致

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

# See documentation in:

# /en/latest/topics/items.html

import scrapy

class WwwZhipinComItem(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

pid = scrapy.Field()

positionName = scrapy.Field()

positionLables = scrapy.Field()

city = scrapy.Field()

experience = scrapy.Field()

educational = scrapy.Field()

salary = scrapy.Field()

company = scrapy.Field()

industryField = scrapy.Field()

financeStage = scrapy.Field()

companySize = scrapy.Field()

time = scrapy.Field()

updated_at = scrapy.Field()

4 分析页面

现在页面改版了，发布时间有了小幅度调整

页面

HTML结构如下

5 爬虫代码

这一步有些看不懂，硬着头皮往下写，不懂得先记着

5.1关于request headers

比如headers中，我在自己的浏览器中找不到下面内容x-devtools-emulate-network-conditions-client-id ？？postman-token ？？

我该学习一下request headers中内容目前采用的方法是把作者的headers拷贝过去，然后我这边有的我替换掉，没有的比如x-devtools我就用作者原有的。

5.2 关于extract_first()和extract()

extract_first()和extract()的区别：提取全部内容: .extract()，获得是一个列表提取第一个:.extract_first()，获得是一个字符串

Selectors根据CSS表达式从网页中选择数据（CSS更常用）response.selector.css('title::text') ##用css选取了title的文字内容由于selector.css使用比较普遍，所以专门定义了css，所以上面也可以写成：response.css('title::text')

运行脚本，会在项目目录下生成一个包含爬取数据的item.json文件

scrapy crawl zhipin -o item.json

debug完最后一个错误之后，第五步终于跑通了，截个图

爬取boss直聘上面关于python的职位

存入json文件的模样有点奇怪，没汉字，第六步应该会解决：

{"pid": "23056497", "positionName": "", "salary": "8k-9k", "city": "北京", "experience": "不限", "educational": "本科", "company": "今日头条", "positionLables": [], "time": "发布于07月12日", "updated_at": "-07-17 00:04:05"},

{"pid": "23066797", "positionName": "", "salary": "18k-25k", "city": "北京", "experience": "1-3年", "educational": "本科", "company": "天下秀", "positionLables": [], "time": "发布于07月13日", "updated_at": "-07-17 00:04:05"},

第五步因为网页发生改版，所以发布时间time这块需要修改一下，其他都没有问题。我也把源码贴一下：

# -07-17

# Author limingxuan

# limx@

# blog：/p/a5907362ba72

import scrapy

import time

from www_zhipin_com.items import WwwZhipinComItem

class ZhipinSpider(scrapy.Spider):

name = 'zhipin'

allowed_domains = ['']

start_urls = ['/']

positionUrl = '/job_detail/?query=python&scity=101010100'

curPage = 1

#我的浏览器找不到源码中的一些字段，比如

#x-devtools-emulate-network-conditions-client-id

#upgrade-insecure-requests

#dnt

#cache-control

#postman-token

#所以就没有加，按我的浏览器查到的信息填写的，现在看起来貌似也能跑起来

headers = {

'accept': "application/json, text/javascript, */*; q=0.01",

'accept-encoding': "gzip, deflate, br",

'accept-language': "zh-CN,zh;q=0.9,en;q=0.8",

'content-type': "application/x-www-form-urlencoded; charset=UTF-8",

'cookie': "JSESSIONID=""; __c=1530137184; sid=sem_pz_bdpc_dasou_title; __g=sem_pz_bdpc_dasou_title; __l=r=https%3A%2F%%2Fgongsi%2F5189f3fadb73e42f1HN40t8~.html&l=%%2Fgongsir%2F5189f3fadb73e42f1HN40t8~.html%3Fka%3Dcompany-jobs&g=%%2F%3Fsid%3Dsem_pz_bdpc_dasou_title; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1531150234,1531231870,1531573701,1531741316; lastCity=101010100; toUrl=https%3A%2F%%2Fjob_detail%2F%3Fquery%3Dpython%26scity%3D101010100; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1531743361; __a=26651524.1530136298.1530136298.1530137184.286.2.285.199",

'origin': "",

'referer': "/job_detail/?query=python&scity=101010100",

'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"

}

def start_requests(self):

return [self.next_request()]

def parse(self,response):

print("request -> " + response.url)

job_list = response.css('div.job-list > ul > li')

for job in job_list:

item = WwwZhipinComItem()

job_primary = job.css('div.job-primary')