1500字范文,内容丰富有趣,写作好帮手!
1500字范文 > python网络爬虫从入门到实践 第5章 (一)

python网络爬虫从入门到实践 第5章 (一)

时间:2022-11-26 05:39:18

相关推荐

python网络爬虫从入门到实践 第5章 (一)

/3/library/re.html

代码:

import reline = "Fat cats are smarter than dogs, is it right?"m = re.match( r'(.*) are (.*?) ', line)print ("匹配的结果: ", m)print ("匹配的起始与终点: ", m.span()) print ('匹配的整句话', m.group(0))print ('匹配的第一个结果', m.group(1))print ('匹配的第二个结果', m.group(2))print ('匹配的结果列表', m.groups())

运行结果:

C:\ProgramData\Anaconda3\python.exe E:/python/work/ana_test/test.py匹配的结果: <re.Match object; span=(0, 21), match='Fat cats are smarter '>匹配的起始与终点: (0, 21)匹配的整句话 Fat cats are smarter 匹配的第一个结果 Fat cats匹配的第二个结果 smarter匹配的结果列表 ('Fat cats', 'smarter')Process finished with exit code 0

贪婪模式:

import rem_match = re.match('[0-9]+', '12345 is the first number, 23456 is the sencond')m_search = re.search('[0-9]+', 'The first number is 12345, 23456 is the sencond')m_findall = re.findall('[0-9]+', '12345 is the first number, 23456 is the sencond')print (m_match.group())print (m_search.group())print (m_findall)

结果:

C:\ProgramData\Anaconda3\python.exe E:/python/work/ana_test/test.py1234512345['12345', '23456']Process finished with exit code 0

代码:

import requestsimport relink = "/"headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/1201 Firefox/3.5.6'}r = requests.get(link, headers= headers)html = r.texttitle_list = re.findall('<h1 class="post-title"><a href=.*?>(.*?)</a></h1>', html)print (title_list)

结果:

['第四章 &#8211; 4.3 通过selenium 模拟浏览器抓取', '第四章 &#8211; 4.2 解析真实地址抓取', '第四章- 动态网页抓取 (解析真实地址 + selenium)', 'Hello world!']

代码:

import requestsfrom bs4 import BeautifulSouplink = "/"headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/1201 Firefox/3.5.6'} r = requests.get(link, headers= headers)soup = BeautifulSoup(r.text,"lxml")first_title = soup.find("h1", class_="post-title").a.text.strip()print ("第1篇文章的标题是:", first_title)title_list = soup.find_all("h1", class_="post-title")for i in range(len(title_list)):title = title_list[i].a.text.strip()print ('第 %s 篇文章的标题是:%s' %(i+1, title))

结果:

C:\ProgramData\Anaconda3\python.exe E:/python/work/ana_test/test.py第1篇文章的标题是: 第四章 – 4.3 通过selenium 模拟浏览器抓取第 1 篇文章的标题是:第四章 – 4.3 通过selenium 模拟浏览器抓取第 2 篇文章的标题是:第四章 – 4.2 解析真实地址抓取第 3 篇文章的标题是:第四章- 动态网页抓取 (解析真实地址 + selenium)第 4 篇文章的标题是:Hello world!Process finished with exit code 0

代码:

import rehtml = """<html><body><header id="header"><h1 id="name">唐松Santos</h1><div class="sns"><a href="/feed/" rel="nofollow" target="_blank" title="RSS"><i aria-hidden="true" class="fa fa-rss"></i></a><a href="/santostang" rel="nofollow" target="_blank" title="Weibo"><i aria-hidden="true" class="fa fa-weibo"></i></a><a href="/in/santostang" rel="nofollow" target="_blank" title="Linkedin"><i aria-hidden="true" class="fa fa-linkedin"></i></a><a href="mailto:tangsongsky@" rel="nofollow" target="_blank" title="envelope"><i aria-hidden="true" class="fa fa-envelope"></i></a></div><div class="nav"><ul><li><a href="/">首页</a></li><li><a href="/sample-page/">关于我</a></li><li><a href="/python%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab%e4%bb%a3%e7%a0%81/">爬虫书代码</a></li><li><a href="/%e5%8a%a0%e6%88%91%e5%be%ae%e4%bf%a1/">加我微信</a></li><li><a href="https://santostang.github.io/">EnglishSite</a></li></ul></div></header></body></html>"""from bs4 import BeautifulSoupsoup = BeautifulSoup(html, "lxml")print (soup.prettify())print ('----------- 1 -----------\n')print (soup.header.h1)print ('----------- 2 -----------\n')print (soup.header.div.contents)print ('----------- 3 -----------\n')print (soup.header.div.contents[1])print ('----------- 4 -----------\n')for child in soup.header.div.children:print (child)print ('----------- 5 -----------\n')for child in soup.header.div.descendants:print(child)print ('----------- 6 -----------\n')a_tag = soup.header.div.aa_tag.parentprint ('----------- 7 -----------\n')soup.find_all('div', class_='sns')print ('----------- 8 -----------\n')for tag in soup.find_all(pile("^h")):print(tag.name)print ('----------- 9 -----------\n')soup.select("header h1")print ('----------- 10 -----------\n')print (soup.select("header > h1"))print ('----------- 11 -----------\n')print (soup.select("div > a"))print ('----------- 12 -----------\n')print ( soup.select('a[href^="/"]') )print ('----------- 13 -----------\n')

结果:

C:\ProgramData\Anaconda3\python.exe E:/python/work/ana_test/test.py<html><body><header id="header"><h1 id="name">唐松Santos</h1><div class="sns"><a href="/feed/" rel="nofollow" target="_blank" title="RSS"><i aria-hidden="true" class="fa fa-rss"></i></a><a href="/santostang" rel="nofollow" target="_blank" title="Weibo"><i aria-hidden="true" class="fa fa-weibo"></i></a><a href="/in/santostang" rel="nofollow" target="_blank" title="Linkedin"><i aria-hidden="true" class="fa fa-linkedin"></i></a><a href="mailto:tangsongsky@" rel="nofollow" target="_blank" title="envelope"><i aria-hidden="true" class="fa fa-envelope"></i></a></div><div class="nav"><ul><li><a href="/">首页</a></li><li><a href="/sample-page/">关于我</a></li><li><a href="/python%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab%e4%bb%a3%e7%a0%81/">爬虫书代码</a></li><li><a href="/%e5%8a%a0%e6%88%91%e5%be%ae%e4%bf%a1/">加我微信</a></li><li><a href="https://santostang.github.io/">EnglishSite</a></li></ul></div></header></body></html>----------- 1 -----------<h1 id="name">唐松Santos</h1>----------- 2 -----------['\n', <a href="/feed/" rel="nofollow" target="_blank" title="RSS"><i aria-hidden="true" class="fa fa-rss"></i></a>, '\n', <a href="/santostang" rel="nofollow" target="_blank" title="Weibo"><i aria-hidden="true" class="fa fa-weibo"></i></a>, '\n', <a href="/in/santostang" rel="nofollow" target="_blank" title="Linkedin"><i aria-hidden="true" class="fa fa-linkedin"></i></a>, '\n', <a href="mailto:tangsongsky@" rel="nofollow" target="_blank" title="envelope"><i aria-hidden="true" class="fa fa-envelope"></i></a>, '\n']----------- 3 -----------<a href="/feed/" rel="nofollow" target="_blank" title="RSS"><i aria-hidden="true" class="fa fa-rss"></i></a>----------- 4 -----------<a href="/feed/" rel="nofollow" target="_blank" title="RSS"><i aria-hidden="true" class="fa fa-rss"></i></a><a href="/santostang" rel="nofollow" target="_blank" title="Weibo"><i aria-hidden="true" class="fa fa-weibo"></i></a><a href="/in/santostang" rel="nofollow" target="_blank" title="Linkedin"><i aria-hidden="true" class="fa fa-linkedin"></i></a><a href="mailto:tangsongsky@" rel="nofollow" target="_blank" title="envelope"><i aria-hidden="true" class="fa fa-envelope"></i></a>----------- 5 -----------<a href="/feed/" rel="nofollow" target="_blank" title="RSS"><i aria-hidden="true" class="fa fa-rss"></i></a><i aria-hidden="true" class="fa fa-rss"></i><a href="/santostang" rel="nofollow" target="_blank" title="Weibo"><i aria-hidden="true" class="fa fa-weibo"></i></a><i aria-hidden="true" class="fa fa-weibo"></i><a href="/in/santostang" rel="nofollow" target="_blank" title="Linkedin"><i aria-hidden="true" class="fa fa-linkedin"></i></a><i aria-hidden="true" class="fa fa-linkedin"></i><a href="mailto:tangsongsky@" rel="nofollow" target="_blank" title="envelope"><i aria-hidden="true" class="fa fa-envelope"></i></a><i aria-hidden="true" class="fa fa-envelope"></i>----------- 6 ---------------------- 7 ---------------------- 8 -----------htmlheaderh1----------- 9 ---------------------- 10 -----------[<h1 id="name">唐松Santos</h1>]----------- 11 -----------[<a href="/feed/" rel="nofollow" target="_blank" title="RSS"><i aria-hidden="true" class="fa fa-rss"></i></a>, <a href="/santostang" rel="nofollow" target="_blank" title="Weibo"><i aria-hidden="true" class="fa fa-weibo"></i></a>, <a href="/in/santostang" rel="nofollow" target="_blank" title="Linkedin"><i aria-hidden="true" class="fa fa-linkedin"></i></a>, <a href="mailto:tangsongsky@" rel="nofollow" target="_blank" title="envelope"><i aria-hidden="true" class="fa fa-envelope"></i></a>]----------- 12 -----------[<a href="/feed/" rel="nofollow" target="_blank" title="RSS"><i aria-hidden="true" class="fa fa-rss"></i></a>, <a href="/">首页</a>, <a href="/sample-page/">关于我</a>, <a href="/python%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab%e4%bb%a3%e7%a0%81/">爬虫书代码</a>, <a href="/%e5%8a%a0%e6%88%91%e5%be%ae%e4%bf%a1/">加我微信</a>]----------- 13 -----------Process finished with exit code 0

代码:

import requestsfrom lxml import etreelink = "/"headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/1201 Firefox/3.5.6'}r = requests.get(link, headers= headers)html = etree.HTML(r.text)title_list = html.xpath('//h1[@class="post-title"]/a/text()')print (title_list)

(稍后补充)

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。