1500字范文 > python 维基百科爬虫_如何使用Python提取维基百科数据

python 维基百科爬虫_如何使用Python提取维基百科数据

时间：2023-10-02 23:10:49

python 维基百科爬虫

这是本教程的可视版：

我需要指出的是，我们不会手动抓取Wikipedia页面， Wikipedia模块已经为我们完成了艰巨的工作。让我们安装它：

pip3 install wikipedia

打开一个Python交互式外壳或一个空文件，然后继续。

让我们总结一下什么是Python编程语言：

import wikipedia# print the summary of what python isprint(wikipedia.summary( "Python Programming Language" ))

这将从此Wikipedia页面中提取摘要。更具体地说，它将打印一些第一句话，我们可以指定要提取的句子数：

In [2 ]: wikipedia.summary( "Python programming languag" , sentences= 2 )Out[ 2 ]: "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace."

请注意，我故意将查询拼写错误，但仍然可以得到准确的结果。

在维基百科搜索中搜索术语：

In [3 ]: result = wikipedia.search( "Neural networks" )In [ 4 ]: print(result)[ 'Neural network' , 'Artificial neural network' , 'Convolutional neural network' , 'Recurrent neural network' , 'Rectifier (neural networks)' , 'Feedforward neural network' , 'Neural circuit' , 'Quantum neural network' , 'Dropout (neural networks)' , 'Types of artificial neural networks' ]

这返回了相关页面标题的列表，让我们获得“神经网络”的整个页面，即“ result [0]”：

# get the page: Neural networkpage = wikipedia.page(result[0 ])

提取标题：

# get the titleof the pagetitle = page.title

获取该Wikipedia页面的所有类别：

# get the categoriesof the pagecategories = page.categories

删除所有HTML标记后提取文本（这是自动完成的）：

# get the whole wikipedia page text (content)content = page.content

所有链接：

# get all the linksin the pagelinks = page.links

参考文献：

# get the page referencesreferences = page.references

最后，总结：

# summarysummary = page.summary

让我们将它们打印出来：

# print infoprint( "Page content:\n" , content, "\n" )print( "Page title:" , title, "\n" )print( "Categories:" , categories, "\n" )print( "Links:" , links, "\n" )print( "References:" , references, "\n" )print( "Summary:" , summary, "\n" )

试试看！

好了，我们完成了，这是关于如何使用Python从Wikipedia中提取信息的简短介绍。如果您想自动收集语言模型的数据，回答问题的聊天机器人，围绕此创建包装应用程序等等，这将很有帮助！可能性无穷无尽，请在下面的评论中告诉我们您的处理方法！

如果本教程有用。给我买咖啡 -> buymeacoff.ee/gajeshnaik