1500字范文 > 用Python进行新型冠状病毒（COVID-19/-nCoV）疫情分析

用Python进行新型冠状病毒（COVID-19/-nCoV）疫情分析

时间：2022-01-23 23:02:09

新型冠状病毒（COVID-19/-nCoV）疫情分析

祈LHL

重要说明

分析文档：完成度：代码质量 3:5:2

其中分析文档是指你数据分析的过程中，对各问题分析的思路、对结果的解释、说明(要求言简意赅，不要为写而写)

ps:你自己写的代码胜过一切的代笔，无关美丑，只问今日比昨日更长进！加油！

由于数据过多，查看数据尽量使用head()或tail()，以免程序长时间无响应

=======================

本项目数据来源于丁香园。本项目主要目的是通过对疫情历史数据的分析研究，以更好的了解疫情与疫情的发展态势，为抗击疫情之决策提供数据支持。

关于本章使用的数据集，欢迎点击——>我的B站视频在评论区获取。

一. 提出问题

从全国范围，你所在省市，国外疫情等三个方面主要研究以下几个问题：

（一）全国累计确诊/疑似/治愈/死亡情况随时间变化趋势如何？

（二）全国新增确诊/疑似/治愈/死亡情况随时间变化趋势如何？

（三）全国新增境外输入随时间变化趋势如何？

（四）你所在的省市情况如何？

（五）国外疫情态势如何？

（六）结合你的分析结果，对个人和社会在抗击疫情方面有何建议？

二. 理解数据

原始数据集：AreaInfo.csv，导入相关包及读取数据：

r_hex = '#dc2624'# red, RGB = 220,38,36dt_hex = '#2b4750' # dark teal, RGB = 43,71,80tl_hex = '#45a0a2' # teal,RGB = 69,160,162r1_hex = '#e87a59' # red, RGB = 232,122,89tl1_hex = '#7dcaa9' # teal,RGB = 125,202,169g_hex = '#649E7D'# green,RGB = 100,158,125o_hex = '#dc8018'# orange, RGB = 220,128,24tn_hex = '#C89F91' # tan, RGB = 200,159,145g50_hex = '#6c6d6c' # grey-50, RGB = 108,109,108bg_hex = '#4f6268' # blue grey, RGB = 79,98,104g25_hex = '#c7cccf' # grey-25, RGB = 199,204,207

import numpy as npimport pandas as pdimport matplotlib,reimport matplotlib.pyplot as pltfrom matplotlib.pyplot import MultipleLocatordata = pd.read_csv(r'data/AreaInfo.csv')

查看与统计数据，以对数据有一个大致了解

data.head()

三. 数据清洗

（一）基本数据处理

数据清洗主要包括：选取子集，缺失数据处理、数据格式转换、异常值数据处理等。

国内疫情数据选取（最终选取的数据命名为china）

选取国内疫情数据

对于更新时间(updateTime)列，需将其转换为日期类型并提取出年-月-日，并查看处理结果。(提示：dt.date)

因数据每天按小时更新，一天之内有很多重复数据，请去重并只保留一天之内最新的数据。

提示：df.drop_duplicates(subset=[‘provinceName’, ‘updateTime’], keep=‘first’, inplace=False)

其中df是你选择的国内疫情数据的DataFrame

分析：选取countryName一列中值为中国的行组成CHINA。

CHINA = data.loc[data['countryName'] == '中国']CHINA.dropna(subset=['cityName'], how='any', inplace=True)#CHINA

D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: /pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

分析：取出含所有中国城市的列表

cities = list(set(CHINA['cityName']))

分析：遍历取出每一个城市的子dataframe，然后用sort对updateTime进行时间排序

for city in cities:CHINA.loc[data['cityName'] == city].sort_values(by = 'updateTime')

分析：去除空值所在行

CHINA.dropna(subset=['cityName'],inplace=True)#CHINA.loc[CHINA['cityName'] == '秦皇岛'].tail(20)

D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: /pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy"""Entry point for launching an IPython kernel.

分析：将CHINA中的updateTime列进行格式化处理

CHINA.updateTime = pd.to_datetime(CHINA.updateTime,format="%Y-%m-%d",errors='coerce').dt.date#CHINA.loc[data['cityName'] == '秦皇岛'].tail(15)

D:\Anaconda\envs\python32\lib\site-packages\pandas\core\generic.py:5303: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: /pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copyself[name] = value

CHINA.head()

分析：每日数据的去重只保留第一个数据，因为前面已经对时间进行排序，第一个数据即为当天最新数据

分析：考虑到合并dataframe需要用到concat，需要创建一个初始china

real = CHINA.loc[data['cityName'] == cities[1]]real.drop_duplicates(subset='updateTime', keep='first', inplace=True)china = real

分析：遍历每个城市dataframe进行每日数据的去重，否则会出现相同日期只保留一个城市的数据的情况

for city in cities[2:]:real_data = CHINA.loc[data['cityName'] == city]real_data.drop_duplicates(subset='updateTime', keep='first', inplace=True)china = pd.concat([real_data, china],sort=False)

D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: /pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copyThis is separate from the ipykernel package so we can avoid doing imports until

查看数据信息，是否有缺失数据/数据类型是否正确。

提示：若不会处理缺失值，可以将其舍弃

分析：有的城市不是每日都上报的，如果某日只统计上报的那些城市，那些存在患者却不上报的城市就会被忽略，数据就失真了，需要补全所有城市每日的数据，即便不上报的城市也要每日记录数据统计，所以要进行插值处理补全部分数据，处理方法详见数据透视与分析

china.info()

<class 'pandas.core.frame.DataFrame'>Int64Index: 32812 entries, 96106 to 208267Data columns (total 19 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 continentName 32812 non-null object 1 continentEnglishName32812 non-null object 2 countryName 32812 non-null object 3 countryEnglishName 32812 non-null object 4 provinceName 32812 non-null object 5 provinceEnglishName32812 non-null object 6 province_zipCode 32812 non-null int64 7 province_confirmedCount 32812 non-null int64 8 province_suspectedCount 32812 non-null float649 province_curedCount32812 non-null int64 10 province_deadCount 32812 non-null int64 11 updateTime32812 non-null object 12 cityName 32812 non-null object 13 cityEnglishName31968 non-null object 14 city_zipCode 32502 non-null float6415 city_confirmedCount32812 non-null float6416 city_suspectedCount32812 non-null float6417 city_curedCount32812 non-null float6418 city_deadCount 32812 non-null float64dtypes: float64(6), int64(4), object(9)memory usage: 5.0+ MB

china.head()

你所在省市疫情数据选取（最终选取的数据命名为myhome）

此步也可在后面用到的再做

myhome = china.loc[data['provinceName'] == '广东省']myhome.head()

国外疫情数据选取（最终选取的数据命名为world）

此步也可在后面用到的再做

world = data.loc[data['countryName'] != '中国']world.head()

数据透视与分析

分析：对china进行插值处理补全部分数据

china.head()

分析：先创建省份列表和日期列表，并初始化一个draft

province = list(set(china['provinceName']))#每个省份#p_city = list(set(china[china['provinceName'] == province[0]]['cityName']))#每个省份的城市date_0 = []for dt in china.loc[china['provinceName'] == province[0]]['updateTime']:date_0.append(str(dt))date_0 = list(set(date_0))date_0.sort()start = china.loc[china['provinceName'] == province[0]]['updateTime'].min()end = china.loc[china['provinceName'] == province[0]]['updateTime'].max()dates = pd.date_range(start=str(start), end=str(end))aid_frame = pd.DataFrame({'updateTime': dates,'provinceName':[province[0]]*len(dates)})aid_frame.updateTime = pd.to_datetime(aid_frame.updateTime,format="%Y-%m-%d",errors='coerce').dt.date#draft = pd.merge(china.loc[china['provinceName'] == province[1]], aid_frame, on='updateTime', how='outer').sort_values('updateTime')draft = pd.concat([china.loc[china['provinceName'] == province[0]], aid_frame], join='outer').sort_values('updateTime')draft.province_confirmedCount.fillna(method="ffill",inplace=True)draft.province_suspectedCount.fillna(method="ffill", inplace=True)draft.province_curedCount.fillna(method="ffill", inplace=True)draft.province_deadCount.fillna(method="ffill", inplace=True)

分析：补全部分时间，取前日的数据进行插值，因为有的省份从4月末开始陆续就不再有新增病患，不再上报，所以这些省份的数据只能补全到4月末，往后的数据逐渐失去真实性

分析：同时进行日期格式化

for p in range(1,len(province)):date_d = []for dt in china.loc[china['provinceName'] == province[p]]['updateTime']:date_d.append(dt)date_d = list(set(date_d))date_d.sort()start = china.loc[china['provinceName'] == province[p]]['updateTime'].min()end = china.loc[china['provinceName'] == province[p]]['updateTime'].max()dates = pd.date_range(start=start, end=end)aid_frame = pd.DataFrame({'updateTime': dates,'provinceName':[province[p]]*len(dates)})aid_frame.updateTime = pd.to_datetime(aid_frame.updateTime,format="%Y-%m-%d",errors='coerce').dt.dateX = china.loc[china['provinceName'] == province[p]]X.reset_index(drop= True)Y = aid_frameY.reset_index(drop= True)draft_d = pd.concat([X,Y], join='outer').sort_values('updateTime')draft = pd.concat([draft,draft_d])draft.province_confirmedCount.fillna(method="ffill",inplace=True)draft.province_suspectedCount.fillna(method="ffill", inplace=True)draft.province_curedCount.fillna(method="ffill", inplace=True)draft.province_deadCount.fillna(method="ffill", inplace=True)#draft['updateTime'] = draft['updateTime'].strftime('%Y-%m-%d')#draft['updateTime'] = pd.to_datetime(draft['updateTime'],format="%Y-%m-%d",errors='coerce').dt.date

china = draft

china.head()

四. 数据分析及可视化

在进行数据分析及可视化时，依据每个问题选取所需变量并新建DataFrame再进行分析和可视化展示，这样数据不易乱且条理更清晰。

基础分析

基础分析，只允许使用numpy、pandas和matplotlib库。

可以在一张图上多个坐标系展示也可以在多张图上展示

请根据分析目的选择图形的类型(折线图、饼图、直方图和散点图等等)，实在没有主意可以到百度疫情地图或其他疫情分析的站点激发激发灵感。

（一）全国累计确诊/疑似/治愈/死亡情况随时间变化趋势如何？

分析：要获得全国累计情况随时间变化趋势，首先需要整合每日全国累计确诊情况做成date_confirmed

分析：要整合每日全国累计确诊情况，首先得提取每个省份每日当天最新累计确诊人数，省份数据求和后形成dataframe，

for循环拼接到date_confirmed中

date = list(set(china['updateTime']))date.sort()date

[datetime.date(, 1, 24),datetime.date(, 1, 25),datetime.date(, 1, 26),datetime.date(, 1, 27),datetime.date(, 1, 28),datetime.date(, 1, 29),datetime.date(, 1, 30),datetime.date(, 1, 31),datetime.date(, 2, 1),datetime.date(, 2, 2),datetime.date(, 2, 3),datetime.date(, 2, 4),datetime.date(, 2, 5),datetime.date(, 2, 6),datetime.date(, 2, 7),datetime.date(, 2, 8),datetime.date(, 2, 9),datetime.date(, 2, 10),datetime.date(, 2, 11),datetime.date(, 2, 12),datetime.date(, 2, 13),datetime.date(, 2, 14),datetime.date(, 2, 15),datetime.date(, 2, 16),datetime.date(, 2, 17),datetime.date(, 2, 18),datetime.date(, 2, 19),datetime.date(, 2, 20),datetime.date(, 2, 21),datetime.date(, 2, 22),datetime.date(, 2, 23),datetime.date(, 2, 24),datetime.date(, 2, 25),datetime.date(, 2, 26),datetime.date(, 2, 27),datetime.date(, 2, 28),datetime.date(, 2, 29),datetime.date(, 3, 1),datetime.date(, 3, 2),datetime.date(, 3, 3),datetime.date(, 3, 4),datetime.date(, 3, 5),datetime.date(, 3, 6),datetime.date(, 3, 7),datetime.date(, 3, 8),datetime.date(, 3, 9),datetime.date(, 3, 10),datetime.date(, 3, 11),datetime.date(, 3, 12),datetime.date(, 3, 13),datetime.date(, 3, 14),datetime.date(, 3, 15),datetime.date(, 3, 16),datetime.date(, 3, 17),datetime.date(, 3, 18),datetime.date(, 3, 19),datetime.date(, 3, 20),datetime.date(, 3, 21),datetime.date(, 3, 22),datetime.date(, 3, 23),datetime.date(, 3, 24),datetime.date(, 3, 25),datetime.date(, 3, 26),datetime.date(, 3, 27),datetime.date(, 3, 28),datetime.date(, 3, 29),datetime.date(, 3, 30),datetime.date(, 3, 31),datetime.date(, 4, 1),datetime.date(, 4, 2),datetime.date(, 4, 3),datetime.date(, 4, 4),datetime.date(, 4, 5),datetime.date(, 4, 6),datetime.date(, 4, 7),datetime.date(, 4, 8),datetime.date(, 4, 9),datetime.date(, 4, 10),datetime.date(, 4, 11),datetime.date(, 4, 12),datetime.date(, 4, 13),datetime.date(, 4, 14),datetime.date(, 4, 15),datetime.date(, 4, 16),datetime.date(, 4, 17),datetime.date(, 4, 18),datetime.date(, 4, 19),datetime.date(, 4, 20),datetime.date(, 4, 21),datetime.date(, 4, 22),datetime.date(, 4, 23),datetime.date(, 4, 24),datetime.date(, 4, 25),datetime.date(, 4, 26),datetime.date(, 4, 27),datetime.date(, 4, 28),datetime.date(, 4, 29),datetime.date(, 4, 30),datetime.date(, 5, 1),datetime.date(, 5, 2),datetime.date(, 5, 3),datetime.date(, 5, 4),datetime.date(, 5, 5),datetime.date(, 5, 6),datetime.date(, 5, 7),datetime.date(, 5, 8),datetime.date(, 5, 9),datetime.date(, 5, 10),datetime.date(, 5, 11),datetime.date(, 5, 12),datetime.date(, 5, 13),datetime.date(, 5, 14),datetime.date(, 5, 15),datetime.date(, 5, 16),datetime.date(, 5, 17),datetime.date(, 5, 18),datetime.date(, 5, 19),datetime.date(, 5, 20),datetime.date(, 5, 21),datetime.date(, 5, 22),datetime.date(, 5, 23),datetime.date(, 5, 24),datetime.date(, 5, 25),datetime.date(, 5, 26),datetime.date(, 5, 27),datetime.date(, 5, 28),datetime.date(, 5, 29),datetime.date(, 5, 30),datetime.date(, 5, 31),datetime.date(, 6, 1),datetime.date(, 6, 2),datetime.date(, 6, 3),datetime.date(, 6, 4),datetime.date(, 6, 5),datetime.date(, 6, 6),datetime.date(, 6, 7),datetime.date(, 6, 8),datetime.date(, 6, 9),datetime.date(, 6, 10),datetime.date(, 6, 11),datetime.date(, 6, 12),datetime.date(, 6, 13),datetime.date(, 6, 14),datetime.date(, 6, 15),datetime.date(, 6, 16),datetime.date(, 6, 17),datetime.date(, 6, 18),datetime.date(, 6, 19),datetime.date(, 6, 20),datetime.date(, 6, 21),datetime.date(, 6, 22),datetime.date(, 6, 23)]

china = china.set_index('provinceName')china = china.reset_index()

分析：循环遍历省份和日期获得每个省份每日累计确诊，因为需要拼接，先初始化一个date_confirmed

list_p = []list_d = []list_e = []for p in range(0,32):try:con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['province_confirmedCount'])#该日每省的累计确诊人数except:continuelist_d.append(sum(list_p))list_e.append(str(date[0]))date_confirmed = pd.DataFrame(list_d,index=list_e)date_confirmed.index.name="date"date_confirmed.columns=["China_confirmedCount"]date_confirmed

分析：遍历每个省份拼接每日的总确诊人数的dataframe

l = 0for i in date[3:]:list_p = []list_d = []list_e = []l +=1for p in range(0,32):try:con_0 = china.loc[china['updateTime'] == date[l]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['province_confirmedCount'])#该日每省的累计确诊人数except:continue#con_0 = china.loc[china['updateTime'] == date[0]].loc[china['provinceName'] == '河北省'].loc[[0]].iloc[0]#list_p.append(con_0['province_confirmedCount'])#该日每省的累计确诊人数list_d.append(sum(list_p))list_e.append(str(date[l]))confirmed = pd.DataFrame(list_d, index=list_e)confirmed.index.name="date"confirmed.columns=["China_confirmedCount"]date_confirmed = pd.concat([date_confirmed,confirmed],sort=False)date_confirmed

150 rows × 1 columns

分析：去除空值和不全的值

date_confirmed.dropna(subset=['China_confirmedCount'],inplace=True)date_confirmed.tail(20)

分析：数据从4月末开始到5月末就因为缺失过多省份的数据(部分省份从4月末至今再也没有新增病患)而失真，自-06-06起完全失去真实性，所以我删除了-06-06往后的数据

date_confirmed = date_confirmed.drop(['-06-06','-06-07','-06-08','-06-09','-06-10','-06-11','-06-12','-06-13','-06-14','-06-15','-06-16','-06-19','-06-18','-06-20','-06-17','-06-21'])

分析：构造拼接函数

def data_frame(self,china,element):l = 0for i in date[3:]:list_p = []list_d = []list_e = []l +=1for p in range(0,32):try:con_0 = china.loc[china['updateTime'] == date[l]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0[element])except:continue#con_0 = china.loc[china['updateTime'] == date[0]].loc[china['provinceName'] == '河北省'].loc[[0]].iloc[0]#list_p.append(con_0['province_confirmedCount'])list_d.append(sum(list_p))list_e.append(str(date[l]))link = pd.DataFrame(list_d, index=list_e)link.index.name="date"link.columns=["China"]self = pd.concat([self,link],sort=False)self.dropna(subset=['China'],inplace=True)self = self.drop(['-06-06','-06-07','-06-08','-06-09','-06-10','-06-11','-06-12','-06-13','-06-14','-06-15','-06-16','-06-19','-06-18','-06-20','-06-17','-06-21'])return self

分析：初始化各个变量

#累计治愈人数 date_curedCountlist_p = []list_d = []list_e = []for p in range(0,32):try:con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['province_curedCount'])except:continuelist_d.append(sum(list_p))list_e.append(str(date[0]))date_cured = pd.DataFrame(list_d, index=list_e)date_cured.index.name="date"date_cured.columns=["China"]#累计死亡人数 date_deadlist_p = []list_d = []list_e = []for p in range(0,32):try:con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['province_deadCount'])except:continuelist_d.append(sum(list_p))list_e.append(str(date[0]))date_dead = pd.DataFrame(list_d, index=list_e)date_dead.index.name="date"date_dead.columns=["China"]

#累计确诊患者 date_confirmedplt.rcParams['font.sans-serif'] = ['SimHei'] #更改字体,否则无法显示汉字fig = plt.figure( figsize=(16,6), dpi=100)ax = fig.add_subplot(1,1,1)x = date_confirmed.indexy = date_confirmed.valuesax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )ax.set_title('累计确诊患者',fontdict={'color':'black','size':24})ax.set_xticks( range(0,len(x),30))

[<matplotlib.axis.XTick at 0x255520e4908>,<matplotlib.axis.XTick at 0x255520e49e8>,<matplotlib.axis.XTick at 0x255520af048>,<matplotlib.axis.XTick at 0x2555216b0b8>,<matplotlib.axis.XTick at 0x2555216b4e0>]

#累计治愈患者 date_curedCountdate_cured = data_frame(date_cured,china,'province_curedCount')fig = plt.figure( figsize=(16,6), dpi=100)ax = fig.add_subplot(1,1,1)x = date_cured.indexy = date_cured.valuesax.set_title('累计治愈患者',fontdict={'color':'black','size':24})ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )ax.set_xticks( range(0,len(x),30))

[<matplotlib.axis.XTick at 0x25550ef60f0>,<matplotlib.axis.XTick at 0x255521cd0b8>,<matplotlib.axis.XTick at 0x255521b7780>,<matplotlib.axis.XTick at 0x2555208ffd0>,<matplotlib.axis.XTick at 0x2555208f0f0>]

分析：累计疑似无法通过补全数据得到

#累计死亡患者 date_deaddate_dead = data_frame(date_dead,china,'province_deadCount')fig = plt.figure( figsize=(16,6), dpi=100)ax = fig.add_subplot(1,1,1)x = date_dead.indexy = date_dead.valuesax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )x_major_locator=MultipleLocator(12)ax=plt.gca()ax.set_title('累计死亡患者',fontdict={'color':'black','size':24})ax.xaxis.set_major_locator(x_major_locator)ax.set_xticks( range(0,len(x),30))

[<matplotlib.axis.XTick at 0x255521fda90>,<matplotlib.axis.XTick at 0x255521fda58>,<matplotlib.axis.XTick at 0x25552a51550>,<matplotlib.axis.XTick at 0x25552a75470>,<matplotlib.axis.XTick at 0x25552a75908>]

分析：疫情自1月初开始爆发，到2月末开始减缓增速，到4月末趋于平缓。治愈人数自2月初开始大幅增加，到3月末趋于平缓，死亡人数自1月末开始增加，到2月末趋于平缓，到4月末因为统计因素死亡人数飙升后趋于平缓。

分析总结：确诊人数数据和治愈数据从4月末开始到5月末就因为缺失过多省份的数据(部分省份至今再也没有新增病患)导致失真，其他数据尽量通过补全,越靠近尾部数据越失真。死亡数据补全较为成功，几乎没有错漏。

（二）全国新增确诊/疑似/治愈/死亡情况随时间变化趋势如何？

分析：新增确诊/治愈/死亡的数据需要对china进行运算，每省每日进行diff差值运算

分析：首先初始化各个数据，然后仿照上面的拼接函数，作适用于该题的拼接函数

#新增确诊人数 date_new_confirmedlist_p = []list_d = []list_e = []for p in range(0,32):try:con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['province_confirmedCount'])#该日每省的累计确诊人数except:continuelist_d.append(sum(list_p))list_e.append(str(date[0]))date_new_confirmed = pd.DataFrame(list_d,index=list_e)date_new_confirmed.index.name="date"date_new_confirmed.columns=["China"]date_new_confirmed#新增治愈人数 date_new_curedCountlist_p = []list_d = []list_e = []for p in range(0,32):try:con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['province_curedCount'])except:continuelist_d.append(sum(list_p))list_e.append(str(date[0]))date_new_cured = pd.DataFrame(list_d, index=list_e)date_new_cured.index.name="date"date_new_cured.columns=["China"]#新增死亡人数 date_new_deadlist_p = []list_d = []list_e = []for p in range(0,32):try:con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['province_deadCount'])except:continuelist_d.append(sum(list_p))list_e.append(str(date[0]))date_new_dead = pd.DataFrame(list_d, index=list_e)date_new_dead.index.name="date"date_new_dead.columns=["China"]

分析：构造拼接函数

def data_new_frame(self,china,element):l = 0for i in date[3:]:list_p = []list_d = []list_e = []l +=1for p in range(0,32):try:con_0 = china.loc[china['updateTime'] == date[l]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0[element])except:continue#con_0 = china.loc[china['updateTime'] == date[0]].loc[china['provinceName'] == '河北省'].loc[[0]].iloc[0]#list_p.append(con_0['province_confirmedCount'])list_d.append(sum(list_p))list_e.append(str(date[l]))link = pd.DataFrame(list_d, index=list_e)link.index.name="date"link.columns=["China"]self = pd.concat([self,link],sort=False)self.dropna(subset=['China'],inplace=True)return self

分析：数据补全以及去除含缺失省份的数据

d = data_new_frame(date_new_confirmed,china,'province_confirmedCount')for i in range(len(d)):dr = []for a,b in zip(range(0,len(d)-1),range(1,len(d)-2)):if d.iloc[b].iloc[0] < d.iloc[a].iloc[0]:dr.append(d.iloc[b].iloc[0])d = d[~d['China'].isin(dr)]

分析：做差值运算

d['China'] = d['China'].diff()

分析：去除两个含缺失省份的日期

d.drop(['-06-20','-06-21'],inplace=True)

分析：作折线图表现时间趋势

#新增确诊患者 date_confirmedfig = plt.figure( figsize=(16,6), dpi=100)ax = fig.add_subplot(1,1,1)x = d.indexy = d.valuesax.set_title('新增确诊患者',fontdict={'color':'black','size':24})ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )ax.set_xticks( range(0,len(x),10))

[<matplotlib.axis.XTick at 0x25552a9c898>,<matplotlib.axis.XTick at 0x25552a9c860>,<matplotlib.axis.XTick at 0x25552ab7550>,<matplotlib.axis.XTick at 0x25552ad50f0>,<matplotlib.axis.XTick at 0x25552ad5518>,<matplotlib.axis.XTick at 0x25552ad59b0>,<matplotlib.axis.XTick at 0x25552ad5e48>,<matplotlib.axis.XTick at 0x25552adc320>]

分析：使用初始化数据构造date_new_cured的dataframe，然后作折线图表现时间趋势

cu = data_new_frame(date_new_cured,china,'province_curedCount')for i in range(len(cu)):dr = []for a,b in zip(range(0,len(cu)-1),range(1,len(cu)-2)):if cu.iloc[b].iloc[0] < cu.iloc[a].iloc[0]:dr.append(cu.iloc[b].iloc[0])cu = cu[~cu['China'].isin(dr)]cu['China'] = cu['China'].diff()cu.drop(['-06-20','-06-21'],inplace=True)#新增治愈患者 date_new_curedfig = plt.figure( figsize=(16,6), dpi=100)ax = fig.add_subplot(1,1,1)x = cu.indexy = cu.valuesax.set_title('新增治愈患者',fontdict={'color':'black','size':24})ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )ax.set_xticks( range(0,len(x),10))

[<matplotlib.axis.XTick at 0x25552b13b00>,<matplotlib.axis.XTick at 0x25552b13ac8>,<matplotlib.axis.XTick at 0x25552b137b8>,<matplotlib.axis.XTick at 0x25552b3f470>,<matplotlib.axis.XTick at 0x25552b3f908>,<matplotlib.axis.XTick at 0x25552b3fda0>,<matplotlib.axis.XTick at 0x25552b47278>]

分析：使用初始化数据构造date_new_dead的dataframe，然后作折线图表现时间趋势

de = data_new_frame( date_new_dead,china,'province_deadCount')for i in range(len(de)):dr = []for a,b in zip(range(0,len(de)-1),range(1,len(de)-2)):if de.iloc[b].iloc[0] < de.iloc[a].iloc[0]:dr.append(de.iloc[b].iloc[0])de = de[~de['China'].isin(dr)]de['China'] = de['China'].diff()de.drop(['-06-21'],inplace=True)#新增死亡患者 date_new_deadfig = plt.figure( figsize=(16,6), dpi=100)ax = fig.add_subplot(1,1,1)x = de.indexy = de.valuesax.set_title('新增死亡患者',fontdict={'color':'black','size':24})ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )ax.set_xticks( range(0,len(x),10))

[<matplotlib.axis.XTick at 0x25553bdfd30>,<matplotlib.axis.XTick at 0x25553bdfcf8>,<matplotlib.axis.XTick at 0x25553c01f60>,<matplotlib.axis.XTick at 0x25553c146a0>,<matplotlib.axis.XTick at 0x25553c14b38>,<matplotlib.axis.XTick at 0x25553c14d68>,<matplotlib.axis.XTick at 0x25553c1b4a8>,<matplotlib.axis.XTick at 0x25553c1b940>,<matplotlib.axis.XTick at 0x25553c1bdd8>]

分析：新增患者自1月末开始增加，到2月14日前后到达顶点，后增数下降，趋于平缓。

分析：新增治愈患者自1月末开始增加，到3月02日前后达到顶峰，后增数下降，从4月初开始趋于平缓。

分析：新增死亡患者自1月末开始增加，到2月达到高峰，自3月初开始增数平缓，到4月17日前后因为统计因素飙升后回落。

（三）全国新增境外输入随时间变化趋势如何？

分析：新增境外输入数据需要对CHINA进行运算，逐日相减。

分析：先从CHINA取出境外输入的数据，然后补全时间序列并作差。

imported = CHINA.loc[CHINA['cityName'] == '境外输入']imported.updateTime = pd.to_datetime(imported.updateTime,format="%Y-%m-%d",errors='coerce').dt.dateimported

607 rows × 19 columns

分析：补全省份缺失时间的数据

for i in range(0,len(province)):list_j_d = []date_b = []for dt in imported.loc[imported['provinceName'] == province[i]]['updateTime']:date_b.append(str(dt))list_j_d = list(set(date_b))list_j_d.sort()#imported.loc[imported['provinceName'] == province[3]]try:start = imported.loc[imported['provinceName'] == province[i]]['updateTime'].min()end = imported.loc[imported['provinceName'] == province[i]]['updateTime'].max()dates_b = pd.date_range(start=str(start), end=str(end))aid_frame_b = pd.DataFrame({'updateTime': dates_b,'provinceName':[province[i]]*len(dates_b)})aid_frame_b.updateTime = pd.to_datetime(aid_frame_b.updateTime,format="%Y-%m-%d",errors='coerce').dt.date#draft = pd.merge(china.loc[china['provinceName'] == province[1]], aid_frame, on='updateTime', how='outer').sort_values('updateTime')draft_b = pd.concat([imported.loc[imported['provinceName'] == province[i]], aid_frame_b], join='outer').sort_values('updateTime')draft_b.city_confirmedCount.fillna(method="ffill",inplace=True)draft_b.city_suspectedCount.fillna(method="ffill", inplace=True)draft_b.city_curedCount.fillna(method="ffill", inplace=True)draft_b.city_deadCount.fillna(method="ffill", inplace=True)draft_b.loc[draft_b['provinceName'] == province[i]].fillna(0,inplace=True,limit = 1)draft_b.loc[draft_b['provinceName'] == province[i]].loc[:,'city_confirmedCount':'city_deadCount'] = draft_b.loc[draft_b['provinceName'] == province[i]].loc[:,'city_confirmedCount':'city_deadCount'].diff()draft_b.dropna(subset=['city_confirmedCount','city_suspectedCount','city_curedCount','city_deadCount'],inplace=True)imported = pd.concat([imported,draft_b], join='outer').sort_values('updateTime')except:continueimported

2524 rows × 19 columns

分析：作copy()防止数据处理失误使得原数据丢失

draft_i = imported.copy()

分析：初始化一个省份数据，保证这个方法可行

real_s = imported.loc[imported['provinceName'] == province[0]]real_s.drop_duplicates(subset='updateTime', keep='first', inplace=True)draft_i = real_sfor p in province:real_data = imported.loc[imported['provinceName'] == p]real_data.drop_duplicates(subset='updateTime', keep='first', inplace=True)#imported = pd.concat([real_data, china],sort=False)draft_i = pd.concat([real_data,draft_i],sort=False)

D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: /pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copyD:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: /pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

分析：确认方法无误，对余下省份进行相同的处理

imported = draft_i

imported = imported.set_index('provinceName')imported = imported.reset_index()

分析：进行各个省份的数据合并。

list_p = []list_d = []list_e = []for p in range(0,32):try:con_0 = imported.loc[imported['updateTime'] == date[2]].loc[imported['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['city_confirmedCount'])#该日每省的累计确诊人数except:continuelist_d.append(sum(list_p))list_e.append(str(date[0]))date_new_foreign_confirmed = pd.DataFrame(list_d,index=list_e)date_new_foreign_confirmed.index.name="date"date_new_foreign_confirmed.columns=["imported_confirmedCount"]date_new_foreign_confirmed

l = 0for i in date[3:]:list_p = []list_d = []list_e = []l +=1for p in range(0,32):try:con_0 = imported.loc[imported['updateTime'] == date[l]].loc[imported['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['city_confirmedCount'])#该日每省的累计确诊人数except:continue#con_0 = imported.loc[imported['updateTime'] == date[0]].loc[imported['provinceName'] == '河北省'].loc[[0]].iloc[0]#list_p.append(con_0['city_confirmedCount'])#该日每省的累计确诊人数list_d.append(sum(list_p))list_e.append(str(date[l]))confirmed = pd.DataFrame(list_d, index=list_e)confirmed.index.name="date"confirmed.columns=["imported_confirmedCount"]date_new_foreign_confirmed = pd.concat([date_new_foreign_confirmed,confirmed],sort=False)date_new_foreign_confirmed

150 rows × 1 columns

#新增境外输入fig = plt.figure( figsize=(16,4), dpi=100)ax = fig.add_subplot(1,1,1)x = date_new_foreign_confirmed.indexy = date_new_foreign_confirmed.valuesplot = ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-',label='date_new_foreign_confirmed' )ax.set_xticks( range(0,len(x),10))plt.xlabel('日期',fontsize=20)plt.ylabel('人数',fontsize=20)plt.title('COVID-19——新增境外输入',fontsize=30)ax.legend( loc=0, frameon=True )

<matplotlib.legend.Legend at 0x25553ca5f28>

分析总结：境外输入病例自3月末开始激增，到5月初增速趋于平缓，到6月初开始增速减缓。

（四）你所在的省市情况如何？

分析：首先取出广东省的所有时间序列,转换成string类型,然后进行排序

m_dates = list(set(myhome['updateTime']))aid_d = m_dates.copy()for d in aid_d:a = str(d)m_dates.remove(d)m_dates.append(a)m_dates.sort()

myhome = myhome.set_index('provinceName')myhome = myhome.reset_index()

分析：遍历我的城市对应的省份的时间构建对应的dataframe

#广东省累计确诊人数list_g = []for i in range(0,len(m_dates)):try:con_m = myhome.loc[myhome['updateTime'] == date[i]].loc[myhome['cityName'] == '茂名'].iloc[[0]].iloc[0] list_g.append(con_m['province_confirmedCount'])except:list_g.append(0)continueg_date_confirmed = pd.DataFrame(list_g, index=m_dates)g_date_confirmed.index.name="date"g_date_confirmed.columns=["g_confirmed"]g_date_confirmed=g_date_confirmed[~g_date_confirmed['g_confirmed'].isin([0])]#广东省累计治愈人数list_g = []for i in range(0,len(m_dates)):try:con_m = myhome.loc[myhome['updateTime'] == date[i]].loc[myhome['cityName'] == '茂名'].iloc[[0]].iloc[0] list_g.append(con_m['province_curedCount'])except:list_g.append(0)continueg_date_cured = pd.DataFrame(list_g, index=m_dates)g_date_cured.index.name="date"g_date_cured.columns=["g_cured"]g_date_cured=g_date_cured[~g_date_cured['g_cured'].isin([0])]#广东省累计死亡人数list_g = []for i in range(0,len(m_dates)):try:con_m = myhome.loc[myhome['updateTime'] == date[i]].loc[myhome['cityName'] == '茂名'].iloc[[0]].iloc[0] list_g.append(con_m['province_deadCount'])except:list_g.append(0)continueg_date_dead = pd.DataFrame(list_g, index=m_dates)g_date_dead.index.name="date"g_date_dead.columns=["g_dead"]g_date_dead=g_date_dead[~g_date_dead['g_dead'].isin([0])]

分析：作折线图表现疫情随时间变化趋势

##广东省累计确诊人数广东省累计治愈人数plt.rcParams['font.sans-serif'] = ['SimHei'] x= g_date_confirmed.indexy1 = g_date_confirmed.valuesy2 = g_date_cured.valuesy3 = g_date_dead#font_manager = font_manager.FontProperties(fname = 'C:/Windows/Fonts/simsun.ttc',size = 18)plt.figure(figsize=(20,10),dpi = 80)plt.plot(x,y1,color = r_hex,label = 'confirmed')plt.plot(x,y2,color = g_hex,label = 'cured')x_major_locator=MultipleLocator(12)ax=plt.gca()ax.xaxis.set_major_locator(x_major_locator)plt.title('COVID-19 —— 广东省',fontsize=30)plt.xlabel('日期',fontsize=20)plt.ylabel('人数',fontsize=20)plt.legend(loc=1, bbox_to_anchor=(1.00,0.90), bbox_transform=ax.transAxes)

<matplotlib.legend.Legend at 0x25553d02a90>

#广东省累计死亡人数plt.rcParams['font.sans-serif'] = ['SimHei'] fig = plt.figure( figsize=(16,4), dpi=100)ax = fig.add_subplot(1,1,1)x = g_date_dead.indexy = g_date_dead.valuesplot = ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-',label='dead' )ax.set_xticks( range(0,len(x),10))plt.xlabel('日期',fontsize=20)plt.ylabel('人数',fontsize=20)plt.title('COVID-19——广东省',fontsize=30)ax.legend( loc=0, frameon=True )

<matplotlib.legend.Legend at 0x25553d94940>

分析：广东省的数据补全很成功，真实性高。

分析：从折线图来看，广东省自1月末起感染人数激增，直到2月中旬趋于平缓，3月初开始由于检测普及以及统计因素，短期确诊患者人数小幅度增加。广东省自2月初开始治愈人数激增，直到6月初开始因为新增感染人数趋于平缓，所以治愈人数趋于平缓。广东省自3月初开始不再有新增死亡患者。

（五）国外疫情态势如何？

分析：数据去除空值

world.dropna(axis=1, how='any', inplace=True)#world.set_index('updateTime')

分析：创建国家列表country，创建日期列表date_y

country = list(set(world['provinceName']))date_y = []for dt in world.loc[world['provinceName'] == country[0]]['updateTime']:date_y.append(str(dt))date_y = list(set(date_0))date_y.sort()

分析：遍历国家列表对world中的updateTime进行处理并去重。

for c in country:world.loc[world['provinceName'] == c].sort_values(by = 'updateTime')world.dropna(subset=['provinceName'],inplace=True)world.updateTime = pd.to_datetime(world.updateTime,format="%Y-%m-%d",errors='coerce').dt.date

D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: /pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copyThis is separate from the ipykernel package so we can avoid doing imports untilD:\Anaconda\envs\python32\lib\site-packages\pandas\core\generic.py:5303: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: /pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copyself[name] = value

分析：取前15个国家的province_confirmedCount透视构成world_confirmed，并进行数据补全处理

world_confirmed = world.loc[world['provinceName'] == world.head(15)['provinceName'][0]].pivot_table(index='updateTime', columns='provinceName', values='province_confirmedCount',aggfunc=np.mean)for i in world.head(15)['provinceName'][1:]:draft_c = world.loc[world['provinceName'] == i].pivot_table(index='updateTime', columns='provinceName', values='province_confirmedCount',aggfunc=np.mean)world_confirmed = pd.merge(world_confirmed,draft_c,on='updateTime', how='outer',sort=True)world_confirmed.fillna(0,inplace=True,limit = 1)world_confirmed.fillna(method="ffill",inplace=True)world_confirmed

144 rows × 15 columns

分析：作前15个国家的疫情随时间变动表

#plt.rcParams['font.sans-serif'] = ['SimHei'] fig = plt.figure(figsize=(16,10))plt.plot(world_confirmed)plt.legend(world_confirmed.columns)plt.title('前15个国家累计确诊人数',fontsize=20)plt.xlabel('日期',fontsize=20)plt.ylabel('人数/百万',fontsize=20);

分析：国外数据的补全较为成功，有一定的真实性。

分析：国外新冠确诊人数自3月末开始激增，排名前四的国家的疫情没有受到控制的趋势，国外疫情的趋势为确诊人数继续激增。