1500字范文 > 学习笔记3 | 高维数据处理——Xarray

学习笔记3 | 高维数据处理——Xarray

时间：2019-10-24 18:26:17

一、数据结构

1.DataArray

（1）DataArray的创建

import xarray as xrimport numpy as npimport pandas as pddata = np.random.randn(4,3,2) #随机生成三维数组xr.DataArray(data) #简单的dataarraytimes = pd.date_range('2000-01-01',periods=4)lat = [0,1,2]lon = [10,20]da = xr.DataArray(data,coords={'time'=times,'lat'=lat,'lon'=lon},dims=['time','lat','lon'])

（2）DataArray的属性及常用方法

da.dims #获取维度da.coords #获取坐标da.values #获取数值type(da.values)da.attrs #获取属性da.attrs['name']='precitation' #设置属性da.attrs['units']='mm'da.rename('rain')

2.DataSet

可以理解为多个DataArray的集合

（1）DataSet的创建

#1.直接创建temp = 15+8*np.random.randn(4,3,2)precip = 10*np.random.randn(4,3,2)xr.Dataset({'temperature'=(['time','lat','lon'],temp)'precipitation'=(['time','lat','lon'],precip)},coords={'lon'=lon'lat'=lat'time'=times})#2.使用DataArray创建xr.Dataset({'precip':da})xr.Dataset({'precip':da,'temp':da+10})

（2）DataSet的属性和常用方法

ds.data_vars #显示有哪些变量'temp' in ds #判断变量是否在Dataset中ds.coordsds._coord_names #获取坐标名ds._dims #获取维度ds.drop_vars('temp') #剔除变量ds.drop_dims('time') #剔除维度

二、数据的读取

nc数据：再分析数据、气候系统模拟grib数据：数值预报中用的较多

1.读取nc文件

#xarray.open_dataset() engine=netcdf4/cfgribimport xarray as xrds = xr.open_dataset('filename') #engine默认netcdf4ds.longitudeds.longitude.valuesds.timetype(ds.time.values)ds['latitude']ds['t2m']ds['t2m'].values#xarray.open_dataarray #只能打开含有一个变量da = xr.open_dataset('filename',drop_variables=['u10','v10','swh'])da = xr.open_dataarray('filename',drop_variables=['u10','v10','swh'])da = xr.open_dataarray('filename')

2.读取grib文件

xr.open_dataset('fd.grib2',engine='cfgrib') #打开后发现缺少t2m类似其他坐标系统里的变量1xr.open_dataset('fd.grib2',engine='cfgrib'，backend_kwargs={'filter_by_keys':{'typeOfLevel':'heightAboveGround','level':2}})xr.open_dataset('fd.grib2',engine='cfgrib'，backend_kwargs={'filter_by_keys':{'typeOfLevel':'surface','level':0}})

3.读取多个文件并合并

# xarray.open_mfdataset(paths,engine=' ')xr.open_dataset('filename')xr.open_mfdataset('../*.nc') #mfdataset也可以读取一个文件

三、数据的索引

1.通过位置索引

：在指定维度位置上输入相应的数值来索引，只能在dataArray中使用

#1.通过数字查找import xarray as xrxr.open_dataset('filename')t = ds['t2m']t.valuest.values.shapet[0,0,0] #得到的是dataArrayt[:,1,1]#2.标签查找t.loc[:,89.75,70.25]t.loc[:,89.75:80,70.25:71] #维度从大到小排列，经度从小到大排列

2.通过名字索引

：索引时输入相关的维度名称和相应的数值

#1.数字查找ds.isel(longitude=1,time=0) #1是位置，没有查找的坐标全部都取#2.标签查找ds.sel(longitude=70.25,latitude=50,time='-01-01') #70.25是值#3.时空的切片索引#空间索引ds.sel(latitude=slice(80,50),longitude=slice(80,155))ds.sortby('latitude') #按照latitude排序ds.sel(latitude=slice(50,80),longitude=slice(80,155))#时间索引ds.sel(time=slice('-02-03','-05')ds.sel(time='-01-01 06')ds.sel(time=(ds.time.dt.day==15)) #指定15号ds.sel(time=(ds.time.dt.month.isin([1,3,5]))) #提取1，3，5月的数据

四、数据的坐标系统

1.修改坐标

import xarray as xrds = xr.open_dataset('filename')#1.修改坐标名字ds.rename({'longitude':'x','latitude':'y'})#2.修改坐标数值ds.assign_coords(longitude=ds['longitude']+180)#3.时间坐标拆分yearv= = ds.time.dt.yearmonthv = ds.time.dt.monthdayv = ds.time.dt.dayhourv = ds.time.dt.hourds_time = ds.assign_coords(year=yearv,month=monthv,day=dayv,hour=hourv) #只定义进去了但没有和变量关联起来ds1 = ds_time.set_index(time_new=('year','month','day','hour')) #增加坐标到dimension里ds1.sel(time_new=(,1,15,[0,12]))ds1.drop_dims('time').unstack('time_new') #但time是和变量关联的，删除之后变量也没了，所以要修改#修改上面三步ds1 = ds_time.set_index(time=('year','month','day','hour'))ds1.sel(time=(,1,15,[0,12]))time = ds1.unstack('time_new')time.sel(day=15,hour=18)

2.增加/删除坐标

#1.增加坐标t = ds['t2m'].expand_dims(height=1) #但还没有关联到变量里,还没有赋值t.assign_coords(height=[2])#2.删除坐标t.squeeze('height') #只能把维度是1的坐标压缩（应该意思是只有一个值的维度吧）

五、数据统计与计算

1.基础运算

import xarray as xr#1.四则运算ds = xr.open_dataset('....nc')t = ds['t2m']t.valuest-273.15u = ds['u10']v = ds['v10']u ** 2 + v ** 2 #风速的平方#2.常用函数t.max()t.min()t.mean()t.std()import numpy as npnp.max(t) #和上面结果一样np.sqrt(u**2 +v**2)abs(u)

2.降维

遇到缺测值可以自动忽略

#1.时间降维t.mean()t.mean(dim='time') #年平均气温t-t.mean(dim='time') #距平气温#2.空间降维t.mean(dim=['latitude','longitude'])

3.分组与聚合

时间分辨率有：year、month、day、hour、minute、second、microsecond、season、dayofyear、dayofweek、days_in_month

#xarray.DataArray.groupby(group,squeeze= ,restore_coord_dims= )t.groupby('time.month')t.groupby('time.season')tmonth = t.groupby('time.month')tmonth.groupsfor i,data in t_month:print(i,data['time'])tmonth.mean() #时间维度变成month

4.滑窗

：是在简单平均数基础上，通过顺序逐期增减新旧数据求算滑动平均，借以消除偶然变动因素，找出事物发展趋势，并进行预测的方法。

#xarray.DataArray.rolling(dim=None,center=False,min_periods=None)t_time = t.mean(['longitude','latitude'])r = t_time.rolling({'time':5})r.mean() #前五个是nanr = t_time.rolling(time=5,center=True)r.mean() #前2个和最后2个是nanr.max() #其他函数也可以用#空间平滑t_region = t.mean('time')r = t_region.rolling(longitude=5,latitude=5,center=True)r.mean()

5.插值

#Xarray.DataArray.interp(coords= ,method= ,...)#算法有：多维：linear、nearest；一维：linear、nearest、zero、slinear、qadratic、cubic#1.一维插值t_oneday = t.sel(time='-01-01') #有四个时间 0，6，12，18t_oneday.timeimport pandas as pdnew_time = pd.data_range(t_oneday.time.values[0],t_oneday.time.values[-1],freq='3H')t_oneday.interp(time=new_time) #插值成功#2.多维插值t_region = t_t.sel(longitude=slice(100,101),latitude=slice(41,40)) #slice左右都是闭的import numpy as npnlat = np.linspace(41,40,11)nlon = np.linspace(100,101,11)t_region.interp(longitude=nlon,latitude=nlat)

六、高维数据可视化

画图对象：dataarray，使用matplotlib库

一维数据：line（）；二维数据：pcolormesh（）；其他：hist（）

1.时间序列可视化

import xarray as xrda = xr.open_dataset('....nc')t = da['t2m']-273.15t_point = t.isel(latitude=10,longitude=10) #10是位置t_point.plot()t_month = t_point.groupby('time.month').mean()t_month.plot()#多个点的时间序列tp = t.isel(latitude=[10,100,200],longitude=10)tp.plot() #结果是堆叠的histtp.plot(hue='latitude') #得到不同latitude的折线图

2.空间数据可视化

t_season = t.sel(longtitude=slice(90,120),latitude=slice(50,20)).groupby('time.season').mean()t_season.iseal(season = 0).plot()

3.多子图

t_season = t.sel(longtitude=slice(90,120),latitude=slice(50,20)).groupby('time.season').mean()g = t_season.plot(col='season',col_wrap=2) #col_wrap是两行的意思g.axestp = t.isel(latitude=[10,100,200],longitude=[10,100,200])tp.plot(row='latitude',col='longtitude')

七、数据的掩膜

把不要的数据隐藏起来的最简便的方法就是设为nan，计算和可视化过程都不参与

数值掩膜空间位置掩膜

1.数值掩膜

#xarray.where(cond,x,y,keep_attrs=None)import xarray as xrimport numpy as npds = xr.open_dataset('...nc')u = ds['u10']v = ds['v10']ws = np.sqrt(u**2+v**2)ws_region = ws.mean('time')ws_region.plot()xr.where(ws_region >5,ws_region,np.nan).plot()

2.空间位置掩膜

1.shp文件；2.salem掩膜工具

#1.陆地掩膜import geopandas as gpdimport salem land_shp = gpd.read_file('filename')land = ws_region.salem.roi(shape = land_shp) #region of interestland.plot()#2.海洋掩膜ocean_shp = gpd.read_file('filename')ocean = ws_region.salem.roi(shape = ocean_shp)ocean.plot()#3.行政区划掩膜china_shp = gpd.read_file('filename')china = ws_region.salem.roi(shape = china_shp)select = china_shp[china_shp['省'].isin(['广东省','河南省'])]

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。