1. 初始化
(1)生成简单序列pd.Series
>>>s = pd.Series([1,3,5,np.nan,6,8])>>>s0 1.01 3.02 5.03 NaN #注意空4 6.05 8.0dtype: float64
(2)生成日期序列pd.date_range
>>>dates = pd.date_range('0101', periods=6)>>> datesDatetimeIndex(['-01-01', '-01-02', '-01-03', '-01-04','-01-05', '-01-06'],dtype='datetime64[ns]', freq='D')
(3)结构
>>>df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))# index 表示序号,columns表示列名称>>> dfA B C D-01-01 0.469112 -0.282863 -1.509059 -1.135632-01-02 1.212112 -0.173215 0.119209 -1.044236-01-03 -0.861849 -2.104569 -0.494929 1.071804-01-04 0.721555 -0.706771 -1.039575 0.271860-01-05 -0.424972 0.567020 0.276232 -1.087401-01-06 -0.673690 0.113648 -1.478427 0.524988
>>>: df2 = pd.DataFrame({'A' : 1.,....: 'B' : pd.Timestamp('0102'),....: 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),....: 'D' : np.array([3] * 4,dtype='int32'),....: 'E' : pd.Categorical(["test","train","test","train"]),....: 'F' : 'foo' })....: >>>: df2A B C DE F0 1.0 -01-02 1.0 3 test foo1 1.0 -01-02 1.0 3 train foo2 1.0 -01-02 1.0 3 test foo3 1.0 -01-02 1.0 3 train foo
2. 观察数据
(1)前n个(head),后n个(tail)
>>> df.head(2)A B C D-01-01 0.469112 -0.282863 -1.509059 -1.135632-01-02 1.212112 -0.173215 0.119209 -1.044236>>> df.tail(3)A B C D-01-04 0.721555 -0.706771 -1.039575 0.271860-01-05 -0.424972 0.567020 0.276232 -1.087401-01-06 -0.673690 0.113648 -1.478427 0.524988
(2)展示序号(index)、列号(columns)、值(values)
>>>df.indexDatetimeIndex(['-01-01', '-01-02', '-01-03', '-01-04','-01-05', '-01-06'],dtype='datetime64[ns]', freq='D')>>> df.columnsIndex(['A', 'B', 'C', 'D'], dtype='object')>>> df.valuesarray([[ 0.4691, -0.2829, -1.5091, -1.1356],[ 1.2121, -0.1732, 0.1192, -1.0442],[-0.8618, -2.1046, -0.4949, 1.0718],[ 0.7216, -0.7068, -1.0396, 0.2719],[-0.425 , 0.567 , 0.2762, -1.0874],[-0.6737, 0.1136, -1.4784, 0.525 ]])
(3)快速数据统计describe
>>>df.describe() A B C Dcount 6.000000 6.000000 6.000000 6.000000mean 0.073711 -0.431125 -0.687758 -0.233103std 0.843157 0.922818 0.779887 0.973118min -0.861849 -2.104569 -1.509059 -1.13563225% -0.611510 -0.600794 -1.368714 -1.07661050% 0.022070 -0.228039 -0.767252 -0.38618875% 0.658444 0.041933 -0.034326 0.461706max 1.212112 0.567020 0.276232 1.071804
(4)转置df.T
(5)按轴排序
降序:ascending=False
升序:ascending=True
横轴: df.sort_index(axis=1, ascending=False)
纵轴: df.sort_index(axis=0, ascending=False)
>>>df.sort_index(axis=1, ascending=False)D C B A-01-01 -1.135632 -1.509059 -0.282863 0.469112-01-02 -1.044236 0.119209 -0.173215 1.212112-01-03 1.071804 -0.494929 -2.104569 -0.861849-01-04 0.271860 -1.039575 -0.706771 0.721555-01-05 -1.087401 0.276232 0.567020 -0.424972-01-06 0.524988 -1.478427 0.113648 -0.673690
(6)按值排序
>>> df.sort_values(by='B')A B C D-01-03 -0.861849 -2.104569 -0.494929 1.071804-01-04 0.721555 -0.706771 -1.039575 0.271860-01-01 0.469112 -0.282863 -1.509059 -1.135632-01-02 1.212112 -0.173215 0.119209 -1.044236-01-06 -0.673690 0.113648 -1.478427 0.524988-01-05 -0.424972 0.567020 0.276232 -1.087401
3. 选择, 与matlab类似
选择某列(df.A ==
df['A'])
选择某个区间(df[0:3])
按标签选择(df.loc[dates[0]])
4. 数据缺失
用nan表示
舍去丢失数据的行 df.dropna(how='any')
补全丢失的数据 df.fillna(value=5)
判断是否缺失数据 pd.isna(df1)
5. 统计
求平均值 df.mean()
6. 使用函数
>>>df.apply(lambda x: x.max() - x.min())A 2.073961B 2.671590C 1.785291D 0.000000F 4.000000dtype: float64