Pandas的介绍及 Series、 DataFrame的创建

继续创造，加快生长！这是我参与「日新计划 10 月更文挑战」的第32天，点击检查活动详情

1. Pandas 是什么？

Pandas 是一个强壮的剖析结构化数据的工具集；它的使用基础是 Numpy（供给高性能的矩阵运算）；用于数据发掘和数据剖析，一起也供给数据清洗功能。Pandas 的首要数据结构是 Series（一维数据）和 DataFrame（二维数据）。

2. Series

Series 是一种类似于一维数组的目标，是由一组数据以及一组与之相关的数据标签（即索引）组成。

创立 Series 目标的语法为 my_series = pd.Series(data, index=index)，这里的 data 可所以 ndarray、字典或许一个标量。下面咱们就来讲下创立 Series 目标的不同办法。

2.1. 经过 ndarray 来创立

最简略的 Series 目标能够由一组数据生成。例如：

import numpy as np
import pandas as pd
my_series = pd.Series(np.array([4, -7, 6, -5, 3, 2]))
print(my_series)

上面代码的输出成果中，左面的一列为索引，右边的一列为值。假如创立 Series 目标时没有清晰指定索引，会主动创立一个从 0 到 n-1（n 为数据的长度）的整数型索引。Series 目标创立完成后，咱们能够经过它的 values 和 index 属性来获取数据和索引。例如：

import numpy as np
import pandas as pd
my_series = pd.Series(np.array([4, -7, 6, -5, 3, 2]))
print(my_series.values)
print(my_series.index)

上面的比如中，在创立 Series 目标时没有清晰指定索引，体系主动创立了一个索引，除了让体系主动生成索引之外，咱们也能够清晰指定索引。例如：

import numpy as np
import pandas as pd
my_series = pd.Series(np.array([4, -7, 6, -5, 3, 2]), index=["a", "b", "c", "d", "e", "f"])
print(my_series)

在上面的代码中，咱们经过列表 ["a", "b", "c", "d", "e", "f"] 指定了 Series 目标 my_series 的索引。

2.2 经过字典来创立

Series 目标同样能够经过字典来实例化。例如：

import pandas as pd
my_dict = {"f": 2, "c": 6, "d": -5, "e": 3, "a": 4, "b": -7}
my_series = pd.Series(my_dict)
print(my_series)
print(my_series.values)
print(my_series.index)

从上述代码的输出成果中，咱们能够看出，在经过字典来实例化 Series 目标时，字典的 key 会成为 Series 目标的索引，字典的 value 会成为 Series 目标的值。当数据的类型是字典而且在没有清晰指定索引的情况下，假如使用的 Python 版别 >= 3.6 而且 pandas 版别 >= 0.23，Series 目标中索引的次序和字典中 {key: value} 的刺进次序相同；假如 Python 版别 < 3.6 或许 pandas 版别 < 0.23，Series 目标中索引的次序为字典中 key 的词典次序。在使用字典来实例化 Series 目标时，咱们同样能够清晰指定索引。当清晰指定的索引和字典中的 key 完全匹配时，Series 目标的索引和数据的内容不会变化，仅仅索引的次序依照清晰指定的索引的次序。例如：

import pandas as pd
my_dict = {"f": 2, "c": 6, "d": -5, "e": 3, "a": 4, "b": -7}
my_series = pd.Series(my_dict, index=["a", "b", "c", "d", "e", "f"])
print(my_series)
print(my_series.values)
print(my_series.index)

当清晰指定的索引和字典中的 key 不完全匹配时，有两种情况。一种情况是，当字典 key 的调集为指定索引的真子集时，匹配不上的 key 对应的 value 为 NaN（代表缺失值）。例如：

import pandas as pd
my_dict = {"f": 2, "c": 6, "d": -5, "e": 3, "a": 4, "b": -7}
my_series = pd.Series(my_dict, index=["a", "b", "c", "d", "e", "f", "g"])
print(my_series)
print(my_series.values)
print(my_series.index)

在上面的代码中，指定索引中的 g 是没有 key 与之对应的，所以 g 对应的 value 为 nan。另一种情况是，当指定的索引为字典 key 调集的真子集时，没有匹配上的 key 在 Series 目标中不存在。例如：

import pandas as pd
my_dict = {"f": 2, "c": 6, "d": -5, "e": 3, "a": 4, "b": -7}
my_series = pd.Series(my_dict, index=["a", "b", "c", "d", "f"])
print(my_series)
print(my_series.values)
print(my_series.index)

在上面的代码中，字典中的键 e 没有匹配上，所以键 e 以及对应的值 3 都不会出现在最终的 Series 目标中。

2.3 经过标量来创立

当 data 为标量时，有必要清晰指定索引，标量会被重复必定的次数以匹配索引的长度。例如：

import pandas as pd
my_series = pd.Series(3, index=["a", "b", "c", "d", "e", "f"])
print(my_series)
print(my_series.values)
print(my_series.index)

在上面的比如中，因为索引的长度为 6，所以 3 被重复了 6 次。

3.DataFrame

DataFrame 是 Pandas 中的一个表格型的数据结构，包括有一组有序的列，每列可所以不同的值类型（数值、字符串、布尔型等），DataFrame 目标既有行索引也有列索引，能够被看做由 Series 目标组成的字典。

3.1 经过 Series 目标的字典来创立

import pandas as pd
d = {
    "Open": pd.Series([136, 137, 140, 143, 141, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "High": pd.Series([137, 140, 143, 144, 144, 145], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "Low": pd.Series([135, 137, 140, 142, 140, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "Close": pd.Series([137, 139, 142, 144, 143, 145], index = ['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
}
df = pd.DataFrame(d)
print(df)
print(df.index)
print(df.values)
print(type(df.values))

在上面的代码中，d 为一个字典，字典的 key 分别为 Open, High, Low, Close，字典的值为 4 个 Series 目标。在最终生成的 DataFrame 目标中，Index(['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08','2021-07-09'],dtype='object') 为行的索引；Index(['Open', 'High', 'Low', 'Close'], dtype='object') 为列的索引；DataFrame 目标值的类型为 ndarray。在上面的比如中，每个 Series 目标的索引是相同的，假如某个 Series 目标缺失了索引 '2021-07-09' 以及对应的值，则在最终生成的 DataFrame 目标中，这个缺失的索引对应的值为 NaN。例如：

import pandas as pd
d = {
    "Open": pd.Series([136, 137, 140, 143, 141, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "High": pd.Series([137, 140, 143, 144, 144, 145], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "Low": pd.Series([135, 137, 140, 142, 140, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "Close": pd.Series([137, 139, 142, 144, 143], index = ['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08'])
}
df = pd.DataFrame(d)
print(df)
print(df.index)
print(df.columns)
print(df.values)
print(type(df.values))

尽管 Series 目标的索引会成为 DataFrame 目标行的索引，咱们也能够清晰指定索引，当清晰指定索引时，以清晰指定的索引为准，例如：

import pandas as pd
d = {
    "Open": pd.Series([136, 137, 140, 143, 141, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "High": pd.Series([137, 140, 143, 144, 144, 145], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "Low": pd.Series([135, 137, 140, 142, 140, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "Close": pd.Series([137, 139, 142, 144, 143, 145], index = ['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
}
df = pd.DataFrame(d, index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07'])
print(df)
print(df.index)
print(df.columns)
print(df.values)
print(type(df.values))

上面的比如中，清晰指定的索引为 Series 目标索引的真子集，在最终生成的 DataFrame 目标中，行的索引便为清晰指定的索引，没有清晰指定的行索引对应的值不会出现在 DataFrame 目标中。当 Series 目标的索引为清晰指定索引的真子集时，在最终生成的 DataFrame 目标中，Series 目标中不存在的索引对应的值为 NaN。例如：

import pandas as pd
d = {
    "Open": pd.Series([136, 137, 140, 143, 141, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "High": pd.Series([137, 140, 143, 144, 144, 145], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "Low": pd.Series([135, 137, 140, 142, 140, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "Close": pd.Series([137, 139, 142, 144, 143, 145], index = ['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
}
df = pd.DataFrame(d, index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09', '2021-07-10'])
print(df)
print(df.index)
print(df.columns)
print(df.values)
print(type(df.values))

在上面的比如中，2021-07-10 这一行的值都为 NaN。当不清晰指定列的索引时，DataFrame 目标的列索引用的是字典的 key。和清晰指定行索引相同，咱们也能够清晰指定列索引，当清晰指定索引时，以指定的索引为准。例如：

import pandas as pd
d = {
    "Open": pd.Series([136, 137, 140, 143, 141, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "High": pd.Series([137, 140, 143, 144, 144, 145], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "Low": pd.Series([135, 137, 140, 142, 140, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "Close": pd.Series([137, 139, 142, 144, 143, 145], index = ['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
}
df = pd.DataFrame(d, index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'], columns=['Open', 'High', 'Low'])
print(df)
print(df.index)
print(df.columns)
print(df.values)
print(type(df.values))

上面的比如中，清晰指定的列索引为字典的 key 调集的真子集，在最终生成的 DataFrame 目标中，列的索引便为清晰指定的索引，没有清晰指定的索引对应的值不会出现在 DataFrame 目标中。当字典 key 的调集为清晰指定索引的真子集时，在最终生成的 DataFrame 目标中，Series 目标中不存在的列索引对应的值为 NaN。例如：

import pandas as pd
d = {
    "Open": pd.Series([136, 137, 140, 143, 141, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "High": pd.Series([137, 140, 143, 144, 144, 145], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "Low": pd.Series([135, 137, 140, 142, 140, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
    "Close": pd.Series([137, 139, 142, 144, 143, 145], index = ['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
}
df = pd.DataFrame(d, index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'], columns=['Open', 'High', 'Low', 'Close', 'Volume'])
print(df)
print(df.index)
print(df.columns)
print(df.values)
print(type(df.values))

在上面比如中，清晰指定的 Volume 没有对应的值，所以在最终生成的 DataFrame 目标中，Volume 列的值为 NaN。上面讲的是经过 Series 目标的字典来创立 DataFrame 目标，下面来讲下创立 DataFrame 目标其他几种方式。

3.2 经过 ndarray 的字典来创立

import pandas as pd
import numpy as np
d = {
    "Open": np.array([136, 137, 140, 143, 141, 142]),
    "High": np.array([137, 140, 143, 144, 144, 145]),
    "Low": np.array([135, 137, 140, 142, 140, 142]),
    "Close": np.array([137, 139, 142, 144, 143, 145])
}
df = pd.DataFrame(d)
print(df)

在上面的代码中，咱们没有清晰指定索引，所以使用的是默许生成的索引。咱们也能够清晰指定索引。

import pandas as pd
import numpy as np
d = {
    "Open": np.array([136, 137, 140, 143, 141, 142]),
    "High": np.array([137, 140, 143, 144, 144, 145]),
    "Low": np.array([135, 137, 140, 142, 140, 142]),
    "Close": np.array([137, 139, 142, 144, 143, 145])
}
df = pd.DataFrame(d, index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
print(df)

3.3 经过列表的字典来创立

import pandas as pd
import numpy as np
d = {
    "Open": [136, 137, 140, 143, 141, 142],
    "High": [137, 140, 143, 144, 144, 145],
    "Low": [135, 137, 140, 142, 140, 142],
    "Close": [137, 139, 142, 144, 143, 145]
}
df = pd.DataFrame(d, index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
print(df)

在上面的代码中，咱们经过列表的字典来创立 DataFrame 目标。

3.4 经过字典的列表来创立

import pandas as pd
import numpy as np
d = [
    {"Open": 136, "High": 137, "Low": 135, "Close": 137},
    {"Open": 137, "High": 140, "Low": 137, "Close": 139},
    {"Open": 140, "High": 143, "Low": 140, "Close": 142},
    {"Open": 143, "High": 144, "Low": 142, "Close": 144},
    {"Open": 141, "High": 144, "Low": 140, "Close": 143},
    {"Open": 142, "High": 145, "Low": 142, "Close": 145}
]
df = pd.DataFrame(d, index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
print(df)

在上面的代码中，咱们经过字典的列表来创立 DataFrame 目标，在每个字典中，key 为列的索引。

3.5 经过元组的列表来创立

import pandas as pd
import numpy as np
d = [
    (136, 137, 135, 137),
    (137, 140, 137, 139),
    (140, 143, 140, 142),
    (143, 144, 142, 144),
    (141, 144, 140, 143),
    (142, 145, 142, 145)
]
df = pd.DataFrame(d, index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'], columns=['Open', 'High', 'Low', 'Close'])
print(df)

在上面的代码中，咱们经过元组的列表来创立 DataFrame 目标，在创立 DataFrame 目标时，经过 index 参数来指定行索引，经过 columns 来指定列索引。

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。