Pandas
is an open source data analysis tool, widely used in data cleaning, data processing, data analysis and other fields. I believe that children’s shoes who often do data processing are more familiar. With the increasing amount of data now, the limitations of pandas
are becoming more and more prominent, and it is very annoying when dealing with big data, so we choose more suitable tools, such as big data processing frameworks such as pyspark
. Pandas
2.0 is the latest version of the Pandas
library and it brings some important improvements and new features. Let’s take a look at the new features after evolution.
Native time zone support
Pandas 2.0
introduces native time zone support, enabling Pandas
to better handle time series data. Now, Pandas
can easily convert time zones and perform calculations on time zone information.
import pandas as pd import numpy as np # 创建一个带有时区信息的时间戳系列ts = pd.Series(pd.date_range('2022-01-01 00:00:00', periods=3, freq='H', tz='Europe/London')) # 将时区转换为另一个时区ts_utc = ts.dt.tz_convert('UTC') # 显示原始和转换后的时间戳print(ts) print(ts_utc)
output
0 2022-01-01 00:00:00+00:00 1 2022-01-01 01:00:00+00:00 2 2022-01-01 02:00:00+00:00 dtype: datetime64[ns, Europe/London] 0 2022-01-01 00:00:00+00:00 1 2022-01-01 01:00:00+00:00 2 2022-01-01 02:00:00+00:00 dtype: datetime64[ns, UTC]
typed columns
Pandas
2.0 introduces typed columns, allowing users to better manage data types and improve the efficiency of data processing. Now, users can specify the data type of each column when creating a data frame, which makes the type of data frame more clear and concise.
import pandas as pd import numpy as np # 创建一个带有类型化列的数据帧df = pd.DataFrame({ 'A': pd.Series(np.random.randn(5), dtype='float32'), 'B': pd.Series(np.random.randint(0, 10, 5), dtype='int32'), 'C': pd.Series(np.random.choice(['foo', 'bar', 'baz'], 5), dtype='category') }) # 显示列的数据类型print(df.dtypes)
output
A float32 B int32 C category dtype: object
Nullability of Datetime types
Pandas
2.0 introduces nullability of Datetime
type, allowing users to better handle missing values. Now, Pandas
can easily handle missing value datetimes without additional processing.
import pandas as pd import numpy as np # 创建一个带有空值日期时间的序列dt = pd.Series([pd.Timestamp('2022-01-01'), pd.NaT, pd.Timestamp('2022-01-03')]) # 显示带有空值的序列print(dt)
output
0 2022-01-01 1 NaT 2 2022-01-03 dtype: datetime64[ns]
Improved grouping operations
Pandas
2.0 introduces improved grouping operations, making it easier for users to group and aggregate data. Now, users can easily perform multiple grouping operations in one data frame at the same time, which greatly improves the efficiency of data processing.
import pandas as pd import numpy as np # 创建一个带有多个列的数据帧df = pd.DataFrame({ 'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C': np.random.randn(8), 'D': np.random.randn(8) }) # 按A 和B 列分组,并对C 列应用mean 函数grouped = df.groupby(['A', 'B']).mean() # 显示分组后的数据print(grouped)
output
CDAB bar one 1.224593 1.277185 three -0.672583 0
Improved IO performance
Pandas
2.0 introduces improved IO
performance, allowing users to read and write data more quickly. Pandas
now handles large datasets better and offers better memory management and compression algorithms.
better memory management
Pandas
2.0 introduces better memory management, making it easier for users to work with large datasets. Now, Pandas
can better manage memory, reducing memory leaks and memory fragmentation problems.
Summarize
Pandas
2.0 brings many important improvements and new features that make Pandas
even more powerful and flexible. If you are a data analyst or data scientist, then Pandas
2.0 is definitely worth a try!
The new highlights in Pandas 2.0 first appeared in Note of the Lost Little Bookboy .
This article is transferred from https://xugaoxiang.com/2023/03/22/pandas-2-features/
This site is only for collection, and the copyright belongs to the original author.