What’s New in Pandas 2.0

Pandas is an open source data analysis tool, widely used in data cleaning, data processing, data analysis and other fields. I believe that children’s shoes who often do data processing are more familiar. With the increasing amount of data now, the limitations of pandas are becoming more and more prominent, and it is very annoying when dealing with big data, so we choose more suitable tools, such as big data processing frameworks such as pyspark . Pandas 2.0 is the latest version of the Pandas library and it brings some important improvements and new features. Let’s take a look at the new features after evolution.

Native time zone support

Pandas 2.0 introduces native time zone support, enabling Pandas to better handle time series data. Now, Pandas can easily convert time zones and perform calculations on time zone information.

 import pandas as pd import numpy as np # 创建一个带有时区信息的时间戳系列ts = pd.Series(pd.date_range('2022-01-01 00:00:00', periods=3, freq='H', tz='Europe/London')) # 将时区转换为另一个时区ts_utc = ts.dt.tz_convert('UTC') # 显示原始和转换后的时间戳print(ts) print(ts_utc)

output

 0 2022-01-01 00:00:00+00:00 1 2022-01-01 01:00:00+00:00 2 2022-01-01 02:00:00+00:00 dtype: datetime64[ns, Europe/London] 0 2022-01-01 00:00:00+00:00 1 2022-01-01 01:00:00+00:00 2 2022-01-01 02:00:00+00:00 dtype: datetime64[ns, UTC]

typed columns

Pandas 2.0 introduces typed columns, allowing users to better manage data types and improve the efficiency of data processing. Now, users can specify the data type of each column when creating a data frame, which makes the type of data frame more clear and concise.

 import pandas as pd import numpy as np # 创建一个带有类型化列的数据帧df = pd.DataFrame({ 'A': pd.Series(np.random.randn(5), dtype='float32'), 'B': pd.Series(np.random.randint(0, 10, 5), dtype='int32'), 'C': pd.Series(np.random.choice(['foo', 'bar', 'baz'], 5), dtype='category') }) # 显示列的数据类型print(df.dtypes)

output

 A float32 B int32 C category dtype: object

Nullability of Datetime types

Pandas 2.0 introduces nullability of Datetime type, allowing users to better handle missing values. Now, Pandas can easily handle missing value datetimes without additional processing.

 import pandas as pd import numpy as np # 创建一个带有空值日期时间的序列dt = pd.Series([pd.Timestamp('2022-01-01'), pd.NaT, pd.Timestamp('2022-01-03')]) # 显示带有空值的序列print(dt)

output

 0 2022-01-01 1 NaT 2 2022-01-03 dtype: datetime64[ns]

Improved grouping operations

Pandas 2.0 introduces improved grouping operations, making it easier for users to group and aggregate data. Now, users can easily perform multiple grouping operations in one data frame at the same time, which greatly improves the efficiency of data processing.

 import pandas as pd import numpy as np # 创建一个带有多个列的数据帧df = pd.DataFrame({ 'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C': np.random.randn(8), 'D': np.random.randn(8) }) # 按A 和B 列分组，并对C 列应用mean 函数grouped = df.groupby(['A', 'B']).mean() # 显示分组后的数据print(grouped)

output

 CDAB bar one 1.224593 1.277185 three -0.672583 0

Improved IO performance

Pandas 2.0 introduces improved IO performance, allowing users to read and write data more quickly. Pandas now handles large datasets better and offers better memory management and compression algorithms.

better memory management

Pandas 2.0 introduces better memory management, making it easier for users to work with large datasets. Now, Pandas can better manage memory, reducing memory leaks and memory fragmentation problems.

Summarize

Pandas 2.0 brings many important improvements and new features that make Pandas even more powerful and flexible. If you are a data analyst or data scientist, then Pandas 2.0 is definitely worth a try!

The new highlights in Pandas 2.0 first appeared in Note of the Lost Little Bookboy .

This article is transferred from https://xugaoxiang.com/2023/03/22/pandas-2-features/
This site is only for collection, and the copyright belongs to the original author.