Pandas高级教程之:时间处理

犀渠玉剑良家子,白马金羁侠少年。这篇文章主要讲述Pandas高级教程之:时间处理相关的知识,希望能为你提供帮助。
简介时间应该是在数据处理中经常会用到的一种数据类型,除了Numpy中datetime64 和 timedelta64 这两种数据类型之外,pandas 还整合了其他python库比如??scikits.timeseries??中的功能。
时间分类pandas中有四种时间类型:

  1. Date times : 日期和时间,可以带时区。和标准库中的??datetime.datetime??类似。
  2. Time deltas: 绝对持续时间,和 标准库中的??datetime.timedelta??类似。
  3. Time spans: 由时间点及其关联的频率定义的时间跨度。
  4. Date offsets:基于日历计算的时间 和 dateutil.relativedelta.relativedelta 类似。
我们用一张表来表示:
类型
标量class
数组class
pandas数据类型
主要创建方法
Date times
??Timestamp??
??DatetimeIndex??
??datetime64[ns]??or??datetime64[ns, tz]??
??to_datetime??or??date_range??
Time deltas
??Timedelta??
??TimedeltaIndex??
??timedelta64[ns]??
??to_timedelta??or??timedelta_range??
Time spans
??Period??
??PeriodIndex??
??period[freq]??
??Period??or??period_range??
Date offsets
??DateOffset??
??None??
??None??
??DateOffset??
看一个使用的例子:
In [19]: pd.Series(range(3), index=pd.date_range("2000", freq="D", periods=3))
Out[19]:
2000-01-010
2000-01-021
2000-01-032
Freq: D, dtype: int64

看一下上面数据类型的空值:
In [24]: pd.Timestamp(pd.NaT)
Out[24]: NaT

In [25]: pd.Timedelta(pd.NaT)
Out[25]: NaT

In [26]: pd.Period(pd.NaT)
Out[26]: NaT

# Equality acts as np.nan would
In [27]: pd.NaT == pd.NaT
Out[27]: False

TimestampTimestamp 是最基础的时间类型,我们可以这样创建:
In [28]: pd.Timestamp(datetime.datetime(2012, 5, 1))
Out[28]: Timestamp(\'2012-05-01 00:00:00\')

In [29]: pd.Timestamp("2012-05-01")
Out[29]: Timestamp(\'2012-05-01 00:00:00\')

In [30]: pd.Timestamp(2012, 5, 1)
Out[30]: Timestamp(\'2012-05-01 00:00:00\')

DatetimeIndexTimestamp 作为index会自动被转换为DatetimeIndex:
In [33]: dates = [
....:pd.Timestamp("2012-05-01"),
....:pd.Timestamp("2012-05-02"),
....:pd.Timestamp("2012-05-03"),
....: ]
....:

In [34]: ts = pd.Series(np.random.randn(3), dates)

In [35]: type(ts.index)
Out[35]: pandas.core.indexes.datetimes.DatetimeIndex

In [36]: ts.index
Out[36]: DatetimeIndex([\'2012-05-01\', \'2012-05-02\', \'2012-05-03\'], dtype=\'datetime64[ns]\', freq=None)

In [37]: ts
Out[37]:
2012-05-010.469112
2012-05-02-0.282863
2012-05-03-1.509059
dtype: float64

date_range 和 bdate_range还可以使用 date_range 来创建DatetimeIndex:
In [74]: start = datetime.datetime(2011, 1, 1)

In [75]: end = datetime.datetime(2012, 1, 1)

In [76]: index = pd.date_range(start, end)

In [77]: index
Out[77]:
DatetimeIndex([\'2011-01-01\', \'2011-01-02\', \'2011-01-03\', \'2011-01-04\',
\'2011-01-05\', \'2011-01-06\', \'2011-01-07\', \'2011-01-08\',
\'2011-01-09\', \'2011-01-10\',
...
\'2011-12-23\', \'2011-12-24\', \'2011-12-25\', \'2011-12-26\',
\'2011-12-27\', \'2011-12-28\', \'2011-12-29\', \'2011-12-30\',
\'2011-12-31\', \'2012-01-01\'],
dtype=\'datetime64[ns]\', length=366, freq=\'D\')

??date_range??是日历范围,??bdate_range??是工作日范围:
In [78]: index = pd.bdate_range(start, end)

In [79]: index
Out[79]:
DatetimeIndex([\'2011-01-03\', \'2011-01-04\', \'2011-01-05\', \'2011-01-06\',
\'2011-01-07\', \'2011-01-10\', \'2011-01-11\', \'2011-01-12\',
\'2011-01-13\', \'2011-01-14\',
...
\'2011-12-19\', \'2011-12-20\', \'2011-12-21\', \'2011-12-22\',
\'2011-12-23\', \'2011-12-26\', \'2011-12-27\', \'2011-12-28\',
\'2011-12-29\', \'2011-12-30\'],
dtype=\'datetime64[ns]\', length=260, freq=\'B\')

两个方法都可以带上??start??,??end??, 和??periods??参数。
In [84]: pd.bdate_range(end=end, periods=20)
In [83]: pd.date_range(start, end, freq="W")
In [86]: pd.date_range("2018-01-01", "2018-01-05", periods=5)

??origin??使用??origin??参数,可以修改??DatetimeIndex??的起点:
In [67]: pd.to_datetime([1, 2, 3], unit="D", origin=pd.Timestamp("1960-01-01"))
Out[67]: DatetimeIndex([\'1960-01-02\', \'1960-01-03\', \'1960-01-04\'], dtype=\'datetime64[ns]\', freq=None)

默认情况下??origin=\'unix\'??, 也就是起点是??1970-01-01 00:00:00??.
In [68]: pd.to_datetime([1, 2, 3], unit="D")
Out[68]: DatetimeIndex([\'1970-01-02\', \'1970-01-03\', \'1970-01-04\'], dtype=\'datetime64[ns]\', freq=None)

格式化使用format参数可以对时间进行格式化:
In [51]: pd.to_datetime("2010/11/12", format="%Y/%m/%d")
Out[51]: Timestamp(\'2010-11-12 00:00:00\')

In [52]: pd.to_datetime("12-11-2010 00:00", format="%d-%m-%Y %H:%M")
Out[52]: Timestamp(\'2010-11-12 00:00:00\')

PeriodPeriod 表示的是一个时间跨度,通常和freq一起使用:
In [31]: pd.Period("2011-01")
Out[31]: Period(\'2011-01\', \'M\')

In [32]: pd.Period("2012-05", freq="D")
Out[32]: Period(\'2012-05-01\', \'D\')

Period可以直接进行运算:
In [345]: p = pd.Period("2012", freq="A-DEC")

In [346]: p + 1
Out[346]: Period(\'2013\', \'A-DEC\')

In [347]: p - 3
Out[347]: Period(\'2009\', \'A-DEC\')

In [348]: p = pd.Period("2012-01", freq="2M")

In [349]: p + 2
Out[349]: Period(\'2012-05\', \'2M\')

In [350]: p - 1
Out[350]: Period(\'2011-11\', \'2M\')

注意,Period只有具有相同的freq才能进行算数运算。包括 offsets 和 timedelta
In [352]: p = pd.Period("2014-07-01 09:00", freq="H")

In [353]: p + pd.offsets.Hour(2)
Out[353]: Period(\'2014-07-01 11:00\', \'H\')

In [354]: p + datetime.timedelta(minutes=120)
Out[354]: Period(\'2014-07-01 11:00\', \'H\')

In [355]: p + np.timedelta64(7200, "s")
Out[355]: Period(\'2014-07-01 11:00\', \'H\')

Period作为index可以自动被转换为PeriodIndex:
In [38]: periods = [pd.Period("2012-01"), pd.Period("2012-02"), pd.Period("2012-03")]

In [39]: ts = pd.Series(np.random.randn(3), periods)

In [40]: type(ts.index)
Out[40]: pandas.core.indexes.period.PeriodIndex

In [41]: ts.index
Out[41]: PeriodIndex([\'2012-01\', \'2012-02\', \'2012-03\'], dtype=\'period[M]\', freq=\'M\')

In [42]: ts
Out[42]:
2012-01-1.135632
2012-021.212112
2012-03-0.173215
Freq: M, dtype: float64

可以通过 pd.period_range 方法来创建 PeriodIndex:
In [359]: prng = pd.period_range("1/1/2011", "1/1/2012", freq="M")

In [360]: prng
Out[360]:
PeriodIndex([\'2011-01\', \'2011-02\', \'2011-03\', \'2011-04\', \'2011-05\', \'2011-06\',
\'2011-07\', \'2011-08\', \'2011-09\', \'2011-10\', \'2011-11\', \'2011-12\',
\'2012-01\'],
dtype=\'period[M]\', freq=\'M\')

还可以通过PeriodIndex直接创建:
In [361]: pd.PeriodIndex(["2011-1", "2011-2", "2011-3"], freq="M")
Out[361]: PeriodIndex([\'2011-01\', \'2011-02\', \'2011-03\'], dtype=\'period[M]\', freq=\'M\')

DateOffsetDateOffset表示的是频率对象。它和Timedelta很类似,表示的是一个持续时间,但是有特殊的日历规则。比如Timedelta一天肯定是24小时,而在 DateOffset中根据夏令时的不同,一天可能会有23,24或者25小时。
# This particular day contains a day light savings time transition
In [144]: ts = pd.Timestamp("2016-10-30 00:00:00", tz="Europe/Helsinki")

# Respects absolute time
In [145]: ts + pd.Timedelta(days=1)
Out[145]: Timestamp(\'2016-10-30 23:00:00+0200\', tz=\'Europe/Helsinki\')

# Respects calendar time
In [146]: ts + pd.DateOffset(days=1)
Out[146]: Timestamp(\'2016-10-31 00:00:00+0200\', tz=\'Europe/Helsinki\')

In [147]: friday = pd.Timestamp("2018-01-05")

In [148]: friday.day_name()
Out[148]: \'Friday\'

# Add 2 business days (Friday --> Tuesday)
In [149]: two_business_days = 2 * pd.offsets.BDay()

In [150]: two_business_days.apply(friday)
Out[150]: Timestamp(\'2018-01-09 00:00:00\')

In [151]: friday + two_business_days
Out[151]: Timestamp(\'2018-01-09 00:00:00\')

In [152]: (friday + two_business_days).day_name()
Out[152]: \'Tuesday\'

DateOffsets 和Frequency  运算是先关的,看一下可用的Date Offset  和它相关联的  Frequency:
Date Offset
Frequency String
描述
??DateOffset??
None
通用的offset 类
??BDay??or??BusinessDay??
??\'B\'??
工作日
??CDay??or??CustomBusinessDay??
??\'C\'??
自定义的工作日
??Week??
??\'W\'??
一周
??WeekOfMonth??
??\'WOM\'??
每个月的第几周的第几天
??LastWeekOfMonth??
??\'LWOM\'??
每个月最后一周的第几天
??MonthEnd??
??\'M\'??
日历月末
MonthBegin
??\'MS\'??
日历月初
??BMonthEnd??or??BusinessMonthEnd??
??\'BM\'??
营业月底
??BMonthBegin??or??BusinessMonthBegin??
??\'BMS\'??
营业月初
??CBMonthEnd??or??CustomBusinessMonthEnd??
??\'CBM\'??
自定义营业月底
??CBMonthBegin??or??CustomBusinessMonthBegin??
??\'CBMS\'??
自定义营业月初
??SemiMonthEnd??
??\'SM\'??
日历月末的第15天
??SemiMonthBegin??
【Pandas高级教程之:时间处理】

    推荐阅读