犀渠玉剑良家子,白马金羁侠少年。这篇文章主要讲述Pandas高级教程之:时间处理相关的知识,希望能为你提供帮助。
简介时间应该是在数据处理中经常会用到的一种数据类型,除了Numpy中datetime64 和 timedelta64 这两种数据类型之外,pandas 还整合了其他python库比如??scikits.timeseries?
?中的功能。
时间分类pandas中有四种时间类型:
- Date times : 日期和时间,可以带时区。和标准库中的?
?datetime.datetime?
?类似。 - Time deltas: 绝对持续时间,和 标准库中的?
?datetime.timedelta?
?类似。 - Time spans: 由时间点及其关联的频率定义的时间跨度。
- Date offsets:基于日历计算的时间 和 dateutil.relativedelta.relativedelta 类似。
类型 | 标量class | 数组class | pandas数据类型 | 主要创建方法 |
Date times | ??Timestamp? ? | ??DatetimeIndex? ? | ??datetime64[ns]? ?or??datetime64[ns, tz]? ? | ??to_datetime? ?or??date_range? ? |
Time deltas | ??Timedelta? ? | ??TimedeltaIndex? ? | ??timedelta64[ns]? ? | ??to_timedelta? ?or??timedelta_range? ? |
Time spans | ??Period? ? | ??PeriodIndex? ? | ??period[freq]? ? | ??Period? ?or??period_range? ? |
Date offsets | ??DateOffset? ? | ??None? ? | ??None? ? | ??DateOffset? ? |
In [19]: pd.Series(range(3), index=pd.date_range("2000", freq="D", periods=3))
Out[19]:
2000-01-010
2000-01-021
2000-01-032
Freq: D, dtype: int64
看一下上面数据类型的空值:
In [24]: pd.Timestamp(pd.NaT)
Out[24]: NaT
In [25]: pd.Timedelta(pd.NaT)
Out[25]: NaT
In [26]: pd.Period(pd.NaT)
Out[26]: NaT
# Equality acts as np.nan would
In [27]: pd.NaT == pd.NaT
Out[27]: False
TimestampTimestamp 是最基础的时间类型,我们可以这样创建:
In [28]: pd.Timestamp(datetime.datetime(2012, 5, 1))
Out[28]: Timestamp(\'2012-05-01 00:00:00\')
In [29]: pd.Timestamp("2012-05-01")
Out[29]: Timestamp(\'2012-05-01 00:00:00\')
In [30]: pd.Timestamp(2012, 5, 1)
Out[30]: Timestamp(\'2012-05-01 00:00:00\')
DatetimeIndexTimestamp 作为index会自动被转换为DatetimeIndex:
In [33]: dates = [
....:pd.Timestamp("2012-05-01"),
....:pd.Timestamp("2012-05-02"),
....:pd.Timestamp("2012-05-03"),
....: ]
....:
In [34]: ts = pd.Series(np.random.randn(3), dates)
In [35]: type(ts.index)
Out[35]: pandas.core.indexes.datetimes.DatetimeIndex
In [36]: ts.index
Out[36]: DatetimeIndex([\'2012-05-01\', \'2012-05-02\', \'2012-05-03\'], dtype=\'datetime64[ns]\', freq=None)
In [37]: ts
Out[37]:
2012-05-010.469112
2012-05-02-0.282863
2012-05-03-1.509059
dtype: float64
date_range 和 bdate_range还可以使用 date_range 来创建DatetimeIndex:
In [74]: start = datetime.datetime(2011, 1, 1)
In [75]: end = datetime.datetime(2012, 1, 1)
In [76]: index = pd.date_range(start, end)
In [77]: index
Out[77]:
DatetimeIndex([\'2011-01-01\', \'2011-01-02\', \'2011-01-03\', \'2011-01-04\',
\'2011-01-05\', \'2011-01-06\', \'2011-01-07\', \'2011-01-08\',
\'2011-01-09\', \'2011-01-10\',
...
\'2011-12-23\', \'2011-12-24\', \'2011-12-25\', \'2011-12-26\',
\'2011-12-27\', \'2011-12-28\', \'2011-12-29\', \'2011-12-30\',
\'2011-12-31\', \'2012-01-01\'],
dtype=\'datetime64[ns]\', length=366, freq=\'D\')
?
?date_range?
?是日历范围,??bdate_range?
?是工作日范围:In [78]: index = pd.bdate_range(start, end)
In [79]: index
Out[79]:
DatetimeIndex([\'2011-01-03\', \'2011-01-04\', \'2011-01-05\', \'2011-01-06\',
\'2011-01-07\', \'2011-01-10\', \'2011-01-11\', \'2011-01-12\',
\'2011-01-13\', \'2011-01-14\',
...
\'2011-12-19\', \'2011-12-20\', \'2011-12-21\', \'2011-12-22\',
\'2011-12-23\', \'2011-12-26\', \'2011-12-27\', \'2011-12-28\',
\'2011-12-29\', \'2011-12-30\'],
dtype=\'datetime64[ns]\', length=260, freq=\'B\')
两个方法都可以带上?
?start?
?,??end?
?, 和??periods?
?参数。In [84]: pd.bdate_range(end=end, periods=20)
In [83]: pd.date_range(start, end, freq="W")
In [86]: pd.date_range("2018-01-01", "2018-01-05", periods=5)
?
?origin?
?使用??origin?
?参数,可以修改??DatetimeIndex?
?的起点:In [67]: pd.to_datetime([1, 2, 3], unit="D", origin=pd.Timestamp("1960-01-01"))
Out[67]: DatetimeIndex([\'1960-01-02\', \'1960-01-03\', \'1960-01-04\'], dtype=\'datetime64[ns]\', freq=None)
默认情况下?
?origin=\'unix\'?
?, 也就是起点是??1970-01-01 00:00:00?
?.In [68]: pd.to_datetime([1, 2, 3], unit="D")
Out[68]: DatetimeIndex([\'1970-01-02\', \'1970-01-03\', \'1970-01-04\'], dtype=\'datetime64[ns]\', freq=None)
格式化使用format参数可以对时间进行格式化:
In [51]: pd.to_datetime("2010/11/12", format="%Y/%m/%d")
Out[51]: Timestamp(\'2010-11-12 00:00:00\')
In [52]: pd.to_datetime("12-11-2010 00:00", format="%d-%m-%Y %H:%M")
Out[52]: Timestamp(\'2010-11-12 00:00:00\')
PeriodPeriod 表示的是一个时间跨度,通常和freq一起使用:
In [31]: pd.Period("2011-01")
Out[31]: Period(\'2011-01\', \'M\')
In [32]: pd.Period("2012-05", freq="D")
Out[32]: Period(\'2012-05-01\', \'D\')
Period可以直接进行运算:
In [345]: p = pd.Period("2012", freq="A-DEC")
In [346]: p + 1
Out[346]: Period(\'2013\', \'A-DEC\')
In [347]: p - 3
Out[347]: Period(\'2009\', \'A-DEC\')
In [348]: p = pd.Period("2012-01", freq="2M")
In [349]: p + 2
Out[349]: Period(\'2012-05\', \'2M\')
In [350]: p - 1
Out[350]: Period(\'2011-11\', \'2M\')
注意,Period只有具有相同的freq才能进行算数运算。包括 offsets 和 timedelta
In [352]: p = pd.Period("2014-07-01 09:00", freq="H")
In [353]: p + pd.offsets.Hour(2)
Out[353]: Period(\'2014-07-01 11:00\', \'H\')
In [354]: p + datetime.timedelta(minutes=120)
Out[354]: Period(\'2014-07-01 11:00\', \'H\')
In [355]: p + np.timedelta64(7200, "s")
Out[355]: Period(\'2014-07-01 11:00\', \'H\')
Period作为index可以自动被转换为PeriodIndex:
In [38]: periods = [pd.Period("2012-01"), pd.Period("2012-02"), pd.Period("2012-03")]
In [39]: ts = pd.Series(np.random.randn(3), periods)
In [40]: type(ts.index)
Out[40]: pandas.core.indexes.period.PeriodIndex
In [41]: ts.index
Out[41]: PeriodIndex([\'2012-01\', \'2012-02\', \'2012-03\'], dtype=\'period[M]\', freq=\'M\')
In [42]: ts
Out[42]:
2012-01-1.135632
2012-021.212112
2012-03-0.173215
Freq: M, dtype: float64
可以通过 pd.period_range 方法来创建 PeriodIndex:
In [359]: prng = pd.period_range("1/1/2011", "1/1/2012", freq="M")
In [360]: prng
Out[360]:
PeriodIndex([\'2011-01\', \'2011-02\', \'2011-03\', \'2011-04\', \'2011-05\', \'2011-06\',
\'2011-07\', \'2011-08\', \'2011-09\', \'2011-10\', \'2011-11\', \'2011-12\',
\'2012-01\'],
dtype=\'period[M]\', freq=\'M\')
还可以通过PeriodIndex直接创建:
In [361]: pd.PeriodIndex(["2011-1", "2011-2", "2011-3"], freq="M")
Out[361]: PeriodIndex([\'2011-01\', \'2011-02\', \'2011-03\'], dtype=\'period[M]\', freq=\'M\')
DateOffsetDateOffset表示的是频率对象。它和Timedelta很类似,表示的是一个持续时间,但是有特殊的日历规则。比如Timedelta一天肯定是24小时,而在 DateOffset中根据夏令时的不同,一天可能会有23,24或者25小时。
# This particular day contains a day light savings time transition
In [144]: ts = pd.Timestamp("2016-10-30 00:00:00", tz="Europe/Helsinki")
# Respects absolute time
In [145]: ts + pd.Timedelta(days=1)
Out[145]: Timestamp(\'2016-10-30 23:00:00+0200\', tz=\'Europe/Helsinki\')
# Respects calendar time
In [146]: ts + pd.DateOffset(days=1)
Out[146]: Timestamp(\'2016-10-31 00:00:00+0200\', tz=\'Europe/Helsinki\')
In [147]: friday = pd.Timestamp("2018-01-05")
In [148]: friday.day_name()
Out[148]: \'Friday\'
# Add 2 business days (Friday --> Tuesday)
In [149]: two_business_days = 2 * pd.offsets.BDay()
In [150]: two_business_days.apply(friday)
Out[150]: Timestamp(\'2018-01-09 00:00:00\')
In [151]: friday + two_business_days
Out[151]: Timestamp(\'2018-01-09 00:00:00\')
In [152]: (friday + two_business_days).day_name()
Out[152]: \'Tuesday\'
DateOffsets 和Frequency 运算是先关的,看一下可用的Date Offset 和它相关联的 Frequency:
Date Offset | Frequency String | 描述 |
??DateOffset? ? | None | 通用的offset 类 |
??BDay? ?or??BusinessDay? ? | ??\'B\'? ? | 工作日 |
??CDay? ?or??CustomBusinessDay? ? | ??\'C\'? ? | 自定义的工作日 |
??Week? ? | ??\'W\'? ? | 一周 |
??WeekOfMonth? ? | ??\'WOM\'? ? | 每个月的第几周的第几天 |
??LastWeekOfMonth? ? | ??\'LWOM\'? ? | 每个月最后一周的第几天 |
??MonthEnd? ? | ??\'M\'? ? | 日历月末 |
MonthBegin | ??\'MS\'? ? | 日历月初 |
??BMonthEnd? ?or??BusinessMonthEnd? ? | ??\'BM\'? ? | 营业月底 |
??BMonthBegin? ?or??BusinessMonthBegin? ? | ??\'BMS\'? ? | 营业月初 |
??CBMonthEnd? ?or??CustomBusinessMonthEnd? ? | ??\'CBM\'? ? | 自定义营业月底 |
??CBMonthBegin? ?or??CustomBusinessMonthBegin? ? | ??\'CBMS\'? ? | 自定义营业月初 |
??SemiMonthEnd? ? | ??\'SM\'? ? | 日历月末的第15天 |
??SemiMonthBegin? ? | 【Pandas高级教程之:时间处理】
推荐阅读
|