lcut函数python pythonljust函数( 五 )


In [14]:
def trasform_date(d):day,month,year = d.split('-')month = months[month]return "20" year '-' str(month) '-' day
In [17]:
#将表中日期格式转换为'yyyy-mm-dd' 。日期格式,通过函数加map方式进行转换table['contb_receipt_dt'] = table['contb_receipt_dt'].apply(trasform_date)
In [18]:
table.head()
Out[18]:
cmte_id cand_id cand_nm contbr_nm contbr_city contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num party
0 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e 08 RETIRED RETIRED 250.0 2011-6-20 NaN NaN NaN SA17A 736166 Republican
1 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e 08 RETIRED RETIRED 50.0 2011-6-23 NaN NaN NaN SA17A 736166 Republican
2 C00410118 P20002978 Bachmann, Michelle SMITH, LANIER LANETT AL 3.68633e 08 INFORMATION REQUESTED INFORMATION REQUESTED 250.0 2011-7-05 NaN NaN NaN SA17A 749073 Republican
3 C00410118 P20002978 Bachmann, Michelle BLEVINS, DARONDA PIGGOTT AR 7.24548e 08 NONE RETIRED 250.0 2011-8-01 NaN NaN NaN SA17A 749073 Republican
4 C00410118 P20002978 Bachmann, Michelle WARDENBURG, HAROLD HOT SPRINGS NATION AR 7.19016e 08 NONE RETIRED 300.0 2011-6-20 NaN NaN NaN SA17A 736166 Republican
In [19]:
#查看老兵(捐献者职业)DISABLED VETERAN主要支持谁:查看老兵们捐赠给谁的钱最多table['contbr_occupation'] == 'DISABLED VETERAN'
Out[19]:
0False1False2False3False4False5False6False7False8False9False10False11False12False13False14False15False16False17False18False19False20False21False22False23False24False25False26False27False28False29False...536011False536012False536013False536014False536015False536016False536017False536018False536019False536020False536021False536022False536023False536024False536025False536026False536027False536028False536029False536030False536031False536032False536033False536034False536035False536036False536037False536038False536039False536040FalseName: contbr_occupation, Length: 536041, dtype: bool
In [21]:
old_bing_df = table.loc[table['contbr_occupation'] == 'DISABLED VETERAN']
In [22]:
old_bing_df.groupby(by='cand_nm')['contb_receipt_amt'].sum()
Out[22]:
cand_nmCain, Herman300.00Obama, Barack4205.00Paul, Ron2425.49Santorum, Rick250.00Name: contb_receipt_amt, dtype: float64
In [23]:
table['contb_receipt_amt'].max()
Out[23]:
1944042.43
In [24]:
#找出候选人的捐赠者中,捐赠金额最大的人的职业以及捐献额.通过query("查询条件来查找捐献人职业")table.query('contb_receipt_amt == 1944042.43')
Out[24]:
cmte_id cand_id cand_nm contbr_nm contbr_city contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num party
176127 C00431445 P80003338 Obama, Barack OBAMA VICTORY FUND 2012 - UNITEMIZED CHICAGO IL 60680 NaN NaN 1944042.43 2011-12-31 NaN X * SA18 763233 Democrat
来源:
如何用 Python 从海量文本抽取主题代码
我们在Jupyter Notebook中新建一个Python 2笔记本 , 起名为topic-model 。
为了处理表格数据 , 我们依然使用数据框工具Pandas 。先调用它 。
import pandas as pd
然后读入我们的数据文件datascience.csv,注意它的编码是中文GB18030,不是Pandas默认设置的编码,所以此处需要显式指定编码类型,以免出现乱码错误 。
df = pd.read_csv("datascience.csv", encoding='gb18030')
我们来看看数据框的头几行,以确认读取是否正确 。
df.head()
显示结果如下:
没问题,头几行内容所有列都正确读入 , 文字显式正常 。我们看看数据框的长度,以确认数据是否读取完整 。
df.shape
执行的结果为:
(1024, 3)
行列数都与我们爬取到的数量一致 , 通过 。

推荐阅读