您當前的位置：首頁 > 書法

對比excel，輕鬆學習python資料分析

作者：由山草紐斜齊發表于書法時間：2019-05-21

一直想把excel和python關聯起來，加深學習pandas的印象，正好在知乎上找到@天天提到的的《對比excel，輕鬆學習python資料分析》這本書，緊接著又搜到了蘇克1900：像 Excel 一樣使用 python 進行資料分析這篇專欄文章，文章寫得很全，遂在jupyte notebook上重寫了一遍裡面的程式碼，以供對照學習。按照書中目錄，總共分為如下部分：

一、生成資料表

import

numpy

as

np

import

pandas

as

pd

# 建立資料表

df

=

pd

。

DataFrame

（{

“id”

：［

1001

，

1002

，

1003

，

1004

，

1005

，

1006

］，

“date”

：

pd

。

date_range

（

‘20130102’

，

periods

=

6

），

“city”

：［

‘Beijing ’

，

‘SH’

，

‘ guangzhou ’

，

‘Shenzhen’

，

‘shanghai’

，

‘BEIJING ’

］，

“age”

：［

23

，

44

，

54

，

32

，

34

，

32

］，

“category”

：［

‘100-A’

，

‘100-B’

，

‘110-A’

，

‘110-C’

，

‘210-A’

，

‘130-F’

］，

“price”

：［

1200

，

np

。

nan

，

2133

，

5433

，

np

。

nan

，

4432

］}，

columns

=

［

‘id’

，

‘date’

，

‘city’

，

‘category’

，

‘age’

，

‘price’

］）

df

#列印結果

id

date

city

category

age

price

0

1001

2013

-

01

-

02

Beijing

100

-

A

23

1200。0

1

1002

2013

-

01

-

03

SH

100

-

B

44

NaN

2

1003

2013

-

01

-

04

guangzhou

110

-

A

54

2133。0

3

1004

2013

-

01

-

05

Shenzhen

110

-

C

32

5433。0

4

1005

2013

-

01

-

06

shanghai

210

-

A

34

NaN

5

1006

2013

-

01

-

07

BEIJING

130

-

F

32

4432。0

二、

資料表檢查

#檢視資料表的維度，對應excel CTRL+向下 CTRL+向右

df。shape

（6， 6）

# 資料表資訊

df。info（）

# 檢視資料格式，Excel中透過選中單元格並檢視開始選單中的數值型別來判斷資料的格式

df。dtypes

# 檢視空值，對應excel CTRL+G定位空值

df。isnull（）

# 檢視唯一值，Excel 中檢視唯一值的方法是使用“條件格式”對唯一值進行顏色標記

df［‘city’］。unique（）

# 檢視資料表數值

df。values

# 檢視列名稱

df。columns

# 檢視前10行資料

df。head（10）

# 檢視後10行資料

df。tail（10）

三、

資料表清洗

# 處理空值（刪除或填充），對應excel查詢和替換——刪除資料表中含有空值的行

df。dropna（how=“any”）

#使用數字 0 填充資料表中空值

df。fillna（value=0）

#使用均值填充資料表中空值

df［‘price’］=df［‘price’］。fillna（df［‘price’］。mean（））

# 清理空格，清除city 欄位中的字元空格

df［‘city’］=df［‘city’］。map（str。strip）

# 大小寫轉換

df［‘city’］=df［‘city’］。str。lower（）

# 更改資料格式，Excel 中透過“設定單元格格式”功能可以修改資料格式

df［‘price’］。astype（‘int’）

# 更改列名稱

df。rename（columns={‘category’： ‘category-size’}）

# 刪除重複值，Excel 的資料目錄下有“刪除重複項”的功能

df［‘city’］。drop_duplicates（）

df［‘city’］。drop_duplicates（keep=‘last’）#保留最後一個重複值

# 數值修改及替換，Excel 中使用“查詢和替換”功能就可以實現數值的替換

df［‘city’］。replace（‘sh’，‘shanghai’）

四、資料預處理

# 資料表合併

#先建立 df1 資料表

df1=pd。DataFrame（{“id”：［1001，1002，1003，1004，1005，1006，1007，1008］，

“gender”：［‘male’，‘female’，‘male’，‘female’，‘male’，‘female’，‘male’，‘female’］，

“pay”：［‘Y’，‘N’，‘Y’，‘Y’，‘N’，‘Y’，‘N’，‘Y’，］，

“m-point”：［10，12，20，40，40，40，30，20］}）

#資料表匹配合並，inner 模式，在 Excel 中沒有直接完成資料表合併的功能，可以透過 VLOOKUP 函式分步實現

df_inner=pd。merge（df，df1，how=“inner”）

df_inner

#其他資料表匹配模式

df_left=pd。merge（df，df1，how=‘left’）

df_right=pd。merge（df，df1，how=‘right’）

df_outer=pd。merge（df，df1，how=‘outer’）

df_outer

# 設定索引列

df_inner。set_index（‘id’）

# 排序（按索引，按數值），Excel 中可以透過資料目錄下的排序按鈕直接對資料表進行排序

df_inner。sort_index（）

df_inner。sort_values（by=［‘age’］） #按列排序需要增加by 引數

# 資料分組，Excel 中可以透過 VLOOKUP 函式進行近似匹配來完成對數值的分組，或者使用“資料透視表”來完成分組

df_inner［‘group’］=np。where（df_inner［‘price’］>3000，‘high’，‘low’）#有點類似excel的if函式

df_inner

#對複合多個條件的資料進行分組標記

df_inner。loc［（df［‘city’］==‘beijing’）&（df［‘price’］>4000），‘sigh’］=1

df_inner

# 資料分列 Excel 中的資料目錄下提供“分列”功能

#對 category 欄位的值依次進行分列，並建立資料表，索引值為 df_inner 的索引列，列名稱為 category 和 size

split=pd。DataFrame（（x。split（‘-’） for x in df［‘category’］），index=df_inner。index，columns=［‘category’，‘size’］）

#將完成分列後的資料表與原 df_inner 資料表進行匹配

pd。merge（df_inner，split，right_index=True，left_index=True） #right_index和left_index相當於唯一欄位

五、資料提取

# 按標籤提取（loc）， Loc 函式按資料表的索引標籤進行提取，iloc，ix現在用的少，暫時不提及

# 按索引標籤提取單行的數值

df_inner。loc［3］

#按索引提取區域行數值

df_inner。loc［0：5］

#重設索引

df_inner。reset_index（）

#設定日期為索引

df_inner。set_index（‘date’）

# 提取 4 日之前的所有資料

df_inner。loc［：“2013-01-04”］

# 按條件提取（區域和條件值），使用 isin 函式對 city 中的值是否為 beijing 進行判斷。

df_inner［‘city’］。isin（［‘beijing’］）#返回布林值

df_inner。loc［df_inner［‘city’］。isin（［‘beijing’］）］#返回布林值為True的值

# 數值提取還可以完成類似資料分列的工作，從合併的數值中提取出制定的數值。

category=df_inner［‘category’］

pd。DataFrame（category。str［：3］） #提取前三個字元，並生成資料表

六、資料篩選

# 使用與，或，非三個條件配合大於，小於和等於對資料進行篩選，並進行計數和求和。與 excel 中的篩選功能和 countifs 和 sumifs 功能相似

# 按條件篩選（與&，或|，非！=），Excel 資料目錄下提供了“篩選”功能，用於對資料表按不同的條件進行篩選。

# 使用“與”條件進行篩選，條件是年齡大於 25 歲，並且城市為 beijing

df_inner。loc［（df_inner［‘age’］>25）&（df_inner［‘city’］==‘beijing’）］

# 使用“或”條件進行篩選，年齡大於 25 歲或城市為 beijing。篩選後有 6 條資料符合要求。

df_inner。loc［（df_inner［‘age’］>25） | （df_inner［‘city’］==‘beijing’）］

#使用“非”條件進行篩選，城市不等於 beijing。將篩選結果按 id 列進行排序（sort函式只能對列表使用，報錯）

df_inner。loc［（df_inner［‘age’］ > 25） | （df_inner［‘city’］！= ‘beijing’），［‘id’，‘city’，‘age’，‘category’，‘gender’］］。sort_values（by=［‘age’］）

# 在前面的程式碼後增加 price 欄位以及 sum 函式，按篩選後的結果將 price 欄位值進行求和，相當於 excel 中 sumifs 的功能。

df_inner。loc［（df_inner［‘age’］ > 25） | （df_inner［‘city’］ == ‘beijing’），［‘id’，‘city’，‘age’，‘category’，‘gender’，‘price’］］。sort_values（by=［‘age’］）。price。sum（）

# 還有一種篩選的方式是用 query 函式

df_inner。query（‘city==［“beijing”，“shanghai”］’）#單引號裡面要用雙引號，否則會報錯

#對篩選後的結果按 price 進行求和

df_inner。query（‘city==［“beijing”，“shanghai”］’）。price。sum（）

七、資料彙總

# Excel 中使用分類彙總和資料透視可以按特定維度對資料進行彙總，python 中使用的主要函式是 groupby 和 pivot_table

# 分類彙總，對所有列進行計數彙總

df_inner。groupby（‘city’）。count（）

pd。crosstab（df_inner［‘city’］，df_inner［‘age’］）#使用crosstab可以實現單列計數彙總

# 對特定的 ID 列進行計數彙總

df_inner。groupby（‘city’）［‘id’］。count（）

#對兩個欄位進行彙總計數

df_inner。groupby（［‘city’，‘size’］）［‘id’］。count（）

# 除了計數和求和外，還可以對彙總後的資料同時按多個維度進行計算 /按城市對 price 欄位進行彙總，並分別計算 price 的數量，總金額和平均金額

df_inner。groupby（‘city’）［‘price’］。agg（［len，np。sum，np。mean］）#len這裡代表的是列表中的專案個數，非字串的長度

# 資料透視，Excel 中的插入目錄下提供“資料透視表”功能對資料表按特定維度進行彙總

# 設定 city 為行欄位，size 為列欄位，price 為值欄位。分別計算 price 的數量和金額並且按行與列進行彙總。

pd。pivot_table（df_inner，index=［‘city’］，columns=［‘size’］，values=［‘price’］，aggfunc=［len，np。sum］，fill_value=0，margins=True）#margins相當於行列彙總

八、資料統計

# 主要介紹資料取樣，標準差，協方差和相關係數的使用方法

# 資料取樣，Excel的資料分析功能中提供了資料抽樣的功能，Python 透過 sample 函式完成資料取樣

#簡單的資料取樣

df_inner。sample（n=3）

#手動設定取樣權重

weights=［0，0，0，0，0。5，0。5］

df_inner。sample（n=2，weights=weights）

# Sample 函式中還有一個引數 replace，用來設定取樣後是否放回

#取樣後不放回

df_inner。sample（n=6， replace=False）

#取樣後放回，‘replace’就是重複的意思。即可以重複對元素進行抽樣，也就是所謂的有放回抽樣

df_inner。sample（n=6， replace=True）

# 描述統計，Excel 中的資料分析中提供了描述統計的功能。Python 中可以透過 Describe 對資料進行描述統計

df_inner。describe（）

#資料表描述性統計

df_inner。describe（）。round（2）。T #round設定小數點後位數，T代表轉置

# 標準差 Python 中的 Std 函式用來接算特定資料列的標準差

df_inner［‘price’］。std（）

# 協方差，python 中透過 cov 函式計算兩個欄位或資料表中各欄位間的協方差

df_inner［‘price’］。cov（df_inner［‘m-point’］）

# 相關分析，python 中透過 corr 函式完成相關分析的操作，並返回相關係數

df_inner［‘price’］。corr（df_inner［‘m-point’］）

#資料表相關性分析

df_inner。corr（）

標簽： df Inner 資料表 city Price

上一篇:快速瞭解墨刀，看這裡！

下一篇：嫁人就是賠錢貨嗎?

猜你喜歡