詳解卡方分箱及應用

作者：由風控獵人發表于遊戲時間：2022-02-11

最近在研究評分卡建模的流程，在特徵處理的過程中涉及到分箱這一基本的常用技巧，本文就對分箱中的卡方分箱展開詳細介紹。

分箱就是將連續型的資料離散化，比如年齡這個變數是，可以分箱為0-18，18-30，30-45，45-60。這也是建立評分卡過程中常見的操作，首先思考一個問題，為什麼要進行分箱？直接用年齡這個變數去建模是否可以？其實是可以的。只不過評分卡需要模型有很強的業務可解釋性，這和你的建模演算法有關。如果你用xgb、lgb等機器學習演算法的話，模型會變得不可解釋，此時不分箱也是可以的。

分箱的好處主要有這些：

分箱後的特徵對異常資料有更強的魯棒性。比如年齡中有一個異常值為300，分箱之後就可能劃到>80這一箱中，而如果直接入模的話會對模型造成很大幹擾。

特徵離散化之後，每個變數有單獨的權重，可以為邏輯迴歸模型引入了非線性，能夠提升模型表達能力，加大擬合。

特徵離散化以後，起到了簡化了邏輯迴歸模型的作用，降低了模型過擬合的風險。

可以將缺失作為獨立的一類帶入模型。

稀疏向量內積乘法運算速度快，計算結果方便儲存，容易擴充套件。

下面開始介紹卡方分箱，首先要先了解卡方檢驗。因為卡方分箱是一種基於卡方檢驗的分箱方法，具體來說是基於卡方檢驗中的獨立性檢驗來實現分箱功能。

卡方檢驗

卡方檢驗就是對分類資料的頻數進行分析的一種方法，它的應用主要表現在兩個方面：擬合優度檢驗和獨立性檢驗（列聯分析）。

擬合優度檢驗

擬合優度是對一個分類變數的檢驗，即根據總體分佈狀況，計算出分類變數中各類別的期望頻數，與分佈的觀察頻數進行對比，判斷期望頻數與觀察頻數是否有顯著差異，從而達到對分類變數進行分析的目的。比如，泰坦尼克號中我們觀察倖存者是否與性別有關，可以理解為一個X是否與Y有必然聯絡。

獨立性檢驗

獨立性檢驗是兩個特徵變數之間的計算，它可以用來分析兩個分類變數是否獨立，或者是否有關聯。比如某原料質量和產地是否依賴關係，可以理解為一個X與另一個X是否獨立。

卡方檢驗步驟

卡方檢驗也是一種假設檢驗，與常見的假設檢驗方法一致。

提出假設，比如假設兩個變數之間獨立

根據分類的觀察頻數計算期望頻數

根據卡方公式，計算實際頻數與期望頻數的卡方值

根據自由度和事先確定的顯著性水平，查詢卡方分佈表計算卡法值，並與上一步卡方值比較

得出結果判斷是否拒絕原假設

評分卡中的卡方分箱

下面以年齡變數為例，講解一下評分卡建模過程中如何對年齡變數進行卡方分箱。先舉實際例子再講理論。

首先，將年齡從小到大排序，每一個年齡取值為單獨一箱。統計對應的違約和不違約的個數。然後進行合併，具體步驟如下：

如果有1，2，3，4個分箱，那麼就需要繫結相鄰的兩個分箱，共三組：12，23，34。然後分別計算三個繫結組的卡方值。

從計算的卡方值中找出最小的一個，並把這兩個分箱合併：比如，23是卡方值最小的一個，那麼就將2和3合併，本輪計算中分箱就變為了1，23，4。

分箱背後的理論依據：如果兩個相鄰的區間具有非常類似的類分佈，那麼這兩個區間可以合併。否則，它們應該分開。低卡方值表明它們具有相似的類分佈。

對於卡方值越小分佈越相似這一核心理論我也做了個簡單的推導：

可以看到如果需要合併的兩箱分佈完全一致的話，合併之後的卡方值為0。下面給出卡方分箱的理論及公式：

上面的步驟只是每一輪需要計算的內容，如果不設定停止條件，演算法就會一直執行。當然，我們一般會設定一些停止條件：

卡方停止的閾值

分箱數目的限制

根據經驗值，卡方停止的閾值一般設定置信度為0。9、0。95、0。99，自由度可以設定為4是對應的卡方值，分箱數一般可以設定為5。卡方分箱的自由度是分類變數型別的個數減一。

下面給一個卡方分箱的程式碼，建議仔細閱讀，有助於程式碼水平的提高和更好地理解卡方分箱。一定要一次性看完，因為看完你就會忘的。

## 自寫卡方最優分箱過程

def get_chi2（X， col）：

‘’‘

計算卡方統計量

’‘’

# 計算樣本期望頻率

pos_cnt = X［‘Defaulter’］。sum（）

all_cnt = X［‘Defaulter’］。count（）

expected_ratio = float（pos_cnt） / all_cnt

# 對變數按屬性值從大到小排序

df = X［［col， ‘Defaulter’］］

df = df。dropna（）

col_value = list（set（df［col］））

col_value。sort（）

# 計算每一個區間的卡方統計量

chi_list = ［］

pos_list = ［］

expected_pos_list = ［］

for value in col_value：

df_pos_cnt = df。loc［df［col］ == value， ‘Defaulter’］。sum（）

df_all_cnt = df。loc［df［col］ == value，‘Defaulter’］。count（）

expected_pos_cnt = df_all_cnt * expected_ratio

chi_square = （df_pos_cnt - expected_pos_cnt）**2 / expected_pos_cnt

chi_list。append（chi_square）

pos_list。append（df_pos_cnt）

expected_pos_list。append（expected_pos_cnt）

# 匯出結果到dataframe

chi_result = pd。DataFrame（{col： col_value， ‘chi_square’：chi_list，

‘pos_cnt’：pos_list， ‘expected_pos_cnt’：expected_pos_list}）

return chi_result

def chiMerge（chi_result， maxInterval=5）：

‘’‘

根據最大區間數限制法則，進行區間合併

’‘’

group_cnt = len（chi_result）

# 如果變數區間超過最大分箱限制，則根據合併原則進行合併，直至在maxInterval之內

while（group_cnt > maxInterval）：

## 取出卡方值最小的區間

min_index = chi_result［chi_result［‘chi_square’］ == chi_result［‘chi_square’］。min（）］。index。tolist（）［0］

# 如果分箱區間在最前，則向下合併

if min_index == 0：

chi_result = merge_chiSquare（chi_result， min_index+1， min_index）

# 如果分箱區間在最後，則向上合併

elif min_index == group_cnt-1：

chi_result = merge_chiSquare（chi_result， min_index-1， min_index）

# 如果分箱區間在中間，則判斷兩邊的卡方值，選擇最小卡方進行合併

else：

if chi_result。loc［min_index-1， ‘chi_square’］ > chi_result。loc［min_index+1， ‘chi_square’］：

chi_result = merge_chiSquare（chi_result， min_index， min_index+1）

else：

chi_result = merge_chiSquare（chi_result， min_index-1， min_index）

group_cnt = len（chi_result）

return chi_result

def cal_chisqure_threshold（dfree=4， cf=0。1）：

‘’‘

根據給定的自由度和顯著性水平，計算卡方閾值

’‘’

percents = ［0。95， 0。90， 0。5， 0。1， 0。05， 0。025， 0。01， 0。005］

## 計算每個自由度，在每個顯著性水平下的卡方閾值

df = pd。DataFrame（np。array（［chi2。isf（percents， df=i） for i in range（1， 30）］））

df。columns = percents

df。index = df。index+1

pd。set_option（‘precision’， 3）

return df。loc［dfree， cf］

def chiMerge_chisqure（chi_result， dfree=4， cf=0。1， maxInterval=5）：

threshold = cal_chisqure_threshold（dfree， cf）

min_chiSquare = chi_result［‘chi_square’］。min（）

group_cnt = len（chi_result）

# 如果變數區間的最小卡方值小於閾值，則繼續合併直到最小值大於等於閾值

while（min_chiSquare < threshold and group_cnt > maxInterval）：

min_index = chi_result［chi_result［‘chi_square’］==chi_result［‘chi_square’］。min（）］。index。tolist（）［0］

# 如果分箱區間在最前，則向下合併

if min_index == 0：

chi_result = merge_chiSquare（chi_result， min_index+1， min_index）

# 如果分箱區間在最後，則向上合併

elif min_index == group_cnt-1：

chi_result = merge_chiSquare（chi_result， min_index-1， min_index）

# 如果分箱區間在中間，則判斷與其相鄰的最小卡方的區間，然後進行合併

else：

if chi_result。loc［min_index-1， ‘chi_square’］ > chi_result。loc［min_index+1， ‘chi_square’］：

chi_result = merge_chiSquare（chi_result， min_index， min_index+1）

else：

chi_result = merge_chiSquare（chi_result， min_index-1， min_index）

min_chiSquare = chi_result［‘chi_square’］。min（）

group_cnt = len（chi_result）

return chi_result

def merge_chiSquare（chi_result， index， mergeIndex， a = ‘expected_pos_cnt’，

b = ‘pos_cnt’， c = ‘chi_square’）：

‘’‘

按index進行合併，並計算合併後的卡方值

mergeindex 是合併後的序列值

’‘’

chi_result。loc［mergeIndex， a］ = chi_result。loc［mergeIndex， a］ + chi_result。loc［index， a］

chi_result。loc［mergeIndex， b］ = chi_result。loc［mergeIndex， b］ + chi_result。loc［index， b］

## 兩個區間合併後，新的chi2值如何計算

chi_result。loc［mergeIndex， c］ = （chi_result。loc［mergeIndex， b］ - chi_result。loc［mergeIndex， a］）**2 /chi_result。loc［mergeIndex， a］

chi_result = chi_result。drop（［index］）

## 重置index

chi_result = chi_result。reset_index（drop=True）

return chi_result

import copy

chi_train_X = copy。deepcopy（train_X）

## 對資料進行卡方分箱，按照自由度進行分箱

chi_result_all = dict（）

for col in chi_train_X。columns：

print（“start get ” + col + “ chi2 result”）

chi2_result = get_chi2（train， col）

chi2_merge = chiMerge_chisqure（chi2_result， dfree=4， cf=0。05， maxInterval=5）

chi_result_all［col］ = chi2_merge

【作者】：Labryant

【原創公眾號】：風控獵人

【簡介】：某創業公司策略分析師，積極上進，努力提升。乾坤未定，你我都是黑馬。

【轉載說明】：轉載請說明出處，謝謝合作！~

標簽： chi Result 分箱 index min

上一篇:吐槽男生“本性難移”，英語這樣說才地道！

下一篇：實木不劈不裂----解析實木開裂的根本原因及預防和修復的方法

詳解卡方分箱及應用

猜你喜歡

ida反彙編出來的，哪位給分析下這段程式碼什麼意思？

兩個任意正態分佈隨機變數的平方和服從何種分佈？

【風控建模】基於邏輯迴歸的評分卡開發（I）

分享卡戴珊–詹娜家族的照片

雅思寫作中因果關係表達的詞彙總結