【風控建模】基於邏輯迴歸的評分卡開發（I）

作者：由蜀道難發表于體育時間：2020-03-29

前言：

建立風控模型的一般流程是【業務定義】——【風險定義】——【風險化解】——【風險策略】。此處僅討論【風險化解】部分，即用演算法訓練評分卡，使用邏輯迴歸（LogisticRegression）模型討論評分卡的建模。

一。邏輯迴歸算法理解

邏輯迴歸是在資料服從伯努利分佈的假設下，透過極大似然方法，運用梯度下降法來求解引數，從而達到將資料二分類的目的。

算法理解可參考這篇文章：

二。

評分卡建模流程

基於Logistic迴歸的申請評分卡模型開發：

①資料準備：收集並整合在庫客戶的資料，定義目標變數，排除特定樣本。

②探索性資料分析：評估每個變數的值分佈情況，處理異常值和缺失值。

③資料預處理：變數篩選，變數分箱，WOE轉換、樣本抽樣。

④模型開發：邏輯迴歸擬合模型。

⑤模型評估：常見幾種評估方法，ROC、KS、AUC等。

⑥生成評分卡

資料清洗需要注意的點：

【目標變數定義】：信用評分卡和反欺詐評分卡對Y值的定義是不同的；應根據滾動率情況確定壞樣本的逾期天數；樣本的表現期應一致。

【變數和樣本篩選】：剔除缺失率嚴重的變數和樣本。

【缺失值填充】：缺失值填充方法有均值填充、固定值填充、隨機森林方法填充等，具體方法的選用要同時考慮缺失比率和業務實質。

如下圖，整體建模分為四個部分：1。整理在庫基礎資料（01_InputSQL

）； 2。合併基礎資料

（02_LongFeatureList）；3。探索性資料分析（03_Exploratory_Data

Analysis）；4。評分卡模型開發（

04_Scorecard）。

專案結構

三。評分卡模型開發

著重講下模型開發部分（04_Scorecard），內容包括特徵分箱、WOE單調性調整、WOE轉換、模型擬合、模型評估。

分享模型開發部分Python指令碼。

GitHub連結：

1.特徵分箱

特徵分箱方法有等頻分箱、等距分箱、Best KS分箱、類別分箱、卡方分箱。特徵分箱過程中常常需要同時使用多種分箱方法。以卡方分箱為例：

#卡方分箱，對於數值變數用卡方分箱

def

graphforbestbin

（

，

graph

False

）：

‘’‘

自動最優分箱函式，基於卡方檢驗的分箱

引數：

DF：需要輸入的資料

X：需要分箱的列名

Y：分箱資料對應的標籤 Y 列名

n：保留分箱個數

q：初始分箱的個數

graph：是否要畫出IV影象

區間為前開後閉（］

’‘’

［［

，

］］

。

copy

（）

bins_df

。

DataFrame

（）

［

“qcut”

］，

bins

。

qcut

（

［

］，

retbins

True

，

duplicates

“drop”

）

coount_y0

。

loc

［

］

。

groupby

（

“qcut”

）

。

count

（）［

］

coount_y1

。

loc

［

］

。

groupby

（

“qcut”

）

。

count

（）［

］

num_bins

［

zip

（

bins

，

bins

［

：］，

coount_y0

，

coount_y1

）］

for

range

（

）：

num_bins

［

］［

：］：

num_bins

［

：

］

［（

num_bins

［

］［

］，

num_bins

［

］［

］，

num_bins

［

］［

］

num_bins

［

］［

］，

num_bins

［

］［

］

num_bins

［

］［

］）］

continue

for

range

（

len

（

num_bins

））：

num_bins

［

］［

：］：

num_bins

［

：

］

［（

num_bins

［

］［

］，

num_bins

［

］［

］，

num_bins

［

］［

］

num_bins

［

］［

］，

num_bins

［

］［

］

num_bins

［

］［

］）］

break

else

：

break

def

get_woe

（

num_bins

）：

columns

［

“min”

，

“max”

，

“count_0”

，

“count_1”

］

。

DataFrame

（

num_bins

，

columns

）

［

“total”

］

。

count_0

。

count_1

［

“percentage”

］

。

total

。

total

。

sum

（）

［

“bad_rate”

］

。

count_1

。

total

［

“good%”

］

。

count_0

。

count_0

。

sum

（）

［

“bad%”

］

。

count_1

。

count_1

。

sum

（）

［

“woe”

］

。

log

（

［

“good%”

］

［

“bad%”

］）

return

def

get_iv

（

）：

rate

［

“good%”

］

［

“bad%”

］

。

sum

（

rate

。

woe

）

return

［］

axisx

［］

while

len

（

num_bins

）

：

pvs

［］

for

range

（

len

（

num_bins

）

）：

num_bins

［

］［

：］

num_bins

［

］［

：］

scipy

。

stats

。

chi2_contingency

（［

，

］）［

］

pvs

。

append

（

）

pvs

。

index

（

max

（

pvs

））

num_bins

［

：

］

［（

num_bins

［

］［

］，

num_bins

［

］［

］，

num_bins

［

］［

］

num_bins

［

］［

］，

num_bins

［

］［

］

num_bins

［

］［

］）］

bins_df

。

DataFrame

（

get_woe

（

num_bins

））

axisx

。

append

（

len

（

num_bins

））

。

append

（

get_iv

（

bins_df

））

graph

：

plt

。

figure

（）

plt

。

plot

（

axisx

，

）

plt

。

xticks

（

axisx

）

plt

。

xlabel

（

“number of box”

）

plt

。

ylabel

（

“IV”

）

plt

。

show

（）

return

bins_df

畫出各特徵在不同分箱個數下的學習曲線，選擇IV值斜率變化最大的分箱點作為最佳分箱個數。如下圖：該特徵變數隨著分箱個數越多，IV值越大。該特徵IV值在箱數為3~4處變化明顯，因此可以選擇3或4作為最佳分箱個數。

特徵分箱學習曲線

2.WOE單調性調整

一般篩選的變數WOE與違約機率都是單調的，如果出現U型，或者其他曲線形狀，則需要重新看下變數是否有問題。由於目的是為了可解釋性，那麼一些業務上可解釋的變數，U型時也是可以的，比如說年齡變數。

#調整WOE單調性

#————————————————————————————————————-

def judge_increasing（L）：

“”“

：param L： list

：return：判斷一個List是否單調遞增

”“”

return all（x < y for x， y in zip（L，L［1：］））

def judge_decreasing（L）：

“”“

：param L： list

：return：判斷一個list是否單調遞減

”“”

return all（x > y for x， y in zip（L， L［1：］））

col = list（iv_final［iv_final。IV > 0。05］。col。unique（））

bins_of_final_rewoe = bins_of_final。copy（）

for cc in col：

cut = bins_of_final［cc］

woe_lst = iv_final［iv_final［‘col’］==cc］。woe

if woe_lst［0］ > 0：

while not judge_decreasing（woe_lst）：

judge_list = ［x > y for x， y in zip（woe_lst， woe_lst［1：］）］

index_list = ［i+1 for i， j in enumerate（judge_list） if j == False］

new_cut = ［j for i，j in enumerate（cut） if i not in index_list］

bins_of_final_rewoe［cc］ = new_cut

bin_df_cc，iv_value_cc = binning_self（train_data，cc，‘target’，cut=new_cut）

woe_lst = bin_df_cc［‘woe’］。tolist（）

cut = new_cut

elif woe_lst［0］ < 0：

while not judge_increasing（woe_lst）：

judge_list = ［x < y for x， y in zip（woe_lst， woe_lst［1：］）］

index_list = ［i+1 for i， j in enumerate（judge_list） if j == False］

new_cut = ［j for i，j in enumerate（cut） if i not in index_list］

bins_of_final_rewoe［cc］ = new_cut

bin_df_cc，iv_value_cc = binning_self（train_data，cc，‘target’，cut=new_cut）

woe_lst = bin_df_cc［‘woe’］。tolist（）

cut = new_cut

3.WOE轉換

將基礎資料對映為WOE值。

def get_woe（df， col， y， bins）：

df = df［［col， y］］。copy（）

df［“cut”］ = pd。cut（df［col］， bins）

bins_df = df。groupby（“cut”）［y］。value_counts（）。unstack（）

woe = bins_df［“woe”］ = np。log（（bins_df［0］ / bins_df［0］。sum（）） / （bins_df［1］ / bins_df［1］。sum（）））

return woe

# 將所有特徵的WOE儲存到字典當中

woeall = {}

for col in bins_of_final：

woeall［col］ = get_woe（model_data， col， “target”， bins_of_final［col］）

#把所有WOE對映到原始資料中

model_woe = pd。DataFrame（index=model_data。index）

for col in bins_of_final：

model_woe［col］ = pd。cut（model_data［col］，bins_of_final［col］）。map（woeall［col］）

4.訓練模型

使用sklearn中的Logistic Regression演算法進行例項化、訓練資料、輸出模型係數。

#例項化

lr = LR（）

#用訓練資料擬合模型

lr = lr。fit（X，y）

lr。score（vali_X，vali_y）

‘’‘

lr = LR（penalty=’l1‘，solver=’liblinear‘，C=0。5，max_iter=100）

#用訓練資料擬合模型

lr = lr。fit（X，y）

lr。score（vali_X，vali_y）

’‘’

‘’‘

#嘗試使用C和max_iter的學習曲線把邏輯迴歸的效果調上去

c_1 = np。linspace（0。01，10，20）

score = ［］

for i in c_1：

lr = LR（solver=’liblinear‘，C=i）。fit（X，y）

score。append（lr。score（vali_X，vali_y））

plt。figure（）

plt。plot（c_1，score）

plt。show（）

’‘’

#看模型在ROC曲線上的效果

vali_proba_df = pd。DataFrame（lr。predict_proba（vali_X））

skplt。metrics。plot_roc（vali_y， vali_proba_df，

plot_micro=False，figsize=（6，6），

plot_macro=False）

5.評價模型效果

透過ROC曲線、KS曲線，評價模型效果。

AUC值為0。67，KS值為0。27，整體模型效果一般。

skplt。metrics。plot_roc（vali_y， vali_proba_df，

plot_micro=False，figsize=（6，6），

plot_macro=False）

def plot_model_ks（y_label， y_pred）：

“”“

繪製ks曲線

param：

y_label —— 真實的y值 list/array

y_pred —— 預測的y值 list/array

return：

ks曲線

”“”

pred_list = list（y_pred）

label_list = list（y_label）

total_bad = sum（label_list）

total_good = len（label_list） - total_bad

items = sorted（zip（pred_list， label_list）， key=lambda x： x［0］）

step = （max（pred_list） - min（pred_list）） / 200

pred_bin = ［］

good_rate = ［］

bad_rate = ［］

ks_list = ［］

for i in range（1， 201）：

idx = min（pred_list） + i * step

pred_bin。append（idx）

label_bin = ［x［1］ for x in items if x［0］ < idx］

bad_num = sum（label_bin）

good_num = len（label_bin） - bad_num

goodrate = good_num / total_good

badrate = bad_num / total_bad

ks = abs（goodrate - badrate）

good_rate。append（goodrate）

bad_rate。append（badrate）

ks_list。append（ks）

fig = plt。figure（figsize=（6， 4））

ax = fig。add_subplot（1， 1， 1）

ax。plot（pred_bin， good_rate， color=‘green’， label=‘good_rate’）

ax。plot（pred_bin， bad_rate， color=‘red’， label=‘bad_rate’）

ax。plot（pred_bin， ks_list， color=‘blue’， label=‘good-bad’）

ax。set_title（‘KS：{：。3f}’。format（max（ks_list）））

ax。legend（loc=‘best’）

return plt。show（）

y_pred = lr。predict_proba（vali_X）［：，1］

plot_model_ks（vali_y， y_pred=y_pred）

6.製作評分卡

。

log

（

）

600

。

log

（

）

base_score

。

intercept_

# lr。intercept_：截距

file

“D：\Scorecard\ScoreData。csv”

with

open

（

file

，

“w”

）

fdata

：

fdata

。

write

（

“base_score，

{}

”

。

format

（

base_score

））

for

，

col

enumerate

（

。

columns

）：

# ［*enumerate（X。columns）］

score

woeall

［

col

］

（

。

coef_

［

］［

］）

score

。

name

“Score”

score

。

index

。

name

col

score

。

to_csv

（

file

，

header

True

，

mode

“a”

）

訓練集好壞樣本得分分佈

測試集好壞樣本得分分佈

四.總結

1。相比於xgboost等機器學習演算法，邏輯迴歸演算法效能一般，但勝在可解釋性強，在評分卡領域應用廣泛。在不調整引數情況下，xgboost模型效果AUC達到0。65，與精心調整過的邏輯迴歸的效果AUC值0。67相差不多。

2。根據模型結果，分析該模型應用效果還有較大提升空間，未來需改善的地方有：1）嚴格定義目標變數，並使樣本表現期一致。 2）使用更多的特徵變數，此次IV大於0。05的候選入模變數近20出頭，是導致模型效果不好的重要原因。 3）特徵分箱需更細緻，此次強制要求所有變數WOE單調，未來應結合實際業務考慮特徵分箱。 4）模型最終迴歸係數有正有負，說明存在多重共線性問題，未來應考慮解決模型的共線性問題。5）改善選取樣本，一方面選取更多的有效樣本，另一方面提高樣本中bad rate比例，此次選取的樣本bad rate不足3%。

標簽： bins df num list 分箱

上一篇:清洗方法：全自動軟化水處理裝置怎麼清洗是最好的

下一篇：1MORE定製版空氣豆：如其名，真無感佩戴

【風控建模】基於邏輯迴歸的評分卡開發（I）

猜你喜歡

LeetCode-120-三角形最小路徑和

CVPR2021-Representative BatchNorm

路網區域的路徑規劃演算法

[AFML] 讀書筆記（二）資料標註 (Label)

用Python統計推斷——交叉表（下：獨立性檢驗）

【風控建模】基於邏輯迴歸的評分卡開發（I）

猜你喜歡

LeetCode-120-三角形最小路徑和

CVPR2021-Representative BatchNorm

路網區域的路徑規劃演算法

[AFML] 讀書筆記 （二） 資料標註 (Label)

用Python統計推斷——交叉表（下：獨立性檢驗）

[AFML] 讀書筆記（二）資料標註 (Label)