lightgbm演算法最佳化實際案例分享
零、案例背景介紹與建模思路說明
1。背景介紹
本案例使用的資料為kaggle中“Santander Customer Satisfaction”比賽的資料。此案例為不平衡二分類問題,目標為最大化auc值(ROC曲線下方面積)。競賽題目連結為:Santander Customer Satisfaction | Kaggle 。目前此比賽已經結束。
2。建模思路
本文件採用微軟開源的lightgbm演算法進行分類,執行速度極快。
1) 讀取資料;
2) 並行運算:由於lightgbm包可以透過設定相應引數進行並行運算,因此不再呼叫doParallel與foreach包進行並行運算;
3) 特徵選擇:使用mlr包提取了99%的chi。square;
4) 調參:逐步除錯lgb。cv函式的引數,並多次除錯,直到滿意為止;
5) 預測結果:用除錯好的引數值構建lightgbm模型,輸出預測結果;本案例所用程式輸出結果的ROC值為0。833386,已超過Private Leaderboard排名第一的結果(0。829072)。
3。lightgbm演算法
lightgbm演算法具體介紹網址:Microsoft/LightGBM ;由於lightgbm演算法沒有給出具體的數學公式,因此此處不再介紹,如有需要,請檢視github專案網址。
4。聯絡方式與個人簡介
關於演算法,如有疑問,請聯絡E-mail:sugs01@outlook。com
蘇高生,西南財經大學統計學碩士畢業,現就職於中國電信,主要負責企業存量客戶資料分析、資料建模。 研究方向:機器學習。
一、讀取資料
options(java。parameters = “-Xmx8g”) ## 特徵選擇時使用,但是需要在載入包之前設定
library(readr)
lgb_tr1 <- read_csv(“C:/Users/Administrator/Documents/kaggle/scs_lgb/train。csv”)
lgb_te1 <- read_csv(“C:/Users/Administrator/Documents/kaggle/scs_lgb/test。csv”)
二、資料探索
1。設定並行運算
library(dplyr)
library(mlr)
library(parallelMap)
parallelStartSocket(2)
2。資料各列初步探索
summarizeColumns(lgb_tr1) %>% View()
3。處理缺失值
impute missing values by mean and mode
imp_tr1 <- impute(
as。data。frame(lgb_tr1),
classes = list(
integer = imputeMean(),
numeric = imputeMean()
)
)
imp_te1 <- impute(
as。data。frame(lgb_te1),
classes = list(
integer = imputeMean(),
numeric = imputeMean()
)
)
處理缺失值後
summarizeColumns(imp_tr1$data) %>% View()
4。觀察訓練資料類別的比例–資料類別不平衡
table(lgb_tr1$TARGET)
5。剔除資料集中的常數列
lgb_tr2 <- removeConstantFeatures(imp_tr1$data)
lgb_te2 <- removeConstantFeatures(imp_te1$data)
6。保留訓練資料集與測試資料及相同的列
tr2_name <- data。frame(tr2_name = colnames(lgb_tr2))
te2_name <- data。frame(te2_name = colnames(lgb_te2))
tr2_name_inner <- tr2_name %>%
inner_join(te2_name, by = c(‘tr2_name’ = ‘te2_name’))
TARGET = data。frame(TARGET = lgb_tr2$TARGET)
lgb_tr2 <- lgb_tr2[, c(tr2_name_inner$tr2_name[2:dim(tr2_name_inner)[1]])]
lgb_te2 <- lgb_te2[, c(tr2_name_inner$tr2_name[2:dim(tr2_name_inner)[1]])]
lgb_tr2 <- cbind(lgb_tr2, TARGET)
7。注:
1)由於本次使用lightgbm演算法,故而不對資料進行標準化處理;
2)lightgbm演算法執行效率極高,1GB內不進行特徵篩選也可以執行的極快,但是此處進行特徵篩選,以進一步加快執行速率;
3)本案例直接進行特徵篩選,未生成衍生變數,原因為:不知特徵實際意義,不好隨機生成。
三、特徵篩選–卡方檢驗
library(lightgbm)
1。試算最大權重值程式,後面將繼續最佳化
grid_search <- expand。grid(
weight = seq(1, 30, 2)
## table(lgb_tr1$TARGET)[1] / table(lgb_tr1$TARGET)[2] = 24。27261
## 故而設定weight在[1, 30]之間
)
lgb_rate_1 <- numeric(length = nrow(grid_search))
set。seed(0)
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr2$TARGET * i + 1) / sum(lgb_tr2$TARGET * i + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr2[, 1:300]),
label = lgb_tr2$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’
)
# 交叉驗證
lgb_tr2_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
learning_rate = 。1,
num_threads = 2,
early_stopping_rounds = 10
)
lgb_rate_1[i] <- unlist(lgb_tr2_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr2_mod$record_evals$valid$auc$eval))]
}
library(ggplot2)
grid_search$perf <- lgb_rate_1
ggplot(grid_search,aes(x = weight, y = perf)) +
geom_point()
從此圖可知auc值受權重影響不大,在weight=5時達到最大
3。特徵選擇
1)特徵選擇
lgb_tr2$TARGET <- factor(lgb_tr2$TARGET)
lgb。task <- makeClassifTask(data = lgb_tr2, target = ‘TARGET’)
lgb。task。smote <- oversample(lgb。task, rate = 5)
fv_time <- system。time(
fv <- generateFilterValuesData(
lgb。task。smote,
method = c(‘chi。squared’)
## 此處可以使用資訊增益/卡方檢驗的方法,但是不建議使用隨機森林方法,效率極低
## 如果有興趣,也可以嘗試IV值方法篩選
## 特徵工程決定目標值(此處為auc)的上限,可以把特徵篩選方法作為超引數處理
)
)
2)製圖檢視
# plotFilterValues(fv)
plotFilterValuesGGVIS(fv)
3)提取99%的chi。squared(lightgbm演算法效率極高,因此可以取更多的變數)
注:提取的X%的chi。squared中的X可以作為超引數處理
fv_data2 <- fv$data %>%
arrange(desc(chi。squared)) %>%
mutate(chi_gain_cul = cumsum(chi。squared) / sum(chi。squared))
fv_data2_filter <- fv_data2 %>% filter(chi_gain_cul <= 0。99)
dim(fv_data2_filter) ## 減少了一半的自變數
fv_feature <- fv_data2_filter$name
lgb_tr3 <- lgb_tr2[, c(fv_feature, ‘TARGET’)]
lgb_te3 <- lgb_te2[, fv_feature]
4)寫出資料
write_csv(lgb_tr3, ‘C:/users/Administrator/Documents/kaggle/scs_lgb/lgb_tr3_chi。csv’)
write_csv(lgb_te3, ‘C:/users/Administrator/Documents/kaggle/scs_lgb/lgb_te3_chi。csv’)
四、演算法
lgb_tr <- rxImport(‘C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb_tr3_chi。csv’)
lgb_te <- rxImport(‘C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb_te3_chi。csv’)
## 建議lgb_te資料在預測時再讀取,以節約記憶體
library(lightgbm)
1。除錯weight引數
grid_search <- expand。grid(
weight = 1:30
)
perf_weight_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * i + 1) / sum(lgb_tr$TARGET * i + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
learning_rate = 。1,
num_threads = 2,
early_stopping_rounds = 10
)
perf_weight_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
library(ggplot2)
grid_search$perf <- perf_weight_1
ggplot(grid_search,aes(x = weight, y = perf)) +
geom_point() +
geom_smooth()
從此圖可知auc值在weight=4時達到最大,呈遞減趨勢
2。除錯learning_rate引數
grid_search <- expand。grid(
learning_rate = 2 ^ (-(8:1))
)
perf_learning_rate_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_learning_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_learning_rate_1
ggplot(grid_search,aes(x = learning_rate, y = perf)) +
geom_point() +
geom_smooth()
從此圖可知auc值在learning_rate=2^(-5) 時達到最大,但是 2^(-(6:3)) 區別極小,故取learning_rate = 。125,提高執行速度
3。除錯num_leaves引數
grid_search <- expand。grid(
learning_rate = 。125,
num_leaves = seq(50, 800, 50)
)
perf_num_leaves_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_num_leaves_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_num_leaves_1
ggplot(grid_search,aes(x = num_leaves, y = perf)) +
geom_point() +
geom_smooth()
從此圖可知auc值在num_leaves=650時達到最大
4。除錯min_data_in_leaf引數
grid_search <- expand。grid(
learning_rate = 。125,
num_leaves = 650,
min_data_in_leaf = 2 ^ (1:7)
)
perf_min_data_in_leaf_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
min_data_in_leaf = grid_search[i, ‘min_data_in_leaf’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_min_data_in_leaf_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_min_data_in_leaf_1
ggplot(grid_search,aes(x = min_data_in_leaf, y = perf)) +
geom_point() +
geom_smooth()
從此圖可知auc值對min_data_in_leaf不敏感,因此不做調整
5。除錯max_bin引數
grid_search <- expand。grid(
learning_rate = 。125,
num_leaves = 650,
max_bin = 2 ^ (5:10)
)
perf_max_bin_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_max_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_max_bin_1
ggplot(grid_search,aes(x = max_bin, y = perf)) +
geom_point() +
geom_smooth()
從此圖可知auc值在max_bin=2^10 時達到最大,需要再次微調max_bin值
6。微調max_bin引數
grid_search <- expand。grid(
learning_rate = 。125,
num_leaves = 650,
max_bin = 100 * (6:15)
)
perf_max_bin_2 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_max_bin_2[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_max_bin_2
ggplot(grid_search,aes(x = max_bin, y = perf)) +
geom_point() +
geom_smooth()
從此圖可知auc值在max_bin=1000時達到最大
7。除錯min_data_in_bin引數
grid_search <- expand。grid(
learning_rate = 。125,
num_leaves = 650,
max_bin=1000,
min_data_in_bin = 2 ^ (1:9)
)
perf_min_data_in_bin_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’],
min_data_in_bin = grid_search[i, ‘min_data_in_bin’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_min_data_in_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_min_data_in_bin_1
ggplot(grid_search,aes(x = min_data_in_bin, y = perf)) +
geom_point() +
geom_smooth()
從此圖可知auc值在min_data_in_bin=8時達到最大,但是變化極其細微,因此不做調整
8。除錯feature_fraction引數
grid_search <- expand。grid(
learning_rate = 。125,
num_leaves = 650,
max_bin=1000,
min_data_in_bin = 8,
feature_fraction = seq(。5, 1, 。02)
)
perf_feature_fraction_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’],
min_data_in_bin = grid_search[i, ‘min_data_in_bin’],
feature_fraction = grid_search[i, ‘feature_fraction’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_feature_fraction_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_feature_fraction_1
ggplot(grid_search,aes(x = feature_fraction, y = perf)) +
geom_point() +
geom_smooth()
從此圖可知auc值在feature_fraction=。62時達到最大,feature_fraction在[。60,。62]之間時,auc值保持穩定,表現較好;從。64開始呈下降趨勢
9。除錯min_sum_hessian引數
grid_search <- expand。grid(
learning_rate = 。125,
num_leaves = 650,
max_bin=1000,
min_data_in_bin = 8,
feature_fraction = 。62,
min_sum_hessian = seq(0, 。02, 。001)
)
perf_min_sum_hessian_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’],
min_data_in_bin = grid_search[i, ‘min_data_in_bin’],
feature_fraction = grid_search[i, ‘feature_fraction’],
min_sum_hessian = grid_search[i, ‘min_sum_hessian’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_min_sum_hessian_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_min_sum_hessian_1
ggplot(grid_search,aes(x = min_sum_hessian, y = perf)) +
geom_point() +
geom_smooth()
從此圖可知auc值在min_sum_hessian=0。005時達到最大,建議min_sum_hessian取值在[0。002, 0。005]區間,0。005後呈遞減趨勢
10。除錯lamda引數
grid_search <- expand。grid(
learning_rate = 。125,
num_leaves = 650,
max_bin=1000,
min_data_in_bin = 8,
feature_fraction = 。62,
min_sum_hessian = 。005,
lambda_l1 = seq(0, 。01, 。002),
lambda_l2 = seq(0, 。01, 。002)
)
perf_lamda_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’],
min_data_in_bin = grid_search[i, ‘min_data_in_bin’],
feature_fraction = grid_search[i, ‘feature_fraction’],
min_sum_hessian = grid_search[i, ‘min_sum_hessian’],
lambda_l1 = grid_search[i, ‘lambda_l1’],
lambda_l2 = grid_search[i, ‘lambda_l2’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_lamda_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_lamda_1
ggplot(data = grid_search, aes(x = lambda_l1, y = perf)) +
geom_point() +
facet_wrap(~ lambda_l2, nrow = 5)
從此圖可知建議lambda_l1 = 0, lambda_l2 = 0
11。除錯drop_rate引數
grid_search <- expand。grid(
learning_rate = 。125,
num_leaves = 650,
max_bin=1000,
min_data_in_bin = 8,
feature_fraction = 。62,
min_sum_hessian = 。005,
lambda_l1 = 0,
lambda_l2 = 0,
drop_rate = seq(0, 1, 。1)
)
perf_drop_rate_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’],
min_data_in_bin = grid_search[i, ‘min_data_in_bin’],
feature_fraction = grid_search[i, ‘feature_fraction’],
min_sum_hessian = grid_search[i, ‘min_sum_hessian’],
lambda_l1 = grid_search[i, ‘lambda_l1’],
lambda_l2 = grid_search[i, ‘lambda_l2’],
drop_rate = grid_search[i, ‘drop_rate’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_drop_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_drop_rate_1
ggplot(data = grid_search, aes(x = drop_rate, y = perf)) +
geom_point() +
geom_smooth()
從此圖可知auc值在drop_rate=0。2時達到最大,在0, 。2, 。5較好;在[0, 1]變化不大
12。除錯max_drop引數
grid_search <- expand。grid(
learning_rate = 。125,
num_leaves = 650,
max_bin=1000,
min_data_in_bin = 8,
feature_fraction = 。62,
min_sum_hessian = 。005,
lambda_l1 = 0,
lambda_l2 = 0,
drop_rate = 。2,
max_drop = seq(1, 10, 2)
)
perf_max_drop_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’],
min_data_in_bin = grid_search[i, ‘min_data_in_bin’],
feature_fraction = grid_search[i, ‘feature_fraction’],
min_sum_hessian = grid_search[i, ‘min_sum_hessian’],
lambda_l1 = grid_search[i, ‘lambda_l1’],
lambda_l2 = grid_search[i, ‘lambda_l2’],
drop_rate = grid_search[i, ‘drop_rate’],
max_drop = grid_search[i, ‘max_drop’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_max_drop_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_max_drop_1
ggplot(data = grid_search, aes(x = max_drop, y = perf)) +
geom_point() +
geom_smooth()
從此圖可知auc值在max_drop=5時達到最大,在[1, 10]區間變化較小
五、二次調參
1。除錯weight引數
grid_search <- expand。grid(
learning_rate = 。125,
num_leaves = 650,
max_bin=1000,
min_data_in_bin = 8,
feature_fraction = 。62,
min_sum_hessian = 。005,
lambda_l1 = 0,
lambda_l2 = 0,
drop_rate = 。2,
max_drop = 5
)
perf_weight_2 <- numeric(length = nrow(grid_search))
for(i in 1:20){
lgb_weight <- (lgb_tr$TARGET * i + 1) / sum(lgb_tr$TARGET * i + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[1, ‘learning_rate’],
num_leaves = grid_search[1, ‘num_leaves’],
max_bin = grid_search[1, ‘max_bin’],
min_data_in_bin = grid_search[1, ‘min_data_in_bin’],
feature_fraction = grid_search[1, ‘feature_fraction’],
min_sum_hessian = grid_search[1, ‘min_sum_hessian’],
lambda_l1 = grid_search[1, ‘lambda_l1’],
lambda_l2 = grid_search[1, ‘lambda_l2’],
drop_rate = grid_search[1, ‘drop_rate’],
max_drop = grid_search[1, ‘max_drop’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
learning_rate = 。1,
num_threads = 2,
early_stopping_rounds = 10
)
perf_weight_2[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
library(ggplot2)
ggplot(data。frame(num = 1:length(perf_weight_2), perf = perf_weight_2), aes(x = num, y = perf)) +
geom_point() +
geom_smooth()
從此圖可知auc值在weight>=3時auc趨於穩定, weight=7 the max
2。除錯learning_rate引數
grid_search <- expand。grid(
learning_rate = seq(。05, 。5, 。03),
num_leaves = 650,
max_bin=1000,
min_data_in_bin = 8,
feature_fraction = 。62,
min_sum_hessian = 。005,
lambda_l1 = 0,
lambda_l2 = 0,
drop_rate = 。2,
max_drop = 5
)
perf_learning_rate_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’],
min_data_in_bin = grid_search[i, ‘min_data_in_bin’],
feature_fraction = grid_search[i, ‘feature_fraction’],
min_sum_hessian = grid_search[i, ‘min_sum_hessian’],
lambda_l1 = grid_search[i, ‘lambda_l1’],
lambda_l2 = grid_search[i, ‘lambda_l2’],
drop_rate = grid_search[i, ‘drop_rate’],
max_drop = grid_search[i, ‘max_drop’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_learning_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_learning_rate_1
ggplot(data = grid_search, aes(x = learning_rate, y = perf)) +
geom_point() +
geom_smooth()
結論:learning_rate=。11時,auc最大
3。除錯num_leaves引數
grid_search <- expand。grid(
learning_rate = 。11,
num_leaves = seq(100, 800, 50),
max_bin=1000,
min_data_in_bin = 8,
feature_fraction = 。62,
min_sum_hessian = 。005,
lambda_l1 = 0,
lambda_l2 = 0,
drop_rate = 。2,
max_drop = 5
)
perf_num_leaves_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’],
min_data_in_bin = grid_search[i, ‘min_data_in_bin’],
feature_fraction = grid_search[i, ‘feature_fraction’],
min_sum_hessian = grid_search[i, ‘min_sum_hessian’],
lambda_l1 = grid_search[i, ‘lambda_l1’],
lambda_l2 = grid_search[i, ‘lambda_l2’],
drop_rate = grid_search[i, ‘drop_rate’],
max_drop = grid_search[i, ‘max_drop’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_num_leaves_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_num_leaves_1
ggplot(data = grid_search, aes(x = num_leaves, y = perf)) +
geom_point() +
geom_smooth()
結論:num_leaves=200時,auc最大
4。除錯max_bin引數
grid_search <- expand。grid(
learning_rate = 。11,
num_leaves = 200,
max_bin = seq(100, 1500, 100),
min_data_in_bin = 8,
feature_fraction = 。62,
min_sum_hessian = 。005,
lambda_l1 = 0,
lambda_l2 = 0,
drop_rate = 。2,
max_drop = 5
)
perf_max_bin_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’],
min_data_in_bin = grid_search[i, ‘min_data_in_bin’],
feature_fraction = grid_search[i, ‘feature_fraction’],
min_sum_hessian = grid_search[i, ‘min_sum_hessian’],
lambda_l1 = grid_search[i, ‘lambda_l1’],
lambda_l2 = grid_search[i, ‘lambda_l2’],
drop_rate = grid_search[i, ‘drop_rate’],
max_drop = grid_search[i, ‘max_drop’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_max_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_max_bin_1
ggplot(data = grid_search, aes(x = max_bin, y = perf)) +
geom_point() +
geom_smooth()
結論:max_bin=600時,auc最大;400,800也是可接受值
5。除錯min_data_in_bin引數
grid_search <- expand。grid(
learning_rate = 。11,
num_leaves = 200,
max_bin = 600,
min_data_in_bin = seq(5, 50, 5),
feature_fraction = 。62,
min_sum_hessian = 。005,
lambda_l1 = 0,
lambda_l2 = 0,
drop_rate = 。2,
max_drop = 5
)
perf_min_data_in_bin_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’],
min_data_in_bin = grid_search[i, ‘min_data_in_bin’],
feature_fraction = grid_search[i, ‘feature_fraction’],
min_sum_hessian = grid_search[i, ‘min_sum_hessian’],
lambda_l1 = grid_search[i, ‘lambda_l1’],
lambda_l2 = grid_search[i, ‘lambda_l2’],
drop_rate = grid_search[i, ‘drop_rate’],
max_drop = grid_search[i, ‘max_drop’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_min_data_in_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_min_data_in_bin_1
ggplot(data = grid_search, aes(x = min_data_in_bin, y = perf)) +
geom_point() +
geom_smooth()
結論:min_data_in_bin=45時,auc最大;其中25是可接受值
5。除錯feature_fraction引數
grid_search <- expand。grid(
learning_rate = 。11,
num_leaves = 200,
max_bin = 600,
min_data_in_bin = 45,
feature_fraction = seq(。5, 。9, 。02),
min_sum_hessian = 。005,
lambda_l1 = 0,
lambda_l2 = 0,
drop_rate = 。2,
max_drop = 5
)
perf_feature_fraction_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’],
min_data_in_bin = grid_search[i, ‘min_data_in_bin’],
feature_fraction = grid_search[i, ‘feature_fraction’],
min_sum_hessian = grid_search[i, ‘min_sum_hessian’],
lambda_l1 = grid_search[i, ‘lambda_l1’],
lambda_l2 = grid_search[i, ‘lambda_l2’],
drop_rate = grid_search[i, ‘drop_rate’],
max_drop = grid_search[i, ‘max_drop’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_feature_fraction_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_feature_fraction_1
ggplot(data = grid_search, aes(x = feature_fraction, y = perf)) +
geom_point() +
geom_smooth()
結論:feature_fraction=。54時,auc最大, 。56, 。58時也較好
6。除錯min_sum_hessian引數
grid_search <- expand。grid(
learning_rate = 。11,
num_leaves = 200,
max_bin = 600,
min_data_in_bin = 45,
feature_fraction = 。54,
min_sum_hessian = seq(。001, 。008, 。0005),
lambda_l1 = 0,
lambda_l2 = 0,
drop_rate = 。2,
max_drop = 5
)
perf_min_sum_hessian_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’],
min_data_in_bin = grid_search[i, ‘min_data_in_bin’],
feature_fraction = grid_search[i, ‘feature_fraction’],
min_sum_hessian = grid_search[i, ‘min_sum_hessian’],
lambda_l1 = grid_search[i, ‘lambda_l1’],
lambda_l2 = grid_search[i, ‘lambda_l2’],
drop_rate = grid_search[i, ‘drop_rate’],
max_drop = grid_search[i, ‘max_drop’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_min_sum_hessian_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_min_sum_hessian_1
ggplot(data = grid_search, aes(x = min_sum_hessian, y = perf)) +
geom_point() +
geom_smooth()
結論:min_sum_hessian=0。0065時auc取得最大值,取min_sum_hessian=0。003,0。0055時可接受
6。除錯lambda引數
grid_search <- expand。grid(
learning_rate = 。11,
num_leaves = 200,
max_bin = 600,
min_data_in_bin = 45,
feature_fraction = 。54,
min_sum_hessian = 0。0065,
lambda_l1 = seq(0, 。001, 。0002),
lambda_l2 = seq(0, 。001, 。0002),
drop_rate = 。2,
max_drop = 5
)
perf_lambda_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’],
min_data_in_bin = grid_search[i, ‘min_data_in_bin’],
feature_fraction = grid_search[i, ‘feature_fraction’],
min_sum_hessian = grid_search[i, ‘min_sum_hessian’],
lambda_l1 = grid_search[i, ‘lambda_l1’],
lambda_l2 = grid_search[i, ‘lambda_l2’],
drop_rate = grid_search[i, ‘drop_rate’],
max_drop = grid_search[i, ‘max_drop’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_lambda_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_lambda_1
ggplot(data = grid_search, aes(x = lambda_l1, y = perf)) +
geom_point() +
facet_wrap(~ lambda_l2, nrow = 5)
結論:lambda與auc整體呈負相關,取lambda_l1=。0002, lambda_l2 = 。0004
7。除錯drop_rate引數
grid_search <- expand。grid(
learning_rate = 。11,
num_leaves = 200,
max_bin = 600,
min_data_in_bin = 45,
feature_fraction = 。54,
min_sum_hessian = 0。0065,
lambda_l1 = 。0002,
lambda_l2 = 。0004,
drop_rate = seq(0, 。5, 。05),
max_drop = 5
)
perf_drop_rate_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’],
min_data_in_bin = grid_search[i, ‘min_data_in_bin’],
feature_fraction = grid_search[i, ‘feature_fraction’],
min_sum_hessian = grid_search[i, ‘min_sum_hessian’],
lambda_l1 = grid_search[i, ‘lambda_l1’],
lambda_l2 = grid_search[i, ‘lambda_l2’],
drop_rate = grid_search[i, ‘drop_rate’],
max_drop = grid_search[i, ‘max_drop’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_drop_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_drop_rate_1
ggplot(data = grid_search, aes(x = drop_rate, y = perf)) +
geom_point()
結論:drop_rate=。4時取到最大值,。15, 。25可接受
8。除錯max_drop引數
grid_search <- expand。grid(
learning_rate = 。11,
num_leaves = 200,
max_bin = 600,
min_data_in_bin = 45,
feature_fraction = 。54,
min_sum_hessian = 0。0065,
lambda_l1 = 。0002,
lambda_l2 = 。0004,
drop_rate = 。4,
max_drop = seq(1, 29, 2)
)
perf_max_drop_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){
lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
# 引數
params <- list(
objective = ‘binary’,
metric = ‘auc’,
learning_rate = grid_search[i, ‘learning_rate’],
num_leaves = grid_search[i, ‘num_leaves’],
max_bin = grid_search[i, ‘max_bin’],
min_data_in_bin = grid_search[i, ‘min_data_in_bin’],
feature_fraction = grid_search[i, ‘feature_fraction’],
min_sum_hessian = grid_search[i, ‘min_sum_hessian’],
lambda_l1 = grid_search[i, ‘lambda_l1’],
lambda_l2 = grid_search[i, ‘lambda_l2’],
drop_rate = grid_search[i, ‘drop_rate’],
max_drop = grid_search[i, ‘max_drop’]
)
# 交叉驗證
lgb_tr_mod <- lgb。cv(
params,
data = lgb_train,
nrounds = 300,
stratified = TRUE,
nfold = 10,
num_threads = 2,
early_stopping_rounds = 10
)
perf_max_drop_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]
}
grid_search$perf <- perf_max_drop_1
ggplot(data = grid_search, aes(x = max_drop, y = perf)) +
geom_point()
結論:max_drop=14時取到最大值,但是差距細微
六、預測
1)權重
lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)
2)訓練資料集
lgb_train <- lgb。Dataset(
data = data。matrix(lgb_tr[, 1:148]),
label = lgb_tr$TARGET,
free_raw_data = FALSE,
weight = lgb_weight
)
3)訓練
# 引數
params <- list(
learning_rate = 。11,
num_leaves = 200,
max_bin = 600,
min_data_in_bin = 45,
feature_fraction = 。54,
min_sum_hessian = 0。0065,
lambda_l1 = 。0002,
lambda_l2 = 。0004,
drop_rate = 。4,
max_drop = 14
)
# 模型
lgb_mod <- lightgbm(
params = params,
data = lgb_train,
nrounds = 300,
early_stopping_rounds = 10,
num_threads = 2
)
# 預測
lgb。pred <- predict(lgb_mod, data。matrix(lgb_te))
4)結果
lgb。pred2 <- matrix(unlist(lgb。pred), ncol = 1)
lgb。pred3 <- data。frame(lgb。pred2)
5)輸出
write。csv(lgb。pred3, “C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb。pred1_tr。csv”)
注: 此處給在校讀書的朋友一些建議:
1。在學校學習機器學習演算法時,測試所用資料量一般較少,因此可以嘗試大多數演算法,大多數的R函式,例如測試隨機森林演算法時,可以選擇randomforest包,如果資料量稍微增多,可以設定並行運算,但是如果資料量達到GB級別,並行運算randomforest包也處理不了了,並且記憶體會溢位;建議使用專業版R中的函式;
2。學校學習主要針對理論進行學習,測試資料一般較為乾淨,實際資料結構一般更為複雜一些。