您當前的位置:首頁 > 體育

lightgbm演算法最佳化實際案例分享

作者:由 木葉 發表于 體育時間:2018-03-04

零、案例背景介紹與建模思路說明

1。背景介紹

本案例使用的資料為kaggle中“Santander Customer Satisfaction”比賽的資料。此案例為不平衡二分類問題,目標為最大化auc值(ROC曲線下方面積)。競賽題目連結為:Santander Customer Satisfaction | Kaggle 。目前此比賽已經結束。

2。建模思路

本文件採用微軟開源的lightgbm演算法進行分類,執行速度極快。

1) 讀取資料;

2) 並行運算:由於lightgbm包可以透過設定相應引數進行並行運算,因此不再呼叫doParallel與foreach包進行並行運算;

3) 特徵選擇:使用mlr包提取了99%的chi。square;

4) 調參:逐步除錯lgb。cv函式的引數,並多次除錯,直到滿意為止;

5) 預測結果:用除錯好的引數值構建lightgbm模型,輸出預測結果;本案例所用程式輸出結果的ROC值為0。833386,已超過Private Leaderboard排名第一的結果(0。829072)。

3。lightgbm演算法

lightgbm演算法具體介紹網址:Microsoft/LightGBM ;由於lightgbm演算法沒有給出具體的數學公式,因此此處不再介紹,如有需要,請檢視github專案網址。

4。聯絡方式與個人簡介

關於演算法,如有疑問,請聯絡E-mail:sugs01@outlook。com

蘇高生,西南財經大學統計學碩士畢業,現就職於中國電信,主要負責企業存量客戶資料分析、資料建模。 研究方向:機器學習。

一、讀取資料

options(java。parameters = “-Xmx8g”) ## 特徵選擇時使用,但是需要在載入包之前設定

library(readr)

lgb_tr1 <- read_csv(“C:/Users/Administrator/Documents/kaggle/scs_lgb/train。csv”)

lgb_te1 <- read_csv(“C:/Users/Administrator/Documents/kaggle/scs_lgb/test。csv”)

二、資料探索

1。設定並行運算

library(dplyr)

library(mlr)

library(parallelMap)

parallelStartSocket(2)

2。資料各列初步探索

summarizeColumns(lgb_tr1) %>% View()

3。處理缺失值

impute missing values by mean and mode

imp_tr1 <- impute(

as。data。frame(lgb_tr1),

classes = list(

integer = imputeMean(),

numeric = imputeMean()

imp_te1 <- impute(

as。data。frame(lgb_te1),

classes = list(

integer = imputeMean(),

numeric = imputeMean()

處理缺失值後

summarizeColumns(imp_tr1$data) %>% View()

4。觀察訓練資料類別的比例–資料類別不平衡

table(lgb_tr1$TARGET)

5。剔除資料集中的常數列

lgb_tr2 <- removeConstantFeatures(imp_tr1$data)

lgb_te2 <- removeConstantFeatures(imp_te1$data)

6。保留訓練資料集與測試資料及相同的列

tr2_name <- data。frame(tr2_name = colnames(lgb_tr2))

te2_name <- data。frame(te2_name = colnames(lgb_te2))

tr2_name_inner <- tr2_name %>%

inner_join(te2_name, by = c(‘tr2_name’ = ‘te2_name’))

TARGET = data。frame(TARGET = lgb_tr2$TARGET)

lgb_tr2 <- lgb_tr2[, c(tr2_name_inner$tr2_name[2:dim(tr2_name_inner)[1]])]

lgb_te2 <- lgb_te2[, c(tr2_name_inner$tr2_name[2:dim(tr2_name_inner)[1]])]

lgb_tr2 <- cbind(lgb_tr2, TARGET)

7。注:

1)由於本次使用lightgbm演算法,故而不對資料進行標準化處理;

2)lightgbm演算法執行效率極高,1GB內不進行特徵篩選也可以執行的極快,但是此處進行特徵篩選,以進一步加快執行速率;

3)本案例直接進行特徵篩選,未生成衍生變數,原因為:不知特徵實際意義,不好隨機生成。

三、特徵篩選–卡方檢驗

library(lightgbm)

1。試算最大權重值程式,後面將繼續最佳化

grid_search <- expand。grid(

weight = seq(1, 30, 2)

## table(lgb_tr1$TARGET)[1] / table(lgb_tr1$TARGET)[2] = 24。27261

## 故而設定weight在[1, 30]之間

lgb_rate_1 <- numeric(length = nrow(grid_search))

set。seed(0)

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr2$TARGET * i + 1) / sum(lgb_tr2$TARGET * i + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr2[, 1:300]),

label = lgb_tr2$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’

# 交叉驗證

lgb_tr2_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

learning_rate = 。1,

num_threads = 2,

early_stopping_rounds = 10

lgb_rate_1[i] <- unlist(lgb_tr2_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr2_mod$record_evals$valid$auc$eval))]

}

library(ggplot2)

grid_search$perf <- lgb_rate_1

ggplot(grid_search,aes(x = weight, y = perf)) +

geom_point()

從此圖可知auc值受權重影響不大,在weight=5時達到最大

3。特徵選擇

1)特徵選擇

lgb_tr2$TARGET <- factor(lgb_tr2$TARGET)

lgb。task <- makeClassifTask(data = lgb_tr2, target = ‘TARGET’)

lgb。task。smote <- oversample(lgb。task, rate = 5)

fv_time <- system。time(

fv <- generateFilterValuesData(

lgb。task。smote,

method = c(‘chi。squared’)

## 此處可以使用資訊增益/卡方檢驗的方法,但是不建議使用隨機森林方法,效率極低

## 如果有興趣,也可以嘗試IV值方法篩選

## 特徵工程決定目標值(此處為auc)的上限,可以把特徵篩選方法作為超引數處理

2)製圖檢視

# plotFilterValues(fv)

plotFilterValuesGGVIS(fv)

3)提取99%的chi。squared(lightgbm演算法效率極高,因此可以取更多的變數)

注:提取的X%的chi。squared中的X可以作為超引數處理

fv_data2 <- fv$data %>%

arrange(desc(chi。squared)) %>%

mutate(chi_gain_cul = cumsum(chi。squared) / sum(chi。squared))

fv_data2_filter <- fv_data2 %>% filter(chi_gain_cul <= 0。99)

dim(fv_data2_filter) ## 減少了一半的自變數

fv_feature <- fv_data2_filter$name

lgb_tr3 <- lgb_tr2[, c(fv_feature, ‘TARGET’)]

lgb_te3 <- lgb_te2[, fv_feature]

4)寫出資料

write_csv(lgb_tr3, ‘C:/users/Administrator/Documents/kaggle/scs_lgb/lgb_tr3_chi。csv’)

write_csv(lgb_te3, ‘C:/users/Administrator/Documents/kaggle/scs_lgb/lgb_te3_chi。csv’)

四、演算法

lgb_tr <- rxImport(‘C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb_tr3_chi。csv’)

lgb_te <- rxImport(‘C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb_te3_chi。csv’)

## 建議lgb_te資料在預測時再讀取,以節約記憶體

library(lightgbm)

1。除錯weight引數

grid_search <- expand。grid(

weight = 1:30

perf_weight_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * i + 1) / sum(lgb_tr$TARGET * i + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

learning_rate = 。1,

num_threads = 2,

early_stopping_rounds = 10

perf_weight_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

library(ggplot2)

grid_search$perf <- perf_weight_1

ggplot(grid_search,aes(x = weight, y = perf)) +

geom_point() +

geom_smooth()

從此圖可知auc值在weight=4時達到最大,呈遞減趨勢

2。除錯learning_rate引數

grid_search <- expand。grid(

learning_rate = 2 ^ (-(8:1))

perf_learning_rate_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_learning_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_learning_rate_1

ggplot(grid_search,aes(x = learning_rate, y = perf)) +

geom_point() +

geom_smooth()

從此圖可知auc值在learning_rate=2^(-5) 時達到最大,但是 2^(-(6:3)) 區別極小,故取learning_rate = 。125,提高執行速度

3。除錯num_leaves引數

grid_search <- expand。grid(

learning_rate = 。125,

num_leaves = seq(50, 800, 50)

perf_num_leaves_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_num_leaves_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_num_leaves_1

ggplot(grid_search,aes(x = num_leaves, y = perf)) +

geom_point() +

geom_smooth()

從此圖可知auc值在num_leaves=650時達到最大

4。除錯min_data_in_leaf引數

grid_search <- expand。grid(

learning_rate = 。125,

num_leaves = 650,

min_data_in_leaf = 2 ^ (1:7)

perf_min_data_in_leaf_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

min_data_in_leaf = grid_search[i, ‘min_data_in_leaf’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_min_data_in_leaf_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_min_data_in_leaf_1

ggplot(grid_search,aes(x = min_data_in_leaf, y = perf)) +

geom_point() +

geom_smooth()

從此圖可知auc值對min_data_in_leaf不敏感,因此不做調整

5。除錯max_bin引數

grid_search <- expand。grid(

learning_rate = 。125,

num_leaves = 650,

max_bin = 2 ^ (5:10)

perf_max_bin_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_max_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_max_bin_1

ggplot(grid_search,aes(x = max_bin, y = perf)) +

geom_point() +

geom_smooth()

從此圖可知auc值在max_bin=2^10 時達到最大,需要再次微調max_bin值

6。微調max_bin引數

grid_search <- expand。grid(

learning_rate = 。125,

num_leaves = 650,

max_bin = 100 * (6:15)

perf_max_bin_2 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_max_bin_2[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_max_bin_2

ggplot(grid_search,aes(x = max_bin, y = perf)) +

geom_point() +

geom_smooth()

從此圖可知auc值在max_bin=1000時達到最大

7。除錯min_data_in_bin引數

grid_search <- expand。grid(

learning_rate = 。125,

num_leaves = 650,

max_bin=1000,

min_data_in_bin = 2 ^ (1:9)

perf_min_data_in_bin_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’],

min_data_in_bin = grid_search[i, ‘min_data_in_bin’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_min_data_in_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_min_data_in_bin_1

ggplot(grid_search,aes(x = min_data_in_bin, y = perf)) +

geom_point() +

geom_smooth()

從此圖可知auc值在min_data_in_bin=8時達到最大,但是變化極其細微,因此不做調整

8。除錯feature_fraction引數

grid_search <- expand。grid(

learning_rate = 。125,

num_leaves = 650,

max_bin=1000,

min_data_in_bin = 8,

feature_fraction = seq(。5, 1, 。02)

perf_feature_fraction_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’],

min_data_in_bin = grid_search[i, ‘min_data_in_bin’],

feature_fraction = grid_search[i, ‘feature_fraction’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_feature_fraction_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_feature_fraction_1

ggplot(grid_search,aes(x = feature_fraction, y = perf)) +

geom_point() +

geom_smooth()

從此圖可知auc值在feature_fraction=。62時達到最大,feature_fraction在[。60,。62]之間時,auc值保持穩定,表現較好;從。64開始呈下降趨勢

9。除錯min_sum_hessian引數

grid_search <- expand。grid(

learning_rate = 。125,

num_leaves = 650,

max_bin=1000,

min_data_in_bin = 8,

feature_fraction = 。62,

min_sum_hessian = seq(0, 。02, 。001)

perf_min_sum_hessian_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’],

min_data_in_bin = grid_search[i, ‘min_data_in_bin’],

feature_fraction = grid_search[i, ‘feature_fraction’],

min_sum_hessian = grid_search[i, ‘min_sum_hessian’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_min_sum_hessian_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_min_sum_hessian_1

ggplot(grid_search,aes(x = min_sum_hessian, y = perf)) +

geom_point() +

geom_smooth()

從此圖可知auc值在min_sum_hessian=0。005時達到最大,建議min_sum_hessian取值在[0。002, 0。005]區間,0。005後呈遞減趨勢

10。除錯lamda引數

grid_search <- expand。grid(

learning_rate = 。125,

num_leaves = 650,

max_bin=1000,

min_data_in_bin = 8,

feature_fraction = 。62,

min_sum_hessian = 。005,

lambda_l1 = seq(0, 。01, 。002),

lambda_l2 = seq(0, 。01, 。002)

perf_lamda_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’],

min_data_in_bin = grid_search[i, ‘min_data_in_bin’],

feature_fraction = grid_search[i, ‘feature_fraction’],

min_sum_hessian = grid_search[i, ‘min_sum_hessian’],

lambda_l1 = grid_search[i, ‘lambda_l1’],

lambda_l2 = grid_search[i, ‘lambda_l2’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_lamda_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_lamda_1

ggplot(data = grid_search, aes(x = lambda_l1, y = perf)) +

geom_point() +

facet_wrap(~ lambda_l2, nrow = 5)

從此圖可知建議lambda_l1 = 0, lambda_l2 = 0

11。除錯drop_rate引數

grid_search <- expand。grid(

learning_rate = 。125,

num_leaves = 650,

max_bin=1000,

min_data_in_bin = 8,

feature_fraction = 。62,

min_sum_hessian = 。005,

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = seq(0, 1, 。1)

perf_drop_rate_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’],

min_data_in_bin = grid_search[i, ‘min_data_in_bin’],

feature_fraction = grid_search[i, ‘feature_fraction’],

min_sum_hessian = grid_search[i, ‘min_sum_hessian’],

lambda_l1 = grid_search[i, ‘lambda_l1’],

lambda_l2 = grid_search[i, ‘lambda_l2’],

drop_rate = grid_search[i, ‘drop_rate’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_drop_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_drop_rate_1

ggplot(data = grid_search, aes(x = drop_rate, y = perf)) +

geom_point() +

geom_smooth()

從此圖可知auc值在drop_rate=0。2時達到最大,在0, 。2, 。5較好;在[0, 1]變化不大

12。除錯max_drop引數

grid_search <- expand。grid(

learning_rate = 。125,

num_leaves = 650,

max_bin=1000,

min_data_in_bin = 8,

feature_fraction = 。62,

min_sum_hessian = 。005,

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = 。2,

max_drop = seq(1, 10, 2)

perf_max_drop_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’],

min_data_in_bin = grid_search[i, ‘min_data_in_bin’],

feature_fraction = grid_search[i, ‘feature_fraction’],

min_sum_hessian = grid_search[i, ‘min_sum_hessian’],

lambda_l1 = grid_search[i, ‘lambda_l1’],

lambda_l2 = grid_search[i, ‘lambda_l2’],

drop_rate = grid_search[i, ‘drop_rate’],

max_drop = grid_search[i, ‘max_drop’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_max_drop_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_max_drop_1

ggplot(data = grid_search, aes(x = max_drop, y = perf)) +

geom_point() +

geom_smooth()

從此圖可知auc值在max_drop=5時達到最大,在[1, 10]區間變化較小

五、二次調參

1。除錯weight引數

grid_search <- expand。grid(

learning_rate = 。125,

num_leaves = 650,

max_bin=1000,

min_data_in_bin = 8,

feature_fraction = 。62,

min_sum_hessian = 。005,

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = 。2,

max_drop = 5

perf_weight_2 <- numeric(length = nrow(grid_search))

for(i in 1:20){

lgb_weight <- (lgb_tr$TARGET * i + 1) / sum(lgb_tr$TARGET * i + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[1, ‘learning_rate’],

num_leaves = grid_search[1, ‘num_leaves’],

max_bin = grid_search[1, ‘max_bin’],

min_data_in_bin = grid_search[1, ‘min_data_in_bin’],

feature_fraction = grid_search[1, ‘feature_fraction’],

min_sum_hessian = grid_search[1, ‘min_sum_hessian’],

lambda_l1 = grid_search[1, ‘lambda_l1’],

lambda_l2 = grid_search[1, ‘lambda_l2’],

drop_rate = grid_search[1, ‘drop_rate’],

max_drop = grid_search[1, ‘max_drop’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

learning_rate = 。1,

num_threads = 2,

early_stopping_rounds = 10

perf_weight_2[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

library(ggplot2)

ggplot(data。frame(num = 1:length(perf_weight_2), perf = perf_weight_2), aes(x = num, y = perf)) +

geom_point() +

geom_smooth()

從此圖可知auc值在weight>=3時auc趨於穩定, weight=7 the max

2。除錯learning_rate引數

grid_search <- expand。grid(

learning_rate = seq(。05, 。5, 。03),

num_leaves = 650,

max_bin=1000,

min_data_in_bin = 8,

feature_fraction = 。62,

min_sum_hessian = 。005,

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = 。2,

max_drop = 5

perf_learning_rate_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’],

min_data_in_bin = grid_search[i, ‘min_data_in_bin’],

feature_fraction = grid_search[i, ‘feature_fraction’],

min_sum_hessian = grid_search[i, ‘min_sum_hessian’],

lambda_l1 = grid_search[i, ‘lambda_l1’],

lambda_l2 = grid_search[i, ‘lambda_l2’],

drop_rate = grid_search[i, ‘drop_rate’],

max_drop = grid_search[i, ‘max_drop’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_learning_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_learning_rate_1

ggplot(data = grid_search, aes(x = learning_rate, y = perf)) +

geom_point() +

geom_smooth()

結論:learning_rate=。11時,auc最大

3。除錯num_leaves引數

grid_search <- expand。grid(

learning_rate = 。11,

num_leaves = seq(100, 800, 50),

max_bin=1000,

min_data_in_bin = 8,

feature_fraction = 。62,

min_sum_hessian = 。005,

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = 。2,

max_drop = 5

perf_num_leaves_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’],

min_data_in_bin = grid_search[i, ‘min_data_in_bin’],

feature_fraction = grid_search[i, ‘feature_fraction’],

min_sum_hessian = grid_search[i, ‘min_sum_hessian’],

lambda_l1 = grid_search[i, ‘lambda_l1’],

lambda_l2 = grid_search[i, ‘lambda_l2’],

drop_rate = grid_search[i, ‘drop_rate’],

max_drop = grid_search[i, ‘max_drop’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_num_leaves_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_num_leaves_1

ggplot(data = grid_search, aes(x = num_leaves, y = perf)) +

geom_point() +

geom_smooth()

結論:num_leaves=200時,auc最大

4。除錯max_bin引數

grid_search <- expand。grid(

learning_rate = 。11,

num_leaves = 200,

max_bin = seq(100, 1500, 100),

min_data_in_bin = 8,

feature_fraction = 。62,

min_sum_hessian = 。005,

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = 。2,

max_drop = 5

perf_max_bin_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’],

min_data_in_bin = grid_search[i, ‘min_data_in_bin’],

feature_fraction = grid_search[i, ‘feature_fraction’],

min_sum_hessian = grid_search[i, ‘min_sum_hessian’],

lambda_l1 = grid_search[i, ‘lambda_l1’],

lambda_l2 = grid_search[i, ‘lambda_l2’],

drop_rate = grid_search[i, ‘drop_rate’],

max_drop = grid_search[i, ‘max_drop’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_max_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_max_bin_1

ggplot(data = grid_search, aes(x = max_bin, y = perf)) +

geom_point() +

geom_smooth()

結論:max_bin=600時,auc最大;400,800也是可接受值

5。除錯min_data_in_bin引數

grid_search <- expand。grid(

learning_rate = 。11,

num_leaves = 200,

max_bin = 600,

min_data_in_bin = seq(5, 50, 5),

feature_fraction = 。62,

min_sum_hessian = 。005,

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = 。2,

max_drop = 5

perf_min_data_in_bin_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’],

min_data_in_bin = grid_search[i, ‘min_data_in_bin’],

feature_fraction = grid_search[i, ‘feature_fraction’],

min_sum_hessian = grid_search[i, ‘min_sum_hessian’],

lambda_l1 = grid_search[i, ‘lambda_l1’],

lambda_l2 = grid_search[i, ‘lambda_l2’],

drop_rate = grid_search[i, ‘drop_rate’],

max_drop = grid_search[i, ‘max_drop’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_min_data_in_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_min_data_in_bin_1

ggplot(data = grid_search, aes(x = min_data_in_bin, y = perf)) +

geom_point() +

geom_smooth()

結論:min_data_in_bin=45時,auc最大;其中25是可接受值

5。除錯feature_fraction引數

grid_search <- expand。grid(

learning_rate = 。11,

num_leaves = 200,

max_bin = 600,

min_data_in_bin = 45,

feature_fraction = seq(。5, 。9, 。02),

min_sum_hessian = 。005,

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = 。2,

max_drop = 5

perf_feature_fraction_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’],

min_data_in_bin = grid_search[i, ‘min_data_in_bin’],

feature_fraction = grid_search[i, ‘feature_fraction’],

min_sum_hessian = grid_search[i, ‘min_sum_hessian’],

lambda_l1 = grid_search[i, ‘lambda_l1’],

lambda_l2 = grid_search[i, ‘lambda_l2’],

drop_rate = grid_search[i, ‘drop_rate’],

max_drop = grid_search[i, ‘max_drop’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_feature_fraction_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_feature_fraction_1

ggplot(data = grid_search, aes(x = feature_fraction, y = perf)) +

geom_point() +

geom_smooth()

結論:feature_fraction=。54時,auc最大, 。56, 。58時也較好

6。除錯min_sum_hessian引數

grid_search <- expand。grid(

learning_rate = 。11,

num_leaves = 200,

max_bin = 600,

min_data_in_bin = 45,

feature_fraction = 。54,

min_sum_hessian = seq(。001, 。008, 。0005),

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = 。2,

max_drop = 5

perf_min_sum_hessian_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’],

min_data_in_bin = grid_search[i, ‘min_data_in_bin’],

feature_fraction = grid_search[i, ‘feature_fraction’],

min_sum_hessian = grid_search[i, ‘min_sum_hessian’],

lambda_l1 = grid_search[i, ‘lambda_l1’],

lambda_l2 = grid_search[i, ‘lambda_l2’],

drop_rate = grid_search[i, ‘drop_rate’],

max_drop = grid_search[i, ‘max_drop’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_min_sum_hessian_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_min_sum_hessian_1

ggplot(data = grid_search, aes(x = min_sum_hessian, y = perf)) +

geom_point() +

geom_smooth()

結論:min_sum_hessian=0。0065時auc取得最大值,取min_sum_hessian=0。003,0。0055時可接受

6。除錯lambda引數

grid_search <- expand。grid(

learning_rate = 。11,

num_leaves = 200,

max_bin = 600,

min_data_in_bin = 45,

feature_fraction = 。54,

min_sum_hessian = 0。0065,

lambda_l1 = seq(0, 。001, 。0002),

lambda_l2 = seq(0, 。001, 。0002),

drop_rate = 。2,

max_drop = 5

perf_lambda_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’],

min_data_in_bin = grid_search[i, ‘min_data_in_bin’],

feature_fraction = grid_search[i, ‘feature_fraction’],

min_sum_hessian = grid_search[i, ‘min_sum_hessian’],

lambda_l1 = grid_search[i, ‘lambda_l1’],

lambda_l2 = grid_search[i, ‘lambda_l2’],

drop_rate = grid_search[i, ‘drop_rate’],

max_drop = grid_search[i, ‘max_drop’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_lambda_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_lambda_1

ggplot(data = grid_search, aes(x = lambda_l1, y = perf)) +

geom_point() +

facet_wrap(~ lambda_l2, nrow = 5)

結論:lambda與auc整體呈負相關,取lambda_l1=。0002, lambda_l2 = 。0004

7。除錯drop_rate引數

grid_search <- expand。grid(

learning_rate = 。11,

num_leaves = 200,

max_bin = 600,

min_data_in_bin = 45,

feature_fraction = 。54,

min_sum_hessian = 0。0065,

lambda_l1 = 。0002,

lambda_l2 = 。0004,

drop_rate = seq(0, 。5, 。05),

max_drop = 5

perf_drop_rate_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’],

min_data_in_bin = grid_search[i, ‘min_data_in_bin’],

feature_fraction = grid_search[i, ‘feature_fraction’],

min_sum_hessian = grid_search[i, ‘min_sum_hessian’],

lambda_l1 = grid_search[i, ‘lambda_l1’],

lambda_l2 = grid_search[i, ‘lambda_l2’],

drop_rate = grid_search[i, ‘drop_rate’],

max_drop = grid_search[i, ‘max_drop’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_drop_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_drop_rate_1

ggplot(data = grid_search, aes(x = drop_rate, y = perf)) +

geom_point()

結論:drop_rate=。4時取到最大值,。15, 。25可接受

8。除錯max_drop引數

grid_search <- expand。grid(

learning_rate = 。11,

num_leaves = 200,

max_bin = 600,

min_data_in_bin = 45,

feature_fraction = 。54,

min_sum_hessian = 0。0065,

lambda_l1 = 。0002,

lambda_l2 = 。0004,

drop_rate = 。4,

max_drop = seq(1, 29, 2)

perf_max_drop_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

# 引數

params <- list(

objective = ‘binary’,

metric = ‘auc’,

learning_rate = grid_search[i, ‘learning_rate’],

num_leaves = grid_search[i, ‘num_leaves’],

max_bin = grid_search[i, ‘max_bin’],

min_data_in_bin = grid_search[i, ‘min_data_in_bin’],

feature_fraction = grid_search[i, ‘feature_fraction’],

min_sum_hessian = grid_search[i, ‘min_sum_hessian’],

lambda_l1 = grid_search[i, ‘lambda_l1’],

lambda_l2 = grid_search[i, ‘lambda_l2’],

drop_rate = grid_search[i, ‘drop_rate’],

max_drop = grid_search[i, ‘max_drop’]

# 交叉驗證

lgb_tr_mod <- lgb。cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

perf_max_drop_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}

grid_search$perf <- perf_max_drop_1

ggplot(data = grid_search, aes(x = max_drop, y = perf)) +

geom_point()

結論:max_drop=14時取到最大值,但是差距細微

六、預測

1)權重

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

2)訓練資料集

lgb_train <- lgb。Dataset(

data = data。matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

3)訓練

# 引數

params <- list(

learning_rate = 。11,

num_leaves = 200,

max_bin = 600,

min_data_in_bin = 45,

feature_fraction = 。54,

min_sum_hessian = 0。0065,

lambda_l1 = 。0002,

lambda_l2 = 。0004,

drop_rate = 。4,

max_drop = 14

# 模型

lgb_mod <- lightgbm(

params = params,

data = lgb_train,

nrounds = 300,

early_stopping_rounds = 10,

num_threads = 2

# 預測

lgb。pred <- predict(lgb_mod, data。matrix(lgb_te))

4)結果

lgb。pred2 <- matrix(unlist(lgb。pred), ncol = 1)

lgb。pred3 <- data。frame(lgb。pred2)

5)輸出

write。csv(lgb。pred3, “C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb。pred1_tr。csv”)

注: 此處給在校讀書的朋友一些建議:

1。在學校學習機器學習演算法時,測試所用資料量一般較少,因此可以嘗試大多數演算法,大多數的R函式,例如測試隨機森林演算法時,可以選擇randomforest包,如果資料量稍微增多,可以設定並行運算,但是如果資料量達到GB級別,並行運算randomforest包也處理不了了,並且記憶體會溢位;建議使用專業版R中的函式;

2。學校學習主要針對理論進行學習,測試資料一般較為乾淨,實際資料結構一般更為複雜一些。

標簽: Grid  Search  lgb  data  BIN