數據STUDIO轉載|作者 劉秋言
特征篩選是建模過程中的重要一環。
基於決策樹的演算法,如Random Forest, Lightgbm, Xgboost,都能返回模型預設的Feature Importance,但諸多研究都表明該重要性是存在偏差的。
是否有更好的方法來篩選特征呢?Kaggle上很多大師級的選手通常采用的一個方法是Permutation Importance。這個想法最早是由Breiman (2001)[1]提出,後來由Fisher, Rudin, and Dominici (2018)改進[2]。
透過本文,你將透過一個Kaggle Amex真實數據了解到,模型預設的Feature Importance存在什麽問題,什麽是Permutation Importance,它的優劣勢分別是什麽,以及具體程式碼如何實作和使用。
另外,本文程式碼使用GPU來完成dataframe的處理,以及XGB模型的訓練和預測。相較於CPU的版本,可以提升10到100倍的速度。具體來說,使用RAPIDS的CUDF來處理數據。訓練模型使用XGB的GPU版本,同時使用DeviceQuantileDMatrix作為dataloader來減少記憶體的占用。做預測時,則使用RAPIDS的INF。
本文目錄
模型預設的Feature Importance存在什麽問題?
什麽是Permutation Importance?
它的優劣勢是什麽?
Amex數據例項驗證
01 模型預設的Feature Importance存在什麽問題?
Strobl et al[3]在2007年就提出模型預設的Feature Importance會偏好連續型變量或高基數(high cardinality)的型別型變量。這也很好理解,因為連續型變量或高基數的型別變量在樹節點上更容易找到一個切分點,換言之更容易過擬合。
另外一個問題是,Feature Importance的本質是訓練好的模型對變量的依賴程度,它不代表變量在unseen data(比如測試集)上的泛化能力。特別當訓練集和測試集的分布發生偏移時,模型預設的Feature Importance的偏差會更嚴重。
舉一個極端的例子,如果我們隨機生成一些X和二分類標簽y,並用XGB不斷叠代。隨著叠代次數的增加,訓練集的AUC將接近1,但是驗證集上的AUC仍然會在0.5附近徘徊。這時模型預設的Feature Importance仍然會有一些變量的重要性特別高。這些變量幫助模型過擬合,從而在訓練集上實作了接近1的AUC。但實際上這些變量還是無意義的。
02 什麽是Permutation Importance?
Permutation Importance是一種變量篩選的方法。它有效地解決了上述提到的兩個問題。
Permutation Importance將變量隨機打亂來破壞變量和y原有的關系。如果打亂一個變量顯著增加了模型在驗證集上的loss,說明該變量很重要。如果打亂一個變量對模型在驗證集上的loss沒有影響,甚至還降低了loss,那麽說明該變量對模型不重要,甚至是有害的。
打亂變量範例
變量重要性的具體計算步驟如下:
1. 將數據分為train和validation兩個數據集
2. 在train上訓練模型,在validation上做預測,並評價模型(如計算AUC)
3. 迴圈計算每個變量的重要性:
(3.1) 在validation上對單個變量隨機打亂;
(3.2)使用第2步訓練好的模型,重新在validation做預測,並評價模型;
(3.3)計算第2步和第3.2步validation上模型評價的差異,得到該變量的重要性指標
Python程式碼步驟(model表示已經訓練好的模型):
def permutation_importances(model, X, y, metric):
baseline = metric(model, X, y)
imp = []
for col in X.columns:
save = X[col].copy()
X[col] = np.random.permutation(X[col])
m = metric(model, X, y)
X[col] = save
imp.append(baseline - m)
return np.array(imp)
03 Permutation Importance的優劣勢是什麽?
3.1 優勢
可以在任何模型上使用。不只是在基於決策樹的模型,線上性回歸,神經網路,任何模型上都可以使用。
不存在對連續型變量或高基數類別型變量的偏好。
體現了變量的泛化能力,當數據發生偏移時會特別有價值。
相較於迴圈的增加或剔除變量,不需要對模型重新訓練,極大地降低了成本。但是迴圈地對模型做預測仍然會花費不少時間。
3.2 劣勢
對變量的打亂存在隨機性。這就要求隨機打亂需要重復多次,以保證統計的顯著性。
對相關性高的變量會低估重要性,模型預設的Feature Importance同樣存在該問題。
04 A mex數據例項驗證
理論聽起來可能有點頭痛,我們直接以Kaggle的Amex數據作為例項,驗證下Permutation Importance的效果。為了方便閱讀,程式碼附在文末!
考慮到Permutation Importance的隨機性,我們將數據劃分為10個fold,並且每個變量隨機打亂10次,所以每個變量總共打亂100次,再計算打亂後模型評價差異的平均值,以保證統計上的顯著性。之後我們可以再談談如何更好地隨機打亂以節省資源。
這裏我們分別對比3種變量重要性的排序:
模型預設的Feature Importance
Permutation Importance
標準化後的Permutation Importance:Permutation Importance / 隨機100次的標準差。這考慮到了隨機性。如果在Permutation Importance差不多的情況下,標準差更小,說明該變量的重要性結果是更穩定的。
為了簡化實驗,這裏隨機篩選了總共300個變量。使用300個變量時,模型10 fold AUC為0.9597。
下表是不同變化重要性排序下,模型AUC隨著變量個數增加的變化情況:
不同變量重要性排序下,模型效果的變化情況
由此我們可以得到如下結論:
Permutation Importance相較於模型預設的Feature Importance具有更好的排序性。當變量個數小於250個時,使用Permutation Importance排序的變量模型效果都更好。
隨著變量個數的增加,模型預設的Feature Importance和Permutation Importance兩種排序的模型AUC差異逐漸減小。這也間接說明Permutation Importance的重要性排序更好。因為在變量個數少的時候,兩種排序篩選的變量差異會更大。隨著變量的增加,兩種排序下變量的重合逐漸增加,差異逐漸減小。
標準化後的Permutation Importance效果僅略微好於Permutation Importance,這在真實業務場景意義較低,在建模比賽中有一定價值。
程式碼附錄
# !pip install kaggle
# !mkdir /root/.kaggle
# !cp /storage/kaggle.json /root/.kaggle/kaggle.json
# !chmod 600 /root/.kaggle/kaggle.json
# !kaggle datasets download -d raddar/amex-data-integer-dtypes-parquet-format -p /storage/data --unzip
import os
os.listdir('/storage/data')
['train.parquet',
'__notebook_source__.ipynb',
'cust_type.parquet',
'test.parquet',
'train_labels.csv']
import pandas as pd, numpy as np
pd.set_option('mode.chained_assignment', None)
import cupy, cudf
from numba import cuda
from cuml import ForestInference
import joblib
import time
import matplotlib.pyplot as plt
import pyarrow.parquet as pq
import random
from tqdm import tqdm
import pickle
import random
import gc
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.metrics import roc_auc_score
import xgboost as xgb
print('XGB Version',xgb.__version__)
XGB Version 1.6.1
CWD = '/storage/data'# 將CWD換成你的數據目錄
SEED = 42
NAN_VALUE = -127# will fit in int8
FOLDS = 10
RAW_CAT_FEATURES = ["B_30","B_38","D_114","D_116","D_117","D_120","D_126","D_63","D_64","D_66","D_68"]
Utils
deffast_auc(y_true, y_prob):
y_true = cupy.asarray(y_true)
y_true = y_true[cupy.argsort(y_prob)]
cumfalses = cupy.cumsum(1-y_true)
nfalse = cumfalses[-1]
auc = (y_true * cumfalses).sum()
auc = auc / (nfalse * (len(y_true) - nfalse))
return auc
defeval_auc(preds, dtrain):
labels = dtrain.get_label()
return'auc', fast_auc(labels, preds), True
# NEEDED WITH DeviceQuantileDMatrix BELOW
classIterLoadForDMatrix(xgb.core.DataIter):
def__init__(self, df=None, features=None, target=None, batch_size=256*512):
self.features = features
self.target = target
self.df = df
self.it = 0# set iterator to 0
self.batch_size = batch_size
self.batches = int( np.ceil( len(df) / self.batch_size ) )
super().__init__()
defreset(self):
'''Reset the iterator'''
self.it = 0
defnext(self, input_data):
'''Yield next batch of data.'''
if self.it == self.batches:
return0# Return 0 when there's no more batch.
a = self.it * self.batch_size
b = min( (self.it + 1) * self.batch_size, len(self.df) )
dt = cudf.DataFrame(self.df.iloc[a:b])
input_data(data=dt[self.features], label=dt[self.target]) #, weight=dt['weight'])
self.it += 1
return1
deftrain_xgb_params(params, train, val_folds, features, early_stopping_rounds, num_boost_round, verbose_eval, data_seed, model_seed, ver='model', save_dir='./'):
importances = []
oof = []
skf = KFold(n_splits=FOLDS, shuffle=True, random_state=data_seed)
for fold in val_folds:
Xy_train = IterLoadForDMatrix(train.loc[train['fold'] != fold], features, 'target')
X_valid = train.loc[train['fold'] == fold, features]
y_valid = train.loc[train['fold'] == fold, 'target']
dtrain = xgb.DeviceQuantileDMatrix(Xy_train, max_bin=256)
dvalid = xgb.DMatrix(data=X_valid, label=y_valid)
# TRAIN MODEL FOLD K
# XGB MODEL PARAMETERS
xgb_parms = {
'eval_metric':'auc',
'objective':'binary:logistic',
'tree_method':'gpu_hist',
'predictor':'gpu_predictor',
'random_state':model_seed,
}
xgb_parms.update(params)
model = xgb.train(xgb_parms,
dtrain=dtrain,
evals=[(dtrain,'train'), (dvalid,'valid')],
num_boost_round=num_boost_round,
early_stopping_rounds=early_stopping_rounds,
verbose_eval=verbose_eval,
)
best_iter = model.best_iteration
if save_dir:
model.save_model(os.path.join(save_dir, f'{ver}_fold{fold}_{best_iter}.xgb'))
# GET FEATURE IMPORTANCE FOR FOLD K
dd = model.get_score(importance_type='weight')
df = pd.DataFrame({'feature':dd.keys(),f'importance_{fold}':dd.values()})
df = df.set_index('feature')
importances.append(df)
# INFER OOF FOLD K
oof_preds = model.predict(dvalid, iteration_range=(0,model.best_iteration+1))
if verbose_eval:
acc = fast_auc(y_valid.values, oof_preds)
print(f'acc_{fold}', acc)
# SAVE OOF
df = train.loc[train['fold'] == fold, ['customer_ID','target'] ].copy()
df['oof_pred'] = oof_preds
oof.append( df )
del Xy_train, df, X_valid, y_valid, dtrain, dvalid, model
_ = gc.collect()
importances = pd.concat(importances, axis=1)
oof = cudf.concat(oof, axis=0, ignore_index=True).set_index('customer_ID')
acc = fast_auc(oof.target.values, oof.oof_pred.values)
return acc, oof, importances
defclear_file(data_dir):
files = [c for c in os.listdir(data_dir) if'.ipynb_checkpoints'notin c]
for file in files:
os.remove(os.path.join(data_dir, file))
Split Data
train = cudf.read_csv(os.path.join(CWD, 'train_labels.csv'))
train['customer_ID'] = train['customer_ID'].str[-16:].str.hex_to_int().astype('int64')
train_cust = train[['customer_ID']]
CUST_SPLITS = train_cust.sample(frac=0.2, random_state=SEED)
CUST_SPLITS['fold'] = 9999
train_cv_cust = train_cust[~train_cust['customer_ID'].isin(CUST_SPLITS['customer_ID'].values)].reset_index(drop=True)
skf = KFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
for fold,(train_idx, valid_idx) in enumerate(skf.split(train_cv_cust)):
df = train_cv_cust.loc[valid_idx, ['customer_ID']]
df['fold'] = fold
CUST_SPLITS = cudf.concat([CUST_SPLITS, df])
CUST_SPLITS = CUST_SPLITS.set_index('customer_ID')
CUST_SPLITS.to_csv('CUST_SPLITS.csv')
del train, train_cust
_ = gc.collect()
CUST_SPLITS['fold'].value_counts()
Data
defread_file(path = '', usecols = None):
# LOAD DATAFRAME
if usecols isnotNone: df = cudf.read_parquet(path, columns=usecols)
else: df = cudf.read_parquet(path)
# REDUCE DTYPE FOR CUSTOMER AND DATE
df['customer_ID'] = df['customer_ID'].str[-16:].str.hex_to_int().astype('int64')
df.S_2 = cudf.to_datetime( df.S_2 )
# SORT BY CUSTOMER AND DATE (so agg('last') works correctly)
#df = df.sort_values(['customer_ID','S_2'])
#df = df.reset_index(drop=True)
# FILL NAN
df = df.fillna(NAN_VALUE)
print('shape of data:', df.shape)
return df
print('Reading train data...')
train = read_file(path=os.path.join(CWD, 'train.parquet'))
defprocess_and_feature_engineer(df):
# FEATURE ENGINEERING FROM
# https://www.kaggle.com/code/huseyincot/amex-agg-data-how-it-created
all_cols = [c for c in list(df.columns) if c notin ['customer_ID','S_2']]
cat_features = ["B_30","B_38","D_114","D_116","D_117","D_120","D_126","D_63","D_64","D_66","D_68"]
num_features = [col for col in all_cols if col notin cat_features]
test_num_agg = df.groupby("customer_ID")[num_features].agg(['mean', 'std', 'min', 'max', 'last'])
test_num_agg.columns = ['_'.join(x) for x in test_num_agg.columns]
test_cat_agg = df.groupby("customer_ID")[cat_features].agg(['count', 'last', 'nunique'])
test_cat_agg.columns = ['_'.join(x) for x in test_cat_agg.columns]
df = cudf.concat([test_num_agg, test_cat_agg], axis=1)
del test_num_agg, test_cat_agg
print('shape after engineering', df.shape )
return df
train = process_and_feature_engineer(train)
train = train.fillna(NAN_VALUE)
# ADD TARGETS
targets = cudf.read_csv(os.path.join(CWD, 'train_labels.csv'))
targets['customer_ID'] = targets['customer_ID'].str[-16:].str.hex_to_int().astype('int64')
targets = targets.set_index('customer_ID')
train = train.merge(targets, left_index=True, right_index=True, how='left')
train.target = train.target.astype('int8')
del targets
# NEEDED TO MAKE CV DETERMINISTIC (cudf merge above randomly shuffles rows)
train = train.sort_index().reset_index()
# FEATURES
FEATURES = list(train.columns[1:-1])
print(f'There are {len(FEATURES)} features!')
Reading train data...
shape of data: (5531451, 190)
shape after engineering (458913, 918)
There are 918 features!
# ADD CUST SPLITS
CUST_SPLITS = cudf.read_csv('CUST_SPLITS.csv').set_index('customer_ID')
train = train.set_index('customer_ID').merge(CUST_SPLITS, left_index=True, right_index=True, how='left')
test = train[train['fold'] == 9999]
train = train[train['fold'] != 9999]
train = train.sort_index().reset_index()
train.head()
Permutation
defget_permutation_importance(model_path, data, model_features, shuffle_features, shuffle_times=5):
permutation_importance = pd.DataFrame(columns=['feature', 'metric', 'shuffle_idx'])
model = ForestInference.load(model_path, model_type='xgboost', output_ class=True)
preds = model.predict_proba(data[model_features])[1].values
acc = fast_auc(data['target'].values, preds)
permutation_importance.loc[permutation_importance.shape[0]] = ['original', cupy.asnumpy(acc), 0]
for col in tqdm(shuffle_features):
value = data[col].copy().values
for i in np.arange(shuffle_times):
np.random.seed(i)
data[col] = cupy.random.permutation(data[col].copy().values)
preds = model.predict_proba(data[model_features])[1].values
new_acc = fast_auc(data['target'].values, preds)
permutation_importance.loc[permutation_importance.shape[0]] = [col, cupy.asnumpy(new_acc), i]
data[col] = value
return permutation_importance
defagg_permu_files(permu_dir, count_threshold, retrain_idx):
permu_files = [c for c in os.listdir(permu_dir) if ('.csv'in c) and (f'permutation_importance_{retrain_idx}'in c)]
permutation_importance_all = []
for file in permu_files:
df = pd.read_csv(os.path.join(permu_dir, file))
df['permut_idx'] = file.split('_')[-2]
permutation_importance_all.append(df)
permutation_importance_all = pd.concat(permutation_importance_all)
# original_acc
original_acc = permutation_importance_all[permutation_importance_all['feature'] == 'original'].set_index(['fold', 'permut_idx'])[['metric']]
original_acc.rename({'metric': 'metric_ori'}, axis=1, inplace=True)
permutation_importance_all = permutation_importance_all.set_index(['fold', 'permut_idx']).merge(original_acc, left_index=True, right_index=True, how='left')
permutation_importance_all['metric_diff_ratio'] = (permutation_importance_all['metric_ori'] - permutation_importance_all['metric']) / permutation_importance_all['metric_ori']
permutation_importance_all.reset_index(inplace=True)
# random_acc
random_acc = permutation_importance_all[permutation_importance_all['feature'] == 'random'].groupby(['permut_idx'])[['metric_diff_ratio']].agg(['mean', 'std'])
random_acc.columns = ['random_mean', 'random_std']
permutation_importance_all.reset_index(inplace=True)
permutation_importance_agg = permutation_importance_all[permutation_importance_all['feature'] != 'random'].groupby(['feature', 'permut_idx'])['metric_diff_ratio'].agg(['min', 'max','mean', 'std', 'count'])
permutation_importance_agg['z'] = permutation_importance_agg['mean'] / permutation_importance_agg['std']
if count_threshold:
permutation_importance_agg = permutation_importance_agg[permutation_importance_agg['count'] == count_threshold]
permutation_importance_agg = permutation_importance_agg.reset_index().set_index('permut_idx')
permutation_importance_agg['z'] = permutation_importance_agg['mean'] / permutation_importance_agg['std']
permutation_importance_agg = permutation_importance_agg.merge(random_acc, left_index=True, right_index=True, how='left')
permutation_importance_agg['random_z'] = permutation_importance_agg['random_mean'] / permutation_importance_agg['random_std']
permutation_importance_agg = permutation_importance_agg.reset_index().set_index('feature')
return permutation_importance_agg
params_13m = {
'max_depth':4,
'learning_rate':0.1,
'subsample':0.8,
'colsample_bytree':0.1,
}
# ADD RANDOM FEATURE
np.random.seed(SEED)
train['random'] = np.random.normal(loc=1, scale=1, size=train.shape[0])
np.random.seed(SEED)
np.random.shuffle(FEATURES)
FEATURES = FEATURES[:300]
print(FEATURES[:10])
['D_107_max', 'D_41_mean', 'D_77_min', 'R_19_mean', 'D_131_min', 'B_41_max', 'D_76_max', 'D_91_max', 'D_54_mean', 'D_84_max']
permu_dir = './permut_13m_ver001_retrain'
ifnot os.path.isdir(permu_dir): os.mkdir(permu_dir)
clear_file(permu_dir)
retrain_idx = 1
val_folds = [0,1,2,3,4,5,6,7,8,9]
shuffle_times = 10
whileTrue:
# Features
sub_features_num = len(FEATURES)
feature_split_num = max(round(len(FEATURES) / sub_features_num), 1)
sub_features_num = len(FEATURES) // feature_split_num
print(f'FEATURES: {len(FEATURES)}; retrain_idx: {retrain_idx}; feature_split_num: {feature_split_num}; sub_features_num: {sub_features_num}')
start_idx = 0
for sub_features_idx in range(feature_split_num):
if sub_features_idx == feature_split_num - 1:
end_idx = len(FEATURES)
else:
end_idx = start_idx+sub_features_num
sub_features = FEATURES[start_idx: end_idx]
# Train
acc, oof, importances = train_xgb_params(params_13m,
train,
features=sub_features+['random'],
val_folds=val_folds,
early_stopping_rounds=100,
num_boost_round=9999,
verbose_eval=False,
data_seed=SEED,
model_seed=SEED,
save_dir=permu_dir,
ver=f'13M_{retrain_idx}_{sub_features_idx}')
importances.to_csv(os.path.join(permu_dir, 'original_importances.csv'))
# Permutation Importance
permutation_importance_list = []
for fold in val_folds:
print(f'........ sub_features_idx: {sub_features_idx}; start_idx: {start_idx}, end_idx: {end_idx}, fold: {fold};')
model_file = [c for c in os.listdir(permu_dir) iff'13M_{retrain_idx}_{sub_features_idx}_fold{fold}'in c]
if len(model_file) > 1:
raise ValueError(f'There are more than one model file: {model_file}')
model_file = model_file[0]
model = xgb.Booster()
model_path = os.path.join(permu_dir, model_file)
# PERMUTATION IMPORTANCE
importances_fold = importances[f'importance_{fold}']
shuffle_features = importances_fold[importances_fold.notnull()].index.to_list()
permutation_importance = get_permutation_importance(model_path,
train.loc[train['fold'] == fold, :],
model_features=sub_features+['random'],
shuffle_features=shuffle_features,
shuffle_times=shuffle_times)
permutation_importance['retrain_idx'] = retrain_idx
permutation_importance['sub_features_idx'] = sub_features_idx
permutation_importance['fold'] = fold
permutation_importance.to_csv(os.path.join(permu_dir, f'permutation_importance_{retrain_idx}_{sub_features_idx}_fold{fold}.csv'), index=False)
permutation_importance_list.append(permutation_importance)
start_idx = end_idx
permutation_importance_agg = agg_permu_files(permu_dir, count_threshold=len(val_folds)*shuffle_times, retrain_idx=retrain_idx)
drop_features = permutation_importance_agg[permutation_importance_agg['z'] < permutation_importance_agg['random_z']].index.to_list()
permutation_importance_agg['is_drop'] = permutation_importance_agg.index.isin(drop_features)
permutation_importance_agg.to_csv(os.path.join(permu_dir, f'permutation_importance_agg_{retrain_idx}.csv'))
FEATURES = [c for c in FEATURES if c notin drop_features]
retrain_idx += 1
break
# if len(drop_features) == 0:
# break
Baseline
val_folds = [0,1,2,3,4,5,6,7,8,9]
acc, oof, importances = train_xgb_params(params_13m,
train,
features=FEATURES,
val_folds=val_folds,
early_stopping_rounds=100,
num_boost_round=9999,
verbose_eval=False,
data_seed=SEED,
model_seed=SEED,
save_dir=False)
print('num_features', len(FEATURES), 'auc', acc)
num_features 300 auc 0.9597397304150984
篩選變量實驗
val_folds = [0,1,2,3,4,5,6,7,8,9]
預設的Feature Importance
original_importances = pd.read_csv(os.path.join(permu_dir, 'original_importances.csv'))
original_importances = original_importances.set_index('feature')
original_importances['mean'] = original_importances.mean(axis=1)
original_importances['std'] = original_importances.std(axis=1)
original_importances.reset_index(inplace=True)
features_mean = original_importances.sort_values('mean', ascending=False)['feature'].to_list()
num_features = 25
whileTrue:
acc, oof, importances = train_xgb_params(params_13m,
train,
features=features_mean[:num_features],
val_folds=val_folds,
early_stopping_rounds=100,
num_boost_round=9999,
verbose_eval=False,
data_seed=SEED,
model_seed=SEED,
save_dir=False)
print('num_features', num_features, 'auc', acc)
num_features += 25
if num_features > permutation_importance_agg.shape[0]:
break
num_features 25 auc 0.9519786413209202
num_features 50 auc 0.9551949305682249
num_features 75 auc 0.957583263622448
num_features 100 auc 0.9581601072672994
num_features 125 auc 0.958722652426215
num_features 150 auc 0.959060598689679
num_features 175 auc 0.9593140320399974
num_features 200 auc 0.9595746364492562
num_features 225 auc 0.9596106789864496
num_features 250 auc 0.959798236138387
平均下降比例
permutation_importance_agg = pd.read_csv(os.path.join(permu_dir, 'permutation_importance_agg_1.csv'))
features_mean = permutation_importance_agg.sort_values('mean', ascending=False)['feature'].to_list()
num_features = 25
whileTrue:
acc, oof, importances = train_xgb_params(params_13m,
train,
features=features_mean[:num_features],
val_folds=val_folds,
early_stopping_rounds=100,
num_boost_round=9999,
verbose_eval=False,
data_seed=SEED,
model_seed=SEED,
save_dir=False)
print('num_features', num_features, 'auc', acc)
num_features += 25
if num_features > permutation_importance_agg.shape[0]:
break
num_features 25 auc 0.955488431492996
num_features 50 auc 0.9582052598376949
num_features 75 auc 0.9589016974455882
num_features 100 auc 0.9593989000375628
num_features 125 auc 0.9595675126936822
num_features 150 auc 0.9597684801181003
num_features 175 auc 0.9598145119359532
num_features 200 auc 0.9597407810213192
num_features 225 auc 0.9598664237091002
num_features 250 auc 0.959691966613858
標準化後的平均下降比例
permutation_importance_agg = pd.read_csv(os.path.join(permu_dir, 'permutation_importance_agg_1.csv'))
features_z = permutation_importance_agg.sort_values('z', ascending=False)['feature'].to_list()
num_features = 25
whileTrue:
acc, oof, importances = train_xgb_params(params_13m,
train,
features=features_z[:num_features],
val_folds=val_folds,
early_stopping_rounds=100,
num_boost_round=9999,
verbose_eval=False,
data_seed=SEED,
model_seed=SEED,
save_dir=False)
print('num_features', num_features, 'auc', acc)
num_features += 25
if num_features > permutation_importance_agg.shape[0]:
break
num_features 25 auc 0.9556982881629269
num_features 50 auc 0.9584120117990971
num_features 75 auc 0.9589902318791576
num_features 100 auc 0.9595072118772046
num_features 125 auc 0.9597150474489736
num_features 150 auc 0.9597403596967278
num_features 175 auc 0.9599117083347268
num_features 200 auc 0.9599500114626974
num_features 225 auc 0.9598069184316719
num_features 250 auc 0.959729845054988
參考
1. Breiman, Leo.「Random Forests.」 Machine Learning 45 (1). Springer: 5-32 (2001).
2. Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. 「All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously.」 http://arxiv.org/abs/1801.01489 (2018).
3. S trobl C, et al. B ias in random forest variable importance measures: Illustrations, sources and a solution, BMC Bioinformatics, 2007, vol. 8 pg. 25
來源:https://zhuanlan.zhihu.com/p/563422607
--end--