當前位置: 妍妍網 > 碼農

Permutation Importance 來了,特征重要性的全新改進(附程式碼)

2024-04-02碼農
數據STUDIO轉載|作者 劉秋言

特征篩選是建模過程中的重要一環。

基於決策樹的演算法,如Random Forest, Lightgbm, Xgboost,都能返回模型預設的Feature Importance,但諸多研究都表明該重要性是存在偏差的。

是否有更好的方法來篩選特征呢?Kaggle上很多大師級的選手通常采用的一個方法是Permutation Importance。這個想法最早是由Breiman (2001)[1]提出,後來由Fisher, Rudin, and Dominici (2018)改進[2]。

透過本文,你將透過一個Kaggle Amex真實數據了解到,模型預設的Feature Importance存在什麽問題,什麽是Permutation Importance,它的優劣勢分別是什麽,以及具體程式碼如何實作和使用。

另外,本文程式碼使用GPU來完成dataframe的處理,以及XGB模型的訓練和預測。相較於CPU的版本,可以提升10到100倍的速度。具體來說,使用RAPIDS的CUDF來處理數據。訓練模型使用XGB的GPU版本,同時使用DeviceQuantileDMatrix作為dataloader來減少記憶體的占用。做預測時,則使用RAPIDS的INF。

本文目錄

  • 模型預設的Feature Importance存在什麽問題?

  • 什麽是Permutation Importance?

  • 它的優劣勢是什麽?

  • Amex數據例項驗證

  • 01 模型預設的Feature Importance存在什麽問題?

    Strobl et al[3]在2007年就提出模型預設的Feature Importance會偏好連續型變量或高基數(high cardinality)的型別型變量。這也很好理解,因為連續型變量或高基數的型別變量在樹節點上更容易找到一個切分點,換言之更容易過擬合。

    另外一個問題是,Feature Importance的本質是訓練好的模型對變量的依賴程度,它不代表變量在unseen data(比如測試集)上的泛化能力。特別當訓練集和測試集的分布發生偏移時,模型預設的Feature Importance的偏差會更嚴重。

    舉一個極端的例子,如果我們隨機生成一些X和二分類標簽y,並用XGB不斷叠代。隨著叠代次數的增加,訓練集的AUC將接近1,但是驗證集上的AUC仍然會在0.5附近徘徊。這時模型預設的Feature Importance仍然會有一些變量的重要性特別高。這些變量幫助模型過擬合,從而在訓練集上實作了接近1的AUC。但實際上這些變量還是無意義的。

    02 什麽是Permutation Importance?

    Permutation Importance是一種變量篩選的方法。它有效地解決了上述提到的兩個問題。

    Permutation Importance將變量隨機打亂來破壞變量和y原有的關系。如果打亂一個變量顯著增加了模型在驗證集上的loss,說明該變量很重要。如果打亂一個變量對模型在驗證集上的loss沒有影響,甚至還降低了loss,那麽說明該變量對模型不重要,甚至是有害的。

    打亂變量範例

    變量重要性的具體計算步驟如下:

  • 1. 將數據分為train和validation兩個數據集

  • 2. 在train上訓練模型,在validation上做預測,並評價模型(如計算AUC)

  • 3. 迴圈計算每個變量的重要性:

  • (3.1) 在validation上對單個變量隨機打亂;

  • (3.2)使用第2步訓練好的模型,重新在validation做預測,並評價模型;

  • (3.3)計算第2步和第3.2步validation上模型評價的差異,得到該變量的重要性指標

  • Python程式碼步驟(model表示已經訓練好的模型):

    def permutation_importances(model, X, y, metric):
    baseline = metric(model, X, y)
    imp = []
    for col in X.columns:
    save = X[col].copy()
    X[col] = np.random.permutation(X[col])
    m = metric(model, X, y)
    X[col] = save
    imp.append(baseline - m)
    return np.array(imp)

    03 Permutation Importance的優劣勢是什麽?

    3.1 優勢

  • 可以在任何模型上使用。不只是在基於決策樹的模型,線上性回歸,神經網路,任何模型上都可以使用。

  • 不存在對連續型變量或高基數類別型變量的偏好。

  • 體現了變量的泛化能力,當數據發生偏移時會特別有價值。

  • 相較於迴圈的增加或剔除變量,不需要對模型重新訓練,極大地降低了成本。但是迴圈地對模型做預測仍然會花費不少時間。

  • 3.2 劣勢

  • 對變量的打亂存在隨機性。這就要求隨機打亂需要重復多次,以保證統計的顯著性。

  • 對相關性高的變量會低估重要性,模型預設的Feature Importance同樣存在該問題。

  • 04 A mex數據例項驗證

    Kaggle Amex逾期預測比賽:https://www.kaggle.com/competitions/amex-default-prediction

    理論聽起來可能有點頭痛,我們直接以Kaggle的Amex數據作為例項,驗證下Permutation Importance的效果。為了方便閱讀,程式碼附在文末!

    考慮到Permutation Importance的隨機性,我們將數據劃分為10個fold,並且每個變量隨機打亂10次,所以每個變量總共打亂100次,再計算打亂後模型評價差異的平均值,以保證統計上的顯著性。之後我們可以再談談如何更好地隨機打亂以節省資源。

    這裏我們分別對比3種變量重要性的排序:

    1. 模型預設的Feature Importance

    2. Permutation Importance

    3. 標準化後的Permutation Importance:Permutation Importance / 隨機100次的標準差。這考慮到了隨機性。如果在Permutation Importance差不多的情況下,標準差更小,說明該變量的重要性結果是更穩定的。

    為了簡化實驗,這裏隨機篩選了總共300個變量。使用300個變量時,模型10 fold AUC為0.9597。

    下表是不同變化重要性排序下,模型AUC隨著變量個數增加的變化情況:

    不同變量重要性排序下,模型效果的變化情況

    由此我們可以得到如下結論:

    1. Permutation Importance相較於模型預設的Feature Importance具有更好的排序性。當變量個數小於250個時,使用Permutation Importance排序的變量模型效果都更好。

    2. 隨著變量個數的增加,模型預設的Feature Importance和Permutation Importance兩種排序的模型AUC差異逐漸減小。這也間接說明Permutation Importance的重要性排序更好。因為在變量個數少的時候,兩種排序篩選的變量差異會更大。隨著變量的增加,兩種排序下變量的重合逐漸增加,差異逐漸減小。

    3. 標準化後的Permutation Importance效果僅略微好於Permutation Importance,這在真實業務場景意義較低,在建模比賽中有一定價值。

    程式碼附錄

    # !pip install kaggle
    # !mkdir /root/.kaggle
    # !cp /storage/kaggle.json /root/.kaggle/kaggle.json
    # !chmod 600 /root/.kaggle/kaggle.json
    # !kaggle datasets download -d raddar/amex-data-integer-dtypes-parquet-format -p /storage/data --unzip

    import os
    os.listdir('/storage/data')

    ['train.parquet',
    '__notebook_source__.ipynb',
    'cust_type.parquet',
    'test.parquet',
    'train_labels.csv']

    import pandas as pd, numpy as np 
    pd.set_option('mode.chained_assignment'None)
    import cupy, cudf
    from numba import cuda
    from cuml import ForestInference
    import joblib
    import time
    import matplotlib.pyplot as plt
    import pyarrow.parquet as pq
    import random
    from tqdm import tqdm
    import pickle
    import random
    import gc
    from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
    from sklearn.metrics import roc_auc_score
    import xgboost as xgb
    print('XGB Version',xgb.__version__)

    XGB Version 1.6.1

    CWD = '/storage/data'# 將CWD換成你的數據目錄
    SEED = 42
    NAN_VALUE = -127# will fit in int8
    FOLDS = 10
    RAW_CAT_FEATURES = ["B_30","B_38","D_114","D_116","D_117","D_120","D_126","D_63","D_64","D_66","D_68"]

    Utils

    deffast_auc(y_true, y_prob):
    y_true = cupy.asarray(y_true)
    y_true = y_true[cupy.argsort(y_prob)]
    cumfalses = cupy.cumsum(1-y_true)
    nfalse = cumfalses[-1]
    auc = (y_true * cumfalses).sum()
    auc = auc / (nfalse * (len(y_true) - nfalse))
    return auc
    defeval_auc(preds, dtrain):
    labels = dtrain.get_label()
    return'auc', fast_auc(labels, preds), True
    # NEEDED WITH DeviceQuantileDMatrix BELOW
    classIterLoadForDMatrix(xgb.core.DataIter):
    def__init__(self, df=None, features=None, target=None, batch_size=256*512):
    self.features = features
    self.target = target
    self.df = df
    self.it = 0# set iterator to 0
    self.batch_size = batch_size
    self.batches = int( np.ceil( len(df) / self.batch_size ) )
    super().__init__()
    defreset(self):
    '''Reset the iterator'''
    self.it = 0
    defnext(self, input_data):
    '''Yield next batch of data.'''
    if self.it == self.batches:
    return0# Return 0 when there's no more batch.
    a = self.it * self.batch_size
    b = min( (self.it + 1) * self.batch_size, len(self.df) )
    dt = cudf.DataFrame(self.df.iloc[a:b])
    input_data(data=dt[self.features], label=dt[self.target]) #, weight=dt['weight'])
    self.it += 1
    return1
    deftrain_xgb_params(params, train, val_folds, features, early_stopping_rounds, num_boost_round, verbose_eval, data_seed, model_seed, ver='model', save_dir='./'):
    importances = []
    oof = []
    skf = KFold(n_splits=FOLDS, shuffle=True, random_state=data_seed)
    for fold in val_folds:
    Xy_train = IterLoadForDMatrix(train.loc[train['fold'] != fold], features, 'target')
    X_valid = train.loc[train['fold'] == fold, features]
    y_valid = train.loc[train['fold'] == fold, 'target']
    dtrain = xgb.DeviceQuantileDMatrix(Xy_train, max_bin=256)
    dvalid = xgb.DMatrix(data=X_valid, label=y_valid)
    # TRAIN MODEL FOLD K
    # XGB MODEL PARAMETERS
    xgb_parms = { 
    'eval_metric':'auc',
    'objective':'binary:logistic',
    'tree_method':'gpu_hist',
    'predictor':'gpu_predictor',
    'random_state':model_seed,
    }
    xgb_parms.update(params)
    model = xgb.train(xgb_parms, 
    dtrain=dtrain,
    evals=[(dtrain,'train'), (dvalid,'valid')],
    num_boost_round=num_boost_round, 
    early_stopping_rounds=early_stopping_rounds,
    verbose_eval=verbose_eval,

    best_iter = model.best_iteration
    if save_dir:
    model.save_model(os.path.join(save_dir, f'{ver}_fold{fold}_{best_iter}.xgb'))
    # GET FEATURE IMPORTANCE FOR FOLD K
    dd = model.get_score(importance_type='weight')
    df = pd.DataFrame({'feature':dd.keys(),f'importance_{fold}':dd.values()})
    df = df.set_index('feature')
    importances.append(df)
    # INFER OOF FOLD K
    oof_preds = model.predict(dvalid, iteration_range=(0,model.best_iteration+1))
    if verbose_eval:
    acc = fast_auc(y_valid.values, oof_preds)
    print(f'acc_{fold}', acc)
    # SAVE OOF
    df = train.loc[train['fold'] == fold, ['customer_ID','target'] ].copy()
    df['oof_pred'] = oof_preds
    oof.append( df )
    del Xy_train, df, X_valid, y_valid, dtrain, dvalid, model
    _ = gc.collect()
    importances = pd.concat(importances, axis=1)
    oof = cudf.concat(oof, axis=0, ignore_index=True).set_index('customer_ID')
    acc = fast_auc(oof.target.values, oof.oof_pred.values)
    return acc, oof, importances
    defclear_file(data_dir):
    files = [c for c in os.listdir(data_dir) if'.ipynb_checkpoints'notin c]
    for file in files:
    os.remove(os.path.join(data_dir, file))













    Split Data

    train = cudf.read_csv(os.path.join(CWD, 'train_labels.csv'))
    train['customer_ID'] = train['customer_ID'].str[-16:].str.hex_to_int().astype('int64')
    train_cust = train[['customer_ID']]
    CUST_SPLITS = train_cust.sample(frac=0.2, random_state=SEED)
    CUST_SPLITS['fold'] = 9999
    train_cv_cust = train_cust[~train_cust['customer_ID'].isin(CUST_SPLITS['customer_ID'].values)].reset_index(drop=True)
    skf = KFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
    for fold,(train_idx, valid_idx) in enumerate(skf.split(train_cv_cust)):
    df = train_cv_cust.loc[valid_idx, ['customer_ID']]
    df['fold'] = fold
    CUST_SPLITS = cudf.concat([CUST_SPLITS, df])
    CUST_SPLITS = CUST_SPLITS.set_index('customer_ID')
    CUST_SPLITS.to_csv('CUST_SPLITS.csv')
    del train, train_cust
    _ = gc.collect()
    CUST_SPLITS['fold'].value_counts()


    Data

    defread_file(path = '', usecols = None):
    # LOAD DATAFRAME
    if usecols isnotNone: df = cudf.read_parquet(path, columns=usecols)
    else: df = cudf.read_parquet(path)
    # REDUCE DTYPE FOR CUSTOMER AND DATE
    df['customer_ID'] = df['customer_ID'].str[-16:].str.hex_to_int().astype('int64')
    df.S_2 = cudf.to_datetime( df.S_2 )
    # SORT BY CUSTOMER AND DATE (so agg('last') works correctly)
    #df = df.sort_values(['customer_ID','S_2'])
    #df = df.reset_index(drop=True)
    # FILL NAN
    df = df.fillna(NAN_VALUE) 
    print('shape of data:', df.shape)
    return df
    print('Reading train data...')
    train = read_file(path=os.path.join(CWD, 'train.parquet'))
    defprocess_and_feature_engineer(df):
    # FEATURE ENGINEERING FROM 
    # https://www.kaggle.com/code/huseyincot/amex-agg-data-how-it-created
    all_cols = [c for c in list(df.columns) if c notin ['customer_ID','S_2']]
    cat_features = ["B_30","B_38","D_114","D_116","D_117","D_120","D_126","D_63","D_64","D_66","D_68"]
    num_features = [col for col in all_cols if col notin cat_features]
    test_num_agg = df.groupby("customer_ID")[num_features].agg(['mean''std''min''max''last'])
    test_num_agg.columns = ['_'.join(x) for x in test_num_agg.columns]
    test_cat_agg = df.groupby("customer_ID")[cat_features].agg(['count''last''nunique'])
    test_cat_agg.columns = ['_'.join(x) for x in test_cat_agg.columns]
    df = cudf.concat([test_num_agg, test_cat_agg], axis=1)
    del test_num_agg, test_cat_agg
    print('shape after engineering', df.shape )
    return df
    train = process_and_feature_engineer(train)
    train = train.fillna(NAN_VALUE)
    # ADD TARGETS
    targets = cudf.read_csv(os.path.join(CWD, 'train_labels.csv'))
    targets['customer_ID'] = targets['customer_ID'].str[-16:].str.hex_to_int().astype('int64')
    targets = targets.set_index('customer_ID')
    train = train.merge(targets, left_index=True, right_index=True, how='left')
    train.target = train.target.astype('int8')
    del targets
    # NEEDED TO MAKE CV DETERMINISTIC (cudf merge above randomly shuffles rows)
    train = train.sort_index().reset_index()
    # FEATURES
    FEATURES = list(train.columns[1:-1])
    print(f'There are {len(FEATURES)} features!')









    Reading train data...
    shape of data: (5531451, 190)
    shape after engineering (458913, 918)
    There are 918 features!

    # ADD CUST SPLITS
    CUST_SPLITS = cudf.read_csv('CUST_SPLITS.csv').set_index('customer_ID')
    train = train.set_index('customer_ID').merge(CUST_SPLITS, left_index=True, right_index=True, how='left')
    test = train[train['fold'] == 9999]
    train = train[train['fold'] != 9999]
    train = train.sort_index().reset_index()
    train.head()

    Permutation

    defget_permutation_importance(model_path, data, model_features, shuffle_features, shuffle_times=5):
    permutation_importance = pd.DataFrame(columns=['feature''metric''shuffle_idx'])
    model = ForestInference.load(model_path, model_type='xgboost', output_ class=True)
    preds = model.predict_proba(data[model_features])[1].values
    acc = fast_auc(data['target'].values, preds)
    permutation_importance.loc[permutation_importance.shape[0]] = ['original', cupy.asnumpy(acc), 0
    for col in tqdm(shuffle_features):
    value = data[col].copy().values
    for i in np.arange(shuffle_times):
    np.random.seed(i)
    data[col] = cupy.random.permutation(data[col].copy().values)
    preds = model.predict_proba(data[model_features])[1].values
    new_acc = fast_auc(data['target'].values, preds)
    permutation_importance.loc[permutation_importance.shape[0]] = [col, cupy.asnumpy(new_acc), i] 
    data[col] = value
    return permutation_importance
    defagg_permu_files(permu_dir, count_threshold, retrain_idx):
    permu_files = [c for c in os.listdir(permu_dir) if ('.csv'in c) and (f'permutation_importance_{retrain_idx}'in c)]
    permutation_importance_all = []
    for file in permu_files:
    df = pd.read_csv(os.path.join(permu_dir, file))
    df['permut_idx'] = file.split('_')[-2]
    permutation_importance_all.append(df)
    permutation_importance_all = pd.concat(permutation_importance_all)
    # original_acc
    original_acc = permutation_importance_all[permutation_importance_all['feature'] == 'original'].set_index(['fold''permut_idx'])[['metric']]
    original_acc.rename({'metric''metric_ori'}, axis=1, inplace=True)
    permutation_importance_all = permutation_importance_all.set_index(['fold''permut_idx']).merge(original_acc, left_index=True, right_index=True, how='left')
    permutation_importance_all['metric_diff_ratio'] = (permutation_importance_all['metric_ori'] - permutation_importance_all['metric']) / permutation_importance_all['metric_ori']
    permutation_importance_all.reset_index(inplace=True)
    # random_acc
    random_acc = permutation_importance_all[permutation_importance_all['feature'] == 'random'].groupby(['permut_idx'])[['metric_diff_ratio']].agg(['mean''std'])
    random_acc.columns = ['random_mean''random_std']
    permutation_importance_all.reset_index(inplace=True)
    permutation_importance_agg = permutation_importance_all[permutation_importance_all['feature'] != 'random'].groupby(['feature''permut_idx'])['metric_diff_ratio'].agg(['min''max','mean''std''count'])
    permutation_importance_agg['z'] = permutation_importance_agg['mean'] / permutation_importance_agg['std']
    if count_threshold:
    permutation_importance_agg = permutation_importance_agg[permutation_importance_agg['count'] == count_threshold]
    permutation_importance_agg = permutation_importance_agg.reset_index().set_index('permut_idx')
    permutation_importance_agg['z'] = permutation_importance_agg['mean'] / permutation_importance_agg['std']
    permutation_importance_agg = permutation_importance_agg.merge(random_acc, left_index=True, right_index=True, how='left')
    permutation_importance_agg['random_z'] = permutation_importance_agg['random_mean'] / permutation_importance_agg['random_std']
    permutation_importance_agg = permutation_importance_agg.reset_index().set_index('feature')
    return permutation_importance_agg
    params_13m = {
    'max_depth':4
    'learning_rate':0.1
    'subsample':0.8,
    'colsample_bytree':0.1,
    }
    # ADD RANDOM FEATURE
    np.random.seed(SEED)
    train['random'] = np.random.normal(loc=1, scale=1, size=train.shape[0])
    np.random.seed(SEED)
    np.random.shuffle(FEATURES)
    FEATURES = FEATURES[:300]
    print(FEATURES[:10])



    ['D_107_max', 'D_41_mean', 'D_77_min', 'R_19_mean', 'D_131_min', 'B_41_max', 'D_76_max', 'D_91_max', 'D_54_mean', 'D_84_max']

    permu_dir = './permut_13m_ver001_retrain'
    ifnot os.path.isdir(permu_dir): os.mkdir(permu_dir)
    clear_file(permu_dir)
    retrain_idx = 1
    val_folds = [0,1,2,3,4,5,6,7,8,9]
    shuffle_times = 10
    whileTrue:
    # Features
    sub_features_num = len(FEATURES)
    feature_split_num = max(round(len(FEATURES) / sub_features_num), 1
    sub_features_num = len(FEATURES) // feature_split_num
    print(f'FEATURES: {len(FEATURES)}; retrain_idx: {retrain_idx}; feature_split_num: {feature_split_num}; sub_features_num: {sub_features_num}')
    start_idx = 0
    for sub_features_idx in range(feature_split_num):
    if sub_features_idx == feature_split_num - 1:
    end_idx = len(FEATURES)
    else:
    end_idx = start_idx+sub_features_num
    sub_features = FEATURES[start_idx: end_idx]
    # Train
    acc, oof, importances = train_xgb_params(params_13m, 
    train,
    features=sub_features+['random'],
    val_folds=val_folds,
    early_stopping_rounds=100
    num_boost_round=9999,
    verbose_eval=False
    data_seed=SEED, 
    model_seed=SEED, 
    save_dir=permu_dir,
    ver=f'13M_{retrain_idx}_{sub_features_idx}')
    importances.to_csv(os.path.join(permu_dir, 'original_importances.csv'))
    # Permutation Importance
    permutation_importance_list = []
    for fold in val_folds:
    print(f'........ sub_features_idx: {sub_features_idx}; start_idx: {start_idx}, end_idx: {end_idx}, fold: {fold};')
    model_file = [c for c in os.listdir(permu_dir) iff'13M_{retrain_idx}_{sub_features_idx}_fold{fold}'in c]
    if len(model_file) > 1:
    raise ValueError(f'There are more than one model file: {model_file}')
    model_file = model_file[0]
    model = xgb.Booster()
    model_path = os.path.join(permu_dir, model_file)
    # PERMUTATION IMPORTANCE
    importances_fold = importances[f'importance_{fold}']
    shuffle_features = importances_fold[importances_fold.notnull()].index.to_list()
    permutation_importance = get_permutation_importance(model_path,
    train.loc[train['fold'] == fold, :],
    model_features=sub_features+['random'], 
    shuffle_features=shuffle_features,
    shuffle_times=shuffle_times) 
    permutation_importance['retrain_idx'] = retrain_idx
    permutation_importance['sub_features_idx'] = sub_features_idx
    permutation_importance['fold'] = fold
    permutation_importance.to_csv(os.path.join(permu_dir, f'permutation_importance_{retrain_idx}_{sub_features_idx}_fold{fold}.csv'), index=False)
    permutation_importance_list.append(permutation_importance)
    start_idx = end_idx
    permutation_importance_agg = agg_permu_files(permu_dir, count_threshold=len(val_folds)*shuffle_times, retrain_idx=retrain_idx)
    drop_features = permutation_importance_agg[permutation_importance_agg['z'] < permutation_importance_agg['random_z']].index.to_list()
    permutation_importance_agg['is_drop'] = permutation_importance_agg.index.isin(drop_features)
    permutation_importance_agg.to_csv(os.path.join(permu_dir, f'permutation_importance_agg_{retrain_idx}.csv'))
    FEATURES = [c for c in FEATURES if c notin drop_features]
    retrain_idx += 1
    break
    # if len(drop_features) == 0:
    # break

    Baseline

    val_folds = [0,1,2,3,4,5,6,7,8,9]
    acc, oof, importances = train_xgb_params(params_13m, 
    train,
    features=FEATURES,
    val_folds=val_folds,
    early_stopping_rounds=100
    num_boost_round=9999,
    verbose_eval=False
    data_seed=SEED, 
    model_seed=SEED, 
    save_dir=False)
    print('num_features', len(FEATURES), 'auc', acc) 

    num_features 300 auc 0.9597397304150984

    篩選變量實驗

    val_folds = [0,1,2,3,4,5,6,7,8,9]

    預設的Feature Importance

    original_importances = pd.read_csv(os.path.join(permu_dir, 'original_importances.csv'))
    original_importances = original_importances.set_index('feature')
    original_importances['mean'] = original_importances.mean(axis=1)
    original_importances['std'] = original_importances.std(axis=1)
    original_importances.reset_index(inplace=True)
    features_mean = original_importances.sort_values('mean', ascending=False)['feature'].to_list()
    num_features = 25
    whileTrue:
    acc, oof, importances = train_xgb_params(params_13m, 
    train,
    features=features_mean[:num_features],
    val_folds=val_folds,
    early_stopping_rounds=100
    num_boost_round=9999,
    verbose_eval=False
    data_seed=SEED, 
    model_seed=SEED, 
    save_dir=False)
    print('num_features', num_features, 'auc', acc) 
    num_features += 25
    if num_features > permutation_importance_agg.shape[0]:
    break

    num_features 25 auc 0.9519786413209202
    num_features 50 auc 0.9551949305682249
    num_features 75 auc 0.957583263622448
    num_features 100 auc 0.9581601072672994
    num_features 125 auc 0.958722652426215
    num_features 150 auc 0.959060598689679
    num_features 175 auc 0.9593140320399974
    num_features 200 auc 0.9595746364492562
    num_features 225 auc 0.9596106789864496
    num_features 250 auc 0.959798236138387

    平均下降比例

    permutation_importance_agg = pd.read_csv(os.path.join(permu_dir, 'permutation_importance_agg_1.csv'))
    features_mean = permutation_importance_agg.sort_values('mean', ascending=False)['feature'].to_list()
    num_features = 25
    whileTrue:
    acc, oof, importances = train_xgb_params(params_13m, 
    train,
    features=features_mean[:num_features],
    val_folds=val_folds,
    early_stopping_rounds=100
    num_boost_round=9999,
    verbose_eval=False
    data_seed=SEED, 
    model_seed=SEED, 
    save_dir=False)
    print('num_features', num_features, 'auc', acc) 
    num_features += 25
    if num_features > permutation_importance_agg.shape[0]:
    break

    num_features 25 auc 0.955488431492996
    num_features 50 auc 0.9582052598376949
    num_features 75 auc 0.9589016974455882
    num_features 100 auc 0.9593989000375628
    num_features 125 auc 0.9595675126936822
    num_features 150 auc 0.9597684801181003
    num_features 175 auc 0.9598145119359532
    num_features 200 auc 0.9597407810213192
    num_features 225 auc 0.9598664237091002
    num_features 250 auc 0.959691966613858

    標準化後的平均下降比例

    permutation_importance_agg = pd.read_csv(os.path.join(permu_dir, 'permutation_importance_agg_1.csv'))
    features_z = permutation_importance_agg.sort_values('z', ascending=False)['feature'].to_list()
    num_features = 25
    whileTrue:
    acc, oof, importances = train_xgb_params(params_13m, 
    train,
    features=features_z[:num_features],
    val_folds=val_folds,
    early_stopping_rounds=100
    num_boost_round=9999,
    verbose_eval=False
    data_seed=SEED, 
    model_seed=SEED, 
    save_dir=False)
    print('num_features', num_features, 'auc', acc) 
    num_features += 25
    if num_features > permutation_importance_agg.shape[0]:
    break

    num_features 25 auc 0.9556982881629269
    num_features 50 auc 0.9584120117990971
    num_features 75 auc 0.9589902318791576
    num_features 100 auc 0.9595072118772046
    num_features 125 auc 0.9597150474489736
    num_features 150 auc 0.9597403596967278
    num_features 175 auc 0.9599117083347268
    num_features 200 auc 0.9599500114626974
    num_features 225 auc 0.9598069184316719
    num_features 250 auc 0.959729845054988

    參考

    1. Breiman, Leo.「Random Forests.」 Machine Learning 45 (1). Springer: 5-32 (2001).

    2. Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. 「All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously.」 http://arxiv.org/abs/1801.01489 (2018).

    3. S trobl C, et al. B ias in random forest variable importance measures: Illustrations, sources and a solution, BMC Bioinformatics, 2007, vol. 8 pg. 25

    來源:https://zhuanlan.zhihu.com/p/563422607

    --end--