作者：韩信子@ShowMeAI

数据剖析实战系列：www.showmeai.tech/tutorials/4…

机器学习实战系列：www.showmeai.tech/tutorials/4…

本文地址：www.showmeai.tech/article-det…

声明：版权所有，转载请联络渠道与作者并注明出处

保藏ShowMeAI查看更多精彩内容

导言

在过去几年中，客户对航空公司的满足度一直在稳步攀升。在 COVID-19 大流行导致的中止之后，航空游览业重新开始，咱们越来越重视航空出行的满足度问题，客户也会对一些常见问题，如『不舒服的座位』、『拥堵的空间』、『延误』和『不合标准的设施』等进行反应。

各家航空公司也越来越重视客户满足度问题并努力提高。对航空公司而言，出色的客户服务，是销量和客户留存的要害；反之，糟糕的客户服务评级会导致客户流失和公司声誉欠安。

在本项目中，咱们将对航空满足度数据进行剖析建模，对满足度进行预估，并找出影响满足度的中心要素。

数据&环境

这儿运用到的首要开发环境是 Jupyter Notebooks，根据 Python 3.9 完结。依赖的东西库包含用于数据探索剖析的Pandas、Numpy、Seaborn 和 Matplotlib 库、用于建模和优化的 XGBoost 和 Scikit-Learn 库，以及用于模型可解说性剖析的 SHAP 东西库。

关于以上东西库的用法，ShowMeAI在实战文章中做了详细介绍，咱们能够查看以下教程系列和文章

数据剖析实战：Python 数据剖析实战教程

机器学习实战：手把手教你玩转机器学习系列

根据SHAP的机器学习可解说性实战

咱们本次用到的数据集是 Kaggle航空满足度数据集。数据集运用csv格式文件存储，预先切分好了 80% 的练习集和 20% 的测验集；方针列“*Satisfaction/*满足度”。咱们能够经过 ShowMeAI 的百度网盘地址下载。

实战数据集下载（百度网盘）：公众号『ShowMeAI研究中心』回复『实战』，或许点击这儿获取本文 [36]『航班乘客满足度』场景数据剖析建模与事务归因解说『Airline Passenger Satisfaction数据集』

⭐ ShowMeAI官方GitHub：github.com/ShowMeAI-Hu…

详细的数据列字段如下：

字段	阐明	概况
Gender	乘客性别	Female, Male
Customer Type	乘客类型	Loyal customer, disloyal customer
Age	乘客年纪	—
Type of Travel	乘客出行目的	Personal Travel, Business Travel
Class	客舱等级	Business, Eco, Eco Plus
Flight distance	航程间隔	—
Inflight wifi service	机上WiFi服务满足度	0:Not Applicable;1-5
Departure/Arrival time convenient	起飞/降落舒适度满足度	—
Ease of Online booking	在线预定满足度	—
Gate location	登机门方位满足度	—
Food and drink	机上食物满足度	—
Online boarding	在线值机满足度	—
Seat comfort	座椅舒适度满足度	—
Inflight entertainment	机上文娱设施满足度	—
On-board service	登机服务满足度	—
Leg room service	腿部空间满足度	—
Baggage handling	行李处理满足度	—
Check-in service	值机满足度	—
Inflight service	机上服务满足度	—
Cleanliness	环境干净度满足度	—
Departure Delay in Minutes	起飞延误时刻	—
Arrival Delay in Minutes	抵达延误时刻	—
Satisfaction	航线满足度	Satisfaction, neutral or dissatisfaction

数据一览和清理

数据一览

咱们先导入东西库，进行根本的设定，并读取数据。

# 导入东西库
import pandas as pd
import numpy as np
import scipy.stats as sp
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
# 可视化图例设定
from matplotlib import rcParams
# 字体巨细
rcParams['font.size'] = 12
# 图例巨细
rcParams['figure.figsize'] = 7, 5
# 读取数据
air_train_df = pd.read_csv('air-train.csv')
air_test_df = pd.read_csv('air-test.csv')
air_train_df.head()

air_train_df.satisfaction.value_counts()
neutral or dissatisfied    58879
satisfied                  45025
Name: satisfaction, dtype: int64
air_train_df.info()
air_test_df.info()

输出的数据信息如下，咱们运用到的数据一共包含 129,880 行25 列。数据集被预拆分为包含 103,904 行的练习数据集（19.8MB）和包含 25,976 行的测验数据集（5MB）。

Training Data Set (air_train_df):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103904 entries, 0 to 103903
Data columns (total 25 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Unnamed: 0                         103904 non-null  int64  
 1   id                                 103904 non-null  int64  
 2   Gender                             103904 non-null  object 
 3   Customer Type                      103904 non-null  object 
 4   Age                                103904 non-null  int64  
 5   Type of Travel                     103904 non-null  object 
 6   Class                              103904 non-null  object 
 7   Flight Distance                    103904 non-null  int64  
 8   Inflight wifi service              103904 non-null  int64  
 9   Departure/Arrival time convenient  103904 non-null  int64  
 10  Ease of Online booking             103904 non-null  int64  
 11  Gate location                      103904 non-null  int64  
 12  Food and drink                     103904 non-null  int64  
 13  Online boarding                    103904 non-null  int64  
 14  Seat comfort                       103904 non-null  int64  
 15  Inflight entertainment             103904 non-null  int64  
 16  On-board service                   103904 non-null  int64  
 17  Leg room service                   103904 non-null  int64  
 18  Baggage handling                   103904 non-null  int64  
 19  Checkin service                    103904 non-null  int64  
 20  Inflight service                   103904 non-null  int64  
 21  Cleanliness                        103904 non-null  int64  
 22  Departure Delay in Minutes         103904 non-null  int64  
 23  Arrival Delay in Minutes           103594 non-null  float64
 24  satisfaction                       103904 non-null  object 
dtypes: float64(1), int64(19), object(5)
memory usage: 19.8+ MB
Testing Set (air_test_df):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25976 entries, 0 to 25975
Data columns (total 25 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Unnamed: 0                         25976 non-null  int64  
 1   id                                 25976 non-null  int64  
 2   Gender                             25976 non-null  object 
 3   Customer Type                      25976 non-null  object 
 4   Age                                25976 non-null  int64  
 5   Type of Travel                     25976 non-null  object 
 6   Class                              25976 non-null  object 
 7   Flight Distance                    25976 non-null  int64  
 8   Inflight wifi service              25976 non-null  int64  
 9   Departure/Arrival time convenient  25976 non-null  int64  
 10  Ease of Online booking             25976 non-null  int64  
 11  Gate location                      25976 non-null  int64  
 12  Food and drink                     25976 non-null  int64  
 13  Online boarding                    25976 non-null  int64  
 14  Seat comfort                       25976 non-null  int64  
 15  Inflight entertainment             25976 non-null  int64  
 16  On-board service                   25976 non-null  int64  
 17  Leg room service                   25976 non-null  int64  
 18  Baggage handling                   25976 non-null  int64  
 19  Checkin service                    25976 non-null  int64  
 20  Inflight service                   25976 non-null  int64  
 21  Cleanliness                        25976 non-null  int64  
 22  Departure Delay in Minutes         25976 non-null  int64  
 23  Arrival Delay in Minutes           25893 non-null  float64
 24  satisfaction                       25976 non-null  object 
dtypes: float64(1), int64(19), object(5)
memory usage: 5.0+ MB

数据集中，19 个 int 数据类型字段，1 个 float 数据类型字段，5 个分类数据类型（方针）字段。

数据清洗

下面咱们进行数据清洗：

id和unnamed两列没有作用，咱们直接删去。
『抵达延误时刻』列是浮点数据类型，『出发延误时刻』列是整数数据类型，在进行进一步剖析前，咱们把它们都调整为浮点数类型，保持一致。
类别型变量，包含列名和列取值，咱们对它们做规范化处理（全部小写化，以便在后续建模进程中准确编码）。
Arrival Delay 列中也存在缺失值——练习集中短少 310 个，测验集中短少 83 个。咱们在这儿用最简略的平均值来填充它们。
数据集的满足度等级列应该是 1 到 5 的等级评分。有一些取值为0的脏数据，咱们剔除掉它们。
咱们把航班延误信息聚合成一些统一的列。标明航班是否阅历了延误（起飞或抵达）和航班延误所花费的总时刻。

def clean_data(orig_df):
    '''
    This function applies 5 steps to the dataframe to clean the data.
    1. Dropping of unnecessary columns
    2. Uniformize datatypes in delay column
    3. Normalizing column names.
    4. Normalizing text values in columns.
    5. Imputing numeric null values with the mean value of the column.
    6. Dropping "zero" values from ranked categorical variables.
    7. Creating aggregated flight delay column
    Return: Cleaned DataFrame, ready for analysis - final encoding still to be applied.
    ''' 
    df = orig_df.copy()
    '''1. Dropping off unnecessary columns'''
    df.drop(['Unnamed: 0', 'id'], axis = 1, inplace = True)
    '''2. Uniformizing datatype in delay column'''
    df['Departure Delay in Minutes'] = df['Departure Delay in Minutes'].astype(float)
    '''3. Normalizing column names'''
    df.columns = df.columns.str.lower()
    '''Replacing spaces and other characters with underscores, this is more 
    for us to make it easier to work with them and so that we can call them using dot notation.'''
    special_chars = "/ -" 
    for special_char in special_chars:
        df.columns = [col.replace(special_char, '_') for col in df.columns]
    '''4. Normalizing text values in columns'''
    cat_cols = ['gender', 'customer_type', 'class', 'type_of_travel', 'satisfaction']
    for column in cat_cols:
        df[column] = df[column].str.lower() 
    '''5. Imputing the nulls in the arrival delay column with the mean.
    Since we cannot safely equate these nulls to a zero value, the mean value of the column is the
    most sensible method of replacement.'''
    df['arrival_delay_in_minutes'].fillna(df['arrival_delay_in_minutes'].mean(), inplace = True)
    df.round({'arrival_delay_in_minutes' : 1})
    '''6. Dropping rows from ranked value columns where "zero" exists as a value
    Since these columns are meant to be ranked on a scale from 1 to 5, having zero as a value 
    does not make sense nor does it help us in any way.'''
    rank_list = ["inflight_wifi_service", "departure_arrival_time_convenient", "ease_of_online_booking", "gate_location",
                "food_and_drink", "online_boarding", "seat_comfort", "inflight_entertainment", "on_board_service",
                "leg_room_service", "baggage_handling", "checkin_service", "inflight_service", "cleanliness"]
    '''7. Creating aggregated and categorical flight delay columns'''
    df['total_delay_time'] = (df['departure_delay_in_minutes'] + df['arrival_delay_in_minutes'])
    df['was_flight_delayed'] = np.nan
    df['was_flight_delayed'] = np.where(df['total_delay_time'] > 0, 'yes', 'no')
    for col in rank_list:
        df.drop(df.loc[df[col]==0].index, inplace=True)
    cleaned_df = df
    return cleaned_df

探索性剖析

完结数据加载与根本的数据清洗后，咱们对数据进行进一步的剖析发掘，即EDA（探索性数据剖析）的进程。

方针变量（客户满足度）散布怎么

咱们先对方针变量进行剖析，即客户满足度状况，这是建模的终究标签，它是一个类别型字段。

air_train_cleaned = clean_data(air_train_df)
air_test_cleaned = clean_data(air_test_df)
fig = plt.figure(figsize = (10,7))
air_train_cleaned.satisfaction.value_counts(normalize = True).plot(kind='bar', alpha = 0.9, rot=0)
plt.title('Customer satisfaction')
plt.ylabel('Percent')
plt.show()

整体来说，标签还算均衡，大约 55% 的中立或不满足，45% 的满足。这种标签份额散布下，咱们不需要进行数据采样。

性别和客户身份 V.S. 满足度

with sns.axes_style(style = 'ticks'):
    d = sns.histplot(x = "gender",  hue= 'satisfaction', data = air_train_cleaned,  
                     stat = 'percent', multiple="dodge", palette = 'Set1')

从性别维度来看，男女好像不同不大，整体满足度或许更取决于其他要素。

with sns.axes_style(style = 'ticks'):
    d = sns.histplot(x = "customer_type",  hue= 'satisfaction', data = air_train_cleaned, 
                     stat = 'percent', multiple="dodge", palette = 'Set1')

从客户忠实度视点看，忠实客户的满足度份额会相对高一点，这也是咱们能够直观理解的。

客舱等级 V.S. 满足度

with sns.axes_style(style = 'ticks'):
    d = sns.histplot(x = "class",  hue= 'satisfaction', data = air_train_cleaned,
                     stat = 'percent', multiple="dodge", palette = 'Set1')

咱们分别看一下乘坐经济舱、高档舱和商务舱的旅客的满足度，从上面的散布咱们能够观察到乘坐高档舱（商务舱）的乘客与乘坐长途客舱（经济舱或奢华舱）的乘客在满足度上存在根本差异。

那咱们进而看一下因个人休闲而出差的乘客

with sns.axes_style(style = 'ticks'):
    d = sns.histplot(x = "type_of_travel",  hue= 'satisfaction', data = air_train_cleaned,
                     stat = 'percent', multiple="dodge", palette = 'Set1')

从上面的剖析咱们发现，商务游览的乘客与休闲游览的乘客之间的满足度存在非常显着的差异。

年纪段 V.S. 满足度

with sns.axes_style('white'):
    g = sns.catplot(x = 'age', data = air_train_cleaned,  
                    kind = 'count', hue = 'satisfaction', order = range(7, 80),
                    height = 8.27, aspect=18.7/8.27, legend = False,
                   palette = 'Set1')
plt.legend(loc='upper right');

sns.violinplot(data = air_train_cleaned, x = "satisfaction", y = "age", palette='Set1')

上图是年纪和满足度之间的联系，剖析成果非常风趣，37-61 岁年纪组与其他年纪组之间存在显着差异（他们对体会的满足度远远高于其他组的乘客）。别的咱们还观察到，这个段的乘客的满足度跟着年纪的增加而稳步上升。

飞翔时刻长短 V.S. 满足度

sns.violinplot(data = air_train_cleaned, x = "satisfaction", y = "flight_distance", palette = 'Set1')

从飞翔间隔维度，咱们看不出显着的满足度差异，并且绝大多数乘客的航班航程为 1,000 英里或更短。

飞翔间隔 V.S. 各个体会维度

score_cols = ["inflight_wifi_service", "departure_arrival_time_convenient", "ease_of_online_booking",
              "gate_location","food_and_drink", "online_boarding", "seat_comfort", "inflight_entertainment", 
              "on_board_service","leg_room_service", "baggage_handling", "checkin_service", "inflight_service","cleanliness"]
plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)
# Loop through scored columns
for n, score_col in enumerate(score_cols):
    # Add a new subplot iteratively
    ax = plt.subplot(4, 4, n + 1)
    # Filter df and plot scored column on new axis
    sns.violinplot(data = air_train_cleaned, 
                   x = score_col, 
                   y = 'flight_distance', 
                   hue = "satisfaction",
                   split = True,
                   ax = ax,
                   palette = 'Set1')
    # Chart formatting
    ax.set_title(score_col)
    ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
          fancybox=True, shadow=True, ncol=5)
    ax.set_xlabel("")

咱们运用小提琴图对航班不同飞翔间隔和旅客对不同服务维度评级的满足程度进行穿插剖析如上，飞翔间隔对客户满足度的影响还是比较大的。

年纪 V.S. 各个体会维度

plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)
# Loop through scored columns
for n, score_col in enumerate(score_cols):
    # Add a new subplot iteratively
    ax = plt.subplot(4, 4, n + 1)
    # Filter df and plot scored column on new axis
    sns.violinplot(data = air_train_cleaned, 
                   x = score_col, 
                   y = 'age', 
                   hue = "satisfaction",
                   split = True,
                   ax = ax,
                   palette = 'Set1')
    # Chart formatting
    ax.set_title(score_col),
    ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
          fancybox=True, shadow=True, ncol=5)
    ax.set_xlabel("")

相同的方法，咱们针对不同的年纪段，关于乘客在不同维度的体会满足度剖析如上，咱们观察到，在这些散布的大多数中，37-60 岁年纪组有一个显着的高峰。

客舱等级和出行目的 V.S. 各个体会维度

plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)
# Loop through scored columns
for n, score_col in enumerate(score_cols):
    # Add a new subplot iteratively
    ax = plt.subplot(4, 4, n + 1)
    # Filter df and plot scored column on new axis
    sns.violinplot(data = air_train_cleaned, 
                   x = 'class', 
                   y = score_col, 
                   hue = "satisfaction",
                   split = True,
                   ax = ax,
                   palette = 'Set1')
    # Chart formatting
    ax.set_title(score_col)
    ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
          fancybox=True, shadow=True, ncol=5)
    ax.set_xlabel("")

plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)
# Loop through scored columns
for n, score_col in enumerate(score_cols):
    # Add a new subplot iteratively
    ax = plt.subplot(4, 4, n + 1)
    # Filter df and plot scored column on new axis
    sns.violinplot(data = air_train_cleaned, 
                   x = 'type_of_travel', 
                   y = score_col, 
                   hue = "satisfaction",
                   split = True,
                   ax = ax,
                   palette = 'Set1')
    # Chart formatting
    ax.set_title(score_col)
    ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
          fancybox=True, shadow=True, ncol=5)
    ax.set_xlabel("")

相同的方法，咱们针对不同的客舱等级和出行目的，关于乘客在不同维度的体会满足度剖析如上，咱们观察到，这两个信息很大程度影响乘客满足度。机上 Wi-Fi 服务、在线登机、座椅舒适度、机上文娱、机上客户服务、腿部空间和机上客户服务的满足度和不满足度都出现了显着的高峰。

很有意思的一点是机上wi-fi服务栏，这一项的满足好像对乘坐经济舱和经济舱的客户的航班行程满足有很大影响，但它好像对商务舱旅客的满足度没有太大影响。

数据处理和特征挑选

数据处理/特征工程

在将数据引进模型之前，有必要对数据进行编码以便为建模做好预备。咱们针对类别型的变量，运用序号编码进行编码映射，具体代码如下（考虑到下面的不同类别取值本身有程度巨细联系，以及咱们会运用xgboost等非线性模型，因而序号编码是OK的）

关于特征工程的详细知识，欢迎咱们查看ShowMeAI的系列教程文章：

机器学习实战 | 机器学习特征工程最全解读

from sklearn.preprocessing import OrdinalEncoder
def encode_data(orig_df):
    '''
    Encodes remaining categorical variables of data frame to be ready for model ingestion
    Inputs:
       Dataframe
    Manipulations:
        Encoding of categorical variables.    
    Return: 
        Encoded Column Values
    '''
    df = orig_df.copy()
    #Ordinal encode of scored rating columns.
    encoder = OrdinalEncoder()
    for j in score_cols:
        df[j] = encoder.fit_transform(df[[j]]) 
    # Replacement of binary categories.
    df.was_flight_delayed.replace({'no': 0, 'yes' : 1}, inplace = True)
    df['satisfaction'].replace({'neutral or dissatisfied': 0, 'satisfied': 1},inplace = True)
    df.customer_type.replace({'disloyal customer': 0, 'loyal customer': 1}, inplace = True)
    df.type_of_travel.replace({'personal travel': 0, 'business travel': 1}, inplace = True)
    df.gender.replace({'male': 0, 'female' : 1}, inplace = True)
    encoded_df = pd.get_dummies(df, columns = ['class'])
    return encoded_df
# 对练习集和测验集进行编码
air_train_encoded = encode_data(air_train_cleaned)
air_test_encoded = encode_data(air_test_cleaned)
# 查看特征和方针列之间的相关性
train_corr = air_train_encoded.corr()[['satisfaction']]
train_corr = train_corr
plt.figure(figsize=(10, 12))
heatmap = sns.heatmap(train_corr.sort_values(by='satisfaction', ascending=False), 
                      vmin=-1, vmax=1, annot=True, cmap='Blues')
heatmap.set_title('Feature Correlation with Target Variable', fontdict={'fontsize':14});

特征挑选

为了更佳的建模作用与更高效的建模效率，在完结特征工程之后咱们要进行特征挑选，咱们这儿运用 Scikit-Learn 的内置特征挑选功用，运用 K-Best 作为特征筛选器，并运用卡方值作为筛选标准（卡方是相对合适的标准，由于咱们的数据集中有几个分类变量）。

# Pre-processing and scaling dataset for feature selection
from sklearn import preprocessing
r_scaler = preprocessing.MinMaxScaler()
r_scaler.fit(air_train_encoded)
air_train_scaled = pd.DataFrame(r_scaler.transform(air_train_encoded), columns = air_train_encoded.columns)
air_train_scaled.head()
# Feature selection, applying Select K Best and Chi2 to output the 15 most important features
from sklearn.feature_selection import SelectKBest, chi2
X = air_train_scaled.loc[:,air_train_scaled.columns!='satisfaction']
y = air_train_scaled[['satisfaction']]
selector = SelectKBest(chi2, k = 10)
selector.fit(X, y)
X_new = selector.transform(X)
features = (X.columns[selector.get_support(indices=True)])
features

输出：

Index(['type_of_travel', 'inflight_wifi_service', 'online_boarding',       'seat_comfort', 'inflight_entertainment', 'on_board_service',       'leg_room_service', 'cleanliness', 'class_business', 'class_eco'],
      dtype='object')

咱们经过K-Best筛选过后的特征是游览类型、机上 wifi 服务、在线登机流程、座椅舒适度、机上文娱、机上客户服务、座位空间、清洁度和游览等级（商务舱或经济舱）。

建模

下一步咱们能够根据已有数据进行建模了，咱们在这儿练习的模型包含 逻辑回归 模型、 Adaboost 分类器、 随机森林 分类器、 朴素贝叶斯 分类模型和 Xgboost 分类器。咱们会根据准确性和测验准确性、精确度、召回率和 ROC 值等目标对模型进行评价。

东西库导入与数据预备

import sklearn
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import CategoricalNB
import xgboost
from xgboost import XGBClassifier
# Features as selected from feature importance
features = features
# Specifying target variable
target = ['satisfaction']
# Splitting into train and test
X_train = air_train_encoded[features].to_numpy()
X_test = air_test_encoded[features]
y_train = air_train_encoded[target].to_numpy()
y_test = air_test_encoded[target]

模型评价目标核算

import time
from resource import getrusage, RUSAGE_SELF
from sklearn.metrics import accuracy_score, roc_auc_score, plot_confusion_matrix, plot_roc_curve, precision_score, recall_score
# 模型评价与成果绘图
def get_model_metrics(model, X_train, X_test, y_train, y_test):
    '''
    Model activation function, takes in model as a parameter and returns metrics as specified.
    Inputs: 
        model,  X_train, y_train, X_test, y_test
    Output: 
        Model output metrics, confusion matrix, ROC AUC curve
    '''
    # Mark of current time when model began running
    t0 = time.time()
    # Fit the model on the training data and run predictions on test data
    model.fit(X_train,  y_train)
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:,1]
    # Obtain training accuracy as a comparative metric using Sklearn's metrics package
    train_score = model.score(X_train, y_train)
    # Obtain testing accuracy as a comparative metric using Sklearn's metrics package
    accuracy = accuracy_score(y_test, y_pred)
    # Obtain precision from predictions using Sklearn's metrics package
    precision = precision_score(y_test, y_pred)
    # Obtain recall from predictions using Sklearn's metrics package
    recall = recall_score(y_test, y_pred)
    # Obtain ROC score from predictions using Sklearn's metrics package
    roc = roc_auc_score(y_test, y_pred_proba)
    # Obtain the time taken used to run the model, by subtracting the start time from the current time
    time_taken = time.time() - t0
    # Obtain the resources consumed in running the model
    memory_used = int(getrusage(RUSAGE_SELF).ru_maxrss / 1024)
    # Outputting the metrics of the model performance
    print("Accuracy on Training = {}".format(train_score))
    print("Accuracy on Test = {} • Precision = {}".format(accuracy, precision))
    print("Recall = {} • ROC Area under Curve = {}".format(recall, roc))
    print("Time taken = {} seconds • Memory consumed = {} Bytes".format(time_taken, memory_used))
    # Plotting the confusion matrix of the model's predictive capabilities
    plot_confusion_matrix(model, X_test, y_test, cmap = plt.cm.Blues, normalize = 'all')
    # Plotting the ROC AUC curve of the model 
    plot_roc_curve(model, X_test, y_test)    
    plt.show()
    return model, train_score, accuracy, precision, recall, roc, time_taken, memory_used

建模与优化

① 逻辑回归模型

# 建模与调参
clf = LogisticRegression()
params = {'C': [0.1, 0.5, 1, 5, 10]}
rscv = RandomizedSearchCV(estimator = clf,
                         param_distributions = params,
                         scoring = 'f1',
                         n_iter = 10,
                         verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)
# Parameter object to be passed through to function activation
params = rscv.best_params_
print("Best parameters:", params)
model_lr = LogisticRegression(**params)
model_lr, train_lr, accuracy_lr, precision_lr, recall_lr, roc_lr, tt_lr, mu_lr = get_model_metrics(model_lr, X_train, X_test, y_train, y_test)

② 随机森林模型

clf = RandomForestClassifier()
params = { 'max_depth': [5, 10, 15, 20, 25, 30],
           'max_leaf_nodes': [10, 20, 30, 40, 50],
           'min_samples_split': [1, 2, 3, 4, 5]}
rscv = RandomizedSearchCV(estimator = clf,
                         param_distributions = params,
                         scoring = 'f1',
                         n_iter = 10,
                         verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)
# Parameter object to be passed through to function activation
params = rscv.best_params_
print("Best parameters:", params)
model_rf = RandomForestClassifier(**params)
model_rf, train_rf, accuracy_rf, precision_rf, recall_rf, roc_rf, tt_rf, mu_rf = get_model_metrics(model_rf, X_train, X_test, y_train, y_test)

③ Adaboost模型

clf = AdaBoostClassifier()
params = { 'n_estimators': [25, 50, 75, 100, 125, 150],
           'learning_rate': [0.2, 0.4, 0.6, 0.8, 1.0]}
rscv = RandomizedSearchCV(estimator = clf,
                         param_distributions = params,
                         scoring = 'f1',
                         n_iter = 10,
                         verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)
# Parameter object to be passed through to function activation
params = rscv.best_params_
print("Best parameters:", params)
model_ada = AdaBoostClassifier(**params)
# Saving output metrics
model_ada, accuracy_ada, train_ada, precision_ada, recall_ada, roc_ada, tt_ada, mu_ada = get_model_metrics(model_ada, X_train, X_test, y_train, y_test)

④ 朴素贝叶斯

clf = CategoricalNB()
params = { 'alpha': [0.0001, 0.001, 0.1, 1, 10, 100, 1000],
           'min_categories': [6, 8, 10]}
rscv = RandomizedSearchCV(estimator = clf,
                         param_distributions = params,
                         scoring = 'f1',
                         n_iter = 10,
                         verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)
# Parameter object to be passed through to function activation
params = rscv.best_params_
print("Best parameters:", params)
model_cnb = CategoricalNB(**params)
# Saving Output Metrics
model_cnb, accuracy_cnb, train_cnb, precision_cnb, recall_cnb, roc_cnb, tt_cnb, mu_cnb = get_model_metrics(model_cnb, X_train, X_test, y_train, y_test)

⑤ Xgboost模型

clf = XGBClassifier()
params = { 'max_depth': [3, 5, 6, 10, 15, 20],
           'learning_rate': [0.01, 0.1, 0.2, 0.3],
           'n_estimators': [100, 500, 1000]}
rscv = RandomizedSearchCV(estimator = clf,
                         param_distributions = params,
                         scoring = 'f1',
                         n_iter = 10,
                         verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)
# Parameter object to be passed through to function activation
params = rscv.best_params_
print("Best parameters:", params)
model_xgb = XGBClassifier(**params)
# Saving Output Metrics
model_xgb, accuracy_xgb, train_xgb, precision_xgb, recall_xgb, roc_xgb, tt_xgb, mu_xgb = get_model_metrics(model_xgb, X_train, X_test, y_train, y_test)

综合比照

如下咱们对作用做一个综合比照，每个模型都运用了参数优化，在练习数据上的准确率不低于 88%，在测验数据上的准确率不低于 87%。

training_scores = [train_lr, train_rf, train_ada, train_cnb, train_xgb]
accuracy = [accuracy_lr, accuracy_rf, accuracy_ada, accuracy_cnb, accuracy_xgb]
roc_scores = [roc_lr, roc_rf, roc_ada, roc_cnb, roc_xgb]
precision = [precision_lr, precision_rf, precision_ada, precision_cnb, precision_xgb]
recall = [recall_lr, recall_rf, recall_ada, recall_cnb, recall_xgb]
time_scores = [tt_lr, tt_rf, tt_ada, tt_cnb, tt_xgb]
memory_scores = [mu_lr, mu_rf, mu_ada, mu_cnb, mu_xgb]
model_data = {'Model': ['Logistic Regression', 'Random Forest', 'Adaptive Boost',
                       'Categorical Bayes', 'Extreme Gradient Boost'],
            'Accuracy on Training' : training_scores,
            'Accuracy on Test' : accuracy,
            'ROC AUC Score' : roc_scores,
            'Precision' : precision,
            'Recall' : recall,
            'Time Elapsed (seconds)' : time_scores,
            'Memory Consumed (bytes)': memory_scores}
model_data = pd.DataFrame(model_data)
model_data

咱们终究挑选xgboost，它体现最好，在练习和测验中都体现出高功能，测验集上ROC-AUC值为 98，精度为 95，召回率为 92。

plt.rcParams["figure.figsize"] = (25,15)
ax1 = model_data.plot.bar(x = 'Model', y = ["Accuracy on Training", "Accuracy on Test", "ROC AUC Score", 
                                            "Precision", "Recall"], 
                          cmap = 'coolwarm')
ax1.legend()
ax1.set_title("Model Comparison", fontsize = 18)
ax1.set_xlabel('Model', fontsize = 14)
ax1.set_ylabel('Result', fontsize = 14, color = 'Black');

模型可解说性

除了拿到终究功能良好的模型，在机器学习实际运用中，很重要的别的一件工作是结合事务场景进行解说，这能协助事务后续提高。咱们能够根据Xgboost自带的特征重要度和SHAP等完结这项使命。

关于SHAP东西库的运用介绍，欢迎咱们阅览ShowMeAI的文章：

根据SHAP的机器学习可解说性实战

XGBoost 特征重要性

from xgboost import plot_importance
model_xgb.get_booster().feature_names = ['type_of_travel', 'inflight_wifi_service', 'online_boarding',
       'seat_comfort', 'inflight_entertainment', 'on_board_service',
       'leg_room_service', 'cleanliness', 'class_business', 'class_eco']
plot_importance(model_xgb)
plt.show()

Xgboost给出的最重要的特征顺次包含：座椅舒适度、在线登机、机上文娱、机上服务质量、腿部空间、机上无线网络和清洁度。

SHAP 模型和特征可解说性

为了剖析模型在 SHAP 中的特征影响，首要运用 Python 的 pickle 库对模型进行 pickle。然后运用模型管道和咱们挑选的特征在 Shap 中创建了一个解说器，并将其运用于 X_train 数据集上。

import shap
# Saving test model. 
pickle.dump(model_xgb, open('./Models/model_xgb.pkl', 'wb'))
explainer = shap.Explainer(model_xgb, feature_names = features)
shap_values = explainer(X_train)
shap.initjs()
shap.summary_plot(shap_values, X_train, class_names=model_xgb.classes_)

假如将平均 SHAP 值作为咱们衡量特征重要性的目标，咱们能够看到机上 Wi-Fi 服务是咱们数据中最具影响力的特征，紧随其后的是游览类型和在线登机。

关于简直每个特征，高取值（大部分是对这个特征维度的满足程度高）对猜测有积极影响，而低特征值对猜测有负面影响。机上 wi-fi 服务是咱们数据集中最具影响力的特征，紧随其后的是游览类型和在线登机流程。

机上 Wi-Fi 服务特征影响剖析

shap.plots.scatter(shap_values[:, "inflight_wifi_service"], color=shap_values)

咱们拿出最重要的特征『机上 Wi-Fi』进行进一步剖析。上图中的横坐标为机上wifi满足度得分，纵坐标为SHAP值巨细，色彩区分游览类型（个人游览编码为 0，商务游览编码为 1）。

咱们观察到：

个人游览乘客：机上WiFi打分高对终究高满足度有更多的正面影响，而机上WiFi打分低对终究满足度低的贡献更大。
商务游览乘客：不管他们的 Wi-Fi 服务体会怎么，都有一部分是满足的（正 SHAP 值超过负值）。

在线登机特征影响剖析

shap.plots.scatter(shap_values[:, "online_boarding"], color=shap_values)

对『在线登机』特征的影响SHAP剖析如上。不管是个人游览还是商务出行，在线登机进程的低分都会对终究满足度输出发生负面影响。

总结

在本篇内容中，咱们结合航空出行场景，对航班乘客满足度进行了详尽的数据剖析和建模猜测，并进行了模型的可解说性剖析。

咱们作用最好的模型取得了95%的accuracy和0.987的auc得分，模型解说上能够看到影响满足度最重要的要素是机上 Wi-Fi 服务、在线登机、机上文娱质量、餐饮、座椅舒适度、机舱清洁度和腿部空间。

参考资料

航空公司乘客满足度数据集(Kaggle)
美国航空公司的乘客不满足原因剖析（CNN）
新闻：跟着飞机客满和票价上涨，旅客满足度下降(CNBC)
数据剖析实战：Python 数据剖析实战教程：www.showmeai.tech/tutorials/4…
机器学习实战：手把手教你玩转机器学习系列：www.showmeai.tech/tutorials/4…
根据SHAP的机器学习可解说性实战：showmeai.tech/article-det…
机器学习实战 | 机器学习特征工程最全解读：showmeai.tech/article-det…

『航班乘客满意度』场景数据分析建模与业务归因解释 ⛵

导言

数据&环境

数据一览和清理

数据一览

数据清洗

探索性剖析

方针变量（客户满足度）散布怎么

性别和客户身份 V.S. 满足度

客舱等级 V.S. 满足度

年纪段 V.S. 满足度

飞翔时刻长短 V.S. 满足度

飞翔间隔 V.S. 各个体会维度

年纪 V.S. 各个体会维度

客舱等级和出行目的 V.S. 各个体会维度

数据处理和特征挑选

数据处理/特征工程

特征挑选

建模

东西库导入与数据预备

模型评价目标核算

建模与优化

① 逻辑回归模型

② 随机森林模型

③ Adaboost模型

④ 朴素贝叶斯

⑤ Xgboost模型

综合比照

模型可解说性

XGBoost 特征重要性

SHAP 模型和特征可解说性

机上 Wi-Fi 服务特征影响剖析

在线登机特征影响剖析

总结

参考资料

推荐阅览

近期文章

近期评论

『航班乘客满意度』场景数据分析建模与业务归因解释 ⛵

导言

数据&环境

数据一览和清理

数据一览

数据清洗

探索性剖析

方针变量（客户满足度）散布怎么

性别和客户身份 V.S. 满足度

客舱等级 V.S. 满足度

年纪段 V.S. 满足度

飞翔时刻长短 V.S. 满足度

飞翔间隔 V.S. 各个体会维度

年纪 V.S. 各个体会维度

客舱等级和出行目的 V.S. 各个体会维度

数据处理和特征挑选

数据处理/特征工程

特征挑选

建模

东西库导入与数据预备

模型评价目标核算

建模与优化

① 逻辑回归模型

② 随机森林模型

③ Adaboost模型

④ 朴素贝叶斯

⑤ Xgboost模型

综合比照

模型可解说性

XGBoost 特征重要性

SHAP 模型和特征可解说性

机上 Wi-Fi 服务 特征影响剖析

在线登机特征影响剖析

总结

参考资料

推荐阅览

近期文章

近期评论

机上 Wi-Fi 服务特征影响剖析