携手创造，一起生长！这是我参与「日新方案 8 月更文挑战」的第32天，点击检查活动详情

作者：韩信子@ShowMeAI

数据剖析实战系列：www.showmeai.tech/tutorials/4…

机器学习实战系列：www.showmeai.tech/tutorials/4…

本文地址：www.showmeai.tech/article-det…

声明：版权所有，转载请联络渠道与作者并注明出处

保藏ShowMeAI检查更多精彩内容

咱们出去旅游最关怀的问题之一就是住宿，在国外以 Airbnb 为代表的民宿互联网模式彻底改变了酒店业，很多游客更喜爱预订 Airbnb 而不是酒店，而在国内的美团飞猪等渠道，也有大量的民宿入驻。

在现在这个信息通明敞开的互联网时代，咱们能否收集数据信息，开发一个机器学习模型来猜测房源价格，为自己的出行供给更智能化的信息呢？肯定是能够的，下面ShowMeAI以Airbnb在大曼彻斯特区域的房源数据为例（到 2022 年 3 月），来演示数据剖析与发掘建模的全过程，同样的办法模式能够使用在咱们了解的国内渠道上。

下面的项目事务和 Airbnb民宿数据 来源于 Inside Airbnb，包含有关 Airbnb 对住宅社区影响的数据和宣扬。数据源能够在上述链接中获取，咱们也能够访问ShowMeAI的百度网盘地址，获取咱们为咱们存储好的项目数据。

实战数据集下载（百度网盘）：公众号『ShowMeAI研究中心』回复『实战』，或许点击这里获取本文 [22]依据Airbnb数据的民宿房价猜测模型『Airbnb民宿数据』

⭐ ShowMeAI官方GitHub：github.com/ShowMeAI-Hu…

事务问题

一般咱们需要在开始发掘和建模之前，深化了解咱们的事务场景和数据状况，咱们先总结了一些在这个事务场景下咱们关怀的一些事务问题，咱们将经过数据剖析发掘来完结这些事务问题的理解。

哪些区域或乡镇的 Airbnb 房源最多？
最受欢迎的房型是什么？
大曼彻斯特区域的 Airbnb 房源价格特点是什么？
房源与房东的散布状况？
大曼彻斯特区域有哪些房型可供挑选？
机器学习模型猜测该区域 Airbnb 房源价格的思路是什么样的？
在猜测大曼彻斯特区域 Airbnb 房源的价格时，哪些特征更重要？

数据读取与初探

咱们先导入本次需要运用到的剖析发掘与建模东西库

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm, trange
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingRegressor
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.inspection import permutation_importance
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

接下来咱们读取大曼彻斯特区域的房源数据

gm_listings = pd.read_csv('gm_listings-2.csv')
gm_calendar = pd.read_csv('calendar-2.csv')
gm_reviews = pd.read_csv('reviews-2.csv')

检查数据的根底信息如下

gm_listings.head()

gm_listings.shape
# (3584, 74)
gm_listings.columns

gm_calendar.head()

gm_reviews.head()

咱们对数据的初览能够看到，大曼彻斯特区域的房源数据集包含 3584 行和 78 列，包含有关房东、房源类型、区域和评级的信息。

数据清洗

数据清洗是机器学习建模使用的【特征工程】阶段的中心步骤，它触及的办法技能欢迎咱们查阅ShowMeAI对应的教程文章，快学快用。

机器学习实战 | 机器学习特征工程最全解读

字段清洗

因为数据中的字段很多，有些字段比较乱，咱们需要做一些数据清洗的作业，数据包含一些带有URL的列，对最终的猜测作用不大，咱们把它们清洗掉。

# 删去url字段
def drop_function(df):
    df = df.drop(columns=['listing_url', 'description', 'host_thumbnail_url', 'host_picture_url', 'latitude', 'longitude', 'picture_url', 'host_url', 'host_location', 'neighbourhood', 'neighbourhood_cleansed', 'host_about', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'calendar_last_scraped'])
    return df
gm_df = drop_function(gm_listings)

删去过后的数据如下，干净很多

缺失值处理

数据中也包含了一些缺失值，咱们对它们进行剖析处理：

# 检查缺失值百分比
(gm_df.isnull().sum()/gm_df.shape[0])* 100

得到如下成果

id                                                0.000000
scrape_id                                         0.000000
last_scraped                                      0.000000
name                                              0.000000
neighborhood_overview                            41.266741
host_id                                           0.000000
host_name                                         0.000000
host_since                                        0.000000
host_response_time                               10.212054
host_response_rate                               10.212054
host_acceptance_rate                              5.636161
host_is_superhost                                 0.000000
host_neighbourhood                               91.657366
host_listings_count                               0.000000
host_total_listings_count                         0.000000
host_verifications                                0.000000
host_has_profile_pic                              0.000000
host_identity_verified                            0.000000
neighbourhood_group_cleansed                      0.000000
property_type                                     0.000000
room_type                                         0.000000
accommodates                                      0.000000
bathrooms                                       100.000000
bathrooms_text                                    0.306920
bedrooms                                          4.687500
beds                                              2.120536
amenities                                         0.000000
price                                             0.000000
minimum_nights                                    0.000000
maximum_nights                                    0.000000
minimum_minimum_nights                            0.000000
maximum_minimum_nights                            0.000000
minimum_maximum_nights                            0.000000
maximum_maximum_nights                            0.000000
minimum_nights_avg_ntm                            0.000000
maximum_nights_avg_ntm                            0.000000
calendar_updated                                100.000000
number_of_reviews                                 0.000000
number_of_reviews_ltm                             0.000000
number_of_reviews_l30d                            0.000000
first_review                                     19.810268
last_review                                      19.810268
review_scores_rating                             19.810268
review_scores_accuracy                           20.089286
review_scores_cleanliness                        20.089286
review_scores_checkin                            20.089286
review_scores_communication                      20.089286
review_scores_location                           20.089286
review_scores_value                              20.089286
license                                         100.000000
instant_bookable                                  0.000000
calculated_host_listings_count                    0.000000
calculated_host_listings_count_entire_homes       0.000000
calculated_host_listings_count_private_rooms      0.000000
calculated_host_listings_count_shared_rooms       0.000000
reviews_per_month                                19.810268
dtype: float64

咱们分几种不同的份额状况对缺失值进行处理：

高缺失份额的字段，如license、calendar_updated、bathrooms、host_neighborhood等包含90%以上的NaN值，包含neighborhood overview是41%的NaN，并且包含文本数据。咱们会直接除掉这些字段。
数值型字段，缺失不多的状况下，咱们用字段平均值进行填充。这保证了这些值的散布被保存下来。这些列包含bedrooms、beds、review_scores_rating、review_scores_accuracy和其他打分字段。
类别型字段，像bathrooms_text和host_response_time，咱们用众数进行填充。

# 除掉高缺失份额字段
def drop_function_2(df):
    df = df.drop(columns=['license', 'calendar_updated', 'bathrooms', 'host_neighbourhood', 'neighborhood_overview'])
    return df
gm_df = drop_function_2(gm_df)
# 均值填充
def input_mean(df, column_list):
    for columns in column_list: 
        df[columns].fillna(value = df[columns].mean(), inplace=True)
    return df
column_list = ['review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
              'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
              'review_scores_value', 'reviews_per_month',
              'bedrooms', 'beds']
gm_df = input_mean(gm_df, column_list)
# 众数填充
def input_mode(df, column_list):    
    for columns in column_list:        
        df[columns].fillna(value = df[columns].mode()[0], inplace=True)
    return df
column_list = ['first_review', 'last_review', 'bathrooms_text', 'host_acceptance_rate', 
               'host_response_rate', 'host_response_time']
gm_df = input_mode(gm_df, column_list)

字段编码

host_is_superhost 和 has_availability 等列对应的字符串意义为 true 或 false，咱们对其编码替换为0或1。

gm_df = gm_df.replace({'host_is_superhost': 't', 'host_has_profile_pic': 't', 'host_identity_verified': 't', 'has_availability': 't', 'instant_bookable': 't'}, 1)
gm_df = gm_df.replace({'host_is_superhost': 'f', 'host_has_profile_pic': 'f', 'host_identity_verified': 'f', 'has_availability': 'f', 'instant_bookable': 'f'}, 0)

咱们检查下替换后的数据散布

gm_df['host_is_superhost'].value_counts()

字段格式转化

价格相关的字段，现在仍是字符串类型，包含“$”等符号，咱们对其处理并转化为数值型。

def string_to_int(df, column):
    # 字符串替换整理
    df[column] = df[column].str.replace("$", "")
    df[column] = df[column].str.replace(",", "")
    # 转为数值型
    df[column] = pd.to_numeric(df[column]).astype(int)
    return df
gm_df = string_to_int(gm_df, 'price')

列表型字段编码

像host_verifications和amenities这样的字段，取值为列表格式，咱们对其进行编码处理（用哑变量替换）。

# 检查列表型取值字段
gm_df_copy = gm_df.copy()
gm_df_copy['amenities'].head()

gm_df_copy['host_verifications'].head()

# 哑变量编码
gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace('"', '')
gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace(']', "")
gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace('[', "")
df_amenities = gm_df_copy['amenities'].str.get_dummies(sep = ",")
gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace("'", "")
gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace(']', "")
gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace('[', "")
df_host_ver = gm_df_copy['host_verifications'].str.get_dummies(sep = ",")

编码后的成果如下所示

df_amenities.head()
df_host_ver.head()

# 删去原始字段
gm_df = gm_df.drop(['host_verifications', 'amenities'], axis=1)

数据探究

下一步咱们要进行更全面一些的探究性数据剖析。

EDA数据剖析部分触及的东西库，咱们能够参阅ShowMeAI制作的东西库速查表和教程进行学习和快速运用。

数据科学东西库速查表 | Pandas 速查表

图解数据剖析：从入门到通晓系列教程

哪些街区的房源最多？

gm_df['neighbourhood_group_cleansed'].value_counts()

bar_data = gm_df['neighbourhood_group_cleansed'].value_counts().sort_values()
# 从bar_data构建新的dataframe
bar_data = pd.DataFrame(bar_data).reset_index()
bar_data['size'] = bar_data['neighbourhood_group_cleansed']/gm_df['neighbourhood_group_cleansed'].count()
# 排序 
bar_data.sort_values(by='size', ascending=False)
bar_data = bar_data.rename(columns={'index' : 'Towns', 'neighbourhood_group_cleansed' : 'number_of_listings',
                        'size':'fraction_of_total'})
#绘图展现
#plt.figure(figsize=(10,10));
bar_data.plot(kind='barh', x ='Towns', y='fraction_of_total', figsize=(8,6))
plt.title('Towns with the Most listings');
plt.xlabel('Fraction of Total Listings');

曼彻斯特镇拥有大曼彻斯特区域的大部分房源，占总房源的 53% (1849)，其次是索尔福德，占总房源的 17% ；特拉福德，占总房源的 9%。

大曼彻斯特区域的 Airbnb 房源价格散布

gm_df['price'].mean(), gm_df['price'].min(), gm_df['price'].max(),gm_df['price'].median()
# (143.47600446428572, 8, 7372, 79.0)

Airbnb 房源的均价为 143 美元，中位价为 79 美元，数据会集观察到的最高价格为 7372 美元。

# 划分价格档位区间
labels = ['$0 - $100', '$100 - $200', '$200 - $300', '$300 - $400', '$400 - $500', '$500 - $1000', '$1000 - $8000']
price_cuts = pd.cut(gm_df['price'], bins = [0, 100, 200, 300, 400, 500, 1000, 8000], right=True, labels= labels)
# 从价格档构建dataframe
price_clusters = pd.DataFrame(price_cuts).rename(columns={'price': 'price_clusters'})
# 拼接原始dataframe
gm_df = pd.concat([gm_df, price_clusters], axis=1)
# 散布绘图
def price_cluster_plot(df, column, title):    
    plt.figure(figsize=(8,6));
    yx = sb.histplot(data = df[column]);
    total = float(df[column].count())
    for p in yx.patches:
        width = p.get_width()
        height = p.get_height()
        yx.text(p.get_x() + p.get_width()/2.,height+5, '{:1.1f}%'.format((height/total)*100), ha='center')
    yx.set_title(title);
    plt.xticks(rotation=90)
    return yx
price_cluster_plot(gm_df, column='price_clusters', 
                   title='Price distribution of Airbnb Listings in the Greater Manchester Area');

从上面的剖析和可视化成果能够看出，65.4% 的总房源价格在 0-100 美元之间，而价格在 100-200 美元的房源占总房源的 23.4%。不过咱们也观察到数据散布有很明显的长尾特性，也能够把特别高价的部分视作异常值，它们可能会对咱们的剖析有一些影响。

最受欢迎的房型是什么

# 依据谈论量统计排序
ax = gm_df.groupby('property_type').agg(
    median_rating=('review_scores_rating', 'median'),number_of_reviews=('number_of_reviews', 'max')).sort_values(
by='number_of_reviews', ascending=False).reset_index()
ax.head()

在谈论最多的前 10 种房产类型中， Entire rental unit 谈论数量最多，其次是Private room in rental unit。

# 可视化
bx = ax.loc[:10]
bx =sb.boxplot(data =bx, x='median_rating', y='property_type')
bx.set_xlim(4.5, 5)
plt.title('Most Enjoyed Property types');
plt.xlabel('Median Rating');
plt.ylabel('Property Type')

房东与房源散布

# 持有房源最多的房东
host_df = pd.DataFrame(gm_df['host_name'].value_counts()/gm_df['host_name'].count() *100).reset_index()
host_df = host_df.rename(columns={'index':'name', 'host_name':'perc_count'})
host_df.head(10)

host_df['perc_count'].loc[:10].sum()

从上述剖析能够看出，房源最多的前 10 名房东占房源总数的 13.6%。

大曼彻斯特区域供给的客房类型散布

gm_df['room_type'].value_counts()

# 散布绘图
zx = sb.countplot(data=gm_df, x='room_type')
total = float(gm_df['room_type'].count())
for p in zx.patches:
    width = p.get_width()
    height = p.get_height()
    zx.text(p.get_x() + p.get_width()/2.,height+5, '{:1.1f}%'.format((height/total)*100), ha='center')
    zx.set_title('Plot showing different type of rooms available');
    plt.xlabel('Room')

大部分客房是 整栋房屋/公寓 ，占房源总数的 60%，其次是私家客房，占房源总数的 39%，共享房间 和 酒店房间 分别占房源的 0.7% 和 0.5%。

机器学习建模

下面咱们运用回归建模办法来对民宿房源价格进行预估。

特征工程

关于特征工程，欢迎咱们查阅ShowMeAI对应的教程文章，快学快用。

机器学习实战 | 机器学习特征工程最全解读

咱们首要对原始数据进行特征工程，得到合适建模的数据特征。

# 检查此刻的数据集
gm_df.head()

# 回归数据集
gm_regression_df = gm_df.copy()
# 除掉无用字段
gm_regression_df = gm_regression_df.drop(columns=['id', 'scrape_id', 'last_scraped', 'name', 'host_id', 'host_since', 'first_review', 'last_review', 'price_clusters', 'host_name'])
# 再次检查数据
gm_regression_df.head()

咱们发现host_response_rate 和 host_acceptance_rate字段带有百分号，咱们再做一点数据清洗。

# 去除百分号并转化为数值型
gm_regression_df['host_response_rate'] =  gm_regression_df['host_response_rate'].str.replace("%", "")
gm_regression_df['host_acceptance_rate'] =  gm_regression_df['host_acceptance_rate'].str.replace("%", "")
# convert to int
gm_regression_df['host_response_rate'] = pd.to_numeric(gm_regression_df['host_response_rate']).astype(int)
gm_regression_df['host_acceptance_rate'] =  pd.to_numeric(gm_regression_df['host_acceptance_rate']).astype(int)
# 检查转化后成果
gm_regression_df['host_response_rate'].head()

bathrooms_text 列包含数字和文本数据的组合，咱们对其做一些处理

# 检查原始字段
gm_regression_df['bathrooms_text'].value_counts()

# 切分与数据处理
def split_bathroom(df, column, text, new_column):
    df_2 = df[df[column].str.contains(text, case=False)]
    df.loc[df[column].str.contains(text, case=False), new_column] = df_2[column]
    return df
# 使用上述函数
gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='shared', new_column='shared_bath')
gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='private', new_column='private_bath')
# 检查shared_bath字段
gm_regression_df['shared_bath'].value_counts()

# 检查private_bath字段
gm_regression_df['private_bath'].value_counts()

gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private bath", "pb", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private baths", "pbs", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared bath", "sb", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared baths", "sb", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared half-bath", "sb", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private half-bath", "sb", case=False)
gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='bath', new_column='bathrooms_new')
gm_regression_df['shared_bath'] = gm_regression_df['shared_bath'].str.split(" ", expand=True)
gm_regression_df['private_bath'] = gm_regression_df['private_bath'].str.split(" ", expand=True)
gm_regression_df['bathrooms_new'] = gm_regression_df['bathrooms_new'].str.split(" ", expand=True)
# 填充缺失值为0 
gm_regression_df = gm_regression_df.fillna(0)
gm_regression_df['shared_bath'] = gm_regression_df['shared_bath'].replace(to_replace='Shared', value=0.5)
gm_regression_df['private_bath'] = gm_regression_df['private_bath'].replace(to_replace='Private', value=0.5)
gm_regression_df['bathrooms_new'] = gm_regression_df['bathrooms_new'].replace(to_replace='Half-bath', value=0.5)
# 转成数值型
gm_regression_df['shared_bath'] = pd.to_numeric(gm_regression_df['shared_bath']).astype(int)
gm_regression_df['private_bath'] = pd.to_numeric(gm_regression_df['private_bath']).astype(int)
gm_regression_df['bathrooms_new'] =  pd.to_numeric(gm_regression_df['bathrooms_new']).astype(int)
# 检查处理后的字段
gm_regression_df[['shared_bath', 'private_bath', 'bathrooms_new']].head()

下面咱们对类别型字段进行编码，依据字段意义的不同，咱们运用「序号编码」和「独热向量编码」等办法来完结。

# 序号编码
def encoder(df):
    for column in df[['neighbourhood_group_cleansed', 'property_type']].columns:
        labels = df[column].astype('category').cat.categories.tolist()
        replace_map = {column : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}
        df.replace(replace_map, inplace=True)
        print(replace_map)
    return df 
gm_regression_df = encoder(gm_regression_df)

咱们关于host_response_time和room_type字段，运用独热向量编码（哑变量改换）

host_dummy = pd.get_dummies(gm_regression_df['host_response_time'], prefix='host_response')
room_dummy = pd.get_dummies(gm_regression_df['room_type'], prefix='room_type')
# 拼接编码后的字段
gm_regression_df = pd.concat([gm_regression_df, host_dummy, room_dummy], axis=1)
# 除掉原始字段
gm_regression_df = gm_regression_df.drop(columns=['host_response_time', 'room_type'], axis=1)

咱们再把之前处理过的df_amenities做一点处理，再拼接到数据特征里

df_3 = pd.DataFrame(df_amenities.sum())
features = df_3['amenities'][:150].to_list()
amenities_updated = df_amenities.filter(items=(features))
gm_regression_df = pd.concat([gm_regression_df, amenities_updated], axis=1)

检查一下最终数据的维度

gm_regression_df.shape
# (3584, 198)

咱们最终得到了198个字段，为了避免特征之间的多重共线性，运用方差因子法（VIF）来挑选机器学习模型的特征。 VIF 大于 10 的特征被删去，因为这些特征的方差能够由数据会集的其他特征表明和解说。

# 计算VIF
vif_model = gm_regression_df.drop(['price'], axis=1)
vif_df = pd.DataFrame()
vif_df['feature'] = vif_model.columns
vif_df['VIF'] = [variance_inflation_factor(vif_model.values, i) for i in range(len(vif_model.columns))]
# 选出小于10的特征
vif_df_new = vif_df[vif_df['VIF']<=10]
feature_list =  vif_df_new['feature'].to_list()
# 选出这些特征对应的数据
model_df = gm_regression_df.filter(items=(feature_list))
model_df.head()

咱们拼接上price目标标签字段，能够构建完整的数据集

price_col = gm_regression_df['price']
model_df = model_df.join(price_col)

机器学习算法

咱们在这里运用几个典型的回归算法，包含线性回归、RandomForestRegression、Lasso Regression 和 GradientBoostingRegression。

关于机器学习算法的使用办法，欢迎咱们查阅ShowMeAI对应的教程与文章，快学快用。

机器学习实战：手把手教你玩转机器学习系列

机器学习实战 | SKLearn入门与简单使用事例

机器学习实战 | SKLearn最全使用指南

线性回归建模

def linear_reg(df, test_size=0.3, random_state=42):
    '''
    构建模型并回来评价成果
    输入: 数据dataframe 
    输出: 特征重要度与评价准则（RMSE与R-squared）
    '''
    X = df.drop(columns=['price'])
    y = df[['price']]
    X_columns = X.columns
    # 切分练习集与测验集
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state=random_state)
    # 线性回归分类器    
    clf = LinearRegression()
    # 候选参数列表      
    parameters = {
                  'n_jobs': [1, 2, 5, 10, 100],
                  'fit_intercept': [True, False]
                  }
    # 网格查找交叉验证调参    
    cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=3, verbose=3)  
    cv.fit(X_train,y_train)
    # 测验集预估
    pred = cv.predict(X_test)
    # 模型评价
    r2 = r2_score(y_test, pred)
    mse = mean_squared_error(y_test, pred)
    rmse = mse **.5
    # 最佳参数
    best_par = cv.best_params_
    coefficients = cv.best_estimator_.coef_
    #特征重要度
    importance = np.abs(coefficients)
    feature_importance = pd.DataFrame(importance, columns=X_columns).T
    #feature_importance = feature_importance.T
    feature_importance.columns = ['importance']
    feature_importance = feature_importance.sort_values('importance', ascending=False)
    print("The model performance for testing set")
    print("--------------------------------------")
    print('RMSE is {}'.format(rmse))
    print('R2 score is {}'.format(r2))
    print("\n")
    return feature_importance, rmse, r2
 linear_feat_importance, linear_rmse, linear_r2 = linear_reg(model_df)

随机森林建模

# 随机森林建模
def random_forest(df):
    '''
    构建模型并回来评价成果
    输入: 数据dataframe 
    输出: 特征重要度与评价准则（RMSE与R-squared）
    '''
    X = df.drop(['price'], axis=1)
    X_columns = X.columns
    y = df['price']
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    # 随机森林模型        
    clf = RandomForestRegressor()
    # 候选参数
    parameters = {
                'n_estimators': [50, 100, 200, 300, 400],
                'max_depth': [2, 3, 4, 5],
                 'max_depth': [80, 90, 100]
                     }
    # 网格查找交叉验证调参
    cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=5, verbose=3)
    model = cv
    model.fit(X_train, y_train)
    # 测验集预估
    pred = model.predict(X_test)
    # 模型评价
    mse = mean_squared_error(y_test, pred)
    rmse = mse**.5
    r2 = r2_score(y_test, pred)
    # 最佳超参数
    best_par = model.best_params_
    # 特征重要度
    r = permutation_importance(model, X_test, y_test,
                           n_repeats=10,
                           random_state=0)
    perm = pd.DataFrame(columns=['AVG_Importance'], index=[i for i in X_train.columns])
    perm['AVG_Importance'] = r.importances_mean
    perm = perm.sort_values(by='AVG_Importance', ascending=False);
    return rmse, r2, best_par, perm
# 运转建模
r_forest_rmse, r_forest_r2, r_fores_best_params, r_forest_importance = random_forest(model_df)

运转成果如下

Fitting 5 folds for each of 15 candidates, totalling 75 fits
[CV 1/5] END ..................max_depth=80, n_estimators=50; total time=   2.4s
[CV 2/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s
[CV 3/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s
[CV 4/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s
[CV 5/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s
[CV 1/5] END .................max_depth=80, n_estimators=100; total time=   3.8s
[CV 2/5] END .................max_depth=80, n_estimators=100; total time=   3.8s
[CV 3/5] END .................max_depth=80, n_estimators=100; total time=   3.9s
[CV 4/5] END .................max_depth=80, n_estimators=100; total time=   3.8s
[CV 5/5] END .................max_depth=80, n_estimators=100; total time=   3.8s
[CV 1/5] END .................max_depth=80, n_estimators=200; total time=   7.5s
[CV 2/5] END .................max_depth=80, n_estimators=200; total time=   7.7s
[CV 3/5] END .................max_depth=80, n_estimators=200; total time=   7.7s
[CV 4/5] END .................max_depth=80, n_estimators=200; total time=   7.6s
[CV 5/5] END .................max_depth=80, n_estimators=200; total time=   7.6s
[CV 1/5] END .................max_depth=80, n_estimators=300; total time=  11.3s
[CV 2/5] END .................max_depth=80, n_estimators=300; total time=  11.4s
[CV 3/5] END .................max_depth=80, n_estimators=300; total time=  11.7s
[CV 4/5] END .................max_depth=80, n_estimators=300; total time=  11.4s
[CV 5/5] END .................max_depth=80, n_estimators=300; total time=  11.4s
[CV 1/5] END .................max_depth=80, n_estimators=400; total time=  15.1s
[CV 2/5] END .................max_depth=80, n_estimators=400; total time=  16.4s
[CV 3/5] END .................max_depth=80, n_estimators=400; total time=  15.6s
[CV 4/5] END .................max_depth=80, n_estimators=400; total time=  15.2s
[CV 5/5] END .................max_depth=80, n_estimators=400; total time=  15.6s
[CV 1/5] END ..................max_depth=90, n_estimators=50; total time=   1.9s
[CV 2/5] END ..................max_depth=90, n_estimators=50; total time=   1.9s
[CV 3/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s
[CV 4/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s
[CV 5/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s
[CV 1/5] END .................max_depth=90, n_estimators=100; total time=   3.9s
[CV 2/5] END .................max_depth=90, n_estimators=100; total time=   3.9s
[CV 3/5] END .................max_depth=90, n_estimators=100; total time=   4.0s
[CV 4/5] END .................max_depth=90, n_estimators=100; total time=   3.9s
[CV 5/5] END .................max_depth=90, n_estimators=100; total time=   3.9s
[CV 1/5] END .................max_depth=90, n_estimators=200; total time=   8.7s
[CV 2/5] END .................max_depth=90, n_estimators=200; total time=   8.1s
[CV 3/5] END .................max_depth=90, n_estimators=200; total time=   8.1s
[CV 4/5] END .................max_depth=90, n_estimators=200; total time=   7.7s
[CV 5/5] END .................max_depth=90, n_estimators=200; total time=   8.0s
[CV 1/5] END .................max_depth=90, n_estimators=300; total time=  11.6s
[CV 2/5] END .................max_depth=90, n_estimators=300; total time=  11.8s
[CV 3/5] END .................max_depth=90, n_estimators=300; total time=  12.2s
[CV 4/5] END .................max_depth=90, n_estimators=300; total time=  12.0s
[CV 5/5] END .................max_depth=90, n_estimators=300; total time=  13.2s
[CV 1/5] END .................max_depth=90, n_estimators=400; total time=  15.6s
[CV 2/5] END .................max_depth=90, n_estimators=400; total time=  15.9s
[CV 3/5] END .................max_depth=90, n_estimators=400; total time=  16.1s
[CV 4/5] END .................max_depth=90, n_estimators=400; total time=  15.7s
[CV 5/5] END .................max_depth=90, n_estimators=400; total time=  15.8s
[CV 1/5] END .................max_depth=100, n_estimators=50; total time=   1.9s
[CV 2/5] END .................max_depth=100, n_estimators=50; total time=   2.0s
[CV 3/5] END .................max_depth=100, n_estimators=50; total time=   2.0s
[CV 4/5] END .................max_depth=100, n_estimators=50; total time=   2.0s
[CV 5/5] END .................max_depth=100, n_estimators=50; total time=   2.0s
[CV 1/5] END ................max_depth=100, n_estimators=100; total time=   4.0s
[CV 2/5] END ................max_depth=100, n_estimators=100; total time=   4.0s
[CV 3/5] END ................max_depth=100, n_estimators=100; total time=   4.1s
[CV 4/5] END ................max_depth=100, n_estimators=100; total time=   4.0s
[CV 5/5] END ................max_depth=100, n_estimators=100; total time=   4.0s
[CV 1/5] END ................max_depth=100, n_estimators=200; total time=   7.8s
[CV 2/5] END ................max_depth=100, n_estimators=200; total time=   7.9s
[CV 3/5] END ................max_depth=100, n_estimators=200; total time=   8.1s
[CV 4/5] END ................max_depth=100, n_estimators=200; total time=   7.9s
[CV 5/5] END ................max_depth=100, n_estimators=200; total time=   7.8s
[CV 1/5] END ................max_depth=100, n_estimators=300; total time=  11.8s
[CV 2/5] END ................max_depth=100, n_estimators=300; total time=  12.0s
[CV 3/5] END ................max_depth=100, n_estimators=300; total time=  12.8s
[CV 4/5] END ................max_depth=100, n_estimators=300; total time=  11.4s
[CV 5/5] END ................max_depth=100, n_estimators=300; total time=  11.5s
[CV 1/5] END ................max_depth=100, n_estimators=400; total time=  15.1s
[CV 2/5] END ................max_depth=100, n_estimators=400; total time=  15.3s
[CV 3/5] END ................max_depth=100, n_estimators=400; total time=  15.6s
[CV 4/5] END ................max_depth=100, n_estimators=400; total time=  15.3s
[CV 5/5] END ................max_depth=100, n_estimators=400; total time=  15.3s

随机森林最终的成果如下

r_forest_rmse, r_forest_r2
# (218.7941962807868, 0.4208644494689676)

GBDT建模

def GBDT_model(df):
    '''
    构建模型并回来评价成果
    输入: 数据dataframe 
    输出: 特征重要度与评价准则（RMSE与R-squared）
    '''
    X = df.drop(['price'], axis=1)
    Y = df['price']
    X_columns = X.columns
    X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42)
    clf = GradientBoostingRegressor()
    parameters = {
                'learning_rate': [0.1, 0.5, 1],
                'min_samples_leaf': [10, 20, 40 , 60]
                     }
    cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=5, verbose=3)
    model = cv
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    r2 = r2_score(y_test, pred)
    mse = mean_squared_error(y_test, pred)
    rmse = mse**.5
    coefficients = model.best_estimator_.feature_importances_
    importance = np.abs(coefficients)
    feature_importance = pd.DataFrame(importance, index= X_columns,
                                      columns=['importance']).sort_values('importance', ascending=False)[:10]
    return r2, mse, rmse, feature_importance
GBDT_r2, GBDT_mse, GBDT_rmse, GBDT_feature_importance = GBDT_model(model_df)
GBDT_r2, GBDT_rmse
# (0.46352992147034244, 210.58063809645563)

成果&剖析

现在随机森林的体现最安稳，而集成模型GradientBoostingRegression 的R很高，RMSE 值也偏高，Boosting的模型受异常值影响很大，这可能是因为数据会集的异常值引起的。

下面咱们来做一下优化，删去数据会集的异常值，看看是否能够进步模型性能。

效果优化

异常值在早些时候就已经被识别出来了，咱们依据统计的办法来对其进行处理。

# 依据统计办法计算价格鸿沟
q3, q1 = np.percentile(model_df['price'], [75, 25])
iqr = q3 - q1
q3 + (iqr*1.5)
# 得到成果245.0

咱们把任何高于 245 美元的值都视为异常值并删去。

new_model_df = model_df[model_df['price']<245]
# 绘制此刻的价格散布
sb.histplot(new_model_df['price'])
plt.title('New price distribution in the dataset')

重新运转这些算法

linear_feat_importance, linear_rmse, linear_r2 = linear_reg(new_model_df)
r_forest_rmse, r_forest_r2, r_fores_best_params, r_forest_importance = random_forest(new_model_df)
GBDT_r2, GBDT_mse, GBDT_rmse, GBDT_feature_importance = GBDTboost(new_model_df)

得到的新成果如下

归因剖析

那么，依据咱们的模型来剖析，在猜测大曼彻斯特区域 Airbnb 房源的价格时，哪些要素更重要？

r_feature_importance = r_forest_importance.reset_index()
r_feature_importance = r_feature_importance.rename(columns={'index':'Feature'})
r_feature_importance[:15]

# 绘制最重要的15个要素
r_feature_importance[:15].sort_values(by='AVG_Importance').plot(kind='barh', x='Feature', y='AVG_Importance', figsize=(8,6));
plt.title('Top 15 Most Imporatant Features');

咱们的模型给出的重要要素包含：

accommodates ：能够容纳的最大人数。
bathrooms_new ：非共用或非私家澡堂的数量。
minimum_nights ：房源可预订的最少晚数。
number_of_reviews ：总谈论数。
Free street parking ：免费路旁边停车位的存在是影响模型定价的最重要的便利设备。
Gym ：健身房设备。

总结&展望

咱们经过对Airbnb的数据进行深化发掘剖析和建模，完结关于民宿租借场景下的AI理解与建模预估。咱们后续还有一些能够做的工作，提高模型的体现，完结更精准地预估，比如：

更完善的特征工程，结合事务场景构建更有效的事务特征。
运用xgboost、lightgbm、catboost等模型。
运用贝叶斯调参等办法对超参数做更深化的调优。
深度学习与神经网络的办法引入。

参阅资料

数据科学东西库速查表 | Pandas 速查表：www.showmeai.tech/article-det…
图解数据剖析：从入门到通晓系列教程：www.showmeai.tech/tutorials/3…
机器学习实战：手把手教你玩转机器学习系列：www.showmeai.tech/tutorials/4…
机器学习实战 | SKLearn入门与简单使用事例：www.showmeai.tech/article-det…
机器学习实战 | SKLearn最全使用指南：www.showmeai.tech/article-det…
机器学习实战 | 机器学习特征工程最全解读：www.showmeai.tech/article-det…

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

AI带你省钱旅游！精准预测民宿房源价格！ ⛵

事务问题

数据读取与初探

数据清洗

字段清洗

缺失值处理

字段编码

字段格式转化

列表型字段编码

数据探究

哪些街区的房源最多？

大曼彻斯特区域的 Airbnb 房源价格散布

最受欢迎的房型是什么

房东与房源散布

大曼彻斯特区域供给的客房类型散布

机器学习建模

特征工程

机器学习算法

线性回归建模

随机森林建模

GBDT建模

成果&剖析

效果优化

归因剖析

总结&展望

参阅资料

评论(0)

提示：请文明发言取消回复

近期文章

近期评论

AI带你省钱旅游！精准预测民宿房源价格！ ⛵

事务问题

数据读取与初探

数据清洗

字段清洗

缺失值处理

字段编码

字段格式转化

列表型字段编码

数据探究

哪些街区的房源最多？

大曼彻斯特区域的 Airbnb 房源价格散布

最受欢迎的房型是什么

房东与房源散布

大曼彻斯特区域供给的客房类型散布

机器学习建模

特征工程

机器学习算法

线性回归建模

随机森林建模

GBDT建模

成果&剖析

效果优化

归因剖析

总结&展望

参阅资料

评论(0)

提示：请文明发言 取消回复

近期文章

近期评论

提示：请文明发言取消回复