一、基于人工智能的校园霸凌受害者猜测与剖析

基于人工智能的校园霸凌受害者预测与分析

基于人工智能的校园霸凌受害者预测与分析

1.布景描绘

本数据集来自全球校内学生健康查询(GSHS)是一项以校园为根底的查询,查询运用自填式问卷来获取年轻人的健康行为和与发病和死亡的主要原因有关的维护要素的数据。
该查询于2018年在阿根廷进行。共有56,981名学生参与。
校园回复率为86%,学生回复率为74%,整体回复率为63%。

2.数据阐明

字段 阐明
Bullied on school property in past 12 months, 在曩昔的12个月里,在校园范围内遭到霸凌
Bullied not on school property in past 12_months 在曩昔的12个月里,在校园以外的当地遭到过霸凌
Cyber bullied in past 12 months 曩昔12个月内被网络霸凌的状况
Custom Age 自定义年纪
Sex 性别
Physically attacked 身体遭到进犯
Physical fighting 身体对立
Felt lonely 感到孤单
Close friends 密切的朋友
Miss school no permission 未经校园答应的矿工天数
Other students kind and helpful 其他学生的好心和协助
Parents understand problems 爸爸妈妈是否知情
Most of the time or always felt lonely 大部分时间或总是感到孤单
Missed classes or school without permission 未经答应而缺课或旷课
Were underweight 是否体重过轻
Were overweight 是否体重过重
Were obese 是否肥壮

3.问题描绘

通常,霸凌与孤单感、缺少密切朋友、与爸爸妈妈交流不畅、缺课等有关。(例如,Nansel等人在美国青年中的欺负行为:普遍性和与社会心理习惯的联系)。查询数据显现,被霸凌者多为体重不足、超重和肥壮的人。

二、数据处理

1.读取数据

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import os
df = pd.read_csv("data/data208695/Bullying_Dataset.csv",sep=',')
df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
序号 曩昔12个月内遭到校内霸凌 曩昔12个月内校外遭到霸凌 曩昔12个月内遭到网络霸凌 年纪 性别 身体遭到进犯 身体对立 感到孤单 密切的朋友 未经校园答应的矿工天数 其他学生的好心和协助 爸爸妈妈了解问题 大部分时间或总是感到孤单 未经答应而缺课或旷课 体重过轻 体重过重 肥壮
0 1 Yes Yes 13 Female 0 times 0 times Always 2 10 or more days Never Always Yes Yes
1 2 No No No 13 Female 0 times 0 times Never 3 or more 0 days Sometimes Always No No
2 3 No No No 14 Male 0 times 0 times Never 3 or more 0 days Sometimes Always No No No No No
3 4 No No No 16 Male 0 times 2 or 3 times Never 3 or more 0 days Sometimes No No No No No
4 5 No No No 13 Female 0 times 0 times Rarely 3 or more 0 days Most of the time Most of the time No No
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56981 entries, 0 to 56980
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   序号             56981 non-null  int64 
 1   曩昔12个月内遭到校内霸凌  56981 non-null  object
 2   曩昔12个月内校外遭到霸凌  56981 non-null  object
 3   曩昔12个月内遭到网络霸凌  56981 non-null  object
 4   年纪             56981 non-null  object
 5   性别             56981 non-null  object
 6   身体遭到进犯         56981 non-null  object
 7   身体对立           56981 non-null  object
 8   感到孤单           56981 non-null  object
 9   密切的朋友          56981 non-null  object
 10  未经校园答应的矿工天数    56981 non-null  object
 11  其他学生的好心和协助     56981 non-null  object
 12  爸爸妈妈了解问题         56981 non-null  object
 13  大部分时间或总是感到孤单   56981 non-null  object
 14  未经答应而缺课或旷课     56981 non-null  object
 15  体重过轻           56981 non-null  object
 16  体重过重           56981 non-null  object
 17  肥壮             56981 non-null  object
dtypes: int64(1), object(17)
memory usage: 7.8+ MB

2.核算空值

关于内容为空白的,替换为空值

df.replace(r'^\s*$', np.nan, regex=True, inplace=True)
  • ^ 代表开端
  • $ 代表完毕
  • \s 空白字符
    • 代表屡次
df.replace(r'^\s*$', np.nan, regex=True, inplace=True)
df.isnull().sum()
序号                   0
曩昔12个月内遭到校内霸凌     1239
曩昔12个月内校外遭到霸凌      489
曩昔12个月内遭到网络霸凌      571
年纪                 108
性别                 536
身体遭到进犯             240
身体对立               268
感到孤单               366
密切的朋友             1076
未经校园答应的矿工天数       1864
其他学生的好心和协助        1559
爸爸妈妈了解问题            2373
大部分时间或总是感到孤单       366
未经答应而缺课或旷课        1864
体重过轻             20929
体重过重             20929
肥壮               20929
dtype: int64

体重过轻 、体重过重、肥壮 这几类空值许多,可以不用算在特收拾。

3.缺失值可视化

主要是中文显现,aistudio现在内置了中文字题,需要特别声明。

import warnings
warnings.filterwarnings("ignore")
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager as font_manager
%matplotlib inline
# 设置显现中文
matplotlib.rcParams['font.sans-serif'] = ['FZSongYi-Z13S'] # 指定默认字体
matplotlib.rcParams['axes.unicode_minus'] = False # 处理保存图像是负号'-'显现为方块的问题
# 各列缺失值百分比
missing_perc = (df.isnull().sum() / len(df)) * 100
# 降序摆放
missing_perc_sorted = missing_perc.sort_values(ascending=False)
# 核算缺失值的累积百分比
cumulative_perc = missing_perc_sorted.cumsum()
# 制造一个帕累托图表
fig, ax1 = plt.subplots(figsize=(10, 6))
ax1.bar(missing_perc_sorted.index, missing_perc_sorted.values, color='tab:blue')
ax1.set_xlabel('特征')
ax1.set_ylabel('缺失值比例', color='tab:blue')
ax1.set_xticklabels(missing_perc_sorted.index, rotation=90)
# 为累积百分比添加第二个y轴
ax2 = ax1.twinx()
ax2.plot(missing_perc_sorted.index, cumulative_perc.values, color='tab:red', marker='o')
ax2.set_ylabel('累积百分数', color='tab:red')
# 旋转x轴标签以便更好地显现
plt.xticks(rotation=90)
# 显现
plt.show()

基于人工智能的校园霸凌受害者预测与分析

4.删除缺失值较多的 特征列

# Drop columns with a high proportion of missing values
df.drop(['肥壮', '体重过轻', '体重过重'], axis=1, inplace=True)
#dropping na values
df=df.dropna()

5.空值查看

# 各列非空值核算
non_null_counts = df.count()
# 查看是否所有列都有相同的非空值计数
if non_null_counts.nunique() == 1:
    print("Total null values:", df.isnull().sum().sum())
else:
    print("Columns have different counts of non-null values.")
Total null values: 0
df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
序号 曩昔12个月内遭到校内霸凌 曩昔12个月内校外遭到霸凌 曩昔12个月内遭到网络霸凌 年纪 性别 身体遭到进犯 身体对立 感到孤单 密切的朋友 未经校园答应的矿工天数 其他学生的好心和协助 爸爸妈妈了解问题 大部分时间或总是感到孤单 未经答应而缺课或旷课
1 2 No No No 13 Female 0 times 0 times Never 3 or more 0 days Sometimes Always No No
2 3 No No No 14 Male 0 times 0 times Never 3 or more 0 days Sometimes Always No No
4 5 No No No 13 Female 0 times 0 times Rarely 3 or more 0 days Most of the time Most of the time No No
5 6 No No No 13 Male 0 times 1 time Never 3 or more 0 days Most of the time Always No No
6 7 No No No 14 Female 1 time 0 times Sometimes 3 or more 0 days Most of the time Always No No
df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
序号 曩昔12个月内遭到校内霸凌 曩昔12个月内校外遭到霸凌 曩昔12个月内遭到网络霸凌 年纪 性别 身体遭到进犯 身体对立 感到孤单 密切的朋友 未经校园答应的矿工天数 其他学生的好心和协助 爸爸妈妈了解问题 大部分时间或总是感到孤单 未经答应而缺课或旷课
1 2 No No No 13 Female 0 times 0 times Never 3 or more 0 days Sometimes Always No No
2 3 No No No 14 Male 0 times 0 times Never 3 or more 0 days Sometimes Always No No
4 5 No No No 13 Female 0 times 0 times Rarely 3 or more 0 days Most of the time Most of the time No No
5 6 No No No 13 Male 0 times 1 time Never 3 or more 0 days Most of the time Always No No
6 7 No No No 14 Female 1 time 0 times Sometimes 3 or more 0 days Most of the time Always No No

三、特征处理

1.特征分类变量序列化

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# 除掉序号列
columns=df.columns[1:]
print(len(columns))
for column in columns:
    print(column)
14
曩昔12个月内遭到校内霸凌
曩昔12个月内校外遭到霸凌
曩昔12个月内遭到网络霸凌
年纪
性别
身体遭到进犯
身体对立
感到孤单
密切的朋友
未经校园答应的矿工天数
其他学生的好心和协助
爸爸妈妈了解问题
大部分时间或总是感到孤单
未经答应而缺课或旷课
for column in columns:
    print(f"完结 {column} 列序列化")
    df[column]=le.fit_transform(df[column])
完结 曩昔12个月内遭到校内霸凌 列序列化
完结 曩昔12个月内校外遭到霸凌 列序列化
完结 曩昔12个月内遭到网络霸凌 列序列化
完结 年纪 列序列化
完结 性别 列序列化
完结 身体遭到进犯 列序列化
完结 身体对立 列序列化
完结 感到孤单 列序列化
完结 密切的朋友 列序列化
完结 未经校园答应的矿工天数 列序列化
完结 其他学生的好心和协助 列序列化
完结 爸爸妈妈了解问题 列序列化
完结 大部分时间或总是感到孤单 列序列化
完结 未经答应而缺课或旷课 列序列化
df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
序号 曩昔12个月内遭到校内霸凌 曩昔12个月内校外遭到霸凌 曩昔12个月内遭到网络霸凌 年纪 性别 身体遭到进犯 身体对立 感到孤单 密切的朋友 未经校园答应的矿工天数 其他学生的好心和协助 爸爸妈妈了解问题 大部分时间或总是感到孤单 未经答应而缺课或旷课
1 2 0 0 0 2 0 0 0 2 3 0 4 0 0 0
2 3 0 0 0 3 1 0 0 2 3 0 4 0 0 0
4 5 0 0 0 2 0 0 0 3 3 0 1 1 0 0
5 6 0 0 0 2 1 0 1 2 3 0 1 0 0 0
6 7 0 0 0 3 0 1 0 4 3 0 1 0 0 0

2.数据集切分

from sklearn.model_selection import train_test_split
# 切分数据集为 练习集 、 测验集
X = df.drop(['序号', '身体遭到进犯'], axis=1)
y = df['身体遭到进犯']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2023)

四、模型练习

!pip install catboost

1.决策树猜测 Decision Tree Classifier

import xgboost
import lightgbm
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
# 练习决策树分类器
clf = DecisionTreeClassifier(random_state=1024)
clf.fit(X_train, y_train)
accuracy_list = []
# test数据集猜测
y_pred = clf.predict(X_test)
from sklearn.metrics import accuracy_score
# 评价精度
accuracy = accuracy_score(y_test, y_pred)
print("精度: %.4f%%" % (accuracy * 100.0))
accuracy_list.append(accuracy*100)
精度: 76.2585%

2.随机森林 RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
r_clf = RandomForestClassifier(max_features=0.5, max_depth=15, random_state=1)
r_clf.fit(X_train, y_train)
r_pred = r_clf.predict(X_test)
r_acc = accuracy_score(y_test, r_pred)
print(r_acc)
accuracy_list.append(100*r_acc)
0.8279972416510688

3.逻辑回归Logistic Regression

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
log_reg_pred = log_reg.predict(X_test)
log_reg_acc = accuracy_score(y_test, log_reg_pred)
print(log_reg_acc)
accuracy_list.append(100*log_reg_acc)
0.8344990641316127

4.支撑向量机Support Vector

sv_clf = SVC()
sv_clf.fit(X_train, y_train)
sv_clf_pred = sv_clf.predict(X_test)
sv_clf_acc = accuracy_score(y_test, sv_clf_pred)
print(sv_clf_acc)
accuracy_list.append(100* sv_clf_acc)
0.8360752635208354

5.K接近算法K Neighbors Classifier

kn_clf = KNeighborsClassifier(n_neighbors=6)
kn_clf.fit(X_train, y_train)
kn_pred = kn_clf.predict(X_test)
kn_acc = accuracy_score(y_test, kn_pred)
print(kn_acc)
accuracy_list.append(100*kn_acc)
0.8304600531967293

6.梯度增强分类器 Gradient Boosting Classifier

gradientboost_clf = GradientBoostingClassifier(max_depth=2, random_state=1)
gradientboost_clf.fit(X_train,y_train)
gradientboost_pred = gradientboost_clf.predict(X_test)
gradientboost_acc = accuracy_score(y_test, gradientboost_pred)
print(gradientboost_acc)
accuracy_list.append(100*gradientboost_acc)
0.8364693133681411

6.xgbrf分类器 xgbrf classifier

xgb_clf = xgboost.XGBRFClassifier(max_depth=3, random_state=1)
xgb_clf.fit(X_train,y_train)
xgb_pred = xgb_clf.predict(X_test)
xgb_acc = accuracy_score(y_test, xgb_pred)
accuracy_list.append(100*xgb_acc)
print(xgb_acc)
[23:09:17] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
0.8360752635208354

7.LGBMClassifier

lgb_clf = lightgbm.LGBMClassifier(max_depth=2, random_state=4)
lgb_clf.fit(X_train,y_train)
lgb_pred = lgb_clf.predict(X_test)
lgb_acc = accuracy_score(y_test, lgb_pred)
print(lgb_acc)
accuracy_list.append(100*lgb_acc)
0.8363708009063147

8.Cat Boost Classifier

cat_clf = CatBoostClassifier()
cat_clf.fit(X_train,y_train)
cat_pred = cat_clf.predict(X_test)
cat_acc = accuracy_score(y_test, cat_pred)
print(cat_acc)
accuracy_list.append(100*cat_acc)
989:	learn: 0.4191992	total: 12.3s	remaining: 124ms
990:	learn: 0.4190779	total: 12.3s	remaining: 112ms
991:	learn: 0.4189944	total: 12.3s	remaining: 99.5ms
992:	learn: 0.4189137	total: 12.3s	remaining: 87ms
993:	learn: 0.4188195	total: 12.4s	remaining: 74.6ms
994:	learn: 0.4187774	total: 12.4s	remaining: 62.2ms
995:	learn: 0.4187095	total: 12.4s	remaining: 49.7ms
996:	learn: 0.4186093	total: 12.4s	remaining: 37.3ms
997:	learn: 0.4185544	total: 12.4s	remaining: 24.9ms
998:	learn: 0.4184855	total: 12.4s	remaining: 12.4ms
999:	learn: 0.4183650	total: 12.4s	remaining: 0us
0.8307555905822086

五、各模型结果比照

print(accuracy_list)
[76.25849669983253, 82.79972416510688, 83.44990641316127, 83.60752635208354, 83.04600531967293, 83.6469313368141, 83.60752635208354, 83.63708009063147, 83.07555905822086]
model_list = ['DecisionTree', 'RandomForest', 'Logistic Regression', 'SVC','KNearestNeighbours',
              'GradientBooster', 'XGBRF','LGBM', 'CatBoostClassifier']
plt.rcParams['figure.figsize']=20,8
sns.set_style('darkgrid')
ax = sns.barplot(x=model_list, y=accuracy_list, palette = "husl", saturation =2.0)
plt.xlabel('Classifier Models', fontsize = 20 )
plt.ylabel('% of Accuracy', fontsize = 20)
plt.title('Accuracy of different Classifier Models', fontsize = 20)
plt.xticks(fontsize = 12, horizontalalignment = 'center', rotation = 8)
plt.yticks(fontsize = 12)
for i in ax.patches:
    width, height = i.get_width(), i.get_height()
    x, y = i.get_xy() 
    ax.annotate(f'{round(height,2)}%', (x + width/2, y + height*1.02), ha='center', fontsize = 'x-large')
plt.show()

基于人工智能的校园霸凌受害者预测与分析

SVC支撑向量机、xgbrf 、lightgbm 耗时特别长,以后不用它!

项目地址: aistudio.baidu.com/aistudio/pr…

本文正在参与「金石计划」