数据科学家赚多少?数据全分析与可视化 ⛵
  • 作者:韩信子@ShowMeAI
  • 数据剖析实战系列:www.showmeai.tech/tutorials/4…
  • AI 岗位&攻略系列:www.showmeai.tech/tutorials/4…
  • 本文地址:www.showmeai.tech/article-det…
  • 声明:版权所有,转载请联络平台与作者并注明出处
  • 收藏ShowMeAI检查更多精彩内容

引言

数据科学家赚多少?数据全分析与可视化 ⛵

数据科学在互联网、医疗、电信、零售、体育、航空、艺术等各个领域仍然越来越受欢迎。在 Glassdoor的美国最佳职位列表中,数据科学职位排名第三,2022 年有近 10,071 个职位空缺。

除了数据独特的魅力,数据科学相关岗位的薪资也备受重视,在本篇内容中,ShowMeAI会根据数据对下述问题进行剖析:

  • 数据科学中薪水最高的作业是什么?
  • 哪个国家的薪水最高,时机最多?
  • 典型的薪资规划是多少?
  • 作业水平对数据科学家有多重要?
  • 数据科学,全职vs自由职业者
  • 数据科学领域薪水最高的作业是什么?
  • 数据科学领域平均薪水最高的作业是什么?
  • 数据科学专业的最低和最高薪酬
  • 招聘数据科学专业人员的公司规划怎么?
  • 薪酬是不是跟公司规划有关?
  • WFH(长途办公)和 WFO 的份额是多少?
  • 数据科学作业的薪水每年怎么增加?
  • 假如有人正在寻觅与数据科学相关的作业,你会主张他在网上搜索什么?
  • 假如你有几年初级职工的经历,你应该考虑跳槽到什么规划的公司?

数据阐明

咱们本次用到的数据集是 数据科学作业薪水数据集,咱们能够经过 ShowMeAI 的百度网盘地址下载。

实战数据集下载(百度网盘):大众号『ShowMeAI研究中心』回复『实战』,或者点击 这儿 获取本文 [37]根据pandasql和plotly的数据科学家薪资剖析与可视化 『ds_salaries数据集

ShowMeAI官方GitHub:github.com/ShowMeAI-Hu…

数据集包含 11 列,对应的称号和意义如下:

参数 意义
work_year 付出薪酬的年份
experience_level : 发薪时的经历等级
employment_type 作业类型
job_title 岗位称号
salary 付出的总薪酬总额
salary_currency 付出的薪水的钱银
salary_in_usd 付出的标准化薪酬(美元)
employee_residence 职工的主要寓居国家
remote_ratio 长途完结的作业总量
company_location 雇主主要办公室地点的国家/区域
company_size 依据职工人数核算的公司规划
数据科学家赚多少?数据全分析与可视化 ⛵

本篇剖析运用到Pandas和SQL,欢迎咱们阅览ShowMeAI的数据剖析教程和对应的东西速查表文章,体系学习和着手实践:

图解数据剖析:从入门到通晓系列教程

编程言语速查表 | SQL 速查表

数据科学东西库速查表 | Pandas 速查表

数据科学东西库速查表 | Matplotlib 速查表

导入东西库

咱们先导入需求运用的东西库,咱们运用pandas读取数据,运用 Plotly 和 matplotlib 进行可视化。而且咱们在本篇中会运用 SQL 进行数据剖析,咱们这儿运用到了 pandasql 东西库。

# For loading data
import pandas as pd
import numpy as np
# For SQL queries
import pandasql as ps
# For ploting graph / Visualization
import plotly.graph_objects as go
import plotly.express as px
from plotly.offline import iplot
import plotly.figure_factory as ff
import plotly.io as pio
import seaborn as sns
import matplotlib.pyplot as plt
# To show graph below the code or on same notebook
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
# To convert country code to country name
import country_converter as coco
import warnings
warnings.filterwarnings('ignore')

加载数据集

咱们下载的数据集是 CSV 格式的,所以咱们能够运用 read_csv 办法来读取咱们的数据集。

# Loading data
salaries = pd.read_csv('ds_salaries.csv')

要检查前五个记录,咱们能够运用 salaries.head() 办法。

数据科学家赚多少?数据全分析与可视化 ⛵

借助 pandasql完结相同的任务是这样的:

# Function query to execute SQL queries
def query(query):
 return ps.sqldf(query)
# Showing Top 5 rows of data
query("""
        SELECT * 
        FROM salaries 
        LIMIT 5
""")

输出:

数据科学家赚多少?数据全分析与可视化 ⛵

数据预处理

咱们数据会集的第1列“Unnamed: 0”是没有用的,在剖析之前咱们把它剔除:

salaries = salaries.drop('Unnamed: 0', axis = 1)

咱们检查一下数据会集缺失值情况:

salaries.isna().sum()

输出:

work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

咱们的数据会集没有任何缺失值,因此不用做缺失值处理,employee_residencecompany_location 运用的是短国家代码。咱们映射替换为国家的全名以便于了解:

# Converting countries code to country names
salaries["employee_residence"] = coco.convert(names=salaries["employee_residence"], to="name")
salaries["company_location"] = coco.convert(names=salaries["company_location"], to="name")

这个数据会集的experience_level代表不同的经历水平,运用的是如下缩写:

  • CN: Entry Level (入门级)
  • ML:Mid level (中级)
  • SE:Senior Level (高档)
  • EX:Expert Level (资深专家级)

为了更容易了解,咱们也把这些缩写替换为全称。

# Replacing values in column - experience_level :
salaries['experience_level'] = query("""SELECT 
                                          REPLACE(
                                            REPLACE(
                                              REPLACE(
                                                REPLACE(
                                                  experience_level, 'MI', 'Mid level'), 
                                                                    'SE', 'Senior Level'), 
                                                                    'EN', 'Entry Level'), 
                                                                    'EX', 'Expert Level') 
                                        FROM 
                                          salaries""")

相同的办法,咱们对作业形式也做全称替换

  • FT: Full Time (全职)
  • PT: Part Time (兼职)
  • CT:Contract (合同制)
  • FL:Freelance (自由职业)
# Replacing values in column - experience_level :
salaries['employment_type'] = query("""SELECT 
                                          REPLACE(
                                            REPLACE(
                                              REPLACE(
                                                REPLACE(
                                                  employment_type, 'PT', 'Part Time'), 
                                                                    'FT', 'Full Time'), 
                                                                    'FL', 'Freelance'), 
                                                                    'CT', 'Contract') 
                                        FROM 
                                          salaries""")

数据会集公司规划字段处理如下:

  • S:Small (小型)
  • M:Medium (中型)
  • L:Large (大型)
# Replacing values in column - company_size :
salaries['company_size'] = query("""SELECT 
                                       REPLACE(
                                         REPLACE(
                                           REPLACE(
                                             company_size, 'M', 'Medium'), 
                                                           'L', 'Large'), 
                                                           'S', 'Small') 
                                    FROM 
                                       salaries""")

咱们对长途比率字段也做一些处理,以便更好了解

# Replacing values in column - remote_ratio :
salaries['remote_ratio'] = query("""SELECT 
                                        REPLACE(
                                          REPLACE(
                                            REPLACE(
                                              remote_ratio, '100', 'Fully Remote'), 
                                                            '50', 'Partially Remote'), 
                                                            '0', 'Non Remote Work') 
                                    FROM 
                                      salaries""")

这是预处理后的终究输出。

数据科学家赚多少?数据全分析与可视化 ⛵

数据剖析&可视化

数据科学中薪水最高的作业是什么?

top10_jobs = query("""
                    SELECT job_title,
                    Count(*) AS job_count
                    FROM salaries
                    GROUP BY job_title
                    ORDER BY job_count DESC
                    LIMIT 10
""")

咱们制作条形图以便更直观了解:

data = go.Bar(x = top10_jobs['job_title'], y = top10_jobs['job_count'],
             text = top10_jobs['job_count'], textposition = 'inside',
             textfont = dict(size = 12,
                            color = 'white'),
             marker = dict(color = px.colors.qualitative.Alphabet,
                          opacity = 0.9,
                          line_color = 'black',
                          line_width = 1))
layout = go.Layout(title = {'text': "<b>Top 10 Data Science Jobs</b>", 
                            'x':0.5, 'xanchor': 'center'},
                   xaxis = dict(title = '<b>Job Title</b>', tickmode = 'array'),
                   yaxis = dict(title = '<b>Total</b>'),
                   width = 900,
                   height = 600)
fig = go.Figure(data = data, layout = layout)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

数据科学职位的商场散布

fig = px.pie(top10_jobs, values='job_count',
              names='job_title', 
              color_discrete_sequence = px.colors.qualitative.Alphabet)
fig.update_layout(title = {'text': "<b>Distribution of job positions</b>", 
                            'x':0.5, 'xanchor': 'center'},
                   width = 900,
                   height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

拥有最多数据科学作业的国家

top10_com_loc = query("""
                    SELECT company_location AS company,
                    Count(*) AS job_count
                    FROM salaries
                    GROUP BY company
                    ORDER BY job_count DESC
                    LIMIT 10
""")
data = go.Bar(x = top10_com_loc['company'], y = top10_com_loc['job_count'],
             textfont = dict(size = 12,
                            color = 'white'),
             marker = dict(color = px.colors.qualitative.Alphabet,
                          opacity = 0.9,
                          line_color = 'black',
                          line_width = 1))
layout = go.Layout(title = {'text': "<b>Top 10 Data Science Countries</b>", 
                            'x':0.5, 'xanchor': 'center'},
                   xaxis = dict(title = '<b>Countries</b>', tickmode = 'array'),
                   yaxis = dict(title = '<b>Total</b>'),
                   width = 900,
                   height = 600)
fig = go.Figure(data = data, layout = layout)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

从上图中,咱们能够看出美国在数据科学方面的作业时机最多。现在咱们来看看世界各地的薪水。咱们能够继续运行代码,检查可视化成果。

df = salaries
df["company_country"] = coco.convert(names = salaries["company_location"], to = 'name_short')
temp_df = df.groupby('company_country')['salary_in_usd'].sum().reset_index()
temp_df['salary_scale'] = np.log10(df['salary_in_usd'])
fig = px.choropleth(temp_df, locationmode = 'country names', locations = "company_country",
                   color = "salary_scale", hover_name = "company_country",
                   hover_data = temp_df[['salary_in_usd']], 
                    color_continuous_scale = 'Jet',
                   )
fig.update_layout(title={'text':'<b>Salaries across the World</b>', 
                         'xanchor': 'center','x':0.5})
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()

平均薪酬(根据钱银核算)

df = salaries[['salary_currency','salary_in_usd']].groupby(['salary_currency'], as_index = False).mean().set_index('salary_currency').reset_index().sort_values('salary_in_usd', ascending = False)
#Selecting top 14
df = df.iloc[:14]
fig = px.bar(df, x = 'salary_currency',
            y = 'salary_in_usd',
            color = 'salary_currency',
            color_discrete_sequence = px.colors.qualitative.Safe,
            )
fig.update_layout(title={'text':'<b>Average salary as a function of currency</b>', 
                         'xanchor': 'center','x':0.5},
                 xaxis_title = '<b>Currency</b>',
                 yaxis_title = '<b>Mean Salary</b>')
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

人们以美元赚取的收入最多,其次是瑞士法郎和新加坡元。

df = salaries[['company_country','salary_in_usd']].groupby(['company_country'], as_index = False).mean().set_index('company_country').reset_index().sort_values('salary_in_usd', ascending = False)
#Selecting top 14
df = df.iloc[:14]
fig = px.bar(df, x = 'company_country',
            y = 'salary_in_usd',
            color = 'company_country',
            color_discrete_sequence = px.colors.qualitative.Dark2,
            )
fig.update_layout(title = {'text': "<b>Average salary as a function of company location</b>", 
                            'x':0.5, 'xanchor': 'center'},
                   xaxis = dict(title = '<b>Company Location</b>', tickmode = 'array'),
                   yaxis = dict(title = '<b>Mean Salary</b>'),
                   width = 900,
                   height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

数据科学作业经历水平散布

job_exp = query("""
            SELECT experience_level, Count(*) AS job_count
            FROM salaries
            GROUP BY experience_level
            ORDER BY job_count ASC
""")
data = go.Bar(x = job_exp['job_count'], y = job_exp['experience_level'],
              orientation = 'h', text = job_exp['job_count'],
             marker = dict(color = px.colors.qualitative.Alphabet,
                          opacity = 0.9,
                          line_color = 'white',
                          line_width = 2))
layout = go.Layout(title = {'text': "<b>Jobs on Experience Levels</b>",
                           'x':0.5, 'xanchor':'center'},
                  xaxis = dict(title='<b>Total</b>', tickmode = 'array'),
                  yaxis = dict(title='<b>Experience lvl</b>'),
                  width = 900,
                  height = 600)
fig = go.Figure(data = data, layout = layout)
fig.update_layout(plot_bgcolor = '#f1e7d2', 
                  paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

从上图能够看出,大多数数据科学都是 高档水平专家级很少。

数据科学作业作业类型散布

job_emp = query("""
SELECT employment_type,
COUNT(*) AS job_count
FROM salaries
GROUP BY employment_type
ORDER BY job_count ASC
""")
data =  go.Bar(x = job_emp['job_count'], y = job_emp['employment_type'], 
               orientation ='h',text = job_emp['job_count'],
               textposition ='outside',
               marker = dict(color = px.colors.qualitative.Alphabet,
                             opacity = 0.9,
                             line_color = 'white',
                             line_width = 2))
layout = go.Layout(title = {'text': "<b>Jobs on Employment Type</b>",
                           'x':0.5, 'xanchor': 'center'},
                   xaxis = dict(title='<b>Total</b>', tickmode = 'array'),
                   yaxis =dict(title='<b>Emp Type lvl</b>'),
                   width = 900,
                   height = 600)
fig = go.Figure(data = data, layout = layout)
fig.update_layout(plot_bgcolor = '#f1e7d2', 
                  paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

从上图中,咱们能够看到大多数数据科学家从事 全职作业而合同工和自由职业者 则较少

数据科学作业数量趋势

job_year = query("""
    SELECT work_year, COUNT(*) AS 'job count'
    FROM salaries
    GROUP BY work_year
    ORDER BY 'job count' DESC
""")
data = go.Scatter(x = job_year['work_year'], y = job_year['job count'],
                  marker = dict(size = 20,
                                line_width = 1.5,
                                line_color = 'white',
                                color = px.colors.qualitative.Alphabet),
                  line = dict(color = '#ED7D31', width = 4), mode = 'lines+markers')
layout  = go.Layout(title = {'text' : "<b><i>Data Science jobs Growth (2020 to 2022)</i></b>",
                             'x' : 0.5, 'xanchor' : 'center'},
                    xaxis = dict(title = '<b>Year</b>'),
                    yaxis = dict(title = '<b>Jobs</b>'),
                    width = 900,
                    height = 600)
fig = go.Figure(data = data, layout = layout)
fig.update_xaxes(tickvals = ['2020','2021','2022'])
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

数据科学作业薪水散布

salary_usd = query("""
                    SELECT salary_in_usd 
                    FROM salaries
""")
import matplotlib.pyplot as plt
plt.figure(figsize = (20, 8))
sns.set(rc = {'axes.facecolor' : '#f1e7d2',
             'figure.facecolor' : '#f1e7d2'})
p = sns.histplot(salary_usd["salary_in_usd"], 
                kde = True, alpha = 1, fill = True,
                edgecolor = 'black', linewidth = 1)
p.axes.lines[0].set_color("orange")
plt.title("Data Science Salary Distribution \n", fontsize = 25)
plt.xlabel("Salary", fontsize = 18)
plt.ylabel("Count", fontsize = 18)
plt.show()
数据科学家赚多少?数据全分析与可视化 ⛵

薪酬最高的 10 大数据科学作业

salary_hi10 = query("""
    SELECT job_title,
    MAX(salary_in_usd) AS salary
    FROM salaries
    GROUP BY salary
    ORDER BY salary DESC
    LIMIT 10
""")
data = go.Bar(x = salary_hi10['salary'],
             y = salary_hi10['job_title'],
             orientation = 'h',
             text = salary_hi10['salary'],
             textposition = 'inside',
             insidetextanchor = 'middle',
              textfont = dict(size = 13,
                             color = 'black'),
              marker = dict(color = px.colors.qualitative.Alphabet,
                           opacity = 0.9,
                           line_color = 'black',
                           line_width = 1))
layout = go.Layout(title = {'text': "<b>Top 10 Highest paid Data Science Jobs</b>",
                           'x':0.5,
                           'xanchor': 'center'},
                   xaxis = dict(title = '<b>salary</b>', tickmode = 'array'),
                   yaxis = dict(title = '<b>Job Title</b>'),
                   width = 900,
                   height = 600)
fig = go.Figure(data = data, layout
                = layout)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

首席数据工程师 是数据科学领域的高薪作业。

不同岗位平均薪资与排名

salary_av10 = query("""
    SELECT job_title,
    ROUND(AVG(salary_in_usd)) AS salary
    FROM salaries
    GROUP BY job_title
    ORDER BY salary DESC
    LIMIT 10
""")
data = go.Bar(x = salary_av10['salary'],
             y = salary_av10['job_title'],
             orientation = 'h',
             text = salary_av10['salary'],
             textposition = 'inside',
             insidetextanchor = 'middle',
              textfont = dict(size = 13,
                             color = 'white'),
              marker = dict(color = px.colors.qualitative.Alphabet,
                           opacity = 0.9,
                           line_color = 'white',
                           line_width = 2))
layout = go.Layout(title = {'text': "<b>Top 10 Average paid Data Science Jobs</b>",
                           'x':0.5,
                           'xanchor': 'center'},
                   xaxis = dict(title = '<b>salary</b>', tickmode = 'array'),
                   yaxis = dict(title = '<b>Job Title</b>'),
                   width = 900,
                   height = 600)
fig = go.Figure(data = data, layout = layout)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

数据科学薪资趋势

salary_year = query("""
    SELECT ROUND(AVG(salary_in_usd)) AS salary,
    work_year AS year
    FROM salaries
    GROUP BY year
    ORDER BY salary DESC
""")
data = go.Scatter(x = salary_year['year'],
                 y = salary_year['salary'],
                 marker = dict(size = 20,
                 line_width = 1.5,
                 line_color = 'black',
                 color = '#ED7D31'),
                 line = dict(color = 'black', width = 4), mode = 'lines+markers')
layout = go.Layout(title = {'text' : "<b>Data Science Salary Growth (2020 to 2022) </b>",
                            'x' : 0.5,
                            'xanchor' : 'center'},
                   xaxis = dict(title = '<b>Year</b>'),
                   yaxis = dict(title = '<b>Salary</b>'),
                   width = 900,
                   height = 600)
fig = go.Figure(data = data, layout = layout)
fig.update_xaxes(tickvals = ['2020','2021','2022'])
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

经历水平&薪资

salary_exp = query("""
    SELECT experience_level AS 'Experience Level',
    salary_in_usd AS Salary
    FROM salaries
""")
fig = px.violin(salary_exp, x = 'Experience Level', y = 'Salary', color = 'Experience Level', box = True)
fig.update_layout(title = {'text': "<b>Salary on Experience Level</b>",
                            'xanchor': 'center','x':0.5},
                   xaxis = dict(title = '<b>Experience level</b>'),
                   yaxis = dict(title = '<b>salary</b>', 
                                ticktext = [-300000, 0, 100000, 200000, 300000, 400000, 500000, 600000, 700000]),
                   width = 900,
                   height = 600)
fig.update_layout(paper_bgcolor= '#f1e7d2', 
                  plot_bgcolor = '#f1e7d2', 
                  showlegend = False)
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

不同经历水平的薪资趋势

tmp_df = salaries.groupby(['work_year', 'experience_level']).median()
tmp_df.reset_index(inplace = True)
fig = px.line(tmp_df, x='work_year', y='salary_in_usd', color='experience_level', symbol="experience_level")
fig.update_layout(title = {'text': "<b>Median Salary Trend By Experience Level</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  xaxis = dict(title = '<b>Working Year</b>', tickvals = [2020, 2021, 2022], tickmode = 'array'),
                  yaxis = dict(title = '<b>Salary</b>'),
                  width = 900,
                  height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

观察 1. 在COVID-19大盛行期间(2020 年至 2021 年),专家级职工薪资十分高,可是出现部分下降趋势。 2. 2021年以后专家级和高档职称人员薪酬有所上涨。

年份&薪资散布

year_gp = salaries.groupby('work_year')
hist_data = [year_gp.get_group(2020)['salary_in_usd'],
             year_gp.get_group(2021)['salary_in_usd'],
            year_gp.get_group(2022)['salary_in_usd']]
group_labels = ['2020', '2021', '2022']
fig = ff.create_distplot(hist_data, group_labels, show_hist = False)
fig.update_layout(title = {'text': "<b>Salary Distribution By Working Year</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  xaxis = dict(title = '<b>Salary</b>'),
                  yaxis = dict(title = '<b>Kernel Density</b>'),
                  width = 900,
                  height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

作业类型&薪资

salary_emp = query("""
    SELECT employment_type AS 'Employment Type',
    salary_in_usd AS Salary
    FROM salaries
""")
fig = px.box(salary_emp,x='Employment Type',y='Salary',
       color = 'Employment Type')
fig.update_layout(title = {'text': "<b>Salary by Employment Type</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  xaxis = dict(title = '<b>Employment Type</b>'),
                  yaxis = dict(title = '<b>Salary</b>'),
                  width = 900,
                  height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

公司规划散布

comp_size = query("""
                SELECT company_size,
                COUNT(*) AS count
                FROM salaries
                GROUP BY company_size
""")
import plotly.graph_objects as go
data = go.Pie(labels = comp_size['company_size'], 
              values = comp_size['count'].values,
              hoverinfo = 'label',
              hole = 0.5,
              textfont_size = 16,
              textposition = 'auto')
fig = go.Figure(data = data)
fig.update_layout(title = {'text': "<b>Company Size</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  xaxis = dict(title = '<b></b>'),
                  yaxis = dict(title = '<b></b>'),
                  width = 900,
                  height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

不同公司规划的经历水平份额

df = salaries.groupby(['company_size', 'experience_level']).size()
comp_s = np.round(df['Small'].values / df['Small'].values.sum(),2)
comp_m = np.round(df['Medium'].values / df['Medium'].values.sum(),2)
comp_l = np.round(df['Large'].values / df['Large'].values.sum(),2)
fig = go.Figure()
categories = ['Entry Level', 'Expert Level','Mid level','Senior Level']
fig.add_trace(go.Scatterpolar(
    r = comp_s,
    theta = categories,
    fill = 'toself',
    name = 'Company Size S'))
fig.add_trace(go.Scatterpolar(
    r = comp_m,
    theta = categories,
    fill = 'toself',
    name = 'Company Size M'))
fig.add_trace(go.Scatterpolar(
    r = comp_l,
    theta = categories,
    fill = 'toself',
    name = 'Company Size L'))
fig.update_layout(
    polar = dict(
    radialaxis = dict(range = [0, 0.6])),
    showlegend = True,
)
fig.update_layout(title = {'text': "<b>Proportion of Experience Level In Different Company Sizes</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  xaxis = dict(title = '<b></b>'),
                  yaxis = dict(title = '<b></b>'),
                  width = 900,
                  height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

不同公司规划&作业薪资

salary_size = query("""
    SELECT company_size AS 'Company size',
    salary_in_usd AS Salary
    FROM salaries
""")
fig = px.box(salary_size, x='Company size', y = 'Salary',
             color = 'Company size')
fig.update_layout(title = {'text': "<b>Salary by Company size</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  xaxis = dict(title = '<b>Company size</b>'),
                  yaxis = dict(title = '<b>Salary</b>'),
                  width = 900,
                  height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

WFH(长途办公)和 WFO 的份额

rem_type = query("""
    SELECT remote_ratio,
    COUNT(*) AS total
    FROM salaries
    GROUP BY remote_ratio
""")
data = go.Pie(labels = rem_type['remote_ratio'], values = rem_type['total'].values,
             hoverinfo = 'label',
             hole = 0.4,
             textfont_size = 18,
             textposition = 'auto')
fig = go.Figure(data = data)
fig.update_layout(title = {'text': "<b>Remote Ratio</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  width = 900,
                  height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

薪水受Remote Type影响程度

salary_remote = query("""
    SELECT remote_ratio AS 'Remote type',
    salary_in_usd AS Salary
    From salaries
""")
fig = px.box(salary_remote, x = 'Remote type', y = 'Salary', color = 'Remote type')
fig.update_layout(title = {'text': "<b>Salary by Remote Type</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  xaxis = dict(title = '<b>Remote type</b>'),
                  yaxis = dict(title = '<b>Salary</b>'),
                  width = 900,
                  height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

不同经历水平&长途比率

exp_remote = salaries.groupby(['experience_level', 'remote_ratio']).count()
exp_remote.reset_index(inplace = True)
fig = px.histogram(exp_remote, x = 'experience_level',
                  y = 'work_year', color = 'remote_ratio',
                  barmode = 'group',
                  text_auto = True)
fig.update_layout(title = {'text': "<b>Respondent Count In Different Experience Level Based on Remote Ratio</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  xaxis = dict(title = '<b>Experience Level</b>'),
                  yaxis = dict(title = '<b>Number of Respondents</b>'),
                  width = 900,
                  height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
数据科学家赚多少?数据全分析与可视化 ⛵

剖析定论

  • 数据科学领域Top3多的职位是数据科学家数据工程师数据剖析师

  • 数据科学作业越来越受欢迎。职工份额从2020年的11.9%增加到2022年的52.4%

  • 美国是数据科学公司最多的国家。

  • 薪酬散布的IQR在62.7k和150k之间。

  • 在数据科学职工中,大多数是高档水平,而专家级则更少。

  • 大多数数据科学职工都是全职作业,很少有合同工自由职业者

  • 首席数据工程师是薪酬最高的数据科学作业。

  • 数据科学的最低薪酬(入门级经历)为4000美元,具有专家级经历的数据科学的最高薪酬为60万美元。

  • 公司构成:53.7%中型公司,32.6%大型公司,13.7%小型数据科学公司。

  • 薪酬也受公司规划影响,规划大的公司付出更高的薪水。

  • 62.8%的数据科学是完全长途作业,20.9%是非长途作业,16.3%部分长途作业。

  • 数据科学薪水随时间和经历堆集而增加

参考资料

  • Glassdoor
  • pandasql
  • 数据科学作业薪水数据集(Kaggle)
  • 图解数据剖析:从入门到通晓系列教程:www.showmeai.tech/tutorials/3…
  • 编程言语速查表 | SQL 速查表:www.showmeai.tech/article-det…
  • 数据科学东西库速查表 | Pandas 速查表:www.showmeai.tech/article-det…
  • 数据科学东西库速查表 | Matplotlib 速查表:www.showmeai.tech/article-det…

推荐阅览

  • 数据剖析实战系列 :www.showmeai.tech/tutorials/4…
  • 机器学习数据剖析实战系列:www.showmeai.tech/tutorials/4…
  • 深度学习数据剖析实战系列:www.showmeai.tech/tutorials/4…
  • TensorFlow数据剖析实战系列:www.showmeai.tech/tutorials/4…
  • PyTorch数据剖析实战系列:www.showmeai.tech/tutorials/4…
  • NLP实战数据剖析实战系列:www.showmeai.tech/tutorials/4…
  • CV实战数据剖析实战系列:www.showmeai.tech/tutorials/4…
  • AI 面试题库系列:www.showmeai.tech/tutorials/4…

数据科学家赚多少?数据全分析与可视化 ⛵

本文正在参与「金石方案 . 分割6万现金大奖」