作者介绍:

毕海成,去哪儿游览高档开发工程师, 首要担任网络平台,硬件平台,Kubernetes相关开发与运维。

一、背景

HPA (horizontal-pod-autoscaler) 水平 pods 主动扩缩功用是 Kubernetes 的一个重要功用,它能够依据 cpu ,内存等目标完结工作负载的主动扩缩容,比较手动扩容有诸多清楚明了的长处,因为我们的 Kubernetes 是多集群布置,必然需求多集群 HPA ,用户只需依据自己 Appcode 资源运用情况装备 cpu ,内存,自定义目标,以及 appcode 全体需求的最大最小副本数,多集群 HPA 会依据各个集群权重主动分配不同集群的最大最小副本数。

集群权重分两种方法,一种是手动,另一种是主动,手动装备集群权重的长处是用户可控,比方机房保护,某些组件的升级时需求人工操控集群权重为0,这样应用就不会往权重为 0 的集群发布,他的缺点是权重调整不实时,因为各个集群的资源运用情况是动态变化的,所以需求集群权重能依据当时资源运用情况实时调整,这样就发展出第二种方法主动调整集群权重,主动调整集群权重会核算各个集群当时剩下资源,按照剩下资源调整权重,假如多 appcode 并发扩容,还需求加上资源锁,当时的集群权重调整方法取这两种调整集群权重方法的长处,支撑人工和主动两种方法且人工调整的优先级高于主动调整。

根据以上对 HPA 的了解,或许很多人会觉得 HPA 一定会减少资源运用,因为它最明显的特点是,低峰缩容,顶峰扩容,按需分配资源,可是实践是这样吗?装备的最小/最大副本数合理吗?最小副本数装备高了还能缩容吗?

HPA 能带来哪些优点?

  • 进步资源利用率
  • 节省人力成本(不必手动调整服务器数量)
  • 对于HPA运行状况和应用的资源需求更加详细直观,便利运维决策

怎么体现出这些优点呢?

衡量目标包括一下几点:

  • 扩容次数
  • 缩容次数
  • 扩容上限次数
  • 缩容下限次数
  • HPA阈值(cpu,内存,自定义)
  • 最小副本数
  • 最大副本数
  • 顶峰时段副本数和cpu均匀运用率(假如顶峰期一直是最大副本能够恰当上调HPA上限)
  • 低峰时段副本数和cpu均匀运用率(假如低峰期一直是最小副本数能够恰当下降HPA下限)

二、HPA相关计算

数据搜集

上述目标中扩容次数(uc),缩容次数(dc),扩容上限次数(maxuc),缩容下限次数(mindc) 需求搜集新建数据表tbxxx(HPA目标表)搜集(uc, dc, maxuc, mindc)

create table tb_hpa_xxxx(
id SERIAL PRIMARY KEY,
appcode varchar(256),
uc int DEFAULT 0,
dc int DEFAULT 0,
maxuc int DEFAULT 0,
mindc int DEFAULT 0,
create_time timestamptz NOT NULL DEFAULT now(),  update_time timestamptz NOT NULL DEFAULT now()
);
COMMENT ON TABLE tb_hpa_xxxx IS 'HPA目标搜集';
COMMENT ON COLUMN tb_hpa_xxxx.id IS '自增ID';
COMMENT ON COLUMN tb_hpa_xxxx.appcode IS 'appcode';
COMMENT ON COLUMN tb_hpa_xxxx.uc IS '扩容次数';
COMMENT ON COLUMN tb_hpa_xxxx.dc IS '缩容次数';
COMMENT ON COLUMN tb_hpa_xxxx.maxuc IS '最大副本数次数';
COMMENT ON COLUMN tb_hpa_xxxx.mindc IS '最小副本数次数';
COMMENT ON COLUMN tb_hpa_xxxx.create_time IS '创立时刻';
COMMENT ON COLUMN tb_hpa_xxxx.update_time IS '更新时刻';

扩缩容次数计算

用上面搜集到的数据和hpa的装备,计算出appcode,env,uc(扩容次数),dc(缩容次数),maxuc(扩容上限次数),mindc(缩容下限次数),hpa阈值(hpa装备信息,cpu,内存,自定义目标阈值),全集群最小副本数,全集群最大副本数。

select G.*,
   N.min_replicas,
   N.max_replicas
from (
        select A.deployment_base as env,
            A.appcode,
            A.annotations as hpa,
            coalesce(M.uc, 0) as uc,
            coalesce(M.dc, 0) as dc,
            coalesce(M.maxuc, 0) as maxuc,
            coalesce(M.mindc, 0) as mindc
        from (
                 select appcode,
                    deployment_base,
                    detail->'metadata'->'annotations' as annotations
                    from tb_k8s_hpaxxx
                    where dep_status = 0
                      and status = 0
                    group by appcode,
                      deployment_base,
                      detail->'metadata'->'annotations'
                ) A
                left join (
                    select appcode,
                       env_name,
                       sum(uc) as uc,
                       sum(dc) as dc,
                       sum(maxuc) as maxuc,
                       sum(mindc) as mindc
                    from tb_hpa_metrics
                    where create_time >= '2022-06-10' 
                       and create_time < '2022-06-11'
                    group by appcode,
                       env_name
                 ) M on M.appcode = A.appcode
                 and M.env_name = A.deployment_base
    ) G
    left join tb_k8s_appcode_hpa N on G.appcode = N.appcode
    and G.env = N.deployment_base;

Kubernetes HPA一定会减少资源使用吗?HPA可观测性实践分享!

三、容器 cpu 运用率计算

运用 Prometheus 装备 kubernetes 环境中 Container 的 CPU 运用率时,会经常遇到 CPU 运用超出 100% ,下面就来解释一下

1.container_spec_cpu_period

当对容器进行CPU限制时,CFS调度的时刻窗口,又称容器CPU的时钟周期一般是100,000微秒

2.container_spec_cpu_quota

是指容器的运用CPU时刻周期总量,假如quota设置的是700,000,就代表该容器可用的CPU时刻是7100,000微秒,一般对应kubernetes的resource.cpu.limits的值

3.container_spec_cpu_share

是指 container 运用分配主机 CPU 相对值,比方 share 设置的是 500m ,代表窗口启动时向主机节点请求 0.5 个 CPU ,也就是 50,000 微秒,一般对应 kubernetes 的 resource.cpu.requests 的值。

4.container_cpu_usage_seconds_total

计算容器的 CPU 在一秒内耗费运用率,应留意的是该 container 所有的 CORE 。

5.container_cpu_system_seconds_total

计算容器内核态在一秒时刻内耗费的 CPU 。

6. container_cpu_user_seconds_total

计算容器用户态在一秒时刻内耗费的 CPU 。

(参阅官方地址:github.com/google/cadv… )

Kubernetes HPA一定会减少资源使用吗?HPA可观测性实践分享!

查询各个集群的P50,P90,P99的均匀P50, P90, P99

select appcode,
      avg(p50) as p50,
      avg(p90) as p90,
      avg(p99) as p99,
      avg(mean) as mean
from tb_cpu_usage_statx
where sampling_point = 'day'
     and stat_start >= '2022-06-08'
     and stat_end < '2022-06-09'
group by appcode;

Kubernetes HPA一定会减少资源使用吗?HPA可观测性实践分享!

查询各个集群的P50,P90,P99的 P50,P90,P99

select appcode,
       percentile_cont(0.5) within group (
           order by p50
       ) as p50,
       percentile_cont(0.9) within group (
           order by p90
       ) as p90, 
       percentile_cont(0.99) within group (
           order by p99
       ) as p99,
       avg(mean) as mean
from tb_cpu_usage_stat
where sampling_point = 'day'
    and stat_start >= '2022-06-08' 
    and stat_end < '2022-06-09'
group by appcode;

Kubernetes HPA一定会减少资源使用吗?HPA可观测性实践分享!

各集群P90的P90 跟不分集群核算的P90是不一样的,下面是个举例,其他P99,P50以此类推。

Kubernetes HPA一定会减少资源使用吗?HPA可观测性实践分享!

所以直接用已有数据会不精确,需求从原始数据重新核算高低峰期时段cpu运用率。

单日顶峰期时段cpu运用率

(2022-06-08 8:00 – 2022-06-08 23:00)

select appcode,
   percentile_cont(0.5) within group (
      order by cpu_usage 
   ) as p50,
   percentile_cont(0.9) within group (
      order by cpu_usage
   ) as p90,
   percentile_cont(0.99) within group (
      order by cpu_usage
   ) as p99,
   avg(cpu_usage)
from tb_container_cpu_usage_seconds_total
where collect_time >= '2022-06-08 08:00:00'
   and collect_time <= '2022-06-08 22:59:59'
group by appcode;

成果如下图所示:

Kubernetes HPA一定会减少资源使用吗?HPA可观测性实践分享!

单日低峰期时段cpu运用率

(2022-06-08 23:00 – 2022-06-08 23:59:59, 2022-06-08 00:00 -2022-06-08 07:59:59)

select appcode,
   percentile_cont(0.5) within group (
      order by cpu_usage 
   ) as p50,
   percentile_cont(0.9) within group ( 
      order by cpu_usage
   ) as p90,
   percentile_cont(0.99) within group (
      order by cpu_usage 
   ) as p99,
   avg(cpu_usage)
from tb_container_cpu_xxx
where collect_time >= '2022-06-08 23:00:00'
   and collect_time < '2022-06-08 23:59:59'
   or collect_time >= '2022-06-08 00:00:00'
   and collect_time <= '2022-06-08 07:59:59'
group by appcode;

如下图所示:

Kubernetes HPA一定会减少资源使用吗?HPA可观测性实践分享!

低峰时段POD数计算

select appcode,
   round(sum(pod_replicas_avail) / 9.0, 2) as pods
from tb_k8s_resource
where (
        ( 
         record_time >= '2022-06-09 23:00:00' 
         and record_time <= '2022-06-09 23:59:59'
         )
         or (
             record_time >= '2022-06-09 00:00:00' 
             and record_time <= '2022-06-09 07:59:59' 
         ) 
    )
group by appcode;

成果如下所示:

Kubernetes HPA一定会减少资源使用吗?HPA可观测性实践分享!

顶峰时段POD数计算

select appcode,
   round(sum(pod_replicas_avail) / 15.0, 2) as pods
from tb_k8s_resource
where record_time >= '2022-06-09 08:00:00'
   and record_time <= '2022-06-09 22:59:59'
group by appcode;

执行成果如下所示:

Kubernetes HPA一定会减少资源使用吗?HPA可观测性实践分享!

四、报表数据

将上面计算的日数据写入到表里,记录一下历史数据,便利以后计算周数据,月数据。

create table tb_hpa_report_xxx(
   id SERIAL PRIMARY KEY,
   appcode varchar(256),
   env_name varchar(256),
   uc int DEFAULT 0,
   dc int DEFAULT 0,
   maxuc int DEFAULT 0,
   mindc int DEFAULT 0,
   cpu int DEFAULT 0, 
   mem int DEFAULT 0,
   cname VARCHAR(512) DEFAULT '',
   cval int DEFAULT 0,
   min_replicas int DEFAULT 0,
   max_replicas int DEFAULT 0,
   hcpu_p50 numeric(10,4) DEFAULT 0,
   hcpu_p90 numeric(10,4) DEFAULT 0, 
   hcpu_p99 numeric(10,4) DEFAULT 0,
   hcpu_mean numeric(10,4) DEFAULT 0,  
   lcpu_p50 numeric(10,4)  DEFAULT 0,
   lcpu_p90 numeric(10,4)  DEFAULT 0, 
   lcpu_p99 numeric(10,4)  DEFAULT 0,  
   lcpu_mean numeric(10,4) DEFAULT 0, 
   record_time timestamptz, 
   create_time timestamptz NOT NULL DEFAULT now(),
   update_time timestamptz NOT NULL DEFAULT now()
);
COMMENT ON TABLE tb_hpa_report_xxx IS 'HPA数据报表';
COMMENT ON COLUMN tb_hpa_report_xxx.id IS '自增ID';
COMMENT ON COLUMN tb_hpa_report_xxx.appcode IS 'appcode';
COMMENT ON COLUMN tb_hpa_report_xxx.env_name IS 'env_name';
COMMENT ON COLUMN tb_hpa_report_xxx.uc IS '扩容次数';
COMMENT ON COLUMN tb_hpa_report_xxx.dc IS '缩容次数';
COMMENT ON COLUMN tb_hpa_report_xxx.maxuc IS '最大副本数次数';
COMMENT ON COLUMN tb_hpa_report_xxx.mindc IS '最小副本数次数';
COMMENT ON COLUMN tb_hpa_report_xxx.cpu IS 'cpu阈值';
COMMENT ON COLUMN tb_hpa_report_xxx.mem IS '内存阈值';
COMMENT ON COLUMN tb_hpa_report_xxx.cname IS '自定义目标名';
COMMENT ON COLUMN tb_hpa_report_xxx.cval IS '自定义目标阈值';
COMMENT ON COLUMN tb_hpa_report_xxx.min_replicas IS '最小副本数';
COMMENT ON COLUMN tb_hpa_report_xxx.max_replicas IS '最大副本数';
COMMENT ON COLUMN tb_hpa_report_xxx.hcpu_p50 IS '顶峰cpu p50运用率';
COMMENT ON COLUMN tb_hpa_report_xxx.hcpu_p90 IS '顶峰cpu p90运用率';
COMMENT ON COLUMN tb_hpa_report_xxx.hcpu_p99 IS '顶峰cpu p99运用率';
COMMENT ON COLUMN tb_hpa_report_xxx.hcpu_mean IS '顶峰cpu 均匀运用率';
COMMENT ON COLUMN tb_hpa_report_xxx.lcpu_p50 IS '低峰cpu p50运用率';
COMMENT ON COLUMN tb_hpa_report_xxx.lcpu_p90 IS '低峰cpu p90运用率';
COMMENT ON COLUMN tb_hpa_report_xxx.lcpu_p99 IS '低峰cpu p99运用率';
COMMENT ON COLUMN tb_hpa_report_xxx.lcpu_mean IS '低峰cpu 均匀运用率';
COMMENT ON COLUMN tb_hpa_report_xxx.record_time IS '数据计算日期';
COMMENT ON COLUMN tb_hpa_report_xxx.create_time IS '创立时刻';
COMMENT ON COLUMN tb_hpa_report_xxx.update_time IS '更新时刻';

五、代码完结

守时任务计算HPA和cpu运用率的日数据,写入到计算报表

"""HPA计算相关"""
from server.db.base import Base
from server.conf.conf import CONF
import sentry_sdk
import datetime
from sqlalchemy import text
from server.db.model.meta import commit_on_success, db, try_catch_db_exception
from server.db.model.model import HpaReportModel
from server.db.hpa import HPAfrom server.libs.mail import SendMail
from server.libs.qtalk import SendQtalkMsg
from server.libs.decorators import statsd_index
from server.libs.error import Error
import logging
LOG = logging.getLogger('gunicorn.access')
class HpaReport(Base):
    def __init__(self, *args, **kwargs):
           super().__init__(*args, **kwargs)
    @try_catch_db_exception
    @commit_on_success
    def stats_hpa_updown(self, start_time, end_time): 
           rows = db.session.execute(text( 
           """
           select G.*,
             N.min_replicas, 
             N.max_replicas 
           from ( 
                select A.deployment_base as env,
                A.appcode,
                A.annotations as hpa, 
                coalesce(M.uc, 0) as uc,
                coalesce(M.dc, 0) as dc,
                coalesce(M.maxuc, 0) as maxuc,
                coalesce(M.mindc, 0) as mindc
           from (
                select appcode,
                   deployment_base,
                   detail->'metadata'->'annotations' as annotations
                from tb_k8s_hpa_rec
                where dep_status = 0
                   and status = 0 
                group by appcode, 
                   deployment_base,
                   detail->'metadata'->'annotations'
             ) A 
             left join (
                  select appcode,
                      env_name, 
                      sum(uc) as uc, 
                      sum(dc) as dc,
                      sum(maxuc) as maxuc, 
                      sum(mindc) as mindc
                 from tb_hpa_metrics
                 where create_time >= :start_time
                      and create_time <= :end_time
                 group by appcode,
                      env_name  
              ) M on M.appcode = A.appcode                           and M.env_name = A.deployment_base 
          ) G 
          left join tb_k8s_appcode_hpa N on G.appcode = N.appcode
          and G.env = N.deployment_base;
       """
    ), {"start_time": start_time, "end_time": end_time})
    LOG.info(f'stats_hpa_updown: {rows.rowcount}')
    return self.rows_as_dicts(rows.cursor)
 @try_catch_db_exception
 @commit_on_success 
 def stats_high_time_cpu(self, start_time, end_time): 
     LOG.info(f'stats_high_time_cpu: {start_time}, {end_time}')
     rows = db.session.execute(text(
     """  
     select appcode,
        percentile_cont(0.5) within group ( 
            order by cpu_usage
        ) as p50,
        percentile_cont(0.9) within group ( 
            order by cpu_usage
        ) as p90,
        percentile_cont(0.99) within group (
            order by cpu_usage
        ) as p99,
        avg(cpu_usage)
    from tb_container_cpu_usage_seconds_total 
    where collect_time >= :start_time 
        and collect_time <= :end_time
    group by appcode
    """
    ), {"start_time": start_time, "end_time": end_time}) 
    LOG.info(f'stats_high_time_cpu: {rows.rowcount}') 
    return self.rows_as_dicts(rows.cursor)
    @try_catch_db_exception
    @commit_on_success
    def stats_high_time_pods(self, start_time, end_time):
       LOG.info(f'stats_high_time_pods: {start_time}, {end_time}')
       rows = db.session.execute(text(
       """ 
        select appcode,
           round(sum(pod_replicas_avail) / 15.0, 2) as pods
        from tb_k8s_resource
        where record_time >= :start_time 
           and record_time <= :end_time
        group by appcode
        """
     ), {"start_time": start_time, "end_time": end_time})
     LOG.info(f'stats_high_time_pods: {rows.rowcount}')
     return self.rows_as_dicts(rows.cursor)
@try_catch_db_exception
@commit_on_success
def stats_low_time_pods(self, s1, e1, s2, e2):  
   LOG.info(f'stats_low_time_pods: {s1}, {e1}, {s2}, {e2}')
   """低峰期分两段(2022-06-08 23:00 - 2022-06-08 23:59:59, 2022-06-08 00:00 - 2022-06-08 07:59:59)
        @param s1 start_time1 低峰时段1开始时刻 
        @param e1 end_time1 低峰时段1完毕时刻
        @param s2 start_time2 低峰时段2开始时刻
        @param e2 end_time2 低峰时段2完毕时刻 
        """
        rows = db.session.execute(text( 
             """
             select appcode,
                 round(sum(pod_replicas_avail) / 9.0, 2) as pods 
             from tb_k8s_resource 
             where ( 
                     (  
                        record_time >= :s1  
                        and record_time <= :e1                              )
                      or ( 
                          record_time >= :s2 
                          and record_time <= :e2 
                     ) 
                  )  
              group by appcode 
              """
           ), {"s1": s1, "e1": e1, "s2": s2, "e2": e2})
           LOG.info(f'stats_low_time_pods: {rows.rowcount}') 
           return self.rows_as_dicts(rows.cursor)
@staticmethod
def rows_as_dicts(cursor):
      """convert tuple result to dict with cursor"""
      col_names = [i[0] for i in cursor.description]
      return [dict(zip(col_names, row)) for row in cursor]
 @try_catch_db_exception 
 @commit_on_success
 def stats_low_time_cpu(self, s1, e1, s2, e2):
         """低峰期分两段(2022-06-08 23:00 - 2022-06-08 23:59:59, 2022-06-08 00:00 - 2022-06-08 07:59:59)
        @param s1 start_time1 低峰时段1开始时刻
        @param e1 end_time1 低峰时段1完毕时刻
        @param s2 start_time2 低峰时段2开始时刻
        @param e2 end_time2 低峰时段2完毕时刻 
        """
        LOG.info(f'stats_low_time_cpu: {s1}, {e1}, {s2}, {e2}')
        rows = db.session.execute(text( 
        """
        select appcode, 
           percentile_cont(0.5) within group ( 
                order by cpu_usage
           ) as p50,
           percentile_cont(0.9) within group ( 
                order by cpu_usage 
           ) as p90,
           percentile_cont(0.99) within group (
                order by cpu_usage 
           ) as p99,
           avg(cpu_usage)
      from tb_container_cpu_usage_seconds_total 
      where collect_time >= :s1 
           and collect_time <= :e1 
           or collect_time >= :s2
           and collect_time <= :e2 
      group by appcode
      """ 
      ), {"s1": s1, "e1": e1, "s2": s2, "e2": e2})
      LOG.info(f'stats_low_time_cpu: {rows.rowcount}') 
      return self.rows_as_dicts(rows.cursor)
@statsd_index('hpa_report.sendmail') 
@commit_on_success
def send_report_form(self, day): 
   try:  
      start = datetime.datetime.combine(day, datetime.time(0,0,0))
      end = datetime.datetime.combine(day, datetime.time(23,59,59))
      q = HpaReportModel.query.filter(
         HpaReportModel.record_time >= start,
         HpaReportModel.record_time <= end 
      ).order_by( 
         HpaReportModel.uc.desc(),
         HpaReportModel.dc.desc(),   
         HpaReportModel.maxuc.desc(), 
         HpaReportModel.mindc.desc() 
      )  
      count = q.count() 
      day_data = q.all()  
      cell = ""  
      if count > 0: 
          for stat in day_data: 
             cell += f""" 
                 <tr> 
                     <td>{stat.appcode}</td> 
                     <td>{stat.env_name}</td> 
                     <td>{stat.min_replicas}</td>
                     <td>{stat.max_replicas}</td> 
                     <td>{stat.cpu}</td>  
                     <td>{stat.mem}</td> 
                     <td>{stat.cname}:{stat.cval}</td> 
                     <td>{stat.uc}</td>     
                     <td>{stat.dc}</td> 
                     <td>{stat.maxuc}</td> 
                     <td>{stat.mindc}</td> 
                     <td>{round(stat.hpods, 2)}</td>  
                     <td>{round(stat.hcpu_mean, 2)}%</td>
                     <td>{round(stat.lpods, 2)}</td> 
                     <td>{round(stat.lcpu_mean, 2)}%</td>
                     </tr>""" 
              content = f"""
                  <div> 
                      <h2>{day} 00:00:00至23:59:59</h2> 
                      <h3>顶峰(08:00-23:00), 低锋(23:00-08:00)</h3> 
                      <table border='1' cellpadding='1' cellspacing='0'> 
                      <tr> 
                         <th>Appcode</th> 
                         <th>环境</th>   
                         <th>最小副本数</th>  
                         <th>最大副本数</th>  
                         <th>CPU扩容阈值</th>  
                         <th>内存扩容阈值</th> 
                         <th>自定义扩容阈值</th>  
                         <th>扩容次数</th>
                         <th>缩容次数</th>  
                         <th>最大副本数次数</th>  
                         <th>最小副本数次数</th> 
                         <th>顶峰副本数</th>   
                         <th>顶峰CPU均匀运用率</th> 
                         <th>低锋副本数</th>  
                         <th>低锋CPU均匀运用率</th>  
                       </tr> 
                       {cell} 
                   </table>
               </div><br><br>""" 
           SendMail.send_mail( 
               CONF.notice_user.users.split(','), 
               "HPA阔缩容次数及CPU运用率相关计算", 
               content) 
               SendQtalkMsg.send_msg(CONF.notice_user.users.split(','),
                      'HPA阔缩容次数及CPU运用率相关计算过错报表发送完结')
                      except Exception as ex:            sentry_sdk.capture_exception()
                      SendQtalkMsg.send_msg(['haicheng.bi'], f'HPA阔缩容次数及CPU运用率相关计算过错: {ex}')
@try_catch_db_exception
@commit_on_success
@statsd_index('hpa_report.save_stats_result')
def save_stats_result(self, day): 
   """保存HPA和cpu计算的成果
   :param day date 计算日期
   """ 
   if not isinstance(day, datetime.date): 
        raise Error(f"param day is invalid type, we need datetime.date type.")
   LOG.info(f'save_stats_result: {day}')  
   hpa_start = datetime.datetime.combine(day, datetime.time(0,0,0))
   hpa_end = datetime.datetime.combine(day, datetime.time(23,59,59)) 
   hpa_stats_rows = self.stats_hpa_updown(hpa_start, hpa_end) 
   # 08-23
   h_start = datetime.datetime.combine(day, datetime.time(8,0,0))
   h_end = datetime.datetime.combine(day, datetime.time(22,59,59))
   # 23-00, 00-08
   l_s1 = datetime.datetime.combine(day, datetime.time(23,0,0))
   l_e1 = datetime.datetime.combine(day, datetime.time(23,59,59))
   l_s2 = datetime.datetime.combine(day, datetime.time(0,0,0)) 
   l_e2 = datetime.datetime.combine(day, datetime.time(7,59,59)) 
   hcpu_stats_rows = self.stats_high_time_cpu(h_start, h_end) 
   lcpu_stats_rows = self.stats_low_time_cpu(l_s1,l_e1, l_s2, l_e2)
   hpods_stats_rows = self.stats_high_time_pods(h_start, h_end)        lpods_stats_rows = self.stats_low_time_pods(l_s1,l_e1, l_s2, l_e2) 
   cpus_rows = {} 
   pods_rows = {}  
   report_rows = {}   
   for row in hpods_stats_rows:
       appcode = row.get('appcode', '') 
       pods_rows[appcode] = { 
            'appcode': appcode,
            'hpods': row.get('pods', 0)
            } 
       for row in lpods_stats_rows: 
             appcode = row.get('appcode', '') 
             pods_rows[appcode].update({  
                  'appcode': appcode,
                  'lpods': row.get('pods', 0)
             }) 
       for row in hcpu_stats_rows: 
             appcode = row.get('appcode', '')
             cpus_rows[appcode] = { 
                'appcode': appcode,
                'hcpu_p50': row.get('p50', 0),
                'hcpu_p90': row.get('p90', 0),
                'hcpu_p99': row.get('p99', 0),                'hcpu_mean': row.get('avg', 0), 
                } 
      for row in lcpu_stats_rows: 
        appcode = row.get('appcode', '') 
        cpus_rows[appcode].update({ 
            'lcpu_p50': row.get('p50', 0), 
            'lcpu_p90': row.get('p90', 0),  
            'lcpu_p99': row.get('p99', 0),   
            'lcpu_mean': row.get('avg', 0),
            }) 
        for row in hpa_stats_rows:  
             appcode = row.get('appcode', '')
             env_name = row.get('env', '') 
             hpa = row.get('hpa') 
             if not hpa: 
                 continue 
             report_rows[f'{appcode}-{env_name}'] = {
                 'appcode': appcode,
                 'env_name': env_name, 
                 'uc': row.get('uc', 0), 
                 'dc': row.get('dc', 0), 
                 'maxuc': row.get('maxuc', 0),
                 'mindc': row.get('mindc', 0),   
                 'cpu': int(row.get('hpa',                 {}).get('cpuTargetUtilization', 0)), 
                 'mem': int(row.get('hpa', {}).get('memoryTargetValue', 0)),  
                 'cname': row.get('hpa',{}).get('customName', ''),  
                 'cval': int(row.get('hpa', {}).get('customTargetValue', 0)), 
                 'min_replicas': row.get('min_replicas', 0),  
                 'max_replicas': row.get('max_replicas', 0),
                 }
                 report_rows[f'{appcode}-{env_name}'].update(cpus_rows.get(appcode, {}))            
                 report_rows[f'{appcode}-{env_name}'].update(pods_rows.get(appcode, {}))        
                 HpaReportModel.query.filter( 
                 HpaReportModel.record_time == day        ).delete()  
                 for value in report_rows.values(): 
                 model = HpaReportModel(record_time=day, **value)            
                 db.session.add(model)

2. 成果如下图所示:

Kubernetes HPA一定会减少资源使用吗?HPA可观测性实践分享!

3. 计算完结后邮件方式宣布

Kubernetes HPA一定会减少资源使用吗?HPA可观测性实践分享!

六、成果校验

  • 数据完好(包括悉数现已敞开HPA的应用列表)

  • 扩缩次数精确(扩容,缩容次数跟实践产生的一致)

  • CPU运用率精确(高低峰)

  • Pods数量精确(高低峰)

以 t_where_go 为例检查如图所示:

  1. 副本数承认

Kubernetes HPA一定会减少资源使用吗?HPA可观测性实践分享!

  1. cpu运用率和pods数承认

Kubernetes HPA一定会减少资源使用吗?HPA可观测性实践分享!

  1. 数据完好承认

a. 对比已开通hpa的appcode,env和hpa装备表的appcode和env一致

   select A.appcode, A.deployment_base, M.appcode, M.deployment_base
   from tb_k8s_appcode_hpa A
      left join(
         select deployment_base,  
             appcode 
         from tb_k8s_hpa_rec 
         where dep_status != 1 
         group by appcode,
             deployment_base
     ) M on A.appcode = M.appcode and A.deployment_base = M.deployment_base;

b. 扩缩报表记录条数和已开通HPA且未暂时关闭的记录数一致

Kubernetes HPA一定会减少资源使用吗?HPA可观测性实践分享!

   select count(*) from (select appcode,deployment_base from tb_k8s_hpa_xxx where status = 0 and dep_status = 0 group by appcode, deployment_base)M;
   select count(*) from tb_hpa_report_form where record_time = '2022-06-10';

Kubernetes HPA一定会减少资源使用吗?HPA可观测性实践分享!

七、总结

数据计算相关对数据精确性要求很高,怎么搜集数据?怎么清洗数据?怎么组装数据?这些都需求考虑,把原始数据弄精确简练非常重要,其实还是应该在软件开发之前就考虑好需求搜集哪些数据,将来要用哪些数据想好才能做出更优异的软件。

有了上面的数据我们就能够回答文章开头的问题,HPA不一定能下降资源运用,需求装备合理,假如装备不合理,最小副本数大于实践运用的副本数底子不会产生扩缩容。