1.概述

作为一名后端开发工程师,排查系统问题用得最多的手段之一就是查看系统日志,在当下主要的分布式集群环境中一般使用ELK(Elasticsearch , Logstash, Kibana)来统一收集日志,以便后续查看日志定位追踪相关问题。但是在并发情况下,大量的系统用户即多线程并发访问后端服务导致同一个请求的日志记录不再是连续相邻的,此时多个请求的日志是一起串行输出到文件中,所以我们筛选出指定请求的全部相关日志还是比较麻烦的,同时当后端异步处理功能逻辑以及微服务的下游服务调用日志追踪也有着相同的问题。

为了快速排查、定位、解决日常反馈的系统问题,我们就必须解决上面所说的查看请求日志的痛点。解决方案就是:每个请求都使用一个唯一标识traceId来追踪全部的链路显示在日志中,并且不修改原有的打印方式(代码无入侵),然后使用使用Logback的MDC机制日志模板中加入traceId标识,取值方式为%X{traceId} 。这样在收集的日志文件中就可以看到每行日志有一个tracceId值,每个请求的值都不一样,这样我们就可以根据traceId查询过滤出一次请求的所有上下文日志了。

项目推荐:基于SpringBoot2.x、SpringCloud和SpringCloudAlibaba企业级系统架构底层框架封装,解决业务开发时常见的非功能性需求,防止重复造轮子,方便业务快速开发和企业技术栈框架统一管理。引入组件化的思想实现高内聚低耦合并且高度可配置化,做到可插拔。严格控制包依赖和统一版本管理,做到最少化依赖。注重代码规范和注释,非常适合个人学习和企业使用

Github地址:github.com/plasticene/…

Gitee地址:gitee.com/plasticene3…

微信公众号Shepherd进阶笔记

2.实现方案

MDC(Mapped Diagnostic Context,映射调试上下文)log4jlogback 提供的一种方便在多线程条件下记录日志的功能。MDC 可以看成是一个与当前线程绑定的Map,可以往其中添加键值对。MDC 中包含的内容可以被同一线程中执行的代码所访问。当前线程的子线程会继承其父线程中的 MDC 的内容。当需要记录日志时,只需要从MDC 中获取所需的信息即可。MDC的内容则由程序在适当的时候保存进去。对于一个 Web 应用来说,通常是在请求被处理的最开始保存这些数据。

由于MDC内部使用的是ThreadLocal所以只有本线程才有效,子线程和下游的服务MDC里的值会丢失;所以方案主要的难点是解决traceId值的传递问题,需要重点关注一下两点:

  • MDCtraceId数据如何传递给下游服务,下游服务如何接收traceId并放入MDC
  • 异步的情况下(线程池)如何把traceId值传给子线程。

2.1 设置日志模板

无论是我们的项目使用的是log4j还是logback框架,我们都需要先调整日志配置文件的日志格式如下:

<!-- 日志格式 -->
<property name="CONSOLE_LOG_PATTERN" value="[%X{traceId}] [%-5p] [%d{yyyy-MM-dd HH:mm:ss.SSS}] [%t@${PID}]  %c %M : %m%n"/>

这样才能有效地把traceId收集到日志文件中。

2.2 请求上下文设置traceId并有效传递下游服务

按照上面说的,每个请求使用一个唯一标识traceId来追踪一次请求的全部日志,这就要求我们的traceId必须保证唯一性,不然就会出现请求日志混乱问题,是绝对不允许的。这里我们利用hutool框架的生成id工具IdUtil来生成唯一值,可以生成uuid或者使用雪花算法Snowflake生成唯一id都可以,因为这里id是记录在日志文件中做唯一标识用的,所以对id字符类型,递增性那些没啥要求,只要唯一标识即可,按照之前习惯,我就用雪花算法生成唯一id标识了。

生成traceId并放入到MDC上下文中

public class WebTraceFilter extends OncePerRequestFilter {
​
​
  @Override
  protected void doFilterInternal(HttpServletRequest request, HttpServletResponse response,
                  FilterChain filterChain) throws IOException, ServletException {
    try {
      String traceId = request.getHeader(MDCTraceUtils.TRACE_ID_HEADER);
      if (StrUtil.isEmpty(traceId)) {
        MDCTraceUtils.addTrace();
       } else {
        MDCTraceUtils.putTrace(traceId);
       }
      filterChain.doFilter(request, response);
     } finally {
      MDCTraceUtils.removeTrace();
     }
   }
}

这里通过一个过滤器来设置traceId放入到MDC中,可以将该过滤器的执行优先级设置比较靠前,这样就可以有效保证我们一次请求上下文的日志中都有traceId了。同时这个过滤器我们是集成在自定义实现的web starter中,公司的所有服务都会引用web starter集成该过滤器,意味着只要我们请求下游服务时添加了traceId这个header,下游服务执行到该过滤器时就会拿到上游服务传递过来的traceId值放入到当前服务的MDC中。MDCTraceUtils工具类代码如下:

public class MDCTraceUtils {
  /**
   * 追踪id的名称
   */
  public static final String KEY_TRACE_ID = "traceId";
​
  /**
   * 日志链路追踪id信息头
   */
  public static final String TRACE_ID_HEADER = "x-traceId-header";
​
​
  /**
   * 创建traceId并赋值MDC
   */
  public static void addTrace() {
    String traceId = createTraceId();
    MDC.put(KEY_TRACE_ID, traceId);
   }
​
  /**
   * 赋值MDC
   */
  public static void putTrace(String traceId) {
    MDC.put(KEY_TRACE_ID, traceId);
   }
​
  /**
   * 获取MDC中的traceId值
   */
  public static String getTraceId() {
    return MDC.get(KEY_TRACE_ID);
   }
​
  /**
   * 清除MDC的值
   */
  public static void removeTrace() {
    MDC.remove(KEY_TRACE_ID);
   }
​
  /**
   * 创建traceId
   */
  public static String createTraceId() {
    return IdUtil.getSnowflake().nextIdStr();
   }
​
}

接下来我们就来演示下traceId如何在服务间有效传递。无论是微服务间的服务调用还是单体项目的调用下游服务,我都建议使用Spring Cloud框架中的openfeign组件进行服务间接口调用,如果对组件openfeign不太熟悉的,可以看看之前我总结的 openfeign实现原理进行了解。这里就用openFeign进行模拟服务间调用下游服务获取车间列表的接口

@FeignClient(name = "workshopService", url = "http://127.0.0.1:16688/textile", path = "/workshop")
public interface WorkshopService {
  @GetMapping("/list/temp")
  ResponseVO<List<WorkshopDTO>> getList();
}

增加feign拦截器,继续把当前服务的traceId值传递给下游服务

public class FeignInterceptor implements RequestInterceptor {
  @Override
  public void apply(RequestTemplate requestTemplate) {
    ServletRequestAttributes requestAttributes = (ServletRequestAttributes) RequestContextHolder.getRequestAttributes();
    // 传递请求相关header
    if (requestAttributes != null) {
      HttpServletRequest request = requestAttributes.getRequest();
      Enumeration<String> headerNames = request.getHeaderNames();
      if (headerNames != null) {
        while (headerNames.hasMoreElements()) {
          String name = headerNames.nextElement();
          // 跳过 content-length
          if (Objects.equals("content-length", name)){
            continue;
           }
          String value = request.getHeader(name);
          requestTemplate.header(name, value);
         }
       }
     }
    // 传递日志追踪的traceId
    String traceId = MDCTraceUtils.getTraceId();
    if (StringUtils.isNotBlank(traceId)) {
      requestTemplate.header(MDCTraceUtils.TRACE_ID_HEADER, traceId);
     }
   }
}

可以看到这里主要完成传递请求的header,traceId这个header单独处理,这是因为webTraceFilter过滤器中只把traceId放入了MDC中,并没有吧traceId放入到请求的header中,servlet层的filter过滤器Spring不建议修改请求的参数,包括header,改起来也比较麻烦,所以这里需要单独处理传递。当然这里的拦截器FeignInterceptor和上面的过滤器WebTraceFilter都需要注入到Spring容器中。

编写代码进行接口调用测试

  @GetMapping("/trace")
  public void testTrace() {
    log.info("开始执行咯");
    BaseQuery query = new BaseQuery();
    ResponseVO<List<WorkshopDTO>> responseVO = workshopService.getList();
    log.info("接口返回结果:{}", responseVO);
   }

执行日志打印如下,当前服务的日志:

[1675794072381583360] [INFO ] [2023-07-03 17:10:16.289] [http-nio-18888-exec-5@24417]  com.plasticene.boot.web.core.aop.ApiLogPrintAspect timeAround : Request Info : {"ip":"127.0.0.1","url":"http://127.0.0.1:18888/fds/test/trace","httpMethod":"GET","classMethod":"com.plasticene.fast.controller.TestController.testTrace","requestParams":null}
[1675794072381583360] [INFO ] [2023-07-03 17:10:16.299] [http-nio-18888-exec-5@24417]  com.plasticene.fast.controller.TestController testTrace$original$mZGAheRd : 开始执行咯
[1675794072381583360] [INFO ] [2023-07-03 17:10:17.087] [http-nio-18888-exec-5@24417]  com.plasticene.fast.controller.TestController testTrace$original$mZGAheRd : 接口返回结果:ResponseVO(code=200, msg=OK, data=[WorkshopDTO(id=3, orgId=4, name=检验车间, location=杭州市西湖区, remark=这里是最严格的, machineCount=null)])
[1675794072381583360] [INFO ] [2023-07-03 17:10:17.088] [http-nio-18888-exec-5@24417]  com.plasticene.boot.web.core.aop.ApiLogPrintAspect timeAround : Response result:  null
[1675794072381583360] [INFO ] [2023-07-03 17:10:17.089] [http-nio-18888-exec-5@24417]  com.plasticene.boot.web.core.aop.ApiLogPrintAspect timeAround : time cost:  805

traceId为:1675794072381583360,看看下游服务textile的日志如下:

[1675794072381583360] [INFO ] [2023-07-03 17:10:16.438] [http-nio-16688-exec-1@24461]  com.plasticene.boot.web.core.aop.ApiLogPrintAspect timeAround : Request Info : {"ip":"127.0.0.1","url":"http://127.0.0.1:16688/textile/workshop/list/temp","httpMethod":"GET","classMethod":"com.plasticene.textile.controller.WorkshopController.getAllList","requestParams":null}
[1675794072381583360] [DEBUG] [2023-07-03 17:10:16.939] [http-nio-16688-exec-1@24461]  com.plasticene.textile.dao.WorkshopDAO.selectList debug : ==>  Preparing: SELECT id, org_id, name, location, remark, create_time, update_time, creator, updater FROM workshop WHERE (org_id = ?) ORDER BY id DESC
[1675794072381583360] [DEBUG] [2023-07-03 17:10:16.972] [http-nio-16688-exec-1@24461]  com.plasticene.textile.dao.WorkshopDAO.selectList debug : ==> Parameters: 4(Integer)
[1675794072381583360] [DEBUG] [2023-07-03 17:10:17.008] [http-nio-16688-exec-1@24461]  com.plasticene.textile.dao.WorkshopDAO.selectList debug : <==    Total: 1
[1675794072381583360] [INFO ] [2023-07-03 17:10:17.029] [http-nio-16688-exec-1@24461]  com.plasticene.boot.web.core.aop.ApiLogPrintAspect timeAround : Response result:  [{"id":3,"orgId":4,"name":"检验车间","location":"杭州市西湖区","remark":"这里是最严格的","machineCount":null}]
[1675794072381583360] [INFO ] [2023-07-03 17:10:17.040] [http-nio-16688-exec-1@24461]  com.plasticene.boot.web.core.aop.ApiLogPrintAspect timeAround : time cost:  621

可以看到两个服务的traceId都是一样的,这就说明我们的traceId有效传递了。

当然我们也可以使用Spring自带的RestTemplate、或者httpClient、OkHttp3等框架进行接口调用,只要请求接口时设置traceId这个header即可,使用restTemplate客户端调接口时,还可以通过扩展点ClientHttpRequestInterceptor接口的实现类对请求进行拦截处理进行统一traceIdheader设置,这样就不用每个接口请求都要设置一遍,尽量减少重复劳动做到优雅不过时。这里不在展示详细,请自我去实现。

2.3 异步父子线程traceId传递

上面说过MDC内部使用的是ThreadLocal,所以只有本线程才有效,子线程和下游的服务MDC里的值会丢失。我们项目服务使用的logback日志框架,所以我们需要重写logback的LogbackMDCAdapter,由于logback的MDC实现内部使用的是ThreadLocal不能传递子线程,所以需要重写替换为阿里的TransmittableThreadLocalTransmittableThreadLocal 是Alibaba开源的、用于解决在使用线程池等会池化复用线程的执行组件情况下,提供ThreadLocal值的传递功能,解决异步执行时上下文传递的问题。官方文档地址:github.com/alibaba/tra…

重写logback的LogbackMDCAdapter,自定义实现TtlMDCAdapter类,其实就是把LogbackMDCAdapterThreadLocal换成TransmittableThreadLocal即可,其他代码都是一样的。

/**
 *重构{@link LogbackMDCAdapter}类,搭配TransmittableThreadLocal实现父子线程之间的数据传递
 *
 * @author fjzheng
 * @version 1.0
 * @date 2022/7/14 13:50
 */
public class TtlMDCAdapter implements MDCAdapter {
  private final ThreadLocal<Map<String, String>> copyOnInheritThreadLocal = new TransmittableThreadLocal<>();
​
  private static final int WRITE_OPERATION = 1;
  private static final int MAP_COPY_OPERATION = 2;
​
  private static TtlMDCAdapter mtcMDCAdapter;
​
  /**
   * keeps track of the last operation performed
   */
  private final ThreadLocal<Integer> lastOperation = new ThreadLocal<>();
​
  static {
    mtcMDCAdapter = new TtlMDCAdapter();
    MDC.mdcAdapter = mtcMDCAdapter;
   }
​
  public static MDCAdapter getInstance() {
    return mtcMDCAdapter;
   }
​
  private Integer getAndSetLastOperation(int op) {
    Integer lastOp = lastOperation.get();
    lastOperation.set(op);
    return lastOp;
   }
​
  private static boolean wasLastOpReadOrNull(Integer lastOp) {
    return lastOp == null || lastOp == MAP_COPY_OPERATION;
   }
​
  private Map<String, String> duplicateAndInsertNewMap(Map<String, String> oldMap) {
    Map<String, String> newMap = Collections.synchronizedMap(new HashMap<>());
    if (oldMap != null) {
      // we don't want the parent thread modifying oldMap while we are
      // iterating over it
      synchronized (oldMap) {
        newMap.putAll(oldMap);
       }
     }
​
    copyOnInheritThreadLocal.set(newMap);
    return newMap;
   }
​
  /**
   * Put a context value (the <code>val</code> parameter) as identified with the
   * <code>key</code> parameter into the current thread's context map. Note that
   * contrary to log4j, the <code>val</code> parameter can be null.
   * <p/>
   * <p/>
   * If the current thread does not have a context map it is created as a side
   * effect of this call.
   *
   * @throws IllegalArgumentException in case the "key" parameter is null
   */
  @Override
  public void put(String key, String val) {
    if (key == null) {
      throw new IllegalArgumentException("key cannot be null");
     }
​
    Map<String, String> oldMap = copyOnInheritThreadLocal.get();
    Integer lastOp = getAndSetLastOperation(WRITE_OPERATION);
​
    if (wasLastOpReadOrNull(lastOp) || oldMap == null) {
      Map<String, String> newMap = duplicateAndInsertNewMap(oldMap);
      newMap.put(key, val);
     } else {
      oldMap.put(key, val);
     }
   }
​
  /**
   * Remove the the context identified by the <code>key</code> parameter.
   * <p/>
   */
  @Override
  public void remove(String key) {
    if (key == null) {
      return;
     }
    Map<String, String> oldMap = copyOnInheritThreadLocal.get();
    if (oldMap == null) {
      return;
     }
​
    Integer lastOp = getAndSetLastOperation(WRITE_OPERATION);
​
    if (wasLastOpReadOrNull(lastOp)) {
      Map<String, String> newMap = duplicateAndInsertNewMap(oldMap);
      newMap.remove(key);
     } else {
      oldMap.remove(key);
     }
​
   }
​
​
  /**
   * Clear all entries in the MDC.
   */
  @Override
  public void clear() {
    lastOperation.set(WRITE_OPERATION);
    copyOnInheritThreadLocal.remove();
   }
​
  /**
   * Get the context identified by the <code>key</code> parameter.
   * <p/>
   */
  @Override
  public String get(String key) {
    final Map<String, String> map = copyOnInheritThreadLocal.get();
    if ((map != null) && (key != null)) {
      return map.get(key);
     } else {
      return null;
     }
   }
​
  /**
   * Get the current thread's MDC as a map. This method is intended to be used
   * internally.
   */
  public Map<String, String> getPropertyMap() {
    lastOperation.set(MAP_COPY_OPERATION);
    return copyOnInheritThreadLocal.get();
   }
​
  /**
   * Returns the keys in the MDC as a {@link Set}. The returned value can be
   * null.
   */
  public Set<String> getKeys() {
    Map<String, String> map = getPropertyMap();
​
    if (map != null) {
      return map.keySet();
     } else {
      return null;
     }
   }
​
  /**
   * Return a copy of the current thread's context map. Returned value may be
   * null.
   */
  @Override
  public Map<String, String> getCopyOfContextMap() {
    Map<String, String> hashMap = copyOnInheritThreadLocal.get();
    if (hashMap == null) {
      return null;
     } else {
      return new HashMap<>(hashMap);
     }
   }
​
  @Override
  public void setContextMap(Map<String, String> contextMap) {
    lastOperation.set(WRITE_OPERATION);
​
    Map<String, String> newMap = Collections.synchronizedMap(new HashMap<>());
    newMap.putAll(contextMap);
​
    // the newMap replaces the old one for serialisation's sake
    copyOnInheritThreadLocal.set(newMap);
   }
}

接下来只需要实现程序启动时加载上我们自己实现的TtlMDCAdapter:

/**
 *
 * 初始化TtlMDCAdapter实例,并替换MDC中的adapter对象
 *
 * @author fjzheng
 * @version 1.0
 * @date 2022/7/14 13:55
 */
public class TtlMDCAdapterInitializer implements ApplicationContextInitializer<ConfigurableApplicationContext> {
  @Override
  public void initialize(ConfigurableApplicationContext applicationContext) {
    //加载TtlMDCAdapter实例
    TtlMDCAdapter.getInstance();
   }
}

这样我们在异步多线程情况下MDCtraceId值就能正常传递,下面来看看测试示例:

  // 定义线程池
  private ThreadFactory namedThreadFactory = new ThreadFactoryBuilder()
       .setNameFormat("letter-pool-%d").build();
  private ExecutorService fixedThreadPool = new ThreadPoolExecutor(Runtime.getRuntime().availableProcessors()*2,
      Runtime.getRuntime().availableProcessors() * 40,
      0L,
      TimeUnit.MILLISECONDS,
      new LinkedBlockingQueue<Runnable>(Runtime.getRuntime().availableProcessors() * 20),
      namedThreadFactory);
    
  // 测试接口
  @GetMapping("/async")
  public void testAsync() {
    log.info("打印日志了");
    fixedThreadPool.execute(()->{
      log.info("异步执行了");
      try {
        Student student = null;
        String name = student.getName();
       } catch (Exception e) {
        log.error("异步报错了:", e);
       }
​
     });
   }

执行结果日志打印如下:

[1675805796950241280] [INFO ] [2023-07-03 17:56:51.683] [http-nio-18888-exec-8@24417]  com.plasticene.boot.web.core.aop.ApiLogPrintAspect timeAround : Request Info : {"ip":"127.0.0.1","url":"http://127.0.0.1:18888/fds/test","httpMethod":"GET","classMethod":"com.plasticene.fast.controller.TestController.test","requestParams":null}
[1675805796950241280] [INFO ] [2023-07-03 17:56:51.698] [http-nio-18888-exec-8@24417]  com.plasticene.fast.controller.TestController test$original$mZGAheRd : 打印日志了
[1675805796950241280] [INFO ] [2023-07-03 17:56:51.700] [http-nio-18888-exec-8@24417]  com.plasticene.boot.web.core.aop.ApiLogPrintAspect timeAround : Response result:  null
[1675805796950241280] [INFO ] [2023-07-03 17:56:51.700] [http-nio-18888-exec-8@24417]  com.plasticene.boot.web.core.aop.ApiLogPrintAspect timeAround : time cost:  24
[1675805796950241280] [INFO ] [2023-07-03 17:56:51.700] [letter-pool-1@24417]  com.plasticene.fast.controller.TestController lambda$test$0 : 异步执行了
[1675805796950241280] [ERROR] [2023-07-03 17:56:51.704] [letter-pool-1@24417]  com.plasticene.fast.controller.TestController lambda$test$0 : 异步报错了:
java.lang.NullPointerException: null
  at com.plasticene.fast.controller.TestController.lambda$test$0(TestController.java:93)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)

3.总结

以上全部就是关于Spring Boot如何实现分布式日志链路追踪的相关知识点。工欲善其事,必先利其器,我们要想快速通过日志定位系统问题,就必须通过traceId高效查找一次请求的全部上下文日志,包括异步执行的逻辑。