布景

在某项目实践开发一个显现相关功能进程中,出现了FD走漏的问题。随便操作几个界面后,HWC进程出现很多的FD的走漏,很快就超越200个。此项意图硬件渠道是高通CPU,软件是Android 10的渠道,内核版本是Linux 4.19,在高通渠道上,显现模块的HAL层服务是vendor.qti.hardware.display.composer-service,简称HWC。

vendor.qti.hardware.display.composer-service的进程号是864
XXXXXX:/proc/864/fd # ls -l
//存在很多的anon_inode:sync_file文件句柄
lr-x------ 1 system graphics 64 2022-07-29 00:04 342 -> anon_inode:sync_file

剖析进程

初步剖析

因为走漏fd对应的文件名是anon_inode:sync_file,并且发生在HWC进程中,因而能够大致确认和Fence相关。然而HWC里边临Fence的操作地方太多,直接从HWC下手定位较为困难。不过因为走漏速度很快且能够安稳复现,因而能够通过一些调试手段来缩小规模。

剖析的思路便是在监测到FD走漏时,把调用者找出来。首要需要找到FD的分配地方,在Linux里边FD是归于进程独立的,即每个进程都有独立的FD,是一个正整数,普通文件从3开始,FD的分配是在get_unused_fd_flags函数里边完结的,有了这些根底下面就开始正式剖析了。

内核调用栈剖析

  1. 首要在分配FD的方位参加打印调用栈信息,不过因为get_unused_fd_flags的调用点太多,因而设定500作为触发条件,这样就只要走漏FD的调用栈被输出了。在这里界说了一个函数trace_for_leak,在发生走漏时打印一下调用栈。
int trace_for_leak(int fd)
{
  printk("trace_for_leak fd:%d", fd);
  WARN(1, "fd:%d", fd);
  return 0;
}
EXPORT_SYMBOL(trace_for_leak);
int get_unused_fd_flags(unsigned flags)
{
  int fd = __alloc_fd(current->files, 0, rlimit(RLIMIT_NOFILE), flags);
  if(fd > 500) {
    trace_for_leak(fd);
  }
  return fd;
}
EXPORT_SYMBOL(get_unused_fd_flags);
  1. 抓取Log,发现所有的调用栈都是相同的,说明走漏点只要1个,从调用栈看是用户态调用dup导致,这样就确认了走漏的体系调用,不过dup在HWC里边的调用也是十分多,还需要继续缩小规模。
<KERNEL|(2)HwBinder:878_3 >: [  363.776621] fd:2668
<KERNEL|(2)HwBinder:878_3 >: [  363.776644] WARNING: CPU: 2 PID: 1403 at /home/lizhigang/disk1/src/neo3/kernel/msm-4.19/fs/file.c:545 trace_for_leak+0x34/0x48
<KERNEL|(2)HwBinder:878_3 >: [  363.776647] Modules linked in:
<KERNEL|(2)HwBinder:878_3 >: [  363.776655] CPU: 2 PID: 1403 Comm: HwBinder:878_3 Tainted: G S      W         4.19.81-perf+ #13
<KERNEL|(2)HwBinder:878_3 >: [  363.776658] Hardware name: Qualcomm Technologies, Inc. kona MTP (DT)
<KERNEL|(2)HwBinder:878_3 >: [  363.776663] pstate: 60400005 (nZCv daif +PAN -UAO)
<KERNEL|(2)HwBinder:878_3 >: [  363.776666] pc  trace_for_leak+0x34/0x48
<KERNEL|(2)HwBinder:878_3 >: [  363.776670] lr  trace_for_leak+0x34/0x48
<KERNEL|(2)HwBinder:878_3 >: [  363.776673] sp  ffffff8019519de0
<KERNEL|(2)HwBinder:878_3 >: [  363.776675] x29: ffffff8019519df0 x28: ffffffca9ceb1d80
<KERNEL|(2)HwBinder:878_3 >: [  363.776681] x27: 0000000000000000 x26: 0000000000000000
<KERNEL|(2)HwBinder:878_3 >: [  363.776688] x25: 0000000056000000 x24: 0000000000000000
<KERNEL|(2)HwBinder:878_3 >: [  363.776694] x23: ffffffca9ceb1d80 x22: 0000000000000126
<KERNEL|(2)HwBinder:878_3 >: [  363.776700] x21: 0000000000000017 x20: ffffffc9a8bff900
<KERNEL|(2)HwBinder:878_3 >: [  363.776706] x19: 0000000000000a6c x18: 00000000000000b0
<KERNEL|(2)HwBinder:878_3 >: [  363.776712] x17: 00000000000000b0 x16: 0000000000000028
<KERNEL|(2)HwBinder:878_3 >: [  363.776718] x15: ffffff9816703c9c x14: 0000000000003638
<KERNEL|(2)HwBinder:878_3 >: [  363.776724] x13: 0000000000000004 x12: 0000000000000000
<KERNEL|(2)HwBinder:878_3 >: [  363.776730] x11: 0000000000000001 x10: 0000000000000007
<KERNEL|(2)HwBinder:878_3 >: [  363.776736] x9 : d74c738a6b760b00 x8 : d74c738a6b760b00
<KERNEL|(2)HwBinder:878_3 >: [  363.776742] x7 : 0000000000000000 x6 : ffffff9817a55a9c
<KERNEL|(2)HwBinder:878_3 >: [  363.776748] x5 : ffffff8019519aa8 x4 : 000000000000000c
<KERNEL|(2)HwBinder:878_3 >: [  363.776754] x3 : 0000000038363632 x2 : 000000000000000c
<KERNEL|(2)HwBinder:878_3 >: [  363.776760] x1 : 0000000000000000 x0 : 000000000000000c
<KERNEL|(2)HwBinder:878_3 >: [  363.776767] Call trace:
<KERNEL|(2)HwBinder:878_3 >: [  363.776772] trace_for_leak+0x34/0x48
<KERNEL|(2)HwBinder:878_3 >: [  363.776777] get_unused_fd_flags+0x44/0x54
<KERNEL|(2)HwBinder:878_3 >: [  363.776781] ksys_dup+0x30/0x98
<KERNEL|(2)HwBinder:878_3 >: [  363.776785] __arm64_sys_dup+0x1c/0x2c
<KERNEL|(2)HwBinder:878_3 >: [  363.776792] el0_svc_common+0xa4/0x16c
<KERNEL|(2)HwBinder:878_3 >: [  363.776796] el0_svc_handler+0x7c/0x98
<KERNEL|(2)HwBinder:878_3 >: [  363.776801] el0_svc+0x8/0xc
<KERNEL|(2)HwBinder:878_3 >: [  363.776804] ---[ end trace 8608268fa69eeafc ]---

用户调用栈剖析

  1. 通过内核栈剖析,FD的走漏是用户态体系调用直接导致,因而需要抓取用户调用栈,把内核栈和用户栈贯穿一起看,完结此功能需要使用ebpf的bpftrace东西。bpftrace能够把内核栈和用户栈一起打印出来,注意需要把用户态的程序替换为包括符号表的程序,否则只能打印出地址,看不出函数名。在Android的开发渠道下,包括符号表的可履行程序在./out/target/product/XXXX/symbols/下面,因而用这个目录下面的vendor.qti.hardware.display.composer-service来替换设备上的对应文件。然后履行bpftrace,只要简单的一行指令即可。

bpftrace -e ‘kprobe:trace_for_leak { @[probe, pid, tid, kstack,ustack] = count(); }’

下面便是履行的成果

XXXXXXX:/data/local/tmp/bpftools # ./bpftrace -e 'kprobe:trace_for_leak { @[probe, pid, tid, kstack,ustack] = count(); }'
tar: invalid tar format
Attaching 1 probe...
^C
@[kprobe:trace_for_leak, 861, 1022,
    trace_for_leak+0
    ksys_dup+48
    __arm64_sys_dup+28
    el0_svc_common+164
    el0_svc_handler+124
    el0_svc+8
,
    dup+8
    sdm::HWDeviceDRM::Commit(sdm::HWLayers*)+224
    sdm::HWPeripheralDRM::Commit(sdm::HWLayers*)+244
    sdm::DisplayBase::Commit(sdm::LayerStack*)+632
    sdm::DisplayBuiltIn::Commit(sdm::LayerStack*)+1040
    sdm::HWCDisplay::CommitLayerStack()+4228
    sdm::HWCDisplayBuiltIn::Present(int*)+632
    vendor::qti::hardware::display::composer::V2_1::implementation::QtiComposerClient::CommandReader::presentDisplay(unsigned long, int&, std::__1::vector<unsigned long, std::__1::allocator<unsigned long> >&, std::__1::vector<int, std::__1::allocator<int> >&)+2544
    vendor::qti::hardware::display::composer::V2_1::implementation::QtiComposerClient::CommandReader::parseCommonCmd(android::hardware::graphics::composer::V2_3::IComposerClient::Command, unsigned short)+10284
    vendor::qti::hardware::display::composer::V2_1::implementation::QtiComposerClient::CommandReader::parse()+472
    vendor::qti::hardware::display::composer::V2_1::implementation::QtiComposerClient::executeCommands_2_3(unsigned int, android::hardware::hidl_vec<android::hardware::hidl_handle> const&, std::__1::function<void (android::hardware::graphics::composer::V2_1::Error, bool, unsigned int, android::hardware::hidl_vec<android::hardware::hidl_handle> const&)>)+124
    android::hardware::graphics::composer::V2_2::BnHwComposerClient::_hidl_executeCommands_2_2(android::hidl::base::V1_0::BnHwBase*, android::hardware::Parcel const&, android::hardware::Parcel*, std::__1::function<void (android::hardware::Parcel&)>)+532
    vendor::qti::hardware::display::composer::V2_1::BnHwQtiComposerClient::onTransact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+2984
    android::hardware::BHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+76
    android::hardware::IPCThreadState::getAndExecuteCommand()+1044
    android::hardware::IPCThreadState::joinThreadPool(bool)+156
    0x7dcef2fa58
    android::Thread::_threadLoop(void*)+332
    __pthread_start(void*)+40
    __start_thread+68
]: 216
  1. 从调用栈能够确认HWDeviceDRM::Commit里边直接调用了dup体系调用导致了走漏,这和之前的剖析符合,问题点规模缩小了很多,便是在HWDeviceDRM::Commit里边。

代码剖析

  1. 至此问题根本现已锁定HWDeviceDRM::Commit函数里边,这个函数很简单,看一下每一个子函数,排除了其他2个函数,问题进一步缩小到HWDeviceDRM::AtomicCommit里边
DisplayError HWDeviceDRM::Commit(HWLayers *hw_layers) {
  DTRACE_SCOPED();
  DisplayError err = kErrorNone;
  registry_.Register(hw_layers);
  if (default_mode_) {
    err = DefaultCommit(hw_layers);
  } else {
    err = AtomicCommit(hw_layers);
  }
  return err;
}
  1. HWDeviceDRM::AtomicCommit里边的代码有2处调用到Sys::dup_,在对应的方位都加上打印,把FD的值打印出来。
DisplayError HWDeviceDRM::AtomicCommit(HWLayers *hw_layers) {
  。。。
  for (uint32_t i = 0; i < hw_layer_info.hw_layers.size(); i++) {
    Layer &layer = hw_layer_info.hw_layers.at(i);
    HWRotatorSession *hw_rotator_session = &hw_layers->config[i].hw_rotator_session;
    if (hw_rotator_session->mode == kRotatorOffline) {
      hw_rotator_session->output_buffer.release_fence_fd = Sys::dup_(release_fence);
      DLOGW("output_buffer fd:%d", hw_rotator_session->output_buffer.release_fence_fd);
    } else {
      layer.input_buffer.release_fence_fd = -1;
      if (!enable_cac_) {
        // 此处dup生成的layer.input_buffer.release_fence_fd没有关闭,形成fd的走漏,
        // 参加打印后fd的值在快速增长
        layer.input_buffer.release_fence_fd = Sys::dup_(release_fence);
        DLOGW("input_buffer fd:%d", layer.input_buffer.release_fence_fd);
      }
    }
  }
  。。。
}
  1. 再次运转抓取Log发现了input_buffer fd在快速的添加,因而确认了FD走漏的方位。
864   864 W SDM     : HWDeviceDRM::AtomicCommit: input_buffer fd:386
864   864 W SDM     : HWDeviceDRM::AtomicCommit: input_buffer fd:387
864   864 W SDM     : HWDeviceDRM::AtomicCommit: input_buffer fd:388
864   864 W SDM     : HWDeviceDRM::AtomicCommit: input_buffer fd:389
864   864 W SDM     : HWDeviceDRM::AtomicCommit: input_buffer fd:390
864   864 W SDM     : HWDeviceDRM::AtomicCommit: input_buffer fd:391
864   864 W SDM     : HWDeviceDRM::AtomicCommit: input_buffer fd:392
864   864 W SDM     : HWDeviceDRM::AtomicCommit: input_buffer fd:393

总结

通过上面的剖析进程,现已精确的定位到了,引起FD走漏的代码行,原因便是HWDeviceDRM::AtomicCommit函数里边layer.input_buffer.release_fence_fd是发生fd走漏的根本原因,input_buffer.release_fence_fd没有被正确Close,发生了走漏。找到根因后的处理就不是问题了,只要正确的close,走漏就没有了,至此这个问题完美处理了。