监听 Android ANR 信号并获取一切办法栈信息

在前面的文章中我有介绍过 ANR 的原理,感兴趣的同学能够看看:[Framework] 深化理解 Android ANR

AMS 向使用进程发送 ANR 信号后会被 Signal Catcher 线程捕获,然后它就会 dump 一切的线程栈信息到目录 /data/anr 中,这个目录是需求 root 权限才能够读取的,在虚拟机里面比较好拿到,经过 adb root 就能够直接获取 root 权限;不过一般的手机就比较难拿了,能够经过 adb bugreport 指令来导出这些文件。

虽然咱们线下有办法获取 ANR 的 dump 文件,可是非常麻烦,并且 Android 没有供给专门的接口来监听 ANR 的回调,线上用户也没有办法获取到 ANR 的 dump 文件,所以本篇文章便是介绍怎样监听 ANR 的信号和获取 ANR 时的 dump 文件信息。

监听 ANR 信号

AndroidANR 的信号是 SIGQUIT,它默认是被确定的,无法替换它本来的信号处理函数,咱们需求先免除确定:

sigset_t sig_sets;
sigemptyset(&sig_sets);
sigaddset(&sig_sets, SIGQUIT);
pthread_sigmask(SIG_UNBLOCK, &sig_sets, nullptr);

在免除确定后咱们就能够替换本来的信号处理函数:

struct sigaction sigAction{};
sigfillset(&sigAction.sa_mask);
sigAction.sa_flags = SA_RESTART | SA_ONSTACK | SA_SIGINFO;
sigAction.sa_sigaction = anrSignalHandler;
ret = sigaction(SIGQUIT, &sigAction, nullptr);
if (ret == 0) {
    LOGD("Monitor anr signal success.");
} else {
    LOGE("Monitor anr signal fail: %d", ret);
}

上面代码中的 anrSignalHandler 便是咱们的信号处理函数的指针,经过 sigaction() 办法去注册信号处理,这个函数的第三个参数是本来的旧的信号处理的 Action,咱们只需求传入一个 struct sigaction 的指针就能够将本来的信号处理的 Action 写入到咱们传入的地址中。获取到本来的信号处理函数后,咱们就能够在收到信号后,持续传递给本来的信号处理函数。

不过我这儿没有获取本来的处理函数,我自己测验这么做,可是在收到信号后然后回调给本来的处理函数会出现报错,现在我也不知道出现这个问题的原因,所以我换了一个办法向本来的信号处理函数发送消息,后面会介绍。

再来看看我的信号处理函数:

static void anrSignalHandler(int sig, siginfo_t *sig_info, void *uc) {
    LOGD("Receive anr signal.");
    int fromPid1 = sig_info->_si_pad[3];
    int fromPid2 = sig_info->_si_pad[4];
    int myPid = getpid();
    if (fromPid1 != myPid && fromPid2 != myPid) {
        // 处理咱们的逻辑
        pthread_mutex_lock(lock);
        if (dumpState == NO_DUMP) {
            dumpState = WAITING_ANR_DUMP;
        } else {
            LOGE("Skip dump anr, because state: %d", dumpState);
        }
        pthread_mutex_unlock(lock);
    }
    syscall(SYS_tgkill, myPid, gSignalCatcherTid, SIGQUIT);
}

前面咱们讲到 ANR 信号是 AMS 向使用进程发送的,所以信号发送的进程肯定不是咱们的使用进程,因为咱们的使用进程能够给自己发送信号的,简单经过 kill 办法就能够。所以咱们需求判别发送信号的进程不是咱们的进程,咱们才做 ANR 的处理。当收到 ANR 信号后咱们需求再向 Signal Catcher 线程发送信号,发送的办法是 syscall(SYS_tgkill, myPid, gSignalCatcherTid, SIGQUIT);

这儿问题又来了咱们怎样获取 Signal Catchertid 呢?在 Linux/proc/[pid] 中存放了很多进程相关的信息,在 /proc/[pid]/task 目录下面存放了该进程一切的线程信息,文件名便是 tid,文件中的内容便是对应线程的姓名。

OPD2A0:/proc/26483/task $ ls
16343  16346  16348  16350  16354  16357  16374  16377  16379  16381  16392  16394  16396  16398  16400  16402  16405  16412  16577  22976  22978
16344  16347  16349  16351  16355  16365  16376  16378  16380  16390  16393  16395  16397  16399  16401  16404  16407  16576  16814  22977  26483

所以经过读取上述文件就能够找到对应线程的 tid,反之也能够。

我这儿给一下我写的参阅代码:

int getSignalCatcherTid() {
    pid_t myPid = getpid();
    char *processPath = new char[MAX_BUFFER_SIZE];
    int size = sprintf(processPath, "/proc/%d/task", myPid);
    if (size >= MAX_BUFFER_SIZE) {
        LOGE("Read proc path fail, read buffer size: %d", size);
        return -1;
    }
    DIR *processDir = opendir(processPath);
    if (processDir) {
        int tid = -1;
        dirent * child = readdir(processDir);
        while (child != nullptr) {
            if (isNumberStr(child->d_name, 256)) {
                char *filePath = new char[MAX_BUFFER_SIZE];
                size = sprintf(filePath, "%s/%s/comm", processPath, child->d_name);
                if (size >= MAX_BUFFER_SIZE) {
                    continue;
                }
                char *threadName = new char[MAX_BUFFER_SIZE];
                int fd = open(filePath, O_RDONLY);
                size = read(fd, threadName, MAX_BUFFER_SIZE);
                close(fd);
                threadName[size - 1] = '';
                if (strcmp(threadName, "Signal Catcher") == 0) {
                    tid = atoi(child->d_name);
                    break;
                }
            }
            child = readdir(processDir);
        }
        closedir(processDir);
        return tid;
    } else {
        LOGE("Read process dir fail.");
    }
    return - 1;
}

获取 Signal Catcher 线程的 dump 文件

ANR 信号是监听到了,那么咱们要怎样才能够获取到 Signal Catcher 线程写入的 dump 文件呢?首要要知道 Signal Catcher 线程,是咱们使用进程中的一个线程,它是在咱们使用进程启动时就创建了。咱们想要获取它写的文件,就能够经过 PLT/GOT Hook 的办法,去 Hook 它的 write() 办法,这样咱们就能够拿到它写入的内容了,我之前有介绍过 PLT/GOT Hook,感兴趣的同学能够参阅这篇文章:手把手教你怎样 Hook Native 办法

我这儿使用了 xHook 来完结 hook


int hookSignalCatcherWrite() {
    int apiLevel = android_get_device_api_level();
    int signalCatcherTid = gSignalCatcherTid;
    if (signalCatcherTid <= 0) {
        signalCatcherTid = getSignalCatcherTid();
        gSignalCatcherTid = signalCatcherTid;
    }
    LOGD("ApiLevel: %d, SignalCatcherTid: %d", apiLevel, signalCatcherTid);
    if (signalCatcherTid <= 0) {
        LOGE("Get Signal Catcher tid fail.");
        return -1;
    }
    char *writeLibName;
    if (apiLevel >= 30 || apiLevel == 25 || apiLevel == 24) {
        writeLibName = ".*/libc.so$";
    } else if (apiLevel == 29) {
        writeLibName = ".*/libbase.so$";
    } else {
        writeLibName = ".*/libart.so$";
    }
    int ret = xhook_register(writeLibName,
                   "write",
                   (void *) my_write,
                             nullptr);
    LOGD("xhook hook write register result: %d", ret);
    if (ret == 0) {
        ret = xhook_refresh(1);
        LOGD("xhook hook write refresh result: %d", ret);
        return ret;
    } else {
        return ret;
    }
}

不同的 Android 版本 hookso 库也不相同,我也是参阅大佬们的操作,最好是去看 Android 源码,Signal Catcher 的相关代码被打包到哪个 so 中。

咱们在简单看看咱们的 hook 函数 my_write 的完成:

ssize_t my_write(int fd, const void *const buf, size_t count) {
    if (gSignalCatcherTid == gettid()) {
        pthread_mutex_lock(lock);
        if (dumpState != NO_DUMP) {
            LOGD("SignalCatcher write count: %d", count);
            long time = get_time_millis();
            char *stackFileName = new char[MAX_BUFFER_SIZE];
            const char * dir;
            if (dumpState == WAITING_STACK_DUMP) {
                dir = gStackTraceDir;
                LOGD("Start stack dump.");
            } else {
                dir = gAnrTraceDir;
                LOGD("Start anr dump.");
            }
            sprintf(stackFileName, "%s/%ld.text", dir, time);
            LOGD("Create stack file: %s", stackFileName);
            int fileFd = open(stackFileName, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
            if (fileFd < 0) {
                LOGE("Create file fail: %d", fd);
                goto end;
            }
            write(fileFd, buf, count);
            close(fileFd);
            write(gStackNotifyFd, &time, sizeof(time));
            goto end;
        } else {
            goto end;
        }
       end:
        pthread_mutex_unlock(lock);
    }
    return origin_write(fd, buf, count);
}

首要咱们会先判别当前的线程是不是 Signal Catcher,同时还会判别咱们自己设定的状况,假如这些都没有问题,咱们就认为这是咱们要的 ANR dump 文件,然后咱们将它写入到咱们的文件里面。
最终还会调用真实完成的 write() 办法。

主动获取一切的办法栈信息

经过系统的 ANR 信号来获取办法栈的 dump 信息,相对就被动一些,有的时候咱们想要知道使用当前的一切线程的状况,这个时候咱们就能够主动发送一个 SIGQUIT 信号给 Signal Catcher 线程,这样也能够经过 hook 拿到对应的 dump 文件,发送信号的办法和咱们自定义的 signal action 中处理的办法相同,也是经过 syscall(SYS_tgkill, myPid, gSignalCatcherTid, SIGQUIT); 办法发送。

ANR dump 文件示例

// ...
suspend all histogram:	Sum: 165us 99% C.I. 1us-21us Avg: 7.173us Max: 21us
DALVIK THREADS (23):
"Signal Catcher" daemon prio=10 tid=2 Runnable
  | group="system" sCount=0 ucsCount=0 flags=0 obj=0x13600338 self=0xb400007bf3a26000
  | sysTid=5041 nice=-20 cgrp=default sched=0/0 handle=0x7bf4ffbcb0
  | state=R schedstat=( 28127001 5785385 10 ) utm=2 stm=0 core=5 HZ=100
  | stack=0x7bf4f04000-0x7bf4f06000 stackSize=991KB
  | held mutexes= "mutator lock"(shared held)
  native: #00 pc 0000000000570ec4  /apex/com.android.art/lib64/libart.so (art::DumpNativeStack(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, int, BacktraceMap*, char const*, art::ArtMethod*, void*, bool)+148) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #01 pc 0000000000675a24  /apex/com.android.art/lib64/libart.so (art::Thread::DumpStack(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, bool, BacktraceMap*, bool) const+340) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #02 pc 000000000069310c  /apex/com.android.art/lib64/libart.so (art::DumpCheckpoint::Run(art::Thread*)+908) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #03 pc 000000000068ccac  /apex/com.android.art/lib64/libart.so (art::ThreadList::RunCheckpoint(art::Closure*, art::Closure*)+508) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #04 pc 000000000068bf54  /apex/com.android.art/lib64/libart.so (art::ThreadList::Dump(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, bool)+1796) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #05 pc 000000000068b70c  /apex/com.android.art/lib64/libart.so (art::ThreadList::DumpForSigQuit(std::__1::basic_ostream<char, std::__1::char_traits<char> >&)+1340) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #06 pc 000000000063d300  /apex/com.android.art/lib64/libart.so (art::Runtime::DumpForSigQuit(std::__1::basic_ostream<char, std::__1::char_traits<char> >&)+208) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #07 pc 0000000000651dc0  /apex/com.android.art/lib64/libart.so (art::SignalCatcher::HandleSigQuit()+1376) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #08 pc 0000000000650e54  /apex/com.android.art/lib64/libart.so (art::SignalCatcher::Run(void*)+340) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #09 pc 00000000000eb720  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #10 pc 000000000007e2d0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  (no managed stack frames)
"main" prio=5 tid=1 Native
  | group="main" sCount=1 ucsCount=0 flags=1 obj=0x73869160 self=0xb400007c11e10800
  | sysTid=15609 nice=-10 cgrp=default sched=1073741824/0 handle=0x7cbd635500
  | state=S schedstat=( 1086854706 330699698 4068 ) utm=63 stm=45 core=6 HZ=100
  | stack=0x7fd3027000-0x7fd3029000 stackSize=8188KB
  | held mutexes=
  native: #00 pc 0000000000078dec  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #01 pc 00000000002833dc  /apex/com.android.art/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+140) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #02 pc 000000000043bf3c  /apex/com.android.art/lib64/libart.so (art::(anonymous namespace)::CheckJNI::FindClass(_JNIEnv*, char const*) (.llvm.11132044689082360456)+460) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #03 pc 0000000000128ebc  /system/lib64/libandroid_runtime.so (android::NativeDisplayEventReceiver::dispatchVsync(long, android::PhysicalDisplayId, unsigned int, android::gui::VsyncEventData)+92) (BuildId: 4da95a3e8bdc1b6a6682b67c10bdc47e)
  native: #04 pc 00000000000c1820  /system/lib64/libgui.so (android::DisplayEventDispatcher::handleEvent(int, int, void*)+272) (BuildId: 1d69b7a57862392ad7b7712ed6197e18)
  native: #05 pc 000000000001836c  /system/lib64/libutils.so (android::Looper::pollInner(int)+1068) (BuildId: 6038dbf95f76d91eaf842148f10f89ea)
  native: #06 pc 0000000000017ee0  /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+112) (BuildId: 6038dbf95f76d91eaf842148f10f89ea)
  native: #07 pc 000000000016410c  /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44) (BuildId: 4da95a3e8bdc1b6a6682b67c10bdc47e)
  at android.os.MessageQueue.nativePollOnce(Native method)
  at android.os.MessageQueue.next(MessageQueue.java:339)
  at android.os.Looper.loopOnce(Looper.java:186)
  at android.os.Looper.loop(Looper.java:351)
  at android.app.ActivityThread.main(ActivityThread.java:8377)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:584)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1013)
"Jit thread pool worker thread 0" daemon prio=5 tid=4 Native
  | group="system" sCount=1 ucsCount=0 flags=1 obj=0x135c0720 self=0xb400007bf3a47800
  | sysTid=5046 nice=9 cgrp=default sched=0/0 handle=0x7bf4d01cb0
  | state=S schedstat=( 12650002 4618461 48 ) utm=0 stm=0 core=1 HZ=100
  | stack=0x7bf4c02000-0x7bf4c04000 stackSize=1023KB
  | held mutexes=
  native: #00 pc 0000000000078dec  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #01 pc 00000000002833dc  /apex/com.android.art/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+140) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #02 pc 0000000000694b78  /apex/com.android.art/lib64/libart.so (art::ThreadPool::GetTask(art::Thread*)+120) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #03 pc 0000000000693f50  /apex/com.android.art/lib64/libart.so (art::ThreadPoolWorker::Run()+144) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #04 pc 00000000006939cc  /apex/com.android.art/lib64/libart.so (art::ThreadPoolWorker::Callback(void*)+172) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #05 pc 00000000000eb720  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #06 pc 000000000007e2d0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  (no managed stack frames)
"perfetto_hprof_listener" prio=10 tid=8 Native (still starting up)
  | group="" sCount=1 ucsCount=0 flags=1 obj=0x0 self=0xb400007bf3a6f800
  | sysTid=5044 nice=-20 cgrp=default sched=0/0 handle=0x7bf4efdcb0
  | state=S schedstat=( 119385 21461461 4 ) utm=0 stm=0 core=6 HZ=100
  | stack=0x7bf4e06000-0x7bf4e08000 stackSize=991KB
  | held mutexes=
  native: #00 pc 00000000000d5774  /apex/com.android.runtime/lib64/bionic/libc.so (read+4) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #01 pc 000000000001dee4  /apex/com.android.art/lib64/libperfetto_hprof.so (void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, ArtPlugin_Initialize::$_34> >(void*)+260) (BuildId: 13ee3b989b35c4e1d3ac372e558e2961)
  native: #02 pc 00000000000eb720  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #03 pc 000000000007e2d0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  (no managed stack frames)
"binder:15609_1" prio=5 tid=9 Native
  | group="main" sCount=1 ucsCount=0 flags=1 obj=0x13640020 self=0xb400007bf4867400
  | sysTid=5054 nice=-20 cgrp=default sched=0/0 handle=0x7bf42dfcb0
  | state=S schedstat=( 333385 370462 3 ) utm=0 stm=0 core=4 HZ=100
  | stack=0x7bf41e8000-0x7bf41ea000 stackSize=991KB
  | held mutexes=
  native: #00 pc 00000000000d5a54  /apex/com.android.runtime/lib64/bionic/libc.so (__ioctl+4) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #01 pc 00000000000873bc  /apex/com.android.runtime/lib64/bionic/libc.so (ioctl+156) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #02 pc 000000000005f48c  /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+284) (BuildId: 821d5191ea842f908c210c9c338b12f6)
  native: #03 pc 000000000005f788  /system/lib64/libbinder.so (android::IPCThreadState::getAndExecuteCommand()+24) (BuildId: 821d5191ea842f908c210c9c338b12f6)
  native: #04 pc 00000000000600a4  /system/lib64/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+68) (BuildId: 821d5191ea842f908c210c9c338b12f6)
  native: #05 pc 0000000000090048  /system/lib64/libbinder.so (android::PoolThread::threadLoop()+24) (BuildId: 821d5191ea842f908c210c9c338b12f6)
  native: #06 pc 0000000000013550  /system/lib64/libutils.so (android::Thread::_threadLoop(void*)+416) (BuildId: 6038dbf95f76d91eaf842148f10f89ea)
  native: #07 pc 00000000000cc59c  /system/lib64/libandroid_runtime.so (android::AndroidRuntime::javaThreadShell(void*)+140) (BuildId: 4da95a3e8bdc1b6a6682b67c10bdc47e)
  native: #08 pc 00000000000eb720  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #09 pc 000000000007e2d0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  (no managed stack frames)
// ... 

这个文件中包括一切的 Java 线程栈和 Native 线程栈,并且其中还包括线程的状况,锁信息,栈巨细等等有用的信息,这些信息对咱们剖析问题也非常有帮助。

最终

我把上面的一切代码都开源了,并且还发布成了一个单独的 aar 库,感兴趣的同学能够看看:dumpstack