一:阐明

本文针对相对比较复杂的anr问题。

二:ANR分类阐明

Android ANR深入分析
上图为笔者从android9的代码中梳理出的一切anr。

一切anr都与四大组件相关在一起,全体上可分为两类

  1. 与用户直接交互引起的anr,触发源头为input。
  2. 后台组件操作超时引起的anr。触发源头又分为两类:
    • 第一类是服务和播送,都是直接在ams的线程中触发,
    • 第二类是contentprovider,触发源头为建议contentprovider操作的进程主线程。

而一切anr的原因,能够分为以下几类:

  1. 受anr监听的事情自身超时,如touch事情处理超时。这类anr比较简单定位
  2. 受anr监听的事情无法进入主线程开端处理,如主线程被block或使命忙,等待时刻超时引起anr。
  3. 窗口状况不对,如绘制超时,焦点过错等引起。

ps:主线程堵塞了不一定会引起ANR,有些状况下,主线程或许堵塞了一些时刻,但在这些时刻内,没有事情进入,因而就不触发anr。

不同ANR事情的监听完结

  1. input anr,经过InputDispatcher线程监听(直接用looper的epoll timeout,对于当时inputChannel,是堵塞式的,对其他channel则是非堵塞式的)
  2. service anr,经过AMS的主线程(线程名:ActivityManager)监听(经过Handler的delay,非堵塞式)
  3. broadcast anr,经过AMS的主线程(线程名:ActivityManager)监听(经过Handler的delay,非堵塞式)
  4. ContentProvider anr,经过当时进程的主线程去监听,触发后调用AMS去触发anr(经过Handler的delay,非堵塞式)

下图为典型的input anr触发流程

sequenceDiagram
InputDispatcher->>APPMainThread: send event to app
Note left of APPMainThread:正常状况下,app finsh event 的时刻很短
APPMainThread-->>InputDispatcher: app finish event handle
InputDispatcher->>APPMainThread: send event to app
Note left of APPMainThread:anr时,app主线程在超时时刻内
Note left of APPMainThread:不向InputDispatcher返回状况
InputDispatcher->>APPMainThread: send event to app
Note right of InputDispatcher:鄙人一个event事情时,触发一个5s的计时,
InputDispatcher ->> ActivityManager : triger anr
Note right of InputDispatcher:5s后如果还未收到finish,就触发anr

下图为典型的 service/broadcast anr触发流程

sequenceDiagram
Note right of ActivityManager:正常状况时:
Note right of ActivityManager:在迈进程之前,发一个超时的handler音讯
ActivityManager->> APPMainThread: service/broadcast call
Note left of APPMainThread:ams调用相关的操作
APPMainThread-->>ActivityManager: app service/broadcast finish handle
Note right of ActivityManager:在迈进程之后,移出超时的handler音讯
Note right of ActivityManager:ANR时:
ActivityManager->> APPMainThread: service/broadcast call
Note left of APPMainThread:使用一向无响应,超时音讯届时后发出
Note left of APPMainThread:ams就履行对应的anr事情。

下图为典型的 contentProvider anr触发流程

sequenceDiagram
Note right of APP1MainThread: app1在迈进程调用contentProvider前,发一个超时音讯
APP1MainThread->>APP2MainThread: call contentProvider method
APP2MainThread-->>APP1MainThread : return
Note right of APP1MainThread: 在超时前收到 contentProvider 的调用,移出超时的handler音讯
Note right of APP1MainThread: 在超时前未收到 contentProvider 的调用,则向ams触发anr
APP1MainThread->>ActivityManager: triger anr

三:一些比较典型的复杂anr

1: 主线程闲暇

"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x752a4b08 self=0x7204214c00
  | sysTid=4434 nice=-10 cgrp=default sched=0/0 handle=0x7289e7b548
  | state=S schedstat=( 857618499 263812800 679 ) utm=70 stm=15 core=1 HZ=100
  | stack=0x7fccb8f000-0x7fccb91000 stackSize=8MB
  | held mutexes=
  kernel: (couldn't read /proc/self/task/4434/stack)
  native: #00 pc 000000000006e118  /system/lib64/libc.so (offset 6e000) (__epoll_pwait+8)
  native: #01 pc 0000000000013f8c  /system/lib64/libutils.so (android::Looper::pollInner(int)+144)
  native: #02 pc 0000000000013e60  /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+60)
  native: #03 pc 000000000011f9c0  /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44)
  at android.os.MessageQueue.nativePollOnce(Native method)
  at android.os.MessageQueue.next(MessageQueue.java:326)
  at android.os.Looper.loop(Looper.java:160)
  at android.app.ActivityThread.main(ActivityThread.java:6733)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:506)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:858)

如上trace,出现anr的进程的主线程显闲暇,很多同学在遇到类anr的时分会比较无助,大多数状况下,使用开发的同学会直接转给做系统开发的同学持续跟进。 遇到这类问题时,先确认anr触发时刻和trace dump 时刻差多久,以为附录示例为参考

  • 10-14 15:43:19.393在inputdispatcher中触发anr,
  • 10-14 15:43:19.421 在ams中开端处理anr
  • 10-14 15:43:19.934 开端dump anr进程的调用stack
  • 10-14 15:43:20.239 完结 anr进程的调用栈dump.

在这个示例中,从inputdispatcher发现事情超时,到开端dump anr进程的trace,花了500+ms,到dump完结,一共花了850+ms,这次不算长,也不算短,但实践上在这次中,主线程dump出来的stack是处于闲暇态的。当然,这次anr并不是因为主线现已过了block时刻点,但实践开发debug中,有很多状况下是dump的时分,anr的进程现已过了堵塞点。此次只能根据logcag的信息,找到anr触发的详细时刻,再去看主线程的log.否则就简单在过错的时刻,根据过错的log的剖析。

2:跨线程/进程堵塞

被其他线程block的是很简单定位的,trace中经过lock锁就能够容易找到,这儿就不展开了。

迈进程堵塞:(这儿只包括迈进程的部分,更早的调用栈包括公司包名信息,就不贴出来了)

"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x73ec6b08 self=0x7e38414c00
  | sysTid=2411 nice=-10 cgrp=default sched=0/0 handle=0x7ebdf9b548
  | state=S schedstat=( 1110108606 4113781665 1463 ) utm=72 stm=38 core=0 HZ=100
  | stack=0x7ffbe55000-0x7ffbe57000 stackSize=8MB
  | held mutexes=
  kernel: (couldn't read /proc/self/task/2411/stack)
  native: #00 pc 000000000006e218  /system/lib64/libc.so (__ioctl+4)
  native: #01 pc 0000000000029124  /system/lib64/libc.so (ioctl+136)
  native: #02 pc 000000000005a5e4  /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+256)
  native: #03 pc 000000000005b374  /system/lib64/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+60)
  native: #04 pc 000000000005b1c0  /system/lib64/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+176)
  native: #05 pc 0000000000051318  /system/lib64/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+72)
  native: #06 pc 000000000006a47c  /system/lib64/libmedia.so (android::BpMediaPlayerService::getCodecList() const+248)
  native: #07 pc 00000000001633b4  /system/lib64/libstagefright.so (android::MediaCodecList::getInstance() [clone .cfi]+332)
  native: #08 pc 0000000000034128  /system/lib64/libmedia_jni.so (android_media_MediaCodecList_getCodecCount(_JNIEnv*, _jobject*)+36)
  at android.media.MediaCodecList.native_getCodecCount(Native method)
  at android.media.MediaCodecList.initCodecList(MediaCodecList.java:82)

这调用栈中,能够看到主线程处在binder调用中,详细是哪一个调用,一般看BpBinder::transact前一个调用的办法是什么,这儿是getCodecList,那就在整个anr文件中搜哪个进程在被调用这个办法。

"Binder:658_1" sysTid=951
  #00 pc 00053fac  /system/lib/libc.so (__ioctl+8)
  #01 pc 00021c59  /system/lib/libc.so (ioctl+36)
  #02 pc 00015def  /system/lib/libhwbinder.so (android::hardware::IPCThreadState::talkWithDriver(bool)+190)
  #03 pc 0001696d  /system/lib/libhwbinder.so (android::hardware::IPCThreadState::waitForResponse(android::hardware::Parcel*, int*)+268)
  #04 pc 00015bb7  /system/lib/libhwbinder.so (android::hardware::BpHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+90)
  #05 pc 00028cfb  /system/lib/android.hardware.media.omx@1.0.so (android::hardware::media::omx::V1_0::BpHwOmx::_hidl_allocateNode(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, android::hardware::hidl_string const&, android::sp<android::hardware::media::omx::V1_0::IOmxObserver> const&, std::__1::function<void (android::hardware::media::omx::V1_0::Status, android::sp<android::hardware::media::omx::V1_0::IOmxNode> const&)>)+406)
  #06 pc 000296ef  /system/lib/android.hardware.media.omx@1.0.so (android::hardware::media::omx::V1_0::BpHwOmx::allocateNode(android::hardware::hidl_string const&, android::sp<android::hardware::media::omx::V1_0::IOmxObserver> const&, std::__1::function<void (android::hardware::media::omx::V1_0::Status, android::sp<android::hardware::media::omx::V1_0::IOmxNode> const&)>)+78)
  #07 pc 0001dceb  /system/lib/libmedia_omx.so (android::hardware::media::omx::V1_0::utils::LWOmx::allocateNode(char const*, android::sp<android::IOMXObserver> const&, android::sp<android::IOMXNode>*)+98)
  #08 pc 0005d51d  /system/lib/libstagefright.so (android::ACodec::queryCapabilities(char const*, char const*, char const*, bool, android::MediaCodecInfo::CapabilitiesWriter*)+200)
  #09 pc 000a0115  /system/lib/libstagefright.so (android::(anonymous namespace)::queryCapabilities(android::hardware::media::omx::V1_0::IOmxStore::NodeInfo const&, char const*, bool, android::MediaCodecInfo::CapabilitiesWriter*)+76)
  #10 pc 0009fed9  /system/lib/libstagefright.so (android::OmxInfoBuilder::buildMediaCodecList(android::MediaCodecListWriter*)+996)
  #11 pc 0008ffc3  /system/lib/libstagefright.so (android::MediaCodecList::MediaCodecList(std::__1::vector<android::MediaCodecListBuilderBase*, std::__1::allocator<android::MediaCodecListBuilderBase*>>)+234)
  #12 pc 0009015d  /system/lib/libstagefright.so (android::MediaCodecList::getLocalInstance()+68)
  #13 pc 00039279  /system/lib/libmediaplayerservice.so (android::MediaPlayerService::getCodecList() const+36)
  #14 pc 00033ce9  /system/lib/libmedia.so (android::BnMediaPlayerService::onTransact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+640)
  #15 pc 000361cf  /system/lib/libbinder.so (android::BBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+70)
  #16 pc 0003da17  /system/lib/libbinder.so (android::IPCThreadState::executeCommand(int)+410)
  #17 pc 0003d7a3  /system/lib/libbinder.so (android::IPCThreadState::getAndExecuteCommand()+106)
  #18 pc 0003dccb  /system/lib/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+38)
  #19 pc 0005470d  /system/lib/libbinder.so (android::PoolThread::threadLoop()+12)
  #20 pc 0000c1a3  /system/lib/libutils.so (android::Thread::_threadLoop(void*)+166)
  #21 pc 00063cb5  /system/lib/libc.so (__pthread_start(void*)+22)
  #22 pc 0001e0ad  /system/lib/libc.so (__start_thread+22)

搜索发现,在/system/bin/mediaserver进程中,有多个 getCodecList 被调用,这是一个互斥的调用,在此进程中,只有上面显示的这个调用栈在向下调用。 在这个调用中,能够看这个调用最后也是一个binder调用,也处在被其他进程block的状况。 那下一步便是持续找BpHwBinder::transact 的前一个调用的是什么办法,从trace中看是_hidl_allocateNode 接下来持续去追 _hidl_allocateNode的调用。直到找到真实的block点。

附录:一个完好的input anr 流程剖析

下面经过一个完好的input anr触发流程,展现一个anr问题的剖析进程(简单的anr并不需要完好的剖析流程)这个流程分红5步,很多人在anr的trace文件中没有看到直接的问题点时,就直接从第5步所在位置开端剖析,并从这个时刻点向前推算,但这或许并不精确。就如示例中的一样,ActivityManager : ANR in所在时刻点事实上与anr触发时有5s的差距。

正确的过程应该是从Event log am_anr开端,如果没有event log,则向前看,直接找InputDispatcher,如果是input事情超时,会有一个wait queue age,用这个时刻向前推,则能够找到精确的input时刻。

找到精确的input时刻后,再经过anr进程的主线程log,去看这个时刻点上,主线程做了什么,如果前后都没有主线程的log,则有或许是堵塞了,再找anr前最近的主线程log去看,大概率便是那行log之后的操作引起的。

  1. 触发anr的事情

10-14 15:43:14.386 3693 3706 D (input事情,详细log包括公司信息,就不贴出来了)

  1. InputDispatcher 检测到事情超时,触发anr事情

默认超时时刻为5s,超时方式为epoll_wait超时

10-14 15:43:19.393 1946 2036 I InputDispatcher : Application is not responding: AppWindowToken{d0046c6 token=Token{7a154a1 ActivityRecord{e26d508 u0 com.XXX.XXX/.activity.MainActivity t49}}}. It has been 5006.9ms since event, 5006.2ms since wait started. Reason: Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.
10-14 15:43:19.394 1946 2036 I WindowManager : Input event dispatching timed out sending to application AppWindowToken{d0046c6 token=Token{7a154a1 ActivityRecord{e26d508 u0 com.XXX.XXX/.activity.MainActivity t49}}}. Reason: Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.
  1. AMS收到anr事情,开端处理anr事情
10-14 15:43:19.421 1946 1961 I am_anr : [0,4827,com.XXX.XXX,550026829,Input dispatching timed out (Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.)]
  1. 开端顺次dump当时进程的调用栈
10-14 15:43:19.934 1946 1961 I system_server : libdebuggerd_client: started dumping process 4827
10-14 15:43:20.239 1946 1961 I system_server : libdebuggerd_client: done dumping process 4827
10-14 15:43:20.239 1946 1961 I system_server : libdebuggerd_client: started dumping process 1946
10-14 15:43:20.240 1946 1951 I system_server : Thread[2,tid=1951,WaitingInMainSignalCatcherLoop,Thread*=0x6f3f229000,peer=0x12d00000,"Signal Catcher"]: reacting to signal 3
10-14 15:43:20.240 1946 1951 I system_server :
10-14 15:43:20.421 1946 1951 I system_server : Wrote stack traces to '[tombstoned]'
10-14 15:43:20.421 1946 1961 I system_server : libdebuggerd_client: done dumping process 1946
...
10-14 15:43:26.570 1946 1961 I system_server : libdebuggerd_client: started dumping process 5840
10-14 15:43:26.716 1946 1961 I system_server : libdebuggerd_client: done dumping process 5840
  1. 完结调用栈,把cpu的负载状况打印出来
10-14 15:43:26.757 1946 1961 E ActivityManager : ANR in com.XXX.XXX (com.XXX.XXX/.activity.MainActivity)
10-14 15:43:26.757 1946 1961 E ActivityManager : PID: 4827
10-14 15:43:26.757 1946 1961 E ActivityManager : Reason: Input dispatching timed out (Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.)
10-14 15:43:26.757 1946 1961 E ActivityManager : Load: 10.53 / 4.93 / 1.85
10-14 15:43:26.757 1946 1961 E ActivityManager : CPU usage from 0ms to 7320ms later (2022-10-14 15:43:19.396 to 2022-10-14 15:43:26.716):
10-14 15:43:26.757 1946 1961 E ActivityManager : 28% 2227/com.XXX.XXX: 16% user + 12% kernel / faults: 7991 minor 8 major
10-14 15:43:26.757 1946 1961 E ActivityManager : 19% 530/surfaceflinger: 9.5% user + 9.6% kernel / faults: 993 minor
10-14 15:43:26.757 1946 1961 E ActivityManager : 14% 6036/adbd: 1.7% user + 12% kernel / faults: 154514 minor
10-14 15:43:26.757 1946 1961 E ActivityManager : 11% 533/vendor.XXX.hardware.XXX@1.0-service: 9.6% user + 2% kernel / faults: 2508 minor
10-14 15:43:26.757 1946 1961 E ActivityManager : 10% 1946/system_server: 3.9% user + 6.6% kernel / faults: 4631 minor

至此,和anr相关的log信信息现已取完。

实践剖析问题时,能够结合上面的5个时刻节点去找对应的信息。