前言

有必定开发经验的或多或少有听过Watchdog,那什么是Watchdog呢?Watchdog又称看门狗,看门狗是育碧开发的一款游戏,目前已出到《看门狗军团》。开个玩笑,Watchdog是什么,为什么会规划出它,听到它也许能快速联想到死锁,它是一个由SystemServer启动的服务,本质上是一个线程,这次咱们就从源码的视点剖析,它究竟做了啥。

预备

当然看源码前还需要做一些预备,不然你可能会直接看不懂。首要,Handler机制要了解。锁和死锁的概念都要了解,但我感觉应都是了解了死锁之后才听说Watchdog的。SystemServer至少得知道是做什么的。Monitor的规划思维懂更好,不懂在这儿也不会影响看主流程。

这儿源码有两个重要的类HandlerChecker和Monitor,简略了解它的流程大约便是用handler发音讯给监控的线程,然后计时,假设30秒内有收到音讯,什么都不管,假设超过30秒没收到但60秒内有收到,就打印,假设60秒内没收到音讯,就炸。

首要流程源码解析

PS:源码是29的

首要在SystemServer中创立并启动这个线程,你也能够说启动这个服务

private void startBootstrapServices() {
    ......
    final Watchdog watchdog = Watchdog.getInstance();
    watchdog.start();
    ......
    watchdog.init(mSystemContext, mActivityManagerService);
    ......
}

单例,咱们看看结构办法

private Watchdog() {
    super("watchdog");
    mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
            "foreground thread", DEFAULT_TIMEOUT);
    mHandlerCheckers.add(mMonitorChecker);
    // Add checker for main thread.  We only do a quick check since there
    // can be UI running on the thread.
    mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
            "main thread", DEFAULT_TIMEOUT));
    // Add checker for shared UI thread.
    mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
            "ui thread", DEFAULT_TIMEOUT));
    // And also check IO thread.
    mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
            "i/o thread", DEFAULT_TIMEOUT));
    // And the display thread.
    mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
            "display thread", DEFAULT_TIMEOUT));
    // And the animation thread.
    mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
            "animation thread", DEFAULT_TIMEOUT));
    // And the surface animation thread.
    mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
            "surface animation thread", DEFAULT_TIMEOUT));
    // 看主流程的话,Binder threads能够先不必管
    // Initialize monitor for Binder threads.
    addMonitor(new BinderThreadMonitor());
    mOpenFdMonitor = OpenFdMonitor.create();
    // See the notes on DEFAULT_TIMEOUT.
    assert DB ||
            DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
}

看主流程的话,Binder threads能够先不必管,精讲。能够显着的看到这儿便是把一些重要的线程的handler去创立HandlerChecker目标放到数组mHandlerCheckers中。简略了解成创立一个目标去集合这些线程的信息,并且Watchdog有个线程信息目标数组。

public final class HandlerChecker implements Runnable {
    private final Handler mHandler;
    private final String mName;
    private final long mWaitMax;
    private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
    private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();
    private boolean mCompleted;
    private Monitor mCurrentMonitor;
    private long mStartTime;
    private int mPauseCount;
    HandlerChecker(Handler handler, String name, long waitMaxMillis) {
        mHandler = handler;
        mName = name;
        mWaitMax = waitMaxMillis;
        mCompleted = true;
    }
    ......
}

然后咱们先看init办法

public void init(Context context, ActivityManagerService activity) {
    mActivity = activity;
    context.registerReceiver(new RebootRequestReceiver(),
            new IntentFilter(Intent.ACTION_REBOOT),
            android.Manifest.permission.REBOOT, null);
}
final class RebootRequestReceiver extends BroadcastReceiver {
    @Override
    public void onReceive(Context c, Intent intent) {
        if (intent.getIntExtra("nowait", 0) != 0) {
            rebootSystem("Received ACTION_REBOOT broadcast");
            return;
        }
        Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
    }
}
void rebootSystem(String reason) {
    Slog.i(TAG, "Rebooting system because: " + reason);
    IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
    try {
        pms.reboot(false, reason, false);
    } catch (RemoteException ex) {
    }
}

显着能看出是重启的操作,注册播送,接收到这个播送之后重启。这个不是主流程,简略看看就行。

来了,要点来了,开端讲主流程。Watchdog是承继Thread,所以上面调start办法会履行到这儿的run办法,润起来

@Override
public void run() {
    boolean waitedHalf = false;
    while (true) {
        ......
        synchronized (this) {
            long timeout = CHECK_INTERVAL;
            for (int i=0; i<mHandlerCheckers.size(); i++) {
                HandlerChecker hc = mHandlerCheckers.get(i);
                hc.scheduleCheckLocked();
            }
            ......
            long start = SystemClock.uptimeMillis();
            while (timeout > 0) {
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                try {
                    wait(timeout);
                } catch (InterruptedException e) {
                    Log.wtf(TAG, e);
                }
                ......
                timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
            }
            ......
            if (!fdLimitTriggered) {
                // 直接先了解成正常状况下会进这儿
                final int waitState = evaluateCheckerCompletionLocked();
                if (waitState == COMPLETED) {
                    // The monitors have returned; reset
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {
                    // still waiting but within their configured intervals; back off and recheck
                    continue;
                } else if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        Slog.i(TAG, "WAITED_HALF");
                        // We've waited half the deadlock-detection interval.  Pull a stack
                        // trace and wait another half.
                        ArrayList<Integer> pids = new ArrayList<Integer>();
                        pids.add(Process.myPid());
                        ActivityManagerService.dumpStackTraces(pids, null, null,
                            getInterestingNativePids());
                        waitedHalf = true;
                    }
                    continue;
                }
                // something is overdue!
                blockedCheckers = getBlockedCheckersLocked();
                subject = describeCheckersLocked(blockedCheckers);
            } else {
                ......
            }
            ......
        }
        // 扒日志然后退出
        ......
        waitedHalf = false;
    }
}

把一些代码屏蔽了,这样看会比较舒畅,首要是怕代码太多劝退人。

首要死循环,然后遍历mHandlerCheckers,便是咱们在结构办法那创立的HandlerCheckers数组,遍历数组调用HandlerChecker的scheduleCheckLocked办法

public void scheduleCheckLocked() {
    if (mCompleted) {
        // Safe to update monitors in queue, Handler is not in the middle of work
        mMonitors.addAll(mMonitorQueue);
        mMonitorQueue.clear();
    }
    if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
            || (mPauseCount > 0)) {
        mCompleted = true;
        return;
    }
    if (!mCompleted) {
        // we already have a check in flight, so no need
        return;
    }
    mCompleted = false;
    mCurrentMonitor = null;
    mStartTime = SystemClock.uptimeMillis();
    mHandler.postAtFrontOfQueue(this);
}

HandlerChecker内有个Monitor数组,Monitor是一个接口,然后外部一些类完成这个接口完成monitor办法,这个后面会说。

public interface Monitor {
    void monitor();
}

这个mCompleted默认是true

if (mCompleted) {
    // Safe to update monitors in queue, Handler is not in the middle of work
    mMonitors.addAll(mMonitorQueue);
    mMonitorQueue.clear();
}

把mMonitorQueue数组中的元素移动到mMonitors中。这个什么意思呢?有点难解释,这样,你想想,Watchdog的run办法中是一个死循环不断调用scheduleCheckLocked办法吧,我这段代码的逻辑操效果到mMonitors,那我不能在我操作的一起你增加元素进来吧,那不就乱套了,所以假设有新加Monitor的话,就只能在每次循环履行这段逻辑开端的时分,增加进了。这段代码是这个意思。

if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
        || (mPauseCount > 0)) {
    mCompleted = true;
    return;
}

假设mMonitors数组不为空,并且这个handler的messageQueue正在工作,你了解这个isPolling办法是正在工作就行,把mCompleted状况设true,然后直接完毕这个办法,这什么意思呢?你想想,我的目的是要判别这个线程是否卡住了,那我messageQueue正在工作阐明没卡住嘛。看不懂这儿的话能够再了解了解handler机制。

假设没有,咱们往下走

// 先不管,先符号这儿是A1点
if (!mCompleted) {
    // we already have a check in flight, so no need
    return;
}

这段不必管它,从上面能够看出这儿mCompleted是true,往下走,咱们先符号这儿是A1点,后面流程会履行回来。

mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this);

把mCompleted状况设为false,mStartTime用来记载当时时刻作为咱们整个判别的开端时刻,用handler发音讯postAtFrontOfQueue。然后这儿传this,就会调用到这个HandlerChecker自身的run办法。

好了,考验功底的地方,这个run办法是履行在哪个线程中?

@Override
public void run() {
    final int size = mMonitors.size();
    for (int i = 0 ; i < size ; i++) {
        synchronized (Watchdog.this) {
            mCurrentMonitor = mMonitors.get(i);
        }
        mCurrentMonitor.monitor();
    }
    synchronized (Watchdog.this) {
        mCompleted = true;
        mCurrentMonitor = null;
    }
}

这儿是拿mMonitors数组循环遍历然后履行monitor办法,其实这个便是判别死锁的逻辑,你先简略了解成假设产生死锁,这个mCurrentMonitor.monitor就会卡住在这儿,不会往下履行mCompleted = true;

handler发音讯的一起run办法其实已经是切线程了 ,所以Watchdog线程会继续往下履行,咱们回到Watchdog的run办法

long start = SystemClock.uptimeMillis();
while (timeout > 0) {
    if (Debug.isDebuggerConnected()) {
        debuggerWasConnected = 2;
    }
    try {
        wait(timeout);
        // Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
    } catch (InterruptedException e) {
        Log.wtf(TAG, e);
    }
    if (Debug.isDebuggerConnected()) {
        debuggerWasConnected = 2;
    }
    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}

wait(timeout);进行线程阻塞,线线程生命周期变成TIME_WAITTING,timeout在这儿是CHECK_INTERVAL,便是30秒。

30秒之后进入这个流程

final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
    // The monitors have returned; reset
    waitedHalf = false;
    continue;
} else if (waitState == WAITING) {
    // still waiting but within their configured intervals; back off and recheck
    continue;
} else if (waitState == WAITED_HALF) {
    if (!waitedHalf) {
        Slog.i(TAG, "WAITED_HALF");
        // We've waited half the deadlock-detection interval.  Pull a stack
        // trace and wait another half.
        ArrayList<Integer> pids = new ArrayList<Integer>();
        pids.add(Process.myPid());
        ActivityManagerService.dumpStackTraces(pids, null, null,
            getInterestingNativePids());
        waitedHalf = true;
    }
    continue;
}
private int evaluateCheckerCompletionLocked() {
    int state = COMPLETED;
    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerChecker hc = mHandlerCheckers.get(i);
        state = Math.max(state, hc.getCompletionStateLocked());
    }
    return state;
}

evaluateCheckerCompletionLocked便是轮询调用HandlerChecker的getCompletionStateLocked办法,然后依据全部的状况,回来一个终究的状况, 我后面会解释状况。 ,先看getCompletionStateLocked办法 (能够想想这个办法是在哪个线程中履行的)

public int getCompletionStateLocked() {
    if (mCompleted) {
        return COMPLETED;
    } else {
        long latency = SystemClock.uptimeMillis() - mStartTime;
        if (latency < mWaitMax/2) {
            return WAITING;
        } else if (latency < mWaitMax) {
            return WAITED_HALF;
        }
    }
    return OVERDUE;
}

其实HandlerChecker的getCompletionStateLocked办法对应scheduleCheckLocked办法。

判别mCompleted为true的话回来COMPLETED状况。COMPLETED状况便是正常,从上面看出正常状况下都会回来true,只有在那条线程还卡住的状况下,回来false。什么叫“那条线程还卡住的状况”,咱们在scheduleCheckLocked办法postAtFrontOfQueue之后有两种状况会呈现卡住。

(1)这个Handler的MessageQueue的前一个Message一向在处理中,导致postAtFrontOfQueue在这30秒之后都没履行到run办法
(2)run办法中的mCurrentMonitor.monitor()一向卡住,30秒了仍是卡住,精确来说是竞争锁处于BLOCKED状况,没能履行到mCompleted = true

这两种状况下mCompleted都为false,然后latency来计算这段时刻,假设小于30秒,回来WAITING状况,假设大于30秒小于60秒,回来WAITED_HALF状况,假设大于60秒回来OVERDUE状况。

然后看回evaluateCheckerCompletionLocked办法state = Math.max(state, hc.getCompletionStateLocked());这句代码的意思便是由于咱们是检测多条线程的嘛,这么多条线程里边,但凡有一条不正常,终究这个办法都回来最不正常的那个状况。

假设回来COMPLETED状况,阐明这轮循环正常,开端下一轮循环判别,假设回来WAITING, 下一轮履行到HandlerChecker的scheduleCheckLocked办法的时分,就会走点A1的判别

if (!mCompleted) {
    // we already have a check in flight, so no need
    return;
}

这种状况下就不必重复发音讯和记载开端时刻。当回来WAITED_HALF的状况下调用dumpStackTraces搜集信息,当回来OVERDUE的状况下就直接搜集信息然后重启了。下面是搜集信息重启的源码,不想看能够越过。


......
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
ArrayList<Integer> pids = new ArrayList<>();
pids.add(Process.myPid());
if (mPhonePid > 0) pids.add(mPhonePid);
final File stack = ActivityManagerService.dumpStackTraces(
        pids, null, null, getInterestingNativePids());
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(5000);
// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
doSysRq('w');
doSysRq('l');
// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked.  (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
        public void run() {
            // If a watched thread hangs before init() is called, we don't have a
            // valid mActivity. So we can't log the error to dropbox.
            if (mActivity != null) {
                mActivity.addErrorToDropBox(
                        "watchdog", null, "system_server", null, null, null,
                        subject, null, stack, null);
            }
            StatsLog.write(StatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED, subject);
        }
    };
dropboxThread.start();
try {
    dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) {}
IActivityController controller;
synchronized (this) {
    controller = mController;
}
if (controller != null) {
    Slog.i(TAG, "Reporting stuck state to activity controller");
    try {
        Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
        // 1 = keep waiting, -1 = kill system
        int res = controller.systemNotResponding(subject);
        if (res >= 0) {
            Slog.i(TAG, "Activity controller requested to coninue to wait");
            waitedHalf = false;
            continue;
        }
    } catch (RemoteException e) {
    }
}
// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
    debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
    Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
    Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
    Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
    Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
    WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
    Slog.w(TAG, "*** GOODBYE!");
    Process.killProcess(Process.myPid());
    System.exit(10);
}
waitedHalf = false;

弥补

弥补一下4个状况的界说

static final int COMPLETED = 0;
static final int WAITING = 1;
static final int WAITED_HALF = 2;
static final int OVERDUE = 3;

COMPLETED是正常状况,其它都是异常状况,OVERDUE直接重启。

然后关于Monitor,能够随便拿个类来举比如,我看很多人都是用AMS,那我也用AMS吧

public class ActivityManagerService extends IActivityManager.Stub
        implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {

看到AMS完成Watchdog.Monitor,然后在AMS的结构办法中

Watchdog.getInstance().addMonitor(this);
Watchdog.getInstance().addThread(mHandler);
public void addMonitor(Monitor monitor) {
    synchronized (this) {
        mMonitorChecker.addMonitorLocked(monitor);
    }
}
public void addThread(Handler thread, long timeoutMillis) {
    synchronized (this) {
        final String name = thread.getLooper().getThread().getName();
        mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
    }
}

先看addThread办法,能看出,Watchdog除了自己结构函数中增加的那些线程之外,还能供给办法给外部进行增加。然后addMonitor便是把Monitor增加到mMonitorQueue里边

void addMonitorLocked(Monitor monitor) {
    // We don't want to update mMonitors when the Handler is in the middle of checking
    // all monitors. We will update mMonitors on the next schedule if it is safe
    mMonitorQueue.add(monitor);
}

之后在scheduleCheckLocked办法再把mMonitorQueue内容移动到mMonitors中,这个上面有讲了。然后来看AMS完成monitor办法。

public void monitor() {
    synchronized (this) { }
}

表面看什么都没做,实则这儿有个加锁,假设这时分其它线程占有锁了,你这儿调monitor就会BLOCKED,终究时刻长就导致Watchdog那超时,这个上面也有讲了。

剖析

首要看了源码之后我觉得全体来说不行其它功用规划的源码亮眼,比如我上篇写的线程池,感觉规划上比它就差点意思。当然也有好的地方,比如mMonitorQueue和mMonitors的规划这儿。

然后从规划的视点去反推,为什么要定30秒,这个我是剖析不出的,这儿定30秒是有什么含义,随便差不多定一个机遇,仍是依据什么原理去设定的时刻。

然后我觉得有个地方挺迷的,假设有懂的大佬能够回答一下。

便是getCompletionStateLocked,什么状况下会回来WAITING状况。 记载mStartTime -> sleep 30秒 -> getCompletionStateLocked,正常来看,getCompletionStateLocked中获取时刻减去mStartTime肯定是会大于30秒,所以要么getCompletionStateLocked直接回来COMPLETED,要么便是WAITED_HALF或许OVERDUE,什么状况下会WAITING。

然后看源码的时分,有个地方挺有意思的,这个也能够共享一下,便是run办法中,搜集信息重启那个流程,有一句注释

// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(5000);

我是没想到官方人员也这么调皮。

最终回忆一下标题,狗子究竟做了什么?

现在其实去网上找,有很多人说Watchdog是为了检测死锁,然后相当于把Watchdog和死锁绑一起了。包括在SystemServer调用的时分官方也有一句注释。

// Start the watchdog as early as possible so we can crash the system server
// if we deadlock during early boot
traceBeginAndSlog("StartWatchdog");
final Watchdog watchdog = Watchdog.getInstance();
watchdog.start();
traceEnd();

if we deadlock during early boot,让人觉得便是专门处理死锁的。当然假设呈现死锁的话mCurrentMonitor.monitor()会阻塞住所以能检测出来。但是我上面也说了,从源码的视点看,有两种状况会导致卡住。

(1)这个Handler的MessageQueue的前一个Message一向在处理中,导致postAtFrontOfQueue在这30秒之后都没履行到run办法
(2)run办法中的mCurrentMonitor.monitor()一向卡住,30秒了仍是卡住,精确来说是竞争锁处于BLOCKED状况,没能履行到mCompleted = true

第一种状况,我假设上一个message是耗时操作,那这个run就不会履行,这种状况下可没走到死锁的判别。当然,这儿都是监听的特殊的线程,主线程之类的做耗时操作也不切实际。第二种,mCurrentMonitor.monitor()一向卡住就必定是死锁了吗?我一向持有锁不开释也会导致这个结果。

所以我个人觉得这儿Watchdog的效果不仅仅是为了监测死锁,而是监测一些线程,避免它们长时刻被持有导致无法呼应或许由于耗时操作导致无法及时呼应。再看看看门狗的界说,看门狗的功用是定期的检查芯片内部的状况,一旦产生错误就向芯片宣布重启信号 ,我觉得,假设单单仅仅为了监测死锁,那完全能够叫DeadlockWatchdog。

总结

Watchdog的首要流程是:敞开一个死循环,不断给指定线程发送一条音讯,然后休眠30秒,休眠完毕后判别是否收到音讯的回调,假设有,则正常进行下次循环,假设没收到,判别从发音讯到现在的机遇小于30秒不处理,大于30秒小于60秒搜集信息,大于60秒搜集信息并重启。

当然还有一些细节,比如判别时刻是用SystemClock.uptimeMillis(),这些细节我这儿就不独自讲了。

从全体来看,这个规划的思路仍是挺好的,发音讯后推迟然后判别有没有收到音讯 ,其实这便是和判别ANR一样,埋炸弹拆炸弹的进程,是这样的一个思路。

个人比较有疑问的便是这个30秒的规划,是有什么讲究。还有上面说的,什么状况下会呈现小于30秒的场景。

本文正在参与「金石方案」