背景

当你运用下面代码不断分配内存时,终究导致物理内存耗尽,手机所有运用进程被杀死。 在一次 crash问题排查过程中,现象是crash但是并没有crash日志的捕获,后面排查到便是内存泄露导致的lowmemory 。那么线上怎么去监控这种lowmemory killer的状况呢,

JNIEXPORT  void JNICALL
Java_com_test_Java2C_tm(JNIEnv *env, jclass clazz, jlong size) {
    char * t =(char *) malloc(size);
    memset(t,0,size);
}

监控方法

Android R之后system 进程给运用进程供给了一个api 运用进程能够调用android.app.ActivityManager的getHistoricalProcessExitReasons 检查运用的历史被杀原因。 咱们每次启动时能够去运用下这个api 检查被杀记录并进行上报。

Return a list of ApplicationExitInfo records containing the reasons for the most recent app deaths.
Note: System stores this historical information in a ring buffer and only the most recent records will be returned.
Note: In the case that this application was bound to an external service with flag Context.BIND_EXTERNAL_SERVICE, the process of that external service will be included in this package's exit info.
Params:
packageNameOptional, a null value means match all packages belonging to the caller's UID. If this package belongs to another UID, you must hold Manifest.permission.DUMP in order to retrieve it.
pidA process ID that used to belong to this package but died later; a value of 0 means to ignore this parameter and return all matching records.
maxNumThe maximum number of results to be returned; a value of 0 means to ignore this parameter and return all matching records
Returns:
public List<ApplicationExitInfo> getHistoricalProcessExitReasons(@Nullable String packageName,
        @IntRange(from = 0) int pid, @IntRange(from = 0) int maxNum) {
    try {
        ParceledListSlice<ApplicationExitInfo> r = getService().getHistoricalProcessExitReasons(
                packageName, pid, maxNum, mContext.getUserId());
        return r == null ? Collections.emptyList() : r.getList();
    } catch (RemoteException e) {
        throw e.rethrowFromSystemServer();
    }
}

检查ApplicationExitInfo 根据mReason字段大概能够区分下面原因,类型3便是low memory killer 用上面走漏测验代码测验后 的确得到了3。

类型 描绘
1 进程自杀(System#exit)
2 OS signal(OsConstants#SIGKILL)
3 low memory killer
4 java crash
5 native crash
6 anr
7 REASON_INITIALIZATION_FAILURE
8 运行时权限改动
9 过度运用资源
10 用户主动杀进程
11 设备多用户杀场景进程
12 REASON_DEPENDENCY_DIED
13 REASON_OTHER

还有几个ApplicationExitInfo相关字段

mImportance 退出时进程的优先级

运用在前台:100

运用在后台:400

mPss: 退出时进程占用内存

mTimestamp: 进程退出时间戳

目标树立

Crash排查系列第十二篇|如何监控自身进程的lowmemory kill

Crash排查系列第十二篇|如何监控自身进程的lowmemory kill

设备维度剖析

Crash排查系列第十二篇|如何监控自身进程的lowmemory kill

简略探求下lowmemory kill的机制

源码搜索(根据Android 13)

搜索后发现这个原因的杀进程 并不是产生在system_server进程中,而是经过 lmkd进程完结 再告诉到system_server进程 在com.android.server.am.ProcessList 中发现 mAppExitInfoSourceLmkd

/**
 * Bookkeeping low memory kills info from lmkd.
 */
final AppExitInfoExternalSource mAppExitInfoSourceLmkd =
        new AppExitInfoExternalSource("lmkd", ApplicationExitInfo.REASON_LOW_MEMORY);

仓库调试

com.android.server.am.LmkdConnection 中 system_server进程经过LocalSocket与lmkd衔接 lmkd进程杀进程后会告诉到 system_server进程 system_server进程再进行记录

public boolean connect() {
    synchronized (mLmkdSocketLock) {
        if (mLmkdSocket != null) {
            return true;
        }
        // temporary sockets and I/O streams
        final LocalSocket socket = openSocket();
        if (socket == null) {
            Slog.w(TAG, "Failed to connect to lowmemorykiller, retry later");
            return false;
        }
        final OutputStream ostream;
        final InputStream istream;
        try {
            ostream = socket.getOutputStream();
            istream = socket.getInputStream();
        } catch (IOException ex) {
            IoUtils.closeQuietly(socket);
            return false;
        }
        // execute onConnect callback
        if (mListener != null && !mListener.onConnect(ostream)) {
            Slog.w(TAG, "Failed to communicate with lowmemorykiller, retry later");
            IoUtils.closeQuietly(socket);
            return false;
        }
        // connection established
        mLmkdSocket = socket;
        mLmkdOutputStream = ostream;
        mLmkdInputStream = istream;
        mMsgQueue.addOnFileDescriptorEventListener(mLmkdSocket.getFileDescriptor(),
            EVENT_INPUT | EVENT_ERROR,
            new MessageQueue.OnFileDescriptorEventListener() {
                public int onFileDescriptorEvents(FileDescriptor fd, int events) {
                    return fileDescriptorEventHandler(fd, events);
                }
            }
        );
        mLmkdSocketLock.notifyAll();
    }
    return true;
}

        at com.android.server.am.ProcessList$1.handleUnsolicitedMessage(ProcessList.java:810)
        at com.android.server.am.LmkdConnection.processIncomingData(LmkdConnection.java:217)
        at com.android.server.am.LmkdConnection.fileDescriptorEventHandler(LmkdConnection.java:172)
        at com.android.server.am.LmkdConnection.-$$Nest$mfileDescriptorEventHandler(Unknown Source:0)
        at com.android.server.am.LmkdConnection$1.onFileDescriptorEvents(LmkdConnection.java:158)
        at android.os.MessageQueue.dispatchEvents(MessageQueue.java:293)
        at android.os.MessageQueue.nativePollOnce(Native Method)
        at android.os.MessageQueue.next(MessageQueue.java:335)
        at android.os.Looper.loopOnce(Looper.java:161)
        at android.os.Looper.loop(Looper.java:288)
        at android.os.HandlerThread.run(HandlerThread.java:67)
        at com.android.server.ServiceThread.run(ServiceThread.java:44)

        at com.android.server.am.AppExitInfoTracker.updateExistingExitInfoRecordLocked(AppExitInfoTracker.java:477)
        at com.android.server.am.AppExitInfoTracker.handleNoteProcessDiedLocked(AppExitInfoTracker.java:392)
        at com.android.server.am.AppExitInfoTracker$KillHandler.handleMessage(AppExitInfoTracker.java:1632)
        at android.os.Handler.dispatchMessage(Handler.java:106)
        at android.os.Looper.loopOnce(Looper.java:201)
        at android.os.Looper.loop(Looper.java:288)
        at android.os.HandlerThread.run(HandlerThread.java:67)
        at com.android.server.ServiceThread.run(ServiceThread.java:44)

Crash排查系列第十二篇|如何监控自身进程的lowmemory kill

LMKD进程剖析

Logcat 下面就依照这个logcat的case 去剖析。

2023-04-13 13:49:24.921 456-456/? I/lowmemorykiller: Kill 'com.test' (9213), uid 10251, oom_score_adj 700 to free 128292kB rss, 50152kB swap; reason: low watermark is breached and swap is low (852kB < 314572kB)

上面能够看到杀进程是由lmkd进行的。 杀完之后会告诉到咱们的system_server进程。 咱们继续去跟进一下lmkd

代码位置 system/memory/lmkd

enum lmk_cmd {
    LMK_TARGET = 0,         /* Associate minfree with oom_adj_score */
    LMK_PROCPRIO,           /* Register a process and set its oom_adj_score */
    LMK_PROCREMOVE,         /* Unregister a process */
    LMK_PROCPURGE,          /* Purge all registered processes */
    LMK_GETKILLCNT,         /* Get number of kills */
    LMK_SUBSCRIBE,          /* Subscribe for asynchronous events */
    LMK_PROCKILL,           /* Unsolicited msg to subscribed clients on proc kills */
    LMK_UPDATE_PROPS,       /* Reinit properties */
    LMK_STAT_KILL_OCCURRED, /* Unsolicited msg to subscribed clients on proc kills for statsd log */
    LMK_STAT_STATE_CHANGED, /* Unsolicited msg to subscribed clients on state changed */
};

lmkd.h中也有LMK_PROCKILL界说。咱们经过这个去追踪

static inline size_t lmkd_pack_set_prockills(LMKD_CTRL_PACKET packet, pid_t pid, uid_t uid) {
    packet[0] = htonl(LMK_PROCKILL);
    packet[1] = htonl(pid);
    packet[2] = htonl(uid);
    return 3 * sizeof(int);
}
static void ctrl_data_write_lmk_kill_occurred(pid_t pid, uid_t uid) {
    LMKD_CTRL_PACKET packet;
    size_t len = lmkd_pack_set_prockills(packet, pid, uid);
    for (int i = 0; i < MAX_DATA_CONN; i++) {
        if (data_sock[i].sock >= 0 && data_sock[i].async_event_mask & 1 << LMK_ASYNC_EVENT_KILL) {
            ctrl_data_write(i, (char*)packet, len);
        }
    }
}
static int kill_one_process(struct proc* procp, int min_oom_score, struct kill_info *ki,
                            union meminfo *mi, struct wakeup_info *wi, struct timespec *tm,
                            struct psi_data *pd) {
    int pid = procp->pid;
    int pidfd = procp->pidfd;
    uid_t uid = procp->uid;
    char *taskname;
    int kill_result;
    int result = -1;
    struct memory_stat *mem_st;
    struct kill_stat kill_st;
    int64_t tgid;
    int64_t rss_kb;
    int64_t swap_kb;
    char buf[PAGE_SIZE];
    char desc[LINE_MAX];
    if (!procp->valid || !read_proc_status(pid, buf, sizeof(buf))) {
        goto out;
    }
    if (!parse_status_tag(buf, PROC_STATUS_TGID_FIELD, &tgid)) {
        ALOGE("Unable to parse tgid from /proc/%d/status", pid);
        goto out;
    }
    if (tgid != pid) {
        ALOGE("Possible pid reuse detected (pid %d, tgid %" PRId64 ")!", pid, tgid);
        goto out;
    }
    // Zombie processes will not have RSS / Swap fields.
    if (!parse_status_tag(buf, PROC_STATUS_RSS_FIELD, &rss_kb)) {
        goto out;
    }
    if (!parse_status_tag(buf, PROC_STATUS_SWAP_FIELD, &swap_kb)) {
        goto out;
    }
    taskname = proc_get_name(pid, buf, sizeof(buf));
    // taskname will point inside buf, do not reuse buf onwards.
    if (!taskname) {
        goto out;
    }
    mem_st = stats_read_memory_stat(per_app_memcg, pid, uid, rss_kb * 1024, swap_kb * 1024);
    snprintf(desc, sizeof(desc), "lmk,%d,%d,%d,%d,%d", pid, ki ? (int)ki->kill_reason : -1,
             procp->oomadj, min_oom_score, ki ? ki->max_thrashing : -1);
    result = lmkd_free_memory_before_kill_hook(procp, rss_kb / page_k, procp->oomadj,
                                               ki ? (int)ki->kill_reason : -1);
    if (result > 0) {
      /*
       * Memory was freed elsewhere; no need to kill. Note: intentionally do not
       * pid_remove(pid) since it was not killed.
       */
      ALOGI("Skipping kill; %ld kB freed elsewhere.", result * page_k);
      return result;
    }
    trace_kill_start(desc);
    start_wait_for_proc_kill(pidfd < 0 ? pid : pidfd);
    kill_result = reaper.kill({ pidfd, pid, uid }, false);
    trace_kill_end();
    if (kill_result) {
        stop_wait_for_proc_kill(false);
        ALOGE("kill(%d): errno=%d", pid, errno);
        /* Delete process record even when we fail to kill so that we don't get stuck on it */
        goto out;
    }
    last_kill_tm = *tm;
    inc_killcnt(procp->oomadj);
    if (ki) {
        kill_st.kill_reason = ki->kill_reason;
        kill_st.thrashing = ki->thrashing;
        kill_st.max_thrashing = ki->max_thrashing;
        ALOGI("Kill '%s' (%d), uid %d, oom_score_adj %d to free %" PRId64 "kB rss, %" PRId64
              "kB swap; reason: %s", taskname, pid, uid, procp->oomadj, rss_kb, swap_kb,
              ki->kill_desc);
    } else {
        kill_st.kill_reason = NONE;
        kill_st.thrashing = 0;
        kill_st.max_thrashing = 0;
        ALOGI("Kill '%s' (%d), uid %d, oom_score_adj %d to free %" PRId64 "kB rss, %" PRId64
              "kb swap", taskname, pid, uid, procp->oomadj, rss_kb, swap_kb);
    }
    killinfo_log(procp, min_oom_score, rss_kb, swap_kb, ki, mi, wi, tm, pd);
    kill_st.uid = static_cast<int32_t>(uid);
    kill_st.taskname = taskname;
    kill_st.oom_score = procp->oomadj;
    kill_st.min_oom_score = min_oom_score;
    kill_st.free_mem_kb = mi->field.nr_free_pages * page_k;
    kill_st.free_swap_kb = get_free_swap(mi) * page_k;
    stats_write_lmk_kill_occurred(&kill_st, mem_st);
    ctrl_data_write_lmk_kill_occurred((pid_t)pid, uid);
    result = rss_kb / page_k;
out:
    /*
     * WARNING: After pid_remove() procp is freed and can't be used!
     * Therefore placed at the end of the function.
     */
    pid_remove(pid);
    return result;
}

kill_one_process 函数中 ALOGI(“Kill ‘%s’ (%d), uid %d, oom_score_adj %d to free %” PRId64 “kB rss, %” PRId64

“kB swap; reason: %s”, taskname, pid, uid, procp->oomadj, rss_kb, swap_kb,

ki->kill_desc);的打印日志 也和咱们控制台的匹配上了。

2023-04-13 13:49:24.921 456-456/? I/lowmemorykiller: Kill ‘com.shizhuang.duapp’ (9213), uid 10251, oom_score_adj 700 to free 128292kB rss, 50152kB swap; reason: low watermark is breached and swap is low (852kB < 314572kB)

调用仓库

mp_event_psi->find_and_kill_process ->kill_one_process->ctrl_data_write_lmk_kill_occurred

mp_event_psi是经过监听/proc/pressure/memory触发的。

/proc/pressure/memory是Linux体系中的一个文件,用于记录内存压力状况。该文件供给了三个目标:内存运用状况(memory), 内存收回状况(memory_stall)和页面交流状况(swap)。
memory:该目标表明体系内存的压力状况。当可用内存不足时,会产生内存压力,导致程序运行缓慢或溃散。该目标记录了最近一段时间内内存的运用状况,以及内存压力的强度和持续时间等信息。
memory_stall:该目标表明体系内存收回的状况。当体系收回内存时,会产生内存收回压力,影响程序的履行功率。该目标记录了最近一段时间内内存收回的次数、持续时间和收回的页数等信息。
swap:该目标表明体系页面交流的状况。当体系内存不足时,会将不常用的数据存储到交流分区中,以开释内存空间。该目标记录了最近一段时间内页面交流的次数、持续时间和交流的页数等信息。
经过检查/proc/pressure/memory文件,能够了解当时体系的内存运用状况和内存压力状况,然后优化体系的内存办理和程序的履行功率

日志 杀进程条件判别

swap_is_low判别条件 :SwapFree < SwapTotal*swap_free_low_percentage/100

  • SwapFree 和SwapTotal都是经过 /proc/meminfo文件获取的 比方我的机器是SwapTotal: 3145724 kB 而日志是314572kB能够判别swap_free_low_percentage=10

wmark判别条件: MemFree-CmaFree< WMARK_HIGH. WMARK_HIGH是从/proc/zoneinfo 获取的 high

总归便是可用内存越小越容易满足这个条件

else if (swap_is_low && wmark < WMARK_HIGH) {
        /* Both free memory and swap are low */
        kill_reason = LOW_MEM_AND_SWAP;
        snprintf(kill_desc, sizeof(kill_desc), "%s watermark is breached and swap is low (%"
            PRId64 "kB < %" PRId64 "kB)", wmark < WMARK_LOW ? "min" : "low",
            get_free_swap(&mi) * page_k, swap_low_threshold * page_k);
        /* Do not kill perceptible apps unless below min watermark or heavily thrashing */
        if (wmark > WMARK_MIN && thrashing < thrashing_critical_pct) {
            min_score_adj = PERCEPTIBLE_APP_ADJ + 1;
        }
    }

杀进程顺序

经过find_and_kill_process 杀进程

依照进程的oom_score_adj从大到小去杀,相同优先级先杀rss占用较大进程

杀完进程后经过ctrl_data_write_lmk_kill_occurred 告诉到system_server进程。

 for (i = OOM_SCORE_ADJ_MAX; i >= min_score_adj; i--) {
        struct proc *procp;
        if (!choose_heaviest_task && i <= PERCEPTIBLE_APP_ADJ) {
            /*
             * If we have to choose a perceptible process, choose the heaviest one to
             * hopefully minimize the number of victims.
             */
            choose_heaviest_task = true;
        }
        while (true) {
            procp = choose_heaviest_task ?
                proc_get_heaviest(i) : proc_adj_tail(i);
            if (!procp)
                break;
            killed_size = kill_one_process(procp, min_score_adj, ki, mi, wi, tm, pd);
            if (killed_size >= 0) {
                if (!lmk_state_change_start) {
                    lmk_state_change_start = true;
                    stats_write_lmk_state_changed(STATE_START);
                }
                break;
            }
        }
        if (killed_size) {
            break;
        }
    }

进程oomadj怎么告诉到lmkd进程

进程oom_adj改动后都会经过ProcessList.writeLmkd中sLmkdConnection告诉到lmkd进程

Crash排查系列第十二篇|如何监控自身进程的lowmemory kill


        at com.android.server.am.ProcessList.writeLmkd(ProcessList.java:1478)
        at com.android.server.am.ProcessList.setOomAdj(ProcessList.java:1404)
        at com.android.server.am.OomAdjuster.applyOomAdjLSP(OomAdjuster.java:2595)
        at com.android.server.am.OomAdjuster.updateAndTrimProcessLSP(OomAdjuster.java:1066)
        at com.android.server.am.OomAdjuster.updateOomAdjInnerLSP(OomAdjuster.java:860)
        at com.android.server.am.OomAdjuster.performUpdateOomAdjPendingTargetsLocked(OomAdjuster.java:730)
        at com.android.server.am.OomAdjuster.updateOomAdjPendingTargetsLocked(OomAdjuster.java:710)
        at com.android.server.am.ActivityManagerService.updateOomAdjPendingTargetsLocked(ActivityManagerService.java:15505)
public static void setOomAdj(int pid, int uid, int amt) {
    // This indicates that the process is not started yet and so no need to proceed further.
    if (pid <= 0) {
        return;
    }
    if (amt == UNKNOWN_ADJ)
        return;
    long start = SystemClock.elapsedRealtime();
    ByteBuffer buf = ByteBuffer.allocate(4 * 4);
    buf.putInt(LMK_PROCPRIO);
    buf.putInt(pid);
    buf.putInt(uid);
    buf.putInt(amt);
    writeLmkd(buf, null);
    long now = SystemClock.elapsedRealtime();
    if ((now-start) > 250) {
        Slog.w("ActivityManager", "SLOW OOM ADJ: " + (now-start) + "ms for pid " + pid
                + " = " + amt);
    }
}

AMS scheduleTrimMemory机制

也便是咱们application 是在什么状况下会收到onTrimMemory回调,从AMS角度去剖析。

  1. 咱们先确认咱们的内存是不是属于正常值。 这个正常值的判别是用处于proState较大的缓存的进程数来判别的。 这也能够理解 ,内存不够时会触发lmkd,缓存进程更容易被杀导致缓存进程数变少。
  2. 当咱们的memFactor == ProcessStats.ADJ_MEM_FACTOR_NORMA 当时被隐藏界面的进程(即用户点击了Home键或许Back键导致运用的 UI 界面不可见)会调用 scheduleTrimMemory TRIM_MEMORY_UI_HIDDEN
  3. 当咱们的memFactor != ProcessStats.ADJ_MEM_FACTOR_NORMA时 ,会去走scheduleTrimMemory逻辑,而 前台 的进程和后台进程的调度又是分开的
  4. 先说后台进程scheduleTrimMemory逻辑 后台进程条件app.curProcState >= ActivityManager.PROCESS_STATE_HOME,咱们把后台进程均匀分为3份,最优先的一份 app.trimMemoryLevel = TRIM_MEMORY_COMPLETE ,第二份app.trimMemoryLevel=TRIM_MEMORY_MODERATE 最终一份app.trimMemoryLevel =TRIM_MEMORY_BACKGROUND AMS杀进程机制来看最终一份的进程是最容易被杀的。下一次检测时触发scheduleTrimMemory逻辑然后告诉各进程application进行onTrimMemory
  5. 再说前台进程,首先会界说一个fgTrimLevel,这个和缓存的进程数numCached有关 比方numCached<=5 fgTrimLevel=TRIM_MEMORY_RUNNING_LOW 然后再给每个进程app.trimMemoryLevel 赋值,当下一次fgTrimLevel变小时 就会触发scheduleTrimMemory逻辑。

所以说scheduleTrimMemory 并不是经过核算 内存 运用进行触发的,而是经过进程优先级较低缓存的进程数间接来判别的,缓存进程数变少了阐明当时内存不够了告诉运用去开释内存。