文本翻译自: www.scrivano.org/posts/2022-…


原文作者是 Red Hat 工程师 Giuseppe Scrivano ,其回忆了将 OCI 容器启动的时刻提速 30 倍的历程。

当我开始研讨 crun (github.com/containers/…) 时,我正在寻觅一种经过改善 OCI 运转时来更快地启动和停止容器的方法,OCI 运转时是 OCI 堆栈中负责最终与内核交互并设置容器地点环境的组件。

OCI 运转时的运转时刻十分有限,它的作业主要是履行一系列直接映射到 OCI 装备文件的体系调用。

我很惊讶地发现,如此琐碎的任务或许需求花费这么长时刻。

免责声明:关于我的测验,我运用了 Fedora 装置中可用的默许内核以及一切库。除了这篇博文中描述的修正之外,这些年来或许还有其他或许影响全体性能的修正。

以下一切用于测验的 crun 版本都是相同的。

关于一切测验,我都运用 hyperfine,它是经过 cargo 装置的。

2017年的情况如何

要对比咱们与过去相差多大,咱们需求回到 2017 年,或者只装置一个旧的 Fedora 映像。关于下面的测验,我运用了基于 Linux 内核 4.5.5 的 Fedora 24。

在新装置的 Fedora 24 上,运转从主分支构建:

# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
  Time (mean  ):     159.2 ms   21.8 ms    [User: 43.0 ms, System: 16.3 ms]
  Range (min … max):    73.9 ms … 194.9 ms    39 runs

用户时刻和体系时刻指的是进程别离在用户态和内核态的耗时。

160 毫秒许多,据我所知,这与我五年前调查到的情况类似。

对 OCI 运转时的剖析当即表明,大部分用户时刻都花在了 libseccomp 上来编译 seccomp 过滤器。

为了验证这一点,让咱们测验运转一个具有相同装备但没有 seccomp 装备文件的容器:

# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
  Time (mean  ):     139.6 ms   20.8 ms    [User: 4.1 ms, System: 22.0 ms]
  Range (min … max):    61.8 ms … 177.0 ms    47 runs

咱们运用了之前所需用户时刻的 1/10(43 ms -> 4.1 ms),全体时刻也有所改善!

所以主要有两个不同的问题:1) 体系时刻适当长,2) 用户时刻由 libseccomp 控制。咱们需求同时处理这两个问题。

现在让咱们专注于体系时刻,稍后咱们将回到 seccomp。

体系时刻

创建和毁掉 network 命名空间

创建和毁掉网络命名空间从前十分贵重,只需运用该 unshare 东西即可重现该问题,在 Fedora 24 上我得到:

# hyperfine 'unshare -n true'
Benchmark 1: 'unshare -n true'
  Time (mean  ):      47.7 ms   51.4 ms    [User: 0.6 ms, System: 3.2 ms]
  Range (min … max):     0.0 ms … 190.5 ms    365 runs

这算是很长的耗时!

我企图在内核中修正它并提出了一个 patch 补丁。Florian Westphal 以更好的方法将其进行了重写,并兼并到了 Linux 内核中:

commit 8c873e2199700c2de7dbd5eedb9d90d5f109462b
Author: Florian Westphal
Date:   Fri Dec 1 00:21:04 2017 +0100
    netfilter: core: free hooks with call_rcu
    Giuseppe Scrivano says:
      "SELinux, if enabled, registers for each new network namespace 6
        netfilter hooks."
    Cost for this is high.  With synchronize_net() removed:
       "The net benefit on an SMP machine with two cores is that creating a
       new network namespace takes -40% of the original time."
    This patch replaces synchronize_net+kvfree with call_rcu().
    We store rcu_head at the tail of a structure that has no fixed layout,
    i.e. we cannot use offsetof() to compute the start of the original
    allocation.  Thus store this information right after the rcu head.
    We could simplify this by just placing the rcu_head at the start
    of struct nf_hook_entries.  However, this structure is used in
    packet processing hotpath, so only place what is needed for that
    at the beginning of the struct.
    Reported-by: Giuseppe Scrivano
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
commit 26888dfd7e7454686b8d3ea9ba5045d5f236e4d7
Author: Florian Westphal
Date:   Fri Dec 1 00:21:03 2017 +0100
    netfilter: core: remove synchronize_net call if nfqueue is used
    since commit 960632ece6949b ("netfilter: convert hook list to an array")
    nfqueue no longer stores a pointer to the hook that caused the packet
    to be queued.  Therefore no extra synchronize_net() call is needed after
    dropping the packets enqueued by the old rule blob.
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
commit 4e645b47c4f000a503b9c90163ad905786b9bc1d
Author: Florian Westphal
Date:   Fri Dec 1 00:21:02 2017 +0100
    netfilter: core: make nf_unregister_net_hooks simple wrapper again
    This reverts commit d3ad2c17b4047
    ("netfilter: core: batch nf_unregister_net_hooks synchronize_net calls").
    Nothing wrong with it.  However, followup patch will delay freeing of hooks
    with call_rcu, so all synchronize_net() calls become obsolete and there
    is no need anymore for this batching.
    This revert causes a temporary performance degradation when destroying
    network namespace, but its resolved with the upcoming call_rcu conversion.
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

这些补丁产生了巨大的差异,现在创建和毁掉网络命名空间的时刻现已下降到了一个难以置信的地步,以下是一个现代 5.19.15 内核的数据:

# hyperfine 'unshare -n true'
Benchmark 1: 'unshare -n true'
  Time (mean  ):       1.5 ms    0.5 ms    [User: 0.3 ms, System: 1.3 ms]
  Range (min … max):     0.8 ms …   6.7 ms    1907 runs

挂载 mqueue

挂载 mqueue 也是一个相对贵重的操作。

在 Fedora 24 上,它从前是这样的:

# mkdir /tmp/mqueue; hyperfine 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'; rmdir /tmp/mqueue
Benchmark 1: 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'
  Time (mean  ):      16.8 ms    3.1 ms    [User: 2.6 ms, System: 5.0 ms]
  Range (min … max):     9.3 ms …  26.8 ms    261 runs

在这种情况下,我也测验修正它并提出一个 补丁。它没有被接受,但 Al Viro 想出了一个更好的版原本处理这个问题:

commit 36735a6a2b5e042db1af956ce4bcc13f3ff99e21
Author: Al Viro
Date:   Mon Dec 25 19:43:35 2017 -0500
    mqueue: switch to on-demand creation of internal mount
    Instead of doing that upon each ipcns creation, we do that the first
    time mq_open(2) or mqueue mount is done in an ipcns.  What's more,
    doing that allows to get rid of mount_ns() use - we can go with
    considerably cheaper mount_nodev(), avoiding the loop over all
    mqueue superblock instances; ipcns->mq_mnt is used to locate preexisting
    instance in O(1) time instead of O(instances) mount_ns() would've
    cost us.
    Based upon the version by Giuseppe Scrivano ; I've
    added handling of userland mqueue mounts (original had been broken in
    that area) and added a switch to mount_nodev().
    Signed-off-by: Al Viro

在这个补丁之后,创建 mqueue 挂载的本钱也下降了:

# mkdir /tmp/mqueue; hyperfine 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'; rmdir /tmp/mqueue
Benchmark 1: 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'
  Time (mean  ):       0.7 ms    0.5 ms    [User: 0.5 ms, System: 0.6 ms]
  Range (min … max):     0.0 ms …   3.1 ms    772 runs

创建和毁掉 IPC 命名空间

我将加快容器启动时刻的事推迟了几年,并在 2020 年头重新开始。我意识到的另一个问题是创建和毁掉 IPC 命名空间的时刻。

与网络命名空间一样,仅运用以下 unshare 东西即可重现该问题:

# hyperfine 'unshare -i true'
Benchmark 1: 'unshare -i true'
  Time (mean  ):      10.9 ms    2.1 ms    [User: 0.5 ms, System: 1.0 ms]
  Range (min … max):     4.2 ms …  17.2 ms    310 runs

与前两次测验不同,这次我发送的补丁被上游接受了:

commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7
Author: Giuseppe Scrivano
Date:   Sun Jun 7 21:40:10 2020 -0700
    ipc/namespace.c: use a work queue to free_ipc
    the reason is to avoid a delay caused by the synchronize_rcu() call in
    kern_umount() when the mqueue mount is freed.
    the code:
        #define _GNU_SOURCE
        #include
        #include
        #include
        #include
        int main()
        {
            int i;
            for (i = 0; i < 1000; i++)
                if (unshare(CLONE_NEWIPC) < 0)
                    error(EXIT_FAILURE, errno, "unshare");
        }
    goes from
            Command being timed: "./ipc-namespace"
            User time (seconds): 0.00
            System time (seconds): 0.06
            Percent of CPU this job got: 0%
            Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.05
    to
            Command being timed: "./ipc-namespace"
            User time (seconds): 0.00
            System time (seconds): 0.02
            Percent of CPU this job got: 96%
            Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03
    Signed-off-by: Giuseppe Scrivano
    Signed-off-by: Andrew Morton
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Waiman Long
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Link: http://lkml.kernel.org/r/20200225145419.527994-1-gscrivan@redhat.com
    Signed-off-by: Linus Torvalds

有了这个补丁,创建和毁掉 IPC 的时刻也大大减少了,正如提交消息中所概述的那样,在我现在得到的现代 5.19.15 内核上:

# hyperfine 'unshare -i true'
Benchmark 1: 'unshare -i true'
  Time (mean  ):       0.1 ms    0.2 ms    [User: 0.2 ms, System: 0.4 ms]
  Range (min … max):     0.0 ms …   1.5 ms    1966 runs

用户时刻

内核态时刻现在好像已得到控制。咱们能够做些什么来减少用户时刻?

正如咱们之前现已发现的,libseccomp 是这儿的元凶巨恶,因而咱们需求首先处理它,这产生在内核中对 IPC 的修正之后。

libseccomp 的大部分本钱都是由体系调用查找代码引起的。OCI 装备文件包括一个按称号列出体系调用的列表,每个体系调用经过 seccomp_syscall_resolve_name 函数调用进行查找,该函数返回给定体系调用称号的体系调用编号。

libseccomp 用于经过体系调用表对每个体系调用称号履行线性查找,例如,关于 x86_64,它看起来像这样:

/* NOTE: based on Linux v5.4-rc4 */
const struct arch_syscall_def x86_64_syscall_table[] = { \
	{ "_llseek", __PNR__llseek },
	{ "_newselect", __PNR__newselect },
	{ "_sysctl", 156 },
	{ "accept", 43 },
	{ "accept4", 288 },
	{ "access", 21 },
	{ "acct", 163 },
.....
    };
int x86_64_syscall_resolve_name(const char *name)
{
	unsigned int iter;
	const struct arch_syscall_def *table = x86_64_syscall_table;
	/* XXX - plenty of room for future improvement here */
	for (iter = 0; table[iter].name != NULL; iter++) {
		if (strcmp(name, table[iter].name) == 0)
			return table[iter].num;
	}
	return __NR_SCMP_ERROR;
}

经过 libseccomp 构建 seccomp 装备文件的复杂度为 O(n*m),其间 n 是装备文件中的体系调用数量,m 是 libseccomp 已知的体系调用数量。

我遵循了代码注释中的建议,并花了一些时刻测验修正它。2020 年 1 月,我为 libseccomp 开发了一个 补丁,以运用完美的哈希函数查找体系调用称号来处理这个问题。

libseccomp 的补丁是这个:

commit 9b129c41ac1f43d373742697aa2faf6040b9dfab
Author: Giuseppe Scrivano
Date:   Thu Jan 23 17:01:39 2020 +0100
    arch: use gperf to generate a perfact hash to lookup syscall names
    This patch significantly improves the performance of
    seccomp_syscall_resolve_name since it replaces the expensive strcmp
    for each syscall in the database, with a lookup table.
    The complexity for syscall_resolve_num is not changed and it
    uses the linear search, that is anyway less expensive than
    seccomp_syscall_resolve_name as it uses an index for comparison
    instead of doing a string comparison.
    On my machine, calling 1000 seccomp_syscall_resolve_name_arch and
    seccomp_syscall_resolve_num_arch over the entire syscalls DB passed
    from ~0.45 sec to ~0.06s.
    PM: After talking with Giuseppe I made a number of additional
    changes, some substantial, the highlights include:
    * various style tweaks
    * .gitignore fixes
    * fixed subject line, tweaked the description
    * dropped the arch-syscall-validate changes as they were masking
      other problems
    * extracted the syscalls.csv and file deletions to other patches
      to keep this one more focused
    * fixed the x86, x32, arm, all the MIPS ABIs, s390, and s390x ABIs as
      the syscall offsets were not properly incorporated into this change
    * cleaned up the ABI specific headers
    * cleaned up generate_syscalls_perf.sh and renamed to
      arch-gperf-generate
    * fixed problems with automake's file packaging
    Signed-off-by: Giuseppe Scrivano
    Reviewed-by: Tom Hromatka
    [PM: see notes in the "PM" section above]
    Signed-off-by: Paul Moore

该补丁已兼并并发布,现在构建 seccomp 装备文件的复杂度为 O(n),其间 n 是装备文件中体系调用的数量。

改善是明显的,在足够新的 libseccomp 下:

# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
  Time (mean  ):      28.9 ms    5.9 ms    [User: 16.7 ms, System: 4.5 ms]
  Range (min … max):    19.1 ms …  41.6 ms    73 runs

用户时刻仅为 16.7ms。曾经是 40ms 以上,彻底不必 seccomp 的时分是 4ms 左右。

所以运用 4.1ms 作为没有 seccomp 的用户时刻本钱,咱们有:

time_used_by_seccomp_before = 43.0ms - 4.1ms = 38.9ms
time_used_by_seccomp_after = 16.7ms - 4.1ms = 12.6ms

快 3 倍以上!体系调用查找仅仅 libseccomp 所做作业的一部分,别的适当多的时刻用于编译 BPF 过滤器。

BPF 过滤器编译

咱们还能做得更好吗?

BPF 过滤器编译由 seccomp_export_bpf 函数完结,它仍然适当贵重。

一个简单的调查是,大多数容器一遍又一遍地重复运用相同的 seccomp 装备文件,很少进行自定义。

因而缓存编译成果并在或许的情况下重用它是有意义的。

有一个新的运转特性 来缓存 BPF 过滤器编译的成果。在编撰本文时,该补丁没有兼并,尽管它快要完结了。

有了这个,只有当生成的 BPF 过滤器不在缓存中时,编译 seccomp 装备文件的本钱才会被支付,这便是咱们现在所具有的:

# hyperfine 'crun-from-the-future run foo'
Benchmark 1: 'crun-from-the-future run foo'
  Time (mean  ):       5.6 ms    3.0 ms    [User: 1.0 ms, System: 4.5 ms]
  Range (min … max):     4.2 ms …  26.8 ms    101 runs

结论

五年多来,创建和毁掉 OCI 容器所需的总时刻已从将近 160 毫秒加快到略多于 5 毫秒。

这几乎是 30 倍的改善!