1. 概述

1.1 eBPF是什么

先看看官方原文释惑

eBPF is a revolutionary technology with origins in the Linux kernel that can run sandboxed programs in a privileged context such as the operating system kernel. It is used to safely and efficiently extend the capabilities of the kernel without requiring to change kernel source code or load kernel modules.

Historically, the operating system has always been an ideal place to implement observability, security, and networking functionality due to the kernel’s privileged ability to oversee and control the entire system. At the same time, an operating system kernel is hard to evolve due to its central role and high requirement towards stability and security. The rate of innovation at the operating system level has thus traditionally been lower compared to functionality implemented outside of the operating system.

eBPF changes this formula fundamentally. By allowing to run sandboxed programs within the operating system, application developers can run eBPF programs to add additional capabilities to the operating system at runtime. The operating system then guarantees safety and execution efficiency as if natively compiled with the aid of a Just-In-Time (JIT) compiler and verification engine. This has led to a wave of eBPF-based projects covering a wide array of use cases, including next-generation networking, observability, and security functionality.

Today, eBPF is used extensively to drive a wide variety of use cases: Providing high-performance networking and load-balancing in modern data centers and cloud native environments, extracting fine-grained security observability data at low overhead, helping application developers trace applications, providing insights for performance troubleshooting, preventive application and container runtime security enforcement, and much more. The possibilities are endless, and the innovation that eBPF is unlocking has only just begun.

摘自What is eBPF? An Introduction and Deep Dive into the eBPF Technology。

eBPF是extended BPF的缩写，它的前身是BPF（Berkeley Packet Filter）技术。

而BPF是一个有着较长年头的技术，抓包软件Wireshark就是基于BPF实现的，经过若干年发展，eBPF脱胎换骨，成为Linux内核可观测技术事实上的标准。

Linux内核的主要目的是抽象硬件或虚拟硬件，并提供API（系统调用）供应用调用和共享资源。为了实现这一点，需要维护一些子系统和模块来完成这些职责。每个模块通常维护了几种通用的配置，以满足用户的不同需求。而当现有配置无法满足用户需求时，只能通过修改内核来达到目的，根据历史经验有两个途径：

直接修改内核源码，同时将相关修改提交，为保证一致性可能还需要将源码提交到kernel社区，在等待漫长的时间后可能会被合入新的linux版本中，或者干脆自己持续维护该修改，在每次的版本更新后移植适配。
写一个ko模块，在每次的kernel新版本中移植适配，可能还会遇到某些兼容性或安全性问题需要调整。

而eBPF提供了一个新的选择途径，它通过在内核中实现一个轻量级虚拟机，将类C代码编译成字节码，挂在到内核的钩子上，当钩子被触发时，kernel在虚拟机的”沙盒”中运行字节码，这样就可以在无需修改内核源码或者挂载内核模块的情况下，轻松的拓展内核功能。

1.1.1 简介

eBPF是一个转发层的驱动模型，通过下图可了解第一印象。

它包含用户层和内核层：

用户层：主要是负责业务逻辑处理和响应，同时也兼顾着内核中的 epbf 的逻辑 bytescode 生成，并将 bytescode 注入到内核中。

内核层：主要是接收 bytescode，然后在内核层内完成对 bytescode 执行。

1.2 eBPF的工作流程

可先大致了解下eBPF工作的全流程

接下来对几个核心概念以及工作过程做个简单介绍

1.2.1 Hook

在 eBPF的世界里看 Linux 内核所有核心调用都可以 Hook，可以理解成为万物皆可挂钩子做 Callback。它是事件驱动，在内核或应用程序通过某个hook点时会被触发运行。

预设的hook点包括系统调用、函数进入/退出、内核tracepoint、网络事件等等。

如果有特定需求不在上述已有的hook点中时，可通过内核探测（kprobe）或用户探测（uprobe）来在期望的地方挂载eBPF程序。

1.2.2 Map

BPF Map用于在内核和用户空间之间共享数据，它不同类型的通用存储，

有几种可用的存储类型：

Hash tables, Arrays
LRU (Least Recently Used)
Ring Buffer
Stack Trace
LPM (Longest Prefix match)
…

通过bpf-syscall在用户空间被访问，bpf-syscall提供创建映射、查找元素、更新元素和删除元素等命令。涉及到Map的场景有以下几种：

eBPF-prog和用户态程序的交互：prog运行完，得到的结果存储到map，供用户态通过文件描述符访问。
eBPF-prog和内核态程序的交互：和 prog以外的内核程序交互，也可以使用map作为中介。
eBPF-prog间交互：如果prog内部需要用全局变量来交互，但是由于安全原因prog不允许访问全局变量，可以使用 map 来充当全局变量。
Tail call：Tail call 是从prog-A跳转到prog-B，prog-A通过BPF_MAP_TYPE_PROG_ARRAY类型的map来知道prog-B的指针，然后调用tail_call()的helper function来执行Tail call。

共享map的eBPF-prog不要求是相同的程序类型，例如tracing程序可以和网络程序共享map，单个eBPF-prog目前最多可直接访问64个不同 map。

1.2.3 JIT Compailation

JIT (Just-in-Time)编译将程序的通用字节码转换为汇编代码，来优化执行速度。

这使得eBPF程序像本地编译的内核代码或作为内核模块加载的代码一样有效地运行。

1.2.4 eBPF Safety

每一个 eBPF 程序加载到内核都要经过“检查”，用来保证 eBPF 程序的安全性，主要包括：

i、Verifier

要保证prog必须满足系统要求的大小，过大的 eBPF 程序不允许被加载进内核，
堆栈大小被限制在MAX_BPF_STACK（512字节）内，
编译的字节码大小被限制在BPF_COMPLEXITY_LIMIT_INSNS（1M）；
1
2
3
4
/* BPF program can access up to 512 bytes of stack space. */
#define MAX_BPF_STACK 512

#define BPF_COMPLEXITY_LIMIT_INSNS 1000000 /* yes. 1M insns */
要保证程序的复杂度有限，Verifier 会评估 eBPF 程序所有可能的执行路径，必须能在有限时间内完成；
要保证程序不会崩溃或者使得系统出故障；

ii、Privileges

除非启用了非特权eBPF，否则所有打算将eBPF程序加载到Linux内核中的进程必须以特权模式(root)运行，或者需要CAP_BPF功能。这意味着不受信任的程序不能加载eBPF程序。

内核提供了一个配置项 /proc/sys/kernel/unprivileged_bpf_disabled 来禁止非特权用户使用 bpf(2) 系统调用，可以通过 sysctl 命令修改
比较特殊的一点是，这个配置项特意设计为一次性开关（one-time kill switch），这意味着一旦将它设为 1，就没有办法再改为 0 了，除非重启内核
一旦设置为 1 之后，只有初始命名空间中有 CAP_SYS_ADMIN 特权的进程才可以调用 bpf(2) 系统调用。Cilium 启动后也会将这个配置项设为 1 # echo 1 > /proc/sys/kernel/unprivileged_bpf_disabled

如果启用非特权eBPF，非特权进程可以加载某些eBPF程序，这些程序的功能集减少，并且对内核的访问受限。

iii、Hardening

当程序成功校验后，会依据特权进程或非特权进程来加载，来决定eBPF的加固处理，包含以下：

程序执行保护：eBPF程序所在的kernel内存空间会变更为只读。所以无论是kernel异常或其他误操作，当尝试修改eBPF程序时都会触发崩溃，而不是允许该操作继续下去。
抵御Spectre攻击：eBPF程序会屏蔽内存访问，将瞬时指令下的访问定位重定向到受控区域
常量屏蔽：程序中所有的常量都将被屏蔽，以防发生注入攻击。

iv、Abstracted Runtime Context

eBPF程序不能直接访问任意内核内存，程序上下文之外的数据和数据结构必须通过eBPF-helper访问。

这样使得访问数据的接口统一化，且均受到eBPF程序的限制。比如只有与程序相关的数据才允许读写或修改，确保加载时永远都不会发生越界；或者当eBPF程序在运行中尝试修改数据的行为只有在确保安全的前提下才会被允许执行。eBPF程序不能随意修改内核中的数据结构。

1.2.5 Tail Calls

bpf允许在prog中调用另一个prog，并且执行完后不需要返回之前的prog

1.3 eBPF能做什么

鉴于它的起源，eBPF特别适合编写网络程序，将该网络程序附加到网络socket，进行流量过滤，流量分类以及执行网络分类器的动作，甚至可以修改一个已建链的网络socket的配置。

它还可以通过各种钩子，能在用户空间得到系统各种性能指标。大到监控整机状态，小到系统函数的运行时间。比如有开源项目BCC，基于eBPF实现的系统监视工具

它能监控到

1、应用及虚拟机的各种指标
2、系统库性能监控
3、kernel系统调用性能
4、文件系统性能
5、网络调用性能
6、CPU调度器性能
7、内存管理性能
8、中断性能

在了解到以上概念后，我们再来基于android平台深入了解下eBPF的工作过程

2. eBPF的加载bpfloader

在系统启动阶段，会把位于 /system/etc/bpf/ 下的eBPF目标文件通过bpfloader服务加载到内核
system/bpf/bpfloader/bpfloader.rc

on load_bpf_programs
    # Linux 5.16-rc1 has changed the default to 2 (disabled but changeable),
    # but we need 0
	write /proc/sys/kernel/unprivileged_bpf_disabled 0
	# Enable the eBPF JIT -- but do note that on 64-bit kernels it is likely
	# already force enabled by the kernel config option BPF_JIT_ALWAYS_ON
	write /proc/sys/net/core/bpf_jit_enable 1
	# Enable JIT kallsyms export for privileged users only
	write /proc/sys/net/core/bpf_jit_kallsyms 1
	exec_start bpfloader
	
service bpfloader /system/bin/bpfloader
	capabilities CHOWN SYS_ADMIN NET_ADMIN
	#
	# Set RLIMIT_MEMLOCK to 1GiB for bpfloader
	#
	# Actually only 8MiB would be needed if bpfloader ran as its own uid.
	#
	# As such we simply use 1GiB as a reasonable approximation of infinity.
	#
	rlimit memlock 1073741824 1073741824
	oneshot
	reboot_on_failure reboot,bpfloader-failed
	updatable

在bpfloader.te中有限制bpf执行的sepolicy，指定bpfloader是唯一可以加载bpfprog的的程序
/system/sepolicy/private/bpfloader.te

1	neverallow { domain -bpfloader } *:bpf { map_create prog_load };

而bpfloader只在启动时执行一次，确保其他模块无法额外加载系统之外的bpfprog。

进入 /system/bpf/bpfloader/BpfLoader.cpp 看下它的加载过程

int main(int argc, char** argv) {
	(void)argc;
	android::base::InitLogging(argv, &android::base::KernelLogger);

	// Create all the pin subdirectories
	// (this must be done first to allow selinux_context and pin_subdir functionality,
	//  which could otherwise fail with ENOENT during object pinning or renaming,
	//  due to ordering issues)
	for (const auto& location : locations) {
		createSysFsBpfSubDir(location.prefix);
	}

	// Load all ELF objects, create programs and maps, and pin them
	for (const auto& location : locations) {
		if (loadAllElfObjects(location) != 0) {
			ALOGE("=== CRITICAL FAILURE LOADING BPF PROGRAMS FROM %s ===", location.dir);
			ALOGE("If this triggers reliably, you're probably missing kernel options or patches.");
			ALOGE("If this triggers randomly, you might be hitting some memory allocation "
				  "problems or startup script race.");
			ALOGE("--- DO NOT EXPECT SYSTEM TO BOOT SUCCESSFULLY ---");
			sleep(20);
			return 2;
		}
	}

	if (android::base::SetProperty("bpf.progs_loaded", "1") == false) {
		ALOGE("Failed to set bpf.progs_loaded property");
		return 1;
	}

	return 0;
}

实际是调用 loadAllElfObjects 加载各个ELF文件

int loadAllElfObjects(const Location& location) {
	int retVal = 0;
	DIR* dir;
	struct dirent* ent;

	if ((dir = opendir(location.dir)) != NULL) {
		while ((ent = readdir(dir)) != NULL) {
			string s = ent->d_name;
			if (!EndsWith(s, ".o")) continue;

			string progPath(location.dir);
			progPath += s;

			bool critical;
			int ret = android::bpf::loadProg(progPath.c_str(), &critical,
											 location.prefix,
											 location.allowedDomainBitmask,
											 location.allowedProgTypes,
											 location.allowedProgTypesLength);
			...
		}
		closedir(dir);
	}
	return retVal;
}

读取几个指定路径下面所有.o结束的文件

const Location locations[] = {
		 
		 ...
		 
		 // Core operating system
		 {
				 .dir = "/system/etc/bpf/",
				 .prefix = "",
				 .allowedDomainBitmask = domainToBitmask(domain::platform),
		 },
		 // Vendor operating system
		 {
				 .dir = "/vendor/etc/bpf/",
				 .prefix = "vendor/",
				 .allowedDomainBitmask = domainToBitmask(domain::vendor),
				 .allowedProgTypes = kVendorAllowedProgTypes,
				 .allowedProgTypesLength = arraysize(kVendorAllowedProgTypes),
		 },
};

函数 loadProg 根据eBPF目标文件中的段创建容器类型，并加载代码到内核：

首先检查是否存在证书的段（SECTION）
readCodeSections: 读取ELF文件中段信息, 生成代码段codeSection的列表
createMaps: 根据ELF文件中的maps段信息创建对应的map对象供内核空间与用户空间通讯
loadCodeSections: 加载目标文件到内核, 并将加载后的fd固定到特定的路径

int loadProg(const char* elfPath, bool* isCritical, const char* prefix,
			  const unsigned long long allowedDomainBitmask, const bpf_prog_type* allowed,
			  size_t numAllowed) {
	 vector<char> license;
	 vector<char> critical;
	 vector<codeSection> cs;
	 vector<unique_fd> mapFds;
	 int ret;
 
	 if (!isCritical) return -1;
	 *isCritical = false;
 
	 ifstream elfFile(elfPath, ios::in | ios::binary);
	 if (!elfFile.is_open()) return -1;
 
	 //检查证书
	 ret = readSectionByName("critical", elfFile, critical);
	 *isCritical = !ret;
 
	 ret = readSectionByName("license", elfFile, license);
	 
	 . . .
 
	 //检查bpf版本
	 // the following default values are for bpfloader V0.0 format which does not include them
	 unsigned int bpfLoaderMinVer =
			 readSectionUint("bpfloader_min_ver", elfFile, DEFAULT_BPFLOADER_MIN_VER);
	 unsigned int bpfLoaderMaxVer =
			 readSectionUint("bpfloader_max_ver", elfFile, DEFAULT_BPFLOADER_MAX_VER);
	 size_t sizeOfBpfMapDef =
			 readSectionUint("size_of_bpf_map_def", elfFile, DEFAULT_SIZEOF_BPF_MAP_DEF);
	 size_t sizeOfBpfProgDef =
			 readSectionUint("size_of_bpf_prog_def", elfFile, DEFAULT_SIZEOF_BPF_PROG_DEF);
 
	 // inclusive lower bound check
	 if (BPFLOADER_VERSION < bpfLoaderMinVer) {
		 ALOGI("BpfLoader version 0x%05x ignoring ELF object %s with min ver 0x%05x",
			   BPFLOADER_VERSION, elfPath, bpfLoaderMinVer);
		 return 0;
	 }
 
	 // exclusive upper bound check
	 if (BPFLOADER_VERSION >= bpfLoaderMaxVer) {
		 ALOGI("BpfLoader version 0x%05x ignoring ELF object %s with max ver 0x%05x",
			   BPFLOADER_VERSION, elfPath, bpfLoaderMaxVer);
		 return 0;
	 }
 
	 ALOGI("BpfLoader version 0x%05x processing ELF object %s with ver [0x%05x,0x%05x)",
		   BPFLOADER_VERSION, elfPath, bpfLoaderMinVer, bpfLoaderMaxVer);
 
	 if (sizeOfBpfMapDef < DEFAULT_SIZEOF_BPF_MAP_DEF) {
		 ALOGE("sizeof(bpf_map_def) of %zu is too small (< %d)", sizeOfBpfMapDef,
			   DEFAULT_SIZEOF_BPF_MAP_DEF);
		 return -1;
	 }
 
	 //检查bpfprog堆栈大小
	 if (sizeOfBpfProgDef < DEFAULT_SIZEOF_BPF_PROG_DEF) {
		 ALOGE("sizeof(bpf_prog_def) of %zu is too small (< %d)", sizeOfBpfProgDef,
			   DEFAULT_SIZEOF_BPF_PROG_DEF);
		 return -1;
	 }
 
	 ret = readCodeSections(elfFile, cs, sizeOfBpfProgDef, allowed, numAllowed);
	 if (ret) {
		 ALOGE("Couldn't read all code sections in %s", elfPath);
		 return ret;
	 }
 
	 /* Just for future debugging */
	 if (0) dumpAllCs(cs);
 
	 ret = createMaps(elfPath, elfFile, mapFds, prefix, allowedDomainBitmask, sizeOfBpfMapDef);
	 if (ret) {
		 ALOGE("Failed to create maps: (ret=%d) in %s", ret, elfPath);
		 return ret;
	 }
 
	 for (int i = 0; i < (int)mapFds.size(); i++)
		 ALOGD("map_fd found at %d is %d in %s", i, mapFds[i].get(), elfPath);
 
	 applyMapRelo(elfFile, mapFds, cs);
 
	 ret = loadCodeSections(elfPath, cs, string(license.data()), prefix, allowedDomainBitmask);
	 if (ret) ALOGE("Failed to load programs, loadCodeSections ret=%d", ret);
 
	 return ret;
}

通过几步安全检查后，接下来执行重要的三个步骤

2.1 生成代码段

读取ELF目标文件中的段头信息，获取到每个BPF段的代码类型，常见有BPF_PROG_TYPE_KPROBE、BPF_PROG_TYPE_TRACEPOINT等，接着查看该段位是否需要重定位

/* Read a section by its index - for ex to get sec hdr strtab blob */
static int readCodeSections(ifstream& elfFile, vector<codeSection>& cs, size_t sizeOfBpfProgDef,
							const bpf_prog_type* allowed, size_t numAllowed) {
	vector<Elf64_Shdr> shTable;
	int entries, ret = 0;

	ret = readSectionHeadersAll(elfFile, shTable);
	if (ret) return ret;
	entries = shTable.size();

	vector<struct bpf_prog_def> pd;
	ret = readProgDefs(elfFile, pd, sizeOfBpfProgDef);
	if (ret) return ret;
	vector<string> progDefNames;
	ret = getSectionSymNames(elfFile, "progs", progDefNames);
	if (!pd.empty() && ret) return ret;

	for (int i = 0; i < entries; i++) {
		string name;
		codeSection cs_temp;
		cs_temp.type = BPF_PROG_TYPE_UNSPEC;

		ret = getSymName(elfFile, shTable[i].sh_name, name);
		if (ret) return ret;

		enum bpf_prog_type ptype = getSectionType(name);

		if (ptype == BPF_PROG_TYPE_UNSPEC) continue;

		if (!IsAllowed(ptype, allowed, numAllowed)) {
			ALOGE("Program type %s not permitted here", getSectionName(ptype).c_str());
			return -1;
		}

		// This must be done before '/' is replaced with '_'.
		cs_temp.expected_attach_type = getExpectedAttachType(name);

		string oldName = name;

		// convert all slashes to underscores
		std::replace(name.begin(), name.end(), '/', '_');

		cs_temp.type = ptype;
		cs_temp.name = name;

		ret = readSectionByIdx(elfFile, i, cs_temp.data);
		if (ret) return ret;
		ALOGD("Loaded code section %d (%s)", i, name.c_str());

		vector<string> csSymNames;
		ret = getSectionSymNames(elfFile, oldName, csSymNames, STT_FUNC);
		if (ret || !csSymNames.size()) return ret;
		for (size_t i = 0; i < progDefNames.size(); ++i) {
			if (!progDefNames[i].compare(csSymNames[0] + "_def")) {
				cs_temp.prog_def = pd[i];
				break;
			}
		}

		/* Check for rel section */
		if (cs_temp.data.size() > 0 && i < entries) {
			ret = getSymName(elfFile, shTable[i + 1].sh_name, name);
			if (ret) return ret;

			if (name == (".rel" + oldName)) {
				ret = readSectionByIdx(elfFile, i + 1, cs_temp.rel_data);
				if (ret) return ret;
				ALOGD("Loaded relo section %d (%s)", i, name.c_str());
			}
		}

		if (cs_temp.data.size() > 0) {
			cs.push_back(std::move(cs_temp));
			ALOGD("Adding section %d to cs list", i);
		}
	}
	return 0;
}

常见的类型有

/*
 * Map section name prefixes to program types, the section name will be:
 *   SECTION(<prefix>/<name-of-program>)
 * For example:
 *   SECTION("tracepoint/sched_switch_func") where sched_switch_funcs
 * is the name of the program, and tracepoint is the type.
 *
 * However, be aware that you should not be directly using the SECTION() macro.
 * Instead use the DEFINE_(BPF|XDP)_(PROG|MAP)... & LICENSE/CRITICAL macros.
 */
sectionType sectionNameTypes[] = {
		{"bind4/", BPF_PROG_TYPE_CGROUP_SOCK_ADDR, BPF_CGROUP_INET4_BIND},
		{"bind6/", BPF_PROG_TYPE_CGROUP_SOCK_ADDR, BPF_CGROUP_INET6_BIND},
		{"cgroupskb/", BPF_PROG_TYPE_CGROUP_SKB, BPF_ATTACH_TYPE_UNSPEC},
		{"cgroupsock/", BPF_PROG_TYPE_CGROUP_SOCK, BPF_ATTACH_TYPE_UNSPEC},
		{"kprobe/", BPF_PROG_TYPE_KPROBE, BPF_ATTACH_TYPE_UNSPEC},
		{"perf_event/", BPF_PROG_TYPE_PERF_EVENT, BPF_ATTACH_TYPE_UNSPEC},
		{"schedact/", BPF_PROG_TYPE_SCHED_ACT, BPF_ATTACH_TYPE_UNSPEC},
		{"schedcls/", BPF_PROG_TYPE_SCHED_CLS, BPF_ATTACH_TYPE_UNSPEC},
		{"skfilter/", BPF_PROG_TYPE_SOCKET_FILTER, BPF_ATTACH_TYPE_UNSPEC},
		{"tracepoint/", BPF_PROG_TYPE_TRACEPOINT, BPF_ATTACH_TYPE_UNSPEC},
		{"xdp/", BPF_PROG_TYPE_XDP, BPF_ATTACH_TYPE_UNSPEC},
};

2.2 创建map对象

#define BPF_FS_PATH "/sys/fs/bpf/"

static int createMaps(const char* elfPath, ifstream& elfFile, vector<unique_fd>& mapFds,
					  const char* prefix, const unsigned long long allowedDomainBitmask,
					  const size_t sizeOfBpfMapDef) {
	int ret;
	vector<char> mdData, btfData;
	vector<struct bpf_map_def> md;
	vector<string> mapNames;
	std::unordered_map<string, std::pair<uint32_t, uint32_t>> btfTypeIdMap;
	string fname = pathToFilename(string(elfPath), true);

	ret = readSectionByName("maps", elfFile, mdData);
	if (ret == -2) return 0;  // no maps to read
	if (ret) return ret;
	...
	
	ret = getSectionSymNames(elfFile, "maps", mapNames);
	if (ret) return ret;

	unsigned btfMinBpfLoaderVer = readSectionUint("btf_min_bpfloader_ver", elfFile, 0);
	unsigned btfMinKernelVer = readSectionUint("btf_min_kernel_ver", elfFile, 0);
	unsigned kvers = kernelVersion();

	std::optional<unique_fd> btfFd;
	if ((BPFLOADER_VERSION >= btfMinBpfLoaderVer) && (kvers >= btfMinKernelVer) &&
		(!readSectionByName(".BTF", elfFile, btfData))) {
		btfFd = getMapBtfInfo(elfPath, btfTypeIdMap);
	}

	for (int i = 0; i < (int)mapNames.size(); i++) {
	   
		enum bpf_map_type type = md[i].type;

		...

		 if (!reuse) {
			 if (specified(selinux_context)) {
				 string createLoc = string(BPF_FS_PATH) + lookupPinSubdir(selinux_context) +
									"tmp_map_" + fname + "_" + mapNames[i];
				 ret = bpf_obj_pin(fd, createLoc.c_str());
				 if (ret) {
					 int err = errno;
					 ALOGE("create %s -> %d [%d:%s]", createLoc.c_str(), ret, err, strerror(err));
					 return -err;
				 }
				 ret = rename(createLoc.c_str(), mapPinLoc.c_str());
				 if (ret) {
					 int err = errno;
					 ALOGE("rename %s %s -> %d [%d:%s]", createLoc.c_str(), mapPinLoc.c_str(), ret,
						   err, strerror(err));
					 return -err;
				 }
			 } else {
				 ret = bpf_obj_pin(fd, mapPinLoc.c_str());
				 if (ret) return -errno;
			 }
			 
			 ...
		 }
		
		struct bpf_map_info map_info = {};
		__u32 map_info_len = sizeof(map_info);
		int rv = bpf_obj_get_info_by_fd(fd, &map_info, &map_info_len);
		if (rv) {
			ALOGE("bpf_obj_get_info_by_fd failed, ret: %d [%d]", rv, errno);
		} else {
			ALOGI("map %s id %d", mapPinLoc.c_str(), map_info.id);
		}

		mapFds.push_back(std::move(fd));
	}

	return ret;
}

																					 903,0-1                                                                                     672,12

为了避免map对象在bpfloader服务退出之后被销毁，会通过bpf_obj_pin将这些对象固定到/sys/fs/bpf/文件节点

映射的文件节点，其命名有特定规则，以便其他prog能通过文件路径名称来找到对应的bpfprog。

2.3 加载prog

static int loadCodeSections(const char* elfPath, vector<codeSection>& cs, const string& license,
						   const char* prefix, const unsigned long long allowedDomainBitmask) {
   unsigned kvers = kernelVersion();
   int ret, fd;

   if (!kvers) return -1;

   string fname = pathToFilename(string(elfPath), true);

   for (int i = 0; i < (int)cs.size(); i++) {
	   string name = cs[i].name;
	   unsigned bpfMinVer = DEFAULT_BPFLOADER_MIN_VER;  // v0.0
	   unsigned bpfMaxVer = DEFAULT_BPFLOADER_MAX_VER;  // v1.0
	   domain selinux_context = domain::unspecified;
	   domain pin_subdir = domain::unspecified;

	   if (cs[i].prog_def.has_value()) {
		   unsigned min_kver = cs[i].prog_def->min_kver;
		   unsigned max_kver = cs[i].prog_def->max_kver;
		   ALOGD("cs[%d].name:%s min_kver:%x .max_kver:%x (kvers:%x)", i, name.c_str(), min_kver,
				 max_kver, kvers);
		   if (kvers < min_kver) continue;
		   if (kvers >= max_kver) continue;

			bpfMinVer = cs[i].prog_def->bpfloader_min_ver;
			bpfMaxVer = cs[i].prog_def->bpfloader_max_ver;
			selinux_context = getDomainFromSelinuxContext(cs[i].prog_def->selinux_context);
			pin_subdir = getDomainFromPinSubdir(cs[i].prog_def->pin_subdir);
			// Note: make sure to only check for unrecognized *after* verifying bpfloader
			// version limits include this bpfloader's version.
		}

		ALOGD("cs[%d].name:%s requires bpfloader version [0x%05x,0x%05x)", i, name.c_str(),
			  bpfMinVer, bpfMaxVer);
		if (BPFLOADER_VERSION < bpfMinVer) continue;
		if (BPFLOADER_VERSION >= bpfMaxVer) continue;
		if (unrecognized(pin_subdir)) return -ENOTDIR;

		if (specified(selinux_context)) {
			if (!inDomainBitmask(selinux_context, allowedDomainBitmask)) {
				ALOGE("prog %s has invalid selinux_context of %d (allowed bitmask 0x%llx)",
					  name.c_str(), selinux_context, allowedDomainBitmask);
				return -EINVAL;
			}
			ALOGI("prog %s selinux_context [%32s] -> %d -> '%s' (%s)", name.c_str(),
				  cs[i].prog_def->selinux_context, selinux_context,
				  lookupSelinuxContext(selinux_context), lookupPinSubdir(selinux_context));
		}

		if (specified(pin_subdir)) {
			if (!inDomainBitmask(pin_subdir, allowedDomainBitmask)) {
				ALOGE("prog %s has invalid pin_subdir of %d (allowed bitmask 0x%llx)", name.c_str(),
					  pin_subdir, allowedDomainBitmask);
				return -EINVAL;
			}
			ALOGI("prog %s pin_subdir [%32s] -> %d -> '%s'", name.c_str(),
				  cs[i].prog_def->pin_subdir, pin_subdir, lookupPinSubdir(pin_subdir));
		}

		// strip any potential $foo suffix
		// this can be used to provide duplicate programs
		// conditionally loaded based on running kernel version
		name = name.substr(0, name.find_last_of('$'));

		bool reuse = false;
		// Format of pin location is
		// /sys/fs/bpf/<prefix>prog_<filename>_<mapname>
		string progPinLoc = string(BPF_FS_PATH) + lookupPinSubdir(pin_subdir, prefix) + "prog_" +
							fname + '_' + string(name);
		if (access(progPinLoc.c_str(), F_OK) == 0) {
			fd = retrieveProgram(progPinLoc.c_str());
			ALOGD("New bpf prog load reusing prog %s, ret: %d (%s)", progPinLoc.c_str(), fd,
				  (fd < 0 ? std::strerror(errno) : "no error"));
			reuse = true;
		} else {
			vector<char> log_buf(BPF_LOAD_LOG_SZ, 0);

			struct bpf_load_program_attr attr = {
				.prog_type = cs[i].type,
				.name = name.c_str(),
				.insns = (struct bpf_insn*)cs[i].data.data(),
				.license = license.c_str(),
				.log_level = 0,
				.expected_attach_type = cs[i].expected_attach_type,
			};

			fd = bcc_prog_load_xattr(&attr, cs[i].data.size(), log_buf.data(), log_buf.size(),
					true);

			...
		}

		if (fd < 0) return fd;
		if (fd == 0) return -EINVAL;

		if (!reuse) {
			if (specified(selinux_context)) {
				string createLoc = string(BPF_FS_PATH) + lookupPinSubdir(selinux_context) +
								   "tmp_prog_" + fname + '_' + string(name);
				ret = bpf_obj_pin(fd, createLoc.c_str());
				...
			} else {
				ret = bpf_obj_pin(fd, progPinLoc.c_str());
				...
			}
		}

		struct bpf_prog_info prog_info = {};
		__u32 prog_info_len = sizeof(prog_info);
		int rv = bpf_obj_get_info_by_fd(fd, &prog_info, &prog_info_len);
		...

		cs[i].prog_fd.reset(fd);
	}

	return 0;
}

在上面的load过程中，loader通过attr来明确该prog的信息，将其加载到指定位置，并将加载结果返回给句柄fd，再通过bpf_obj_pin 将该prog固定到/sys/fs/bpf/，防止bpfloader加载完退出后eBPF-prog被销毁。

到这一步，eBPF-prog完成了初始化，并且prog也load到了内核。

此时只要内核有相应事件发生，都会调到prog处理，并将结果保存在对应的map对象中，用户进程只需要通过bpf的接口BPF_MAP_LOOKUP_ELEM等即可获取map对象中的数据，以完成响应的处理工作

3. eBPF实例

接下来，再通过几个实例来进一步认识eBPF的功能

3.1 Hello， eBPF ！

先来一个简单的例子，

kernel层hello_kern.c

#include <linux/bpf.h>
#include "bpf/bpf_helpers.h"

#define SEC(NAME) __attribute__((section(NAME), used))

SEC("tracepoint/syscalls/sys_enter_execve")
int bpf_prog(void *ctx) {
	char str[] = "Hello, eBPF !!!\n";
	bpf_trace_printk(str, sizeof(str)); 
	return 0;
}

char _license[] SEC("license") = "GPL";

user层hello_user.c

#include <stdio.h>
#include "bpf_load.h"

int main(int argc, char **argv) {
	if(load_bpf_file("hello_kern.o") != 0) {
		printf("The kernel didn't load eBPF-prog\n");
		return -1;
	}
	read_trace_pipe();
	return 0;
}

将其放到samples/bpf目录下，在makefile中，添加我们的模块

1
2
3

tprogs-y += hello_ebpf
hello-objs := bpf_load.o hello_user.o
always-y += hello_kern.o

接着

通过运行./hello_user命令来完成。这将加载hello_kern.o到内核中。load_bpf_file函数负责加载内核模块并将其附加到内核。在这个过程中，内核模块中的eBPF程序（bpf_prog）会被加载到内核中。
运行可执行程序./hello（随意一个可执行执行程序），就会触发sys_enter_execve tracepoint。当触发该tracepoint时，内核中已加载的 hello_kern.o eBPF程序（bpf_prog）会被执行，打印一条消息：”Hello, eBPF !!!”。用户空间程序（hello_user）通过读取tracepoint输出，可以看到eBPF程序打印的消息。

这个简单的示例展示了如何在Linux内核中使用eBPF跟踪tracepoint并打印一条消息。通过这个示例，你可以了解eBPF的基本概念和使用方法。

3.2 网络管控

我们知道在android中有一个doze模式，用来优化息屏下的待机功耗，它定义了一系列的策略，其中有一个是限制息屏下的网络，我们来研究下它是如何实现的。

先看下大概流程图，参考如下

在 Android Doze模式下，系统会对网络连接进行限制，以减少后台应用程序的网络活动和降低功耗。

进入idle后，上层会在deviceidlecontroller 中调用接口实现策略更新，随后将更新后的策略通过 setUidFirewallRule 接口从fwk侧传递到网络侧 packages/modules/Connectivity，最终在ConnectivityService 中来实现

@Override
public void setUidFirewallRule(final int chain, final int uid, final int rule) {
	enforceNetworkStackOrSettingsPermission();
	.....
	try {
		mBpfNetMaps.setUidRule(chain, uid, firewallRule);
	} catch (ServiceSpecificException e) {
		throw new IllegalStateException(e);
	}
}

/**
 * Set firewall rule for uid
 *
 * @param childChain   target chain
 * @param uid          uid to allow/deny
 * @param firewallRule either FIREWALL_RULE_ALLOW or FIREWALL_RULE_DENY
 * @throws ServiceSpecificException in case of failure, with an error code indicating the
 *                                  cause of the failure.
 */
public void setUidRule(final int childChain, final int uid, final int firewallRule) {
	throwIfPreT("setUidRule is not available on pre-T devices");

	if (sEnableJavaBpfMap) {
		final long match = getMatchByFirewallChain(childChain);
		final boolean isAllowList = isFirewallAllowList(childChain);
		final boolean add = (firewallRule == FIREWALL_RULE_ALLOW && isAllowList)
				|| (firewallRule == FIREWALL_RULE_DENY && !isAllowList);

		if (add) {
			addRule(uid, match, "setUidRule");
		} else {
			removeRule(uid, match, "setUidRule");
		}
	} else {
		final int err = native_setUidRule(childChain, uid, firewallRule);
		maybeThrow(err, "Unable to set uid rule");
	}
}

static jint native_setChildChain(JNIEnv* env, jobject self, jint childChain, jboolean enable) {
  auto chain = static_cast<ChildChain>(childChain);
  int res = mTc.toggleUidOwnerMap(chain, enable);
  if (res) ALOGE("%s failed, error code = %d", __func__, res);
  return (jint)res;
}

最终在 TrafficController 中完成规则的更新

int TrafficController::toggleUidOwnerMap(ChildChain chain, bool enable) {
	std::lock_guard guard(mMutex);
	uint32_t key = UID_RULES_CONFIGURATION_KEY;
	auto oldConfigure = mConfigurationMap.readValue(key);
	if (!oldConfigure.ok()) {
		ALOGE("Cannot read the old configuration from map: %s",
			  oldConfigure.error().message().c_str());
		return -oldConfigure.error().code();
	}
	uint32_t match;
	switch (chain) {
		case DOZABLE:
			match = DOZABLE_MATCH;
			break;
		case STANDBY:
			match = STANDBY_MATCH;
			break;
		case POWERSAVE:
			match = POWERSAVE_MATCH;
			break;
		case RESTRICTED:
			match = RESTRICTED_MATCH;
			break;
		case LOW_POWER_STANDBY:
			match = LOW_POWER_STANDBY_MATCH;
			break;
		case OEM_DENY_1:
			match = OEM_DENY_1_MATCH;
			break;
		case OEM_DENY_2:
			match = OEM_DENY_2_MATCH;
			break;
		case OEM_DENY_3:
			match = OEM_DENY_3_MATCH;
			break;
		default:
			return -EINVAL;
	}
	BpfConfig newConfiguration =
			enable ? (oldConfigure.value() | match) : (oldConfigure.value() & ~match);
	Status res = mConfigurationMap.writeValue(key, newConfiguration, BPF_EXIST);
	if (!isOk(res)) {
		ALOGE("Failed to toggleUidOwnerMap(%d): %s", chain, res.msg().c_str());
	}
	return -res.code();
}

当手机处于doze模式下待机，网卡接收到数据包通过中断唤醒CPU后，经由协议栈解析socket包（这里假设是tcp的包）时

/* The socket must have it's spinlock held when we get
 * here, unless it is a TCP_LISTEN socket.
 *
 * We have a potential double-lock case here, so even when
 * doing backlog processing we use the BH locking scheme.
 * This is because we cannot sleep with the original spinlock
 * held.
 */
int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
{
		enum skb_drop_reason reason;
		struct sock *rsk;

		//假设需要我们处理的是 ESTABLISH 状态下的包  
		if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
				struct dst_entry *dst;

				dst = rcu_dereference_protected(sk->sk_rx_dst,
												lockdep_sock_is_held(sk));

				sock_rps_save_rxhash(sk, skb);
				sk_mark_napi_id(sk, skb);
				if (dst) {
						if (sk->sk_rx_dst_ifindex != skb->skb_iif ||
							!INDIRECT_CALL_1(dst->ops->check, ipv4_dst_check,
											 dst, 0)) {
								RCU_INIT_POINTER(sk->sk_rx_dst, NULL);
								dst_release(dst);
						}
				}
				tcp_rcv_established(sk, skb);
				return 0;
		}

		reason = SKB_DROP_REASON_NOT_SPECIFIED;
		if (tcp_checksum_complete(skb))
				goto csum_err;

		......
}
EXPORT_SYMBOL(tcp_v4_do_rcv);

进一处理处于连接状态下的skb数据

void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
{
		. . .    
		//关于协议栈处理数据包的源码太长，只关注我们需要的部分
		if (!tcp_validate_incoming(sk, skb, th, 1))
				return;
		. . .
}
EXPORT_SYMBOL(tcp_rcv_established);

这里会根据skb里header的source和dest信息来查找对应socket

static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
								  const struct tcphdr *th, int syn_inerr)
{
		struct tcp_sock *tp = tcp_sk(sk);
		SKB_DR(reason);

		.....

		bpf_skops_parse_hdr(sk, skb);

		return true;

discard:
		tcp_drop_reason(sk, skb, reason);
		return false;

		.....
}

走到这里，终于回到了我们的eBPF程序上。当一个网络数据包到达设备时，eBPF会根据预先定义的过滤规则来决定是否允许该数据包通过。

#ifdef CONFIG_CGROUP_BPF
static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
{
		.....
		/* The skb will be handled in the
		 * bpf_skops_established() or
		 * bpf_skops_write_hdr_opt().
		 */
		switch (sk->sk_state) {
		case TCP_SYN_RECV:
		case TCP_SYN_SENT:
		case TCP_LISTEN:
				return;
		}

		sock_owned_by_me(sk);

		memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
		sock_ops.op = BPF_SOCK_OPS_PARSE_HDR_OPT_CB;
		sock_ops.is_fullsock = 1;
		sock_ops.sk = sk;
		bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));

		BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
}

这个宏的定义如下

#define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)                                       \
({                                                                               \
		int __ret = 0;                                                               \
		if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) && (sock_ops)->sk) {       \
				typeof(sk) __sk = sk_to_full_sk((sock_ops)->sk);               \
				if (__sk && sk_fullsock(__sk))                                       \
						__ret = __cgroup_bpf_run_filter_sock_ops(__sk,               \
																 sock_ops,     \
														 CGROUP_SOCK_OPS); \
		}                                                                       \
		__ret;                                                                       \
})

在cgroup中实现如下

/**
 * __cgroup_bpf_run_filter_sock_ops() - Run a program on a sock
 * @sk: socket to get cgroup from
 * @sock_ops: bpf_sock_ops_kern struct to pass to program. Contains
 * sk with connection information (IP addresses, etc.) May not contain
 * cgroup info if it is a req sock.
 * @type: The type of program to be executed
 *
 * socket passed is expected to be of type INET or INET6.
 *
 * The program type passed in via @type must be suitable for sock_ops
 * filtering. No further check is performed to assert that.
 *
 * This function will return %-EPERM if any if an attached program was found
 * and if it returned != 1 during execution. In all other cases, 0 is returned.
 */
int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
									 struct bpf_sock_ops_kern *sock_ops,
									 enum cgroup_bpf_attach_type atype)
{
		struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);

		return bpf_prog_run_array_cg(&cgrp->bpf, atype, sock_ops, bpf_prog_run,
									 0, NULL);
}
EXPORT_SYMBOL(__cgroup_bpf_run_filter_sock_ops);

走到这里，就调到了hook的eBPF-prog，这里挂载的是网络侧的netd，源码位于

packages/modules/Connectivity/bpf_progs/netd.c，与之相关的挂载点如下

DEFINE_BPF_PROG_EXT("cgroupskb/ingress/stats$trace", AID_ROOT, AID_SYSTEM,
					bpf_cgroup_ingress_trace, KVER(5, 8, 0), KVER_INF,
					BPFLOADER_IGNORED_ON_VERSION, BPFLOADER_MAX_VER, false,
					"fs_bpf_netd_readonly", "", false, true, false)
(struct __sk_buff* skb) {
	return bpf_traffic_account(skb, INGRESS, TRACE_ON, KVER(5, 8, 0));
}

DEFINE_NETD_BPF_PROG_KVER_RANGE("cgroupskb/ingress/stats$4_19", AID_ROOT, AID_SYSTEM,
								bpf_cgroup_ingress_4_19, KVER(4, 19, 0), KVER_INF)
(struct __sk_buff* skb) {
	return bpf_traffic_account(skb, INGRESS, TRACE_OFF, KVER(4, 19, 0));
}

DEFINE_NETD_BPF_PROG_KVER_RANGE("cgroupskb/ingress/stats$4_14", AID_ROOT, AID_SYSTEM,
								bpf_cgroup_ingress_4_14, KVER_NONE, KVER(4, 19, 0))
(struct __sk_buff* skb) {
	return bpf_traffic_account(skb, INGRESS, TRACE_OFF, KVER_NONE);
}

DEFINE_BPF_PROG_EXT("cgroupskb/egress/stats$trace", AID_ROOT, AID_SYSTEM,
					bpf_cgroup_egress_trace, KVER(5, 8, 0), KVER_INF,
					BPFLOADER_IGNORED_ON_VERSION, BPFLOADER_MAX_VER, false,
					"fs_bpf_netd_readonly", "", false, true, false)
(struct __sk_buff* skb) {
	return bpf_traffic_account(skb, EGRESS, TRACE_ON, KVER(5, 8, 0));
}

DEFINE_NETD_BPF_PROG_KVER_RANGE("cgroupskb/egress/stats$4_19", AID_ROOT, AID_SYSTEM,
								bpf_cgroup_egress_4_19, KVER(4, 19, 0), KVER_INF)
(struct __sk_buff* skb) {
	return bpf_traffic_account(skb, EGRESS, TRACE_OFF, KVER(4, 19, 0));
}

DEFINE_NETD_BPF_PROG_KVER_RANGE("cgroupskb/egress/stats$4_14", AID_ROOT, AID_SYSTEM,
								bpf_cgroup_egress_4_14, KVER_NONE, KVER(4, 19, 0))
(struct __sk_buff* skb) {
	return bpf_traffic_account(skb, EGRESS, TRACE_OFF, KVER_NONE);
}

均会调到 bpf_traffic_account

 static __always_inline inline int bpf_traffic_account(struct __sk_buff* skb, bool egress,
													  const bool enable_tracing,
													  const unsigned kver) {
	uint32_t sock_uid = bpf_get_socket_uid(skb);
	uint64_t cookie = bpf_get_socket_cookie(skb);
	. . .
	int match = bpf_owner_match(skb, sock_uid, egress, kver);
	. . .
	return match;
}

在这里设定了skb的处理规则

// DROP_IF_SET is set of rules that DROP if rule is globally enabled, and per-uid bit is set
#define DROP_IF_SET (STANDBY_MATCH | OEM_DENY_1_MATCH | OEM_DENY_2_MATCH | OEM_DENY_3_MATCH)
// DROP_IF_UNSET is set of rules that should DROP if globally enabled, and per-uid bit is NOT set
#define DROP_IF_UNSET (DOZABLE_MATCH | POWERSAVE_MATCH | RESTRICTED_MATCH | LOW_POWER_STANDBY_MATCH)

static __always_inline inline int bpf_owner_match(struct __sk_buff* skb, uint32_t uid,
												  bool egress, const unsigned kver) {
	if (is_system_uid(uid)) return PASS;

	if (skip_owner_match(skb, egress, kver)) return PASS;

	BpfConfig enabledRules = getConfig(UID_RULES_CONFIGURATION_KEY);

	UidOwnerValue* uidEntry = bpf_uid_owner_map_lookup_elem(&uid);
	uint32_t uidRules = uidEntry ? uidEntry->rule : 0;
	uint32_t allowed_iif = uidEntry ? uidEntry->iif : 0;

	// Warning: funky bit-wise arithmetic: in parallel, for all DROP_IF_SET/UNSET rules
	// check whether the rules are globally enabled, and if so whether the rules are
	// set/unset for the specific uid.  DROP if that is the case for ANY of the rules.
	// We achieve this by masking out only the bits/rules we're interested in checking,
	// and negating (via bit-wise xor) the bits/rules that should drop if unset.
	if (enabledRules & (DROP_IF_SET | DROP_IF_UNSET) & (uidRules ^ DROP_IF_UNSET)) return DROP;

	if (!egress && skb->ifindex != 1) {
		if (uidRules & IIF_MATCH) {
			if (allowed_iif && skb->ifindex != allowed_iif) {
				// Drops packets not coming from lo nor the allowed interface
				// allowed interface=0 is a wildcard and does not drop packets
				return DROP_UNLESS_DNS;
			}
		} else if (uidRules & LOCKDOWN_VPN_MATCH) {
			// Drops packets not coming from lo and rule does not have IIF_MATCH but has
			// LOCKDOWN_VPN_MATCH
			return DROP_UNLESS_DNS;
		}
	}
	return PASS;
}

根据预设的规则来决定该socket包的处理方式，是BPF_ALLOW还是DROP，分别对应该socket包的流转方式，比如BPF_ALLOW代表允许该socket包按照预设流程继续流转（唤醒客户端进程，传递socket包给应用程序解析），DROP即代表抛弃该socket包，停止流转。

以上是eBPF在tcp包上的应用，udp的包处理方式类似，就是在kernel中协议栈那里调用的接口不同。

以上就是android基于eBPF来实现doze下网络限制的基本原理。基于android实现的netd，我们还可扩展很多其他功能。比如我们期望能过滤指定应用的数据，而将其他应用的数据全部给拦截下来，这个在上层实现可能比较困难，但是基于eBPF实现就非常简单。可以通过将指定应用的uid写入map，在netd拦截到skb时去读取map来校验是否匹配，从而选择对应的处理方式。将超白名单中的所有即时通讯应用都纳入冻结管控，清除超级白名单，而应用收到网络消息的时候也能触发解冻流程，避免消息数据延迟和丢失，又能降低功耗。

流量过滤也是类似原理，大家有兴趣可以自行研究。

参考文献

Linux Socket Filtering aka Berkeley Packet Filter (BPF)
技术|深入理解 BPF:一个阅读清单
 Learn eBPF Tracing: Tutorial and Examples
FerrisEllis.com
GitHub - iovisor/bcc: BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more

eBPF初体验：Android设备网络优化实践