一、前言

众所周知，原生 Linux 内核大部分进程是采用完全公平调度CFS（Completely Fair Scheduler），服务于 PC 机/服务器/Android设备/嵌入式设备等等。然而 Android 设备要求前台应用具有交互和UI的高响应性和高流畅性。原来不区分前后台应用的 CFS 机制似乎没有专门为交互设备考虑到这点，以至于国内大厂纷纷魔改内核调度策略，如 vip 线程、不公平调度、UI First、Turbo X等，慢半拍的厂商甚至名字都取不上了。下面我们基于 github 已公开的源码，梳理一下芯片原厂高通的 cfs 线程 mvp 优先调度策略。

Linux 标准内核完全公平调度总是选择虚拟时间最小的进程进行调度，这样对于滑动、游戏等游戏场景的任务调度来讲会导致重要进程的 runnable 时间长，schedule latency 时间长，没有能够及时调度执行，导致丢帧卡顿、响应不及时等问题。为了解决卡顿丢帧等问题，需要对哪些重要的线程进行优先调度呢？

二、内核调度流程简介

2.1 相关结构体

task_struct
linux 内核采用进程描述符来抽象和描述一个进程，使用 task_struct 数据结构来描述。task_struct 用于描述进程运行状况和控制进程运行的全部信息。
sched_entity
进程调度中有一个非常重要的数据结构 task_entity，称为调度实体，它描述进程作为一个调度实体参与调度所需要的全部信息。
rq
rq 数据结构是描述 cpu 的通用就绪队列，rq 数据结构中记录了一个就绪队列所需要的全部信息，包括一个 cfs 就绪队列数据结构 cfs_rq，一个实时进程调度器就绪队列数据结构 rt_rq 和一个实时调度器 deadline 就绪队列数据结构 dl_rq。

2.2 task enqueue入队流程

Task入队调用路径：

2.3 CFS选择 task 路径

在cpu进行上下文切换的时候，会进行task的选择，根据调度器的不同，会执行到不同的pick_next_task回调函数。cfs调度器会调用到pick_next_task_fair函数：

cfs调度器总是选择红黑树最左边的task，进行调度。

三、选择什么样的进程进行优先调度？

/*
 * Higher prio mvp can preempt lower prio mvp.
 *
 * However, the lower prio MVP slice will be more since we expect them to
 * be the work horses. For example, binders will have higher prio MVP and
 * they can preempt long running rtg prio tasks but binders loose their
 * powers with in 3 msec where as rtg prio tasks can run more than that.
 */
int walt_get_mvp_task_prio(struct task_struct *p)
{
    if (walt_procfs_low_latency_task(p) ||
          walt_pipeline_low_latency_task(p))
      return WALT_LL_PIPE_MVP
 
    if (per_task_boost(p) == TASK_BOOST_STRICT_MAX)
      return WALT_TASK_BOOST_MVP;

    if (walt_binder_low_latency_task(p))
      return WALT_BINDER_MVP;

    if (task_rtg_high_prio(p))
      return WALT_RTG_MVP;

    return WALT_NOT_MVP;
}

从 walt_get_mvp_task_prio 中可以看出一共有 4 类 task 可以进行优先调度，分别为

walt_procfs_low_latency_task || walt_pipeline_low_latency_task
per_task_boost
walt_binder_low_latency_task
task_rtg_high_prio

四、4 种mvp task类型解析

mvp task flag基本定义：

#define WALT_MVP_SLICE                3000000U
#define WALT_MVP_LIMIT                (4 * WALT_MVP_SLICE)
  
/* higher number, better priority */
#define WALT_RTG_MVP                0
#define WALT_BINDER_MVP                1
#define WALT_TASK_BOOST_MVP        2
#define WALT_LL_PIPE_MVP        3
  
#define WALT_NOT_MVP                -1

即定义了4种不同类型的task，并为这些task限制了running time。

4.1 低延迟任务

低延迟任务是第一优先级，包括以下两类：

walt_procfs_low_latency_task：

即 task->low_latency 需要赋值 WALT_LOW_LATENCY_PROCFS，并且 task 的负载需要小于 sysctl_walt_low_latency_task_threshold。

static inline bool walt_procfs_low_latency_task(struct task_struct *p)
{
    struct walt_task_struct *wts = (struct walt_task_struct *) p->android_vendor_data1; 

    return (wts->low_latency & WALT_LOW_LATENCY_PROCFS) &&
      (task_util(p) < sysctl_walt_low_latency_task_threshold);
}

walt_procfs_low_latency_task 通过 proc/sys/walt/sched_low_latency 来进行设置，属于 per task 类型，user空间可以自行对该类型的 task 进行设置。

调用函数为：

sched_task_handler
{
    ......
    case LOW_LATENCY:
    if (val)
      wts->low_latency |= WALT_LOW_LATENCY_PROCFS;
    else
      ts->low_latency &= ~WALT_LOW_LATENCY_PROCFS;
    break;
    ......
}

walt_pipeline_low_latency_task：

static inline bool walt_pipeline_low_latency_task(struct task_struct *p)
{
    struct walt_task_struct *wts = (struct walt_task_struct *) p->android_vendor_data1; 

    return wts->low_latency & WALT_LOW_LATENCY_PIPELINE;
}

同 WALT_LOW_LATENCY_PROCFS，只不过少了对 task_util 的限制。同样在 sysctl中sched_task_handler 中来进行赋值：

case PIPELINE:
  if (val)
  wts->low_latency |= WALT_LOW_LATENCY_PIPELINE;
else
  wts->low_latency &= ~WALT_LOW_LATENCY_PIPELINE;
break;

4.2 per-task 类型 boost 任务

per-task 类型 boost task是第二优先级，即 WALT_TASK_BOOST_MVP。

底层通过提供接口，上层进行设置sched_per_task_boost，使该 task 变成 mvp task 进行优先调度。

case PER_TASK_BOOST:
if (val < TASK_BOOST_NONE || val >= TASK_BOOST_END) {
  ret = -EINVAL;
  goto put_task;
}
wts->boost = val;
if (val == 0)
  wts->boost_period = 0;
break;

将 task 变为 mvp task 进行优先调度。boost类型如下

enum task_boost_type {
TASK_BOOST_NONE = 0,
TASK_BOOST_ON_MID,
TASK_BOOST_ON_MAX,
TASK_BOOST_STRICT_MAX,
TASK_BOOST_END,
};

只有设置了 wts->boost = 3才会生效。

在 binder set prio 处进行调用

static void binder_set_priority_hook(void *data,
struct binder_transaction *bndrtrans, struct task_struct *task)
{
    struct walt_task_struct *wts = (struct walt_task_struct *) task->android_vendor_data1;
    struct walt_task_struct *current_wts =
      (struct walt_task_struct *) current->android_vendor_data1;

    if (unlikely(walt_disabled))
      return;

    if (bndrtrans && bndrtrans->need_reply && current_wts->boost == TASK_BOOST_STRICT_MAX) {
       bndrtrans->android_vendor_data1  = wts->boost;
        wts->boost = TASK_BOOST_STRICT_MAX;
    }
}

如果当前的 task->boost == TASK_BOOST_STRICT_MAX, 即能够把对端的 task->boost 赋值为TASK_BOOST_STRICT_MAX，成为 mvp task。

4.3 binder 类型优先调度任务

binder 类型优先调度是第三优先级，即 WALT_BINDER_MVP。

在 binder 进程间通信时，客户端进程唤醒等待的 rtg 进程的时候，把将要唤醒的 rtg 线程当作 mvp task 进行优先调度。

Vendor hook 为 trace_android_vh_binder_wakeup_ilocked。

static void walt_binder_low_latency_set(void *unused, struct task_struct *task,
                                        bool sync, struct binder_proc *proc)
{
        struct walt_task_struct *wts = (struct walt_task_struct *) task->android_vendor_data1;

        if (unlikely(walt_disabled))
                return;
        if (task && ((task_in_related_thread_group(current) &&
                        task->group_leader->prio < MAX_RT_PRIO) ||
                        (current->group_leader->prio < MAX_RT_PRIO &&
                        task_in_related_thread_group(task))))
                wts->low_latency |= WALT_LOW_LATENCY_BINDER;
        else
                /*
                 * Clear low_latency flag if criterion above is not met, this
                 * will handle usecase where for a binder thread WALT_LOW_LATENCY_BINDER
                 * is set by one task and before WALT clears this flag after timer expiry
                 * some other task tries to use same binder thread.
                 *
                 * The only gets cleared when binder transaction is initiated
                 * and the above condition to set flasg is nto satisfied.
                 */
                wts->low_latency &= ~WALT_LOW_LATENCY_BINDER;

}

该 task 需要满足如下条件：

当前 waker 在 related thread group 组中，目前可以直接理解为 top app 线程 && 该 waker 为 rt 线程或者该 waker 的主进程为 rt 并且 wakee 的线程也需要在 rtg 组里面。
当进行优先调度的时候，还需满足该 task 的负载 < sysctl_walt_low_latency_task_threshold。

static inline bool walt_binder_low_latency_task(struct task_struct *p)
{
        struct walt_task_struct *wts = (struct walt_task_struct *) p->android_vendor_data1;

        return (wts->low_latency & WALT_LOW_LATENCY_BINDER) &&
                (task_util(p) < sysctl_walt_low_latency_task_threshold);
}

4.4 特定 RTG 类型任务

RTG task 类型是第四优先级，即WALT_RTG_MVP。

static inline bool task_rtg_high_prio(struct task_struct *p)
{
        return task_in_related_thread_group(p) &&
                (p->prio <= sysctl_walt_rtg_cfs_boost_prio);
}

在 walt 初始化的时候，会给 related_thread_group 分配空间，为 20 个分组。在 cpu_cgroup_attach 中调用vendor hook trace_android_rvh_cpu_cgroup_attach，将 topapp 分组里面的线程默认加入到DEFAULT_CGROUP_COLOC_ID 分组中。其他 non-topapp task 可根据需要由user空间进行分到其他不同的组中。

4.4.2 成为 mvp task 的条件

task需要在related thread group中
优先级需小于或等于sysctl_walt_rtg_cfs_boost_prio

正常情况下是不会进行 rtg 组优先调度的，不同场景区分 sysctl_walt_rtg_cfs_boost_prio 也不同。正常情况sysctl_walt_rtg_cfs_boost_prio 默认为 99，fps变化刷新的时候如滑动场景sysctl_walt_rtg_cfs_boost_prio=119，该节点上层通过 perflock 来进行设置。

4.4.3 用户调用接口

加入到 rtg 分组接口：

echo xxx（xxx为线程号） y(y为rtg分组) > sched_group_id

五、mvp task调用过程

在介绍 mvp task 如何被调度前，我们先简要看一下 Linux 内核调度触发时机图：

在 task 入队的时候，去调用 walt_get_mvt_task_prio 获取该 task 的 mvp 调度的 prio，根据 prio 来决定加入 mvp list中的顺序。

cfs调度器选择mvp task的流程：

walt_cfs_enqueue_task—>walt_get_mvp_task_prio

void walt_cfs_enqueue_task(struct rq *rq, struct task_struct *p)
{
        struct walt_rq *wrq = (struct walt_rq *) rq->android_vendor_data1;
        struct walt_task_struct *wts = (struct walt_task_struct *) p->android_vendor_data1;
        int mvp_prio = walt_get_mvp_task_prio(p);

        if (mvp_prio == WALT_NOT_MVP)
                return;

        /*
         * This can happen during migration or enq/deq for prio/class change.
         * it was once MVP but got demoted, it will not be MVP until
         * it goes to sleep again.
         */
        if (wts->total_exec > walt_cfs_mvp_task_limit(p))
                return;

        wts->mvp_prio = mvp_prio;
        walt_cfs_insert_mvp_task(wrq, wts, task_running(rq, p));

        /*
         * We inserted the task at the appropriate position. Take the
         * task runtime snapshot. From now onwards we use this point as a
         * baseline to enforce the slice and demotion.
         */
        if (!wts->total_exec) /* queue after sleep */ {
                wts->sum_exec_snapshot_for_total = p->se.sum_exec_runtime;
                wts->sum_exec_snapshot_for_slice = p->se.sum_exec_runtime;
        }
}

total_exec 记录了 task 运行的总时间，如果超过了该 mvp task time limit 就不会加入到 mvp task list 中。

static inline unsigned int walt_cfs_mvp_task_limit(struct task_struct *p)
{
        struct walt_task_struct *wts = (struct walt_task_struct *) p->android_vendor_data1;

        /* Binder MVP tasks are high prio but have only single slice */
        if (wts->mvp_prio == WALT_BINDER_MVP)
                return WALT_MVP_SLICE;

        return WALT_MVP_LIMIT;
}

Binder mvp 调度运行时间默认为 3ms，其他 mvp task time limit 为 12ms。

static void walt_cfs_insert_mvp_task(struct walt_rq *wrq, struct walt_task_struct *wts,
                                     bool at_front)
{
        struct list_head *pos;

        list_for_each(pos, &wrq->mvp_tasks) {
                struct walt_task_struct *tmp_wts = container_of(pos, struct walt_task_struct,
                                                                mvp_list);

                if (at_front) {
                        if (wts->mvp_prio >= tmp_wts->mvp_prio)
                                break;
                } else {
                        if (wts->mvp_prio > tmp_wts->mvp_prio)
                                break;
                }
        }

        list_add(&wts->mvp_list, pos->prev);
        wrq->num_mvp_tasks++;
}

根据 mvp prio 的不同，决定了该 task 在 mvp task list 中的位置。

紧接着，在 scheduler_tick 中去调用 walt_cfs_tick

void walt_cfs_tick(struct rq *rq)
{
        struct walt_rq *wrq = (struct walt_rq *) rq->android_vendor_data1;
        struct walt_task_struct *wts = (struct walt_task_struct *) rq->curr->android_vendor_data1;

        if (unlikely(walt_disabled))
                return;

        raw_spin_lock(&rq->__lock);

        if (list_empty(&wts->mvp_list) || (wts->mvp_list.next == NULL))
                goto out;

        walt_cfs_account_mvp_runtime(rq, rq->curr);
        /*
         * If the current is not MVP means, we have to re-schedule to
         * see if we can run any other task including MVP tasks.
         */
        if ((wrq->mvp_tasks.next != &wts->mvp_list) && rq->cfs.h_nr_running > 1)
                resched_curr(rq);

out:
        raw_spin_unlock(&rq->__lock);
}

即统计当前 mvp task 是否超过了exec time limit，并且在该点有个抢占点，如果当前 task 不是 mvp task，则会标记该 task 需要抢占，在下一个调度时机到来的时候，会切换task。

/*
 * When preempt = false and nopreempt = false, we leave the preemption
 * decision to CFS.
 */
static void walt_cfs_check_preempt_wakeup(void *unused, struct rq *rq, struct task_struct *p,
                                          bool *preempt, bool *nopreempt, int wake_flags,
                                          struct sched_entity *se, struct sched_entity *pse,
                                          int next_buddy_marked, unsigned int granularity)
{
        struct walt_rq *wrq = (struct walt_rq *) rq->android_vendor_data1;
        struct walt_task_struct *wts_p = (struct walt_task_struct *) p->android_vendor_data1;
        struct task_struct *c = rq->curr;
        struct walt_task_struct *wts_c = (struct walt_task_struct *) rq->curr->android_vendor_data1;
        bool resched = false;
        bool p_is_mvp, curr_is_mvp;

        if (unlikely(walt_disabled))
                return;

        p_is_mvp = !list_empty(&wts_p->mvp_list) && wts_p->mvp_list.next;
        curr_is_mvp = !list_empty(&wts_c->mvp_list) && wts_c->mvp_list.next;

        /*
         * current is not MVP, so preemption decision
         * is simple.
         */
        if (!curr_is_mvp) {
                if (p_is_mvp)
                        goto preempt;
                return; /* CFS decides preemption */
        }

        /*
         * current is MVP. update its runtime before deciding the
         * preemption.
         */
        walt_cfs_account_mvp_runtime(rq, c);
        resched = (wrq->mvp_tasks.next != &wts_c->mvp_list);

        /*
         * current is no longer eligible to run. It must have been
         * picked (because of MVP) ahead of other tasks in the CFS
         * tree, so drive preemption to pick up the next task from
         * the tree, which also includes picking up the first in
         * the MVP queue.
         */
        if (resched)
                goto preempt;

        /* current is the first in the queue, so no preemption */
        *nopreempt = true;
        trace_walt_cfs_mvp_wakeup_nopreempt(c, wts_c, walt_cfs_mvp_task_limit(c));
        return;
preempt:
        *preempt = true;
        trace_walt_cfs_mvp_wakeup_preempt(p, wts_p, walt_cfs_mvp_task_limit(p));
}

在检查抢占处，来判定 mvp task 是否能够被其他线程抢占，或者抢占其他线程

如果当前 task 不是 mvp，需要被抢占
如果当前 task 为 mvp，根据 mvp prio 的大小来决定是否能够被抢占

/*
 * MVP task runtime update happens here. Three possibilities:
 *
 * de-activated: The MVP consumed its runtime. Non MVP can preempt.
 * slice expired: MVP slice is expired and other MVP can preempt.
 * slice not expired: This MVP task can continue to run.
 */
static void walt_cfs_account_mvp_runtime(struct rq *rq, struct task_struct *curr)
{
        struct walt_rq *wrq = (struct walt_rq *) rq->android_vendor_data1;
        struct walt_task_struct *wts = (struct walt_task_struct *) curr->android_vendor_data1;
        u64 slice;
        unsigned int limit;

        lockdep_assert_held(&rq->__lock);

        /*
         * RQ clock update happens in tick path in the scheduler.
         * Since we drop the lock in the scheduler before calling
         * into vendor hook, it is possible that update flags are
         * reset by another rq lock and unlock. Do the update here
         * if required.
         */
        if (!(rq->clock_update_flags & RQCF_UPDATED))
                update_rq_clock(rq);

        if (curr->se.sum_exec_runtime > wts->sum_exec_snapshot_for_total)
                wts->total_exec = curr->se.sum_exec_runtime - wts->sum_exec_snapshot_for_total;
        else
                wts->total_exec = 0;

        if (curr->se.sum_exec_runtime > wts->sum_exec_snapshot_for_slice)
                slice = curr->se.sum_exec_runtime - wts->sum_exec_snapshot_for_slice;
        else
                slice = 0;

        /* slice is not expired */
        if (slice < WALT_MVP_SLICE)
                return;

        wts->sum_exec_snapshot_for_slice = curr->se.sum_exec_runtime;
        /*
         * slice is expired, check if we have to deactivate the
         * MVP task, otherwise requeue the task in the list so
         * that other MVP tasks gets a chance.
         */

        limit = walt_cfs_mvp_task_limit(curr);
        if (wts->total_exec > limit) {
                walt_cfs_deactivate_mvp_task(rq, curr);
                trace_walt_cfs_deactivate_mvp_task(curr, wts, limit);
                return;
        }

        if (wrq->num_mvp_tasks == 1)
                return;

        /* slice expired. re-queue the task */
        list_del(&wts->mvp_list);
        wrq->num_mvp_tasks--;
        walt_cfs_insert_mvp_task(wrq, wts, false);
}

该函数的主要作用为更新 task 的运行时间即 total_exec ，判断该 task 是否能够继续被优先调度下去，如果不能就移除 mvp task list，否则需要重新加入到 mvp task list 中。

在上下文切换中，来进行 mvp task 的选择: walt_cfs_replace_next_task_fair 函数。在选择 task 的时候，去优先选择 mvp task list 中的 task，进行优先调度。如果 mvp task list 中没有mvp task，则进行cfs 原生的完全公平调度，根据虚拟时间最小的 task 来进行选择。

static void walt_cfs_replace_next_task_fair(void *unused, struct rq *rq, struct task_struct **p,
                                            struct sched_entity **se, bool *repick, bool simple,
                                            struct task_struct *prev)
{
        struct walt_rq *wrq = (struct walt_rq *) rq->android_vendor_data1;
        struct walt_task_struct *wts;
        struct task_struct *mvp;
        struct cfs_rq *cfs_rq;

        if (unlikely(walt_disabled))
                return;

        if ((*p) && (*p) != prev && ((*p)->on_cpu == 1 || (*p)->on_rq == 0 ||
                                     (*p)->on_rq == TASK_ON_RQ_MIGRATING ||
                                     (*p)->cpu != cpu_of(rq)))
                WALT_BUG(WALT_BUG_UPSTREAM, *p,
                         "picked %s(%d) on_cpu=%d on_rq=%d p->cpu=%d cpu_of(rq)=%d kthread=%d\n",
                         (*p)->comm, (*p)->pid, (*p)->on_cpu,
                         (*p)->on_rq, (*p)->cpu, cpu_of(rq), ((*p)->flags & PF_KTHREAD));

        /* We don't have MVP tasks queued */
        if (list_empty(&wrq->mvp_tasks))
                return;

        /* Return the first task from MVP queue */
        wts = list_first_entry(&wrq->mvp_tasks, struct walt_task_struct, mvp_list);
        mvp = wts_to_ts(wts);

        *p = mvp;
        *se = &mvp->se;
        *repick = true;

        if (simple) {
                for_each_sched_entity((*se)) {
                        /*
                         * TODO If CFS_BANDWIDTH is enabled, we might pick
                         * from a throttled cfs_rq
                         */
                        cfs_rq = cfs_rq_of(*se);
                        set_next_entity(cfs_rq, *se);
                }
        }

        if ((*p) && (*p) != prev && ((*p)->on_cpu == 1 || (*p)->on_rq == 0 ||
                                     (*p)->on_rq == TASK_ON_RQ_MIGRATING ||
                                     (*p)->cpu != cpu_of(rq)))
                WALT_BUG(WALT_BUG_UPSTREAM, *p,
                         "picked %s(%d) on_cpu=%d on_rq=%d p->cpu=%d cpu_of(rq)=%d kthread=%d\n",
                         (*p)->comm, (*p)->pid, (*p)->on_cpu,
                         (*p)->on_rq, (*p)->cpu, cpu_of(rq), ((*p)->flags & PF_KTHREAD));

        trace_walt_cfs_mvp_pick_next(mvp, wts, walt_cfs_mvp_task_limit(mvp));
}

移除 mvp task：将该 task 移除 mvp task list 并且在出队的时候将 wts->total_exec 重置。

void walt_cfs_dequeue_task(struct rq *rq, struct task_struct *p)
{
        struct walt_task_struct *wts = (struct walt_task_struct *) p->android_vendor_data1;

        if (!list_empty(&wts->mvp_list) && wts->mvp_list.next)
                walt_cfs_deactivate_mvp_task(rq, p);

        /*
         * Reset the exec time during sleep so that it starts
         * from scratch upon next wakeup. total_exec should
         * be preserved when task is enq/deq while it is on
         * runqueue.
         */
        if (READ_ONCE(p->__state) != TASK_RUNNING)
                wts->total_exec = 0;
}

六、总结

以上为高通原生的 mvp 调度，我们可以看到如下特点：

覆盖面广，加入 mvp 的粒度大，根据进程的 prio 和 related thread group，而不明确是否和绘制/动画/显示/交互相关的进程，需要被优先调度的进程数量可能会过多
部分具体控制入口不开源
根据上层典型场景来动态设置，如在 FPS 变化以及游戏重负载线程等场景（逻辑线程、渲染线程等）进行动态设置
过多的调度细节暴露给用户层去操作，粒度过细，条件复杂，用户层很难理解和真正运用起来

浅析高通 mvp 进程优先调度

一、前言

二、内核调度流程简介

2.1 相关结构体

2.2 task enqueue入队流程

2.3 CFS选择 task 路径

三、选择什么样的进程进行优先调度？

四、4 种mvp task类型解析

4.1 低延迟任务

4.2 per-task 类型 boost 任务

4.3 binder 类型优先调度任务

4.4 特定 RTG 类型任务

4.4.2 成为 mvp task 的条件

4.4.3 用户调用接口

五、mvp task调用过程

六、总结

七、参考文献

FEATURED TAGS

FRIENDS

一、前言

二、内核调度流程简介

2.1 相关结构体

2.2 task enqueue入队流程

2.3 CFS选择 task 路径

三、选择什么样的进程进行优先调度？

四、4 种mvp task类型解析

4.1 低延迟任务

4.2 per-task 类型 boost 任务

4.3 binder 类型优先调度任务

4.4 特定 RTG 类型任务

4.4.1 加入 related thread group 时机

4.4.2 成为 mvp task 的条件

4.4.3 用户调用接口

五、mvp task调用过程

六、总结

七、参考文献

FEATURED TAGS

FRIENDS