Added kernel
This commit is contained in:
@@ -0,0 +1,391 @@
|
||||
From mboxrd@z Thu Jan 1 00:00:00 1970
|
||||
Return-Path: <linux-kernel-owner@kernel.org>
|
||||
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
|
||||
aws-us-west-2-korg-lkml-1.web.codeaurora.org
|
||||
X-Spam-Level:
|
||||
X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
|
||||
DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
|
||||
INCLUDES_PATCH,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,
|
||||
USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no
|
||||
version=3.4.0
|
||||
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
|
||||
by smtp.lore.kernel.org (Postfix) with ESMTP id BA09AC433ED
|
||||
for <linux-kernel@archiver.kernel.org>; Thu, 20 May 2021 06:54:04 +0000 (UTC)
|
||||
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
|
||||
by mail.kernel.org (Postfix) with ESMTP id 99A326108C
|
||||
for <linux-kernel@archiver.kernel.org>; Thu, 20 May 2021 06:54:04 +0000 (UTC)
|
||||
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
|
||||
id S230359AbhETGzY (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
|
||||
Thu, 20 May 2021 02:55:24 -0400
|
||||
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37854 "EHLO
|
||||
lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
|
||||
with ESMTP id S229534AbhETGzX (ORCPT
|
||||
<rfc822;linux-kernel@vger.kernel.org>);
|
||||
Thu, 20 May 2021 02:55:23 -0400
|
||||
Received: from mail-qk1-x74a.google.com (mail-qk1-x74a.google.com [IPv6:2607:f8b0:4864:20::74a])
|
||||
by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2DB47C061574
|
||||
for <linux-kernel@vger.kernel.org>; Wed, 19 May 2021 23:54:01 -0700 (PDT)
|
||||
Received: by mail-qk1-x74a.google.com with SMTP id z2-20020a3765020000b02903a5f51b1c74so684222qkb.7
|
||||
for <linux-kernel@vger.kernel.org>; Wed, 19 May 2021 23:54:01 -0700 (PDT)
|
||||
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
||||
d=google.com; s=20161025;
|
||||
h=date:message-id:mime-version:subject:from:to:cc;
|
||||
bh=Y3hJAMwzbf34YQU8QX5BsSV2xoCmy36DYto5ZLIStJc=;
|
||||
b=r/V1aR1KHSQ2RwrGIEEbdDV0RqV+tdHJLBnCnPMLdI4quvTDua13dKOHpxS2Rc7bc4
|
||||
6ON9rpxOpEhBMPLS8798xqa4jQBTINTCKNlIi3TpaV8t/shwlViCb4Y9bZ4ng8VEsXp3
|
||||
H2s3DQbb47Iio7YrOnBahF4qBDJl2fkHL257Ao4wgzgG/ZCK2oy5dcipOFrEpQqPk5vO
|
||||
hhTC4Zr1DE3XI+Y+uTozfI8CoAtllv6qL31gAWcycyeN72teVQa9ilaeTdglxhCO9DVG
|
||||
BFkiZH+21Eo3M8PRz4OztnGgRtMvbgNnuUWZ68bnZkO4wMyL6mX2520HA9NQNkGSXLnP
|
||||
74Zg==
|
||||
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
||||
d=1e100.net; s=20161025;
|
||||
h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc;
|
||||
bh=Y3hJAMwzbf34YQU8QX5BsSV2xoCmy36DYto5ZLIStJc=;
|
||||
b=h+lPKp7mQ6QF8fW3fzT7HQgoaLvOfkKtGjwFNvWMOi8UMz94CGWpTgC4tsEX0PenoK
|
||||
snCz9kDMIR35YO9Dlhz1Ci/04htNK9p+rnvGn7ri/Oin5fFeyVQ15qh33Bgut5m3SKR2
|
||||
imeFBLkWsXGtFd23XCBmjIcNrqZA0LxhIwoYCbrVWSq5H29Eo6C9ab0gmJ1oY0DCPOL/
|
||||
Fi8M2neMwLN09EebwZONh8AGuP0XiL0oSnAGDZAhaaAimfHrPBMMYCrxpjnaGxPG2hY0
|
||||
gvju/bIag6Ug8urHdAAGWsdLaNIsdrIKWlaL76FjcULVwdAARKQifiMMTwJ2JU5y5jMG
|
||||
OKRg==
|
||||
X-Gm-Message-State: AOAM5322gu+Tvm1pCjTiKdWMNb3cz1Z6+VCfYHkB7vDvNRYItvu08gEA
|
||||
/W/WlY6Lc6/4O5nrreOspbq5n77XobE=
|
||||
X-Google-Smtp-Source: ABdhPJy+4EmI1VvFDhlB3errX+0774OdClFY8nQyFqDe9Pqq8FOdLBnXamEbn+N9M1F/HG6sJ6Mw/n7qw/8=
|
||||
X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:595d:62ee:f08:8e83])
|
||||
(user=yuzhao job=sendgmr) by 2002:a0c:e4cd:: with SMTP id g13mr3727631qvm.34.1621493640278;
|
||||
Wed, 19 May 2021 23:54:00 -0700 (PDT)
|
||||
Date: Thu, 20 May 2021 00:53:41 -0600
|
||||
Message-Id: <20210520065355.2736558-1-yuzhao@google.com>
|
||||
Mime-Version: 1.0
|
||||
X-Mailer: git-send-email 2.31.1.751.gd2f1c929bd-goog
|
||||
Subject: [PATCH v3 00/14] Multigenerational LRU Framework
|
||||
From: Yu Zhao <yuzhao@google.com>
|
||||
To: linux-mm@kvack.org
|
||||
Cc: Alex Shi <alexs@kernel.org>, Andi Kleen <ak@linux.intel.com>,
|
||||
Andrew Morton <akpm@linux-foundation.org>,
|
||||
Dave Chinner <david@fromorbit.com>,
|
||||
Dave Hansen <dave.hansen@linux.intel.com>,
|
||||
Donald Carr <sirspudd@gmail.com>,
|
||||
Hillf Danton <hdanton@sina.com>, Jens Axboe <axboe@kernel.dk>,
|
||||
Johannes Weiner <hannes@cmpxchg.org>,
|
||||
Jonathan Corbet <corbet@lwn.net>,
|
||||
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
|
||||
Konstantin Kharlamov <hi-angel@yandex.ru>,
|
||||
Marcus Seyfarth <m.seyfarth@gmail.com>,
|
||||
Matthew Wilcox <willy@infradead.org>,
|
||||
Mel Gorman <mgorman@suse.de>,
|
||||
Miaohe Lin <linmiaohe@huawei.com>,
|
||||
Michael Larabel <michael@michaellarabel.com>,
|
||||
Michal Hocko <mhocko@suse.com>,
|
||||
Michel Lespinasse <michel@lespinasse.org>,
|
||||
Rik van Riel <riel@surriel.com>,
|
||||
Roman Gushchin <guro@fb.com>,
|
||||
Tim Chen <tim.c.chen@linux.intel.com>,
|
||||
Vlastimil Babka <vbabka@suse.cz>,
|
||||
Yang Shi <shy828301@gmail.com>,
|
||||
Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
|
||||
linux-kernel@vger.kernel.org, lkp@lists.01.org,
|
||||
page-reclaim@google.com, Yu Zhao <yuzhao@google.com>
|
||||
Content-Type: text/plain; charset="UTF-8"
|
||||
Precedence: bulk
|
||||
List-ID: <linux-kernel.vger.kernel.org>
|
||||
X-Mailing-List: linux-kernel@vger.kernel.org
|
||||
List-Archive: <https://lore.kernel.org/lkml/>
|
||||
|
||||
What's new in v3
|
||||
================
|
||||
1) Fixed a bug reported by the Arch Linux kernel team:
|
||||
https://github.com/zen-kernel/zen-kernel/issues/207
|
||||
2) Rebased to v5.13-rc2.
|
||||
|
||||
Highlights from v2
|
||||
==================
|
||||
Konstantin Kharlamov <hi-angel@yandex.ru> reported:
|
||||
My success story: I have Archlinux with 8G RAM + zswap + swap. While
|
||||
developing, I have lots of apps opened such as multiple LSP-servers
|
||||
for different langs, chats, two browsers, etc. Usually, my system
|
||||
gets quickly to a point of SWAP-storms, where I have to kill
|
||||
LSP-servers, restart browsers to free memory, etc, otherwise the
|
||||
system lags heavily and is barely usable.
|
||||
|
||||
1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
|
||||
patchset, and I started up by opening lots of apps to create memory
|
||||
pressure, and worked for a day like this. Till now I had *not a
|
||||
single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
|
||||
getting to the point of 3G in SWAP before without a single
|
||||
SWAP-storm.
|
||||
|
||||
TLDR
|
||||
====
|
||||
The current page reclaim is too expensive in terms of CPU usage and
|
||||
often making poor choices about what to evict. We would like to offer
|
||||
an alternative framework that is performant, versatile and
|
||||
straightforward.
|
||||
|
||||
Repo
|
||||
====
|
||||
git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/53/1253/1
|
||||
|
||||
Problems
|
||||
========
|
||||
Notion of active/inactive
|
||||
-------------------------
|
||||
Data centers need to predict whether a job can successfully land on a
|
||||
machine without actually impacting the existing jobs. The granularity
|
||||
of the active/inactive is too coarse to be useful for job schedulers
|
||||
to make such decisions. In addition, data centers need to monitor
|
||||
their memory utilization for horizontal scaling. The active/inactive
|
||||
cannot give any insight into a pool of machines because aggregating
|
||||
them across multiple machines without a common frame of reference
|
||||
yields no meaningful results.
|
||||
|
||||
Phones and laptops need to make good choices about what to evict,
|
||||
since they are more sensitive to the major faults and the power
|
||||
consumption. Major faults can cause "janks" (slow UI renderings) and
|
||||
negatively impact user experience. The selection between anon and file
|
||||
types has been suboptimal because direct comparisons between them are
|
||||
infeasible based on the notion of active/inactive. On phones and
|
||||
laptops, executable pages are frequently evicted despite the fact that
|
||||
there are many less recently used anon pages. Conversely, on
|
||||
workstations building large projects, anon pages are occasionally
|
||||
swapped out while page cache contains many less recently used pages.
|
||||
|
||||
Fundamentally, the notion of active/inactive has very limited ability
|
||||
to measure temporal locality.
|
||||
|
||||
Incremental scans via rmap
|
||||
--------------------------
|
||||
Each incremental scan picks up at where the last scan left off and
|
||||
stops after it has found a handful of unreferenced pages. For
|
||||
workloads using a large amount of anon memory, incremental scans lose
|
||||
the advantage under sustained memory pressure due to high ratios of
|
||||
the number of scanned pages to the number of reclaimed pages. On top
|
||||
of this, the rmap has complex data structures. And the combined
|
||||
effects typically result in a high amount of CPU usage in the reclaim
|
||||
path.
|
||||
|
||||
Simply put, incremental scans via rmap have no regard for spatial
|
||||
locality.
|
||||
|
||||
Solutions
|
||||
=========
|
||||
Notion of generation numbers
|
||||
----------------------------
|
||||
The notion of generation numbers introduces a temporal dimension. Each
|
||||
generation is a dot on the timeline and it includes all pages that
|
||||
have been referenced since it was created.
|
||||
|
||||
Given an lruvec, scans of anon and file types and selections between
|
||||
them are all based on direct comparisons of generation numbers, which
|
||||
are simple and yet effective.
|
||||
|
||||
A larger number of pages can be spread out across a configurable
|
||||
number of generations, which are associated with timestamps and
|
||||
therefore aggregatable. This is specifically designed for data centers
|
||||
that require working set estimation and proactive reclaim.
|
||||
|
||||
Differential scans via page tables
|
||||
----------------------------------
|
||||
Each differential scan discovers all pages that have been referenced
|
||||
since the last scan. It walks the mm_struct list associated with an
|
||||
lruvec to scan page tables of processes that have been scheduled since
|
||||
the last scan. The cost of each differential scan is roughly
|
||||
proportional to the number of referenced pages it discovers. Page
|
||||
tables usually have good memory locality. The end result is generally
|
||||
a significant reduction in CPU usage, for workloads using a large
|
||||
amount of anon memory.
|
||||
|
||||
For workloads that have extremely sparse page tables, it is still
|
||||
possible to fall back to incremental scans via rmap.
|
||||
|
||||
Framework
|
||||
=========
|
||||
For each lruvec, evictable pages are divided into multiple
|
||||
generations. The youngest generation number is stored in
|
||||
lrugen->max_seq for both anon and file types as they are aged on an
|
||||
equal footing. The oldest generation numbers are stored in
|
||||
lrugen->min_seq[2] separately for anon and file types as clean file
|
||||
pages can be evicted regardless of may_swap or may_writepage. These
|
||||
three variables are monotonically increasing. Generation numbers are
|
||||
truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into
|
||||
page->flags. The sliding window technique is used to prevent truncated
|
||||
generation numbers from overlapping. Each truncated generation number
|
||||
is an index to
|
||||
lrugen->lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]. Evictable
|
||||
pages are added to the per-zone lists indexed by lrugen->max_seq or
|
||||
lrugen->min_seq[2] (modulo MAX_NR_GENS), depending on their types.
|
||||
|
||||
Each generation is then divided into multiple tiers. Tiers represent
|
||||
levels of usage from file descriptors only. Pages accessed N times via
|
||||
file descriptors belong to tier order_base_2(N). Each generation
|
||||
contains at most MAX_NR_TIERS tiers, and they require additional
|
||||
MAX_NR_TIERS-2 bits in page->flags. In contrast to moving across
|
||||
generations which requires the lru lock for the list operations,
|
||||
moving across tiers only involves an atomic operation on page->flags
|
||||
and therefore has a negligible cost. A feedback loop modeled after the
|
||||
PID controller monitors the refault rates across all tiers and decides
|
||||
when to activate pages from which tiers in the reclaim path.
|
||||
|
||||
The framework comprises two conceptually independent components: the
|
||||
aging and the eviction, which can be invoked separately from user
|
||||
space for the purpose of working set estimation and proactive reclaim.
|
||||
|
||||
Aging
|
||||
-----
|
||||
The aging produces young generations. Given an lruvec, the aging scans
|
||||
page tables for referenced pages of this lruvec. Upon finding one, the
|
||||
aging updates its generation number to max_seq. After each round of
|
||||
scan, the aging increments max_seq.
|
||||
|
||||
The aging maintains either a system-wide mm_struct list or per-memcg
|
||||
mm_struct lists, and it only scans page tables of processes that have
|
||||
been scheduled since the last scan.
|
||||
|
||||
The aging is due when both of min_seq[2] reaches max_seq-1, assuming
|
||||
both anon and file types are reclaimable.
|
||||
|
||||
Eviction
|
||||
--------
|
||||
The eviction consumes old generations. Given an lruvec, the eviction
|
||||
scans the pages on the per-zone lists indexed by either of min_seq[2].
|
||||
It first tries to select a type based on the values of min_seq[2].
|
||||
When anon and file types are both available from the same generation,
|
||||
it selects the one that has a lower refault rate.
|
||||
|
||||
During a scan, the eviction sorts pages according to their new
|
||||
generation numbers, if the aging has found them referenced. It also
|
||||
moves pages from the tiers that have higher refault rates than tier 0
|
||||
to the next generation.
|
||||
|
||||
When it finds all the per-zone lists of a selected type are empty, the
|
||||
eviction increments min_seq[2] indexed by this selected type.
|
||||
|
||||
Use cases
|
||||
=========
|
||||
High anon workloads
|
||||
-------------------
|
||||
Our real-world benchmark that browses popular websites in multiple
|
||||
Chrome tabs demonstrates 51% less CPU usage from kswapd and 52% (full)
|
||||
less PSI.
|
||||
|
||||
Without this patchset, the profile of kswapd looks like:
|
||||
31.03% page_vma_mapped_walk
|
||||
25.59% lzo1x_1_do_compress
|
||||
4.63% do_raw_spin_lock
|
||||
3.89% vma_interval_tree_iter_next
|
||||
3.33% vma_interval_tree_subtree_search
|
||||
|
||||
With this patchset, it looks like:
|
||||
49.36% lzo1x_1_do_compress
|
||||
4.54% page_vma_mapped_walk
|
||||
4.45% memset_erms
|
||||
3.47% walk_pte_range
|
||||
2.88% zram_bvec_rw
|
||||
|
||||
In addition, direct reclaim latency is reduced by 22% at 99th
|
||||
percentile and the number of refaults is reduced by 7%. Both metrics
|
||||
are important to phones and laptops as they are highly correlated to
|
||||
user experience.
|
||||
|
||||
High page cache workloads
|
||||
-------------------------
|
||||
Tiers are specifically designed to improve the performance of page
|
||||
cache under memory pressure. The fio/io_uring benchmark shows 14%
|
||||
increase in IOPS when randomly accessing in buffered I/O mode.
|
||||
|
||||
Without this patchset, the profile of fio/io_uring looks like:
|
||||
Children Self Symbol
|
||||
-----------------------------------
|
||||
12.03% 0.03% __page_cache_alloc
|
||||
6.53% 0.83% shrink_active_list
|
||||
2.53% 0.44% mark_page_accessed
|
||||
|
||||
With this patchset, it looks like:
|
||||
Children Self Symbol
|
||||
-----------------------------------
|
||||
9.45% 0.03% __page_cache_alloc
|
||||
0.52% 0.46% mark_page_accessed
|
||||
|
||||
Working set estimation
|
||||
----------------------
|
||||
User space can invoke the aging by writing "+ memcg_id node_id gen
|
||||
[swappiness]" to /sys/kernel/debug/lru_gen. This debugfs interface
|
||||
also provides the birth time and the size of each generation.
|
||||
|
||||
For example, given a pool of machines, a job scheduler periodically
|
||||
invokes the aging to estimate the working set of each machine. And it
|
||||
ranks the machines based on the sizes of their working sets and
|
||||
selects the most ideal ones to land new jobs.
|
||||
|
||||
Proactive reclaim
|
||||
-----------------
|
||||
User space can invoke the eviction by writing "- memcg_id node_id gen
|
||||
[swappiness] [nr_to_reclaim]" to /sys/kernel/debug/lru_gen. Multiple
|
||||
command lines are supported, so does concatenation with delimiters.
|
||||
|
||||
For example, a job scheduler can invoke the eviction if it anticipates
|
||||
new jobs. The savings from proactive reclaim may provide certain SLA
|
||||
when new jobs actually land.
|
||||
|
||||
Yu Zhao (14):
|
||||
include/linux/memcontrol.h: do not warn in page_memcg_rcu() if
|
||||
!CONFIG_MEMCG
|
||||
include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA
|
||||
include/linux/cgroup.h: export cgroup_mutex
|
||||
mm, x86: support the access bit on non-leaf PMD entries
|
||||
mm/vmscan.c: refactor shrink_node()
|
||||
mm/workingset.c: refactor pack_shadow() and unpack_shadow()
|
||||
mm: multigenerational lru: groundwork
|
||||
mm: multigenerational lru: activation
|
||||
mm: multigenerational lru: mm_struct list
|
||||
mm: multigenerational lru: aging
|
||||
mm: multigenerational lru: eviction
|
||||
mm: multigenerational lru: user interface
|
||||
mm: multigenerational lru: Kconfig
|
||||
mm: multigenerational lru: documentation
|
||||
|
||||
Documentation/vm/index.rst | 1 +
|
||||
Documentation/vm/multigen_lru.rst | 143 ++
|
||||
arch/Kconfig | 9 +
|
||||
arch/x86/Kconfig | 1 +
|
||||
arch/x86/include/asm/pgtable.h | 2 +-
|
||||
arch/x86/mm/pgtable.c | 5 +-
|
||||
fs/exec.c | 2 +
|
||||
fs/fuse/dev.c | 3 +-
|
||||
include/linux/cgroup.h | 15 +-
|
||||
include/linux/memcontrol.h | 7 +-
|
||||
include/linux/mm.h | 2 +
|
||||
include/linux/mm_inline.h | 234 +++
|
||||
include/linux/mm_types.h | 107 ++
|
||||
include/linux/mmzone.h | 117 ++
|
||||
include/linux/nodemask.h | 1 +
|
||||
include/linux/page-flags-layout.h | 19 +-
|
||||
include/linux/page-flags.h | 4 +-
|
||||
include/linux/pgtable.h | 4 +-
|
||||
include/linux/swap.h | 4 +-
|
||||
kernel/bounds.c | 6 +
|
||||
kernel/events/uprobes.c | 2 +-
|
||||
kernel/exit.c | 1 +
|
||||
kernel/fork.c | 10 +
|
||||
kernel/kthread.c | 1 +
|
||||
kernel/sched/core.c | 2 +
|
||||
mm/Kconfig | 58 +
|
||||
mm/huge_memory.c | 5 +-
|
||||
mm/khugepaged.c | 2 +-
|
||||
mm/memcontrol.c | 28 +
|
||||
mm/memory.c | 10 +-
|
||||
mm/migrate.c | 2 +-
|
||||
mm/mm_init.c | 6 +-
|
||||
mm/mmzone.c | 2 +
|
||||
mm/rmap.c | 6 +
|
||||
mm/swap.c | 22 +-
|
||||
mm/swapfile.c | 6 +-
|
||||
mm/userfaultfd.c | 2 +-
|
||||
mm/vmscan.c | 2638 ++++++++++++++++++++++++++++-
|
||||
mm/workingset.c | 169 +-
|
||||
39 files changed, 3498 insertions(+), 160 deletions(-)
|
||||
create mode 100644 Documentation/vm/multigen_lru.rst
|
||||
|
||||
--
|
||||
2.31.1.751.gd2f1c929bd-goog
|
||||
|
||||
|
||||
Reference in New Issue
Block a user