-
Notifications
You must be signed in to change notification settings - Fork 0
/
t.mbox
2839 lines (2615 loc) · 114 KB
/
t.mbox
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
From mboxrd@z Thu Jan 1 00:00:00 1970
Return-Path: <[email protected]>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
by smtp.lore.kernel.org (Postfix) with ESMTP id CF2EAC433EF
for <[email protected]>; Tue, 16 Nov 2021 01:44:44 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
by mail.kernel.org (Postfix) with ESMTP id B85146120F
for <[email protected]>; Tue, 16 Nov 2021 01:44:44 +0000 (UTC)
Received: ([email protected]) by vger.kernel.org via listexpand
id S1344323AbhKPBre (ORCPT
<rfc822;[email protected]>);
Mon, 15 Nov 2021 20:47:34 -0500
Received: from mga07.intel.com ([134.134.136.100]:36057 "EHLO mga07.intel.com"
rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
id S1351588AbhKPBjP (ORCPT <rfc822;[email protected]>);
Mon, 15 Nov 2021 20:39:15 -0500
X-IronPort-AV: E=McAfee;i="6200,9189,10169"; a="297023943"
X-IronPort-AV: E=Sophos;i="5.87,237,1631602800";
d="scan'208";a="297023943"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Nov 2021 17:36:18 -0800
X-IronPort-AV: E=Sophos;i="5.87,237,1631602800";
d="scan'208";a="454262515"
Received: from yhuang6-desk2.sh.intel.com ([10.239.159.101])
by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Nov 2021 17:35:57 -0800
From: Huang Ying <[email protected]>
To: Peter Zijlstra <[email protected]>, Mel Gorman <[email protected]>
Feng Tang <[email protected]>,
Huang Ying <[email protected]>,
Andrew Morton <[email protected]>,
Michal Hocko <[email protected]>,
Rik van Riel <[email protected]>,
Dave Hansen <[email protected]>,
Yang Shi <[email protected]>, Zi Yan <[email protected]>,
Wei Xu <[email protected]>, osalvador <[email protected]>,
Shakeel Butt <[email protected]>
Subject: [PATCH -V10 1/6] NUMA Balancing: add page promotion counter
Date: Tue, 16 Nov 2021 09:35:17 +0800
Message-Id: <[email protected]>
X-Mailer: git-send-email 2.30.2
In-Reply-To: <[email protected]>
References: <[email protected]>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: [email protected]
In a system with multiple memory types, e.g. DRAM and PMEM, the CPU
and DRAM in one socket will be put in one NUMA node as before, while
the PMEM will be put in another NUMA node as described in the
description of the commit c221c0b0308f ("device-dax: "Hotplug"
persistent memory for use like normal RAM"). So, the NUMA balancing
mechanism will identify all PMEM accesses as remote access and try to
promote the PMEM pages to DRAM.
To distinguish the number of the inter-type promoted pages from that
of the inter-socket migrated pages. A new vmstat count is added. The
counter is per-node (count in the target node). So this can be used
to identify promotion imbalance among the NUMA nodes.
Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: osalvador <[email protected]>
Cc: Shakeel Butt <[email protected]>
---
include/linux/mmzone.h | 3 +++
include/linux/node.h | 5 +++++
mm/migrate.c | 13 ++++++++++---
mm/vmstat.c | 3 +++
4 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 58e744b78c2c..eda6d2f09d77 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -210,6 +210,9 @@ enum node_stat_item {
NR_PAGETABLE, /* used for pagetables */
#ifdef CONFIG_SWAP
NR_SWAPCACHE,
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+ PGPROMOTE_SUCCESS, /* promote successfully */
#endif
NR_VM_NODE_STAT_ITEMS
};
diff --git a/include/linux/node.h b/include/linux/node.h
index bb21fd631b16..81bbf1c0afd3 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -181,4 +181,9 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
#define to_node(device) container_of(device, struct node, dev)
+static inline bool node_is_toptier(int node)
+{
+ return node_state(node, N_CPU);
+}
+
#endif /* _LINUX_NODE_H_ */
diff --git a/mm/migrate.c b/mm/migrate.c
index cf25b00f03c8..b7c27abb0e5c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2141,6 +2141,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
pg_data_t *pgdat = NODE_DATA(node);
int isolated;
int nr_remaining;
+ int nr_succeeded;
LIST_HEAD(migratepages);
new_page_t *new;
bool compound;
@@ -2179,7 +2180,8 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
list_add(&page->lru, &migratepages);
nr_remaining = migrate_pages(&migratepages, *new, NULL, node,
- MIGRATE_ASYNC, MR_NUMA_MISPLACED, NULL);
+ MIGRATE_ASYNC, MR_NUMA_MISPLACED,
+ &nr_succeeded);
if (nr_remaining) {
if (!list_empty(&migratepages)) {
list_del(&page->lru);
@@ -2188,8 +2190,13 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
putback_lru_page(page);
}
isolated = 0;
- } else
- count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_pages);
+ }
+ if (nr_succeeded) {
+ count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
+ if (!node_is_toptier(page_to_nid(page)) && node_is_toptier(node))
+ mod_node_page_state(NODE_DATA(node), PGPROMOTE_SUCCESS,
+ nr_succeeded);
+ }
BUG_ON(!list_empty(&migratepages));
return isolated;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d701c335628c..53a6e92b1efb 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1242,6 +1242,9 @@ const char * const vmstat_text[] = {
#ifdef CONFIG_SWAP
"nr_swapcached",
#endif
+#ifdef CONFIG_NUMA_BALANCING
+ "pgpromote_success",
+#endif
/* enum writeback_stat_item counters */
"nr_dirty_threshold",
--
2.30.2
From mboxrd@z Thu Jan 1 00:00:00 1970
Return-Path: <[email protected]>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
by smtp.lore.kernel.org (Postfix) with ESMTP id 91564C433EF
for <[email protected]>; Tue, 16 Nov 2021 01:44:56 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
by mail.kernel.org (Postfix) with ESMTP id 7552D61265
for <[email protected]>; Tue, 16 Nov 2021 01:44:56 +0000 (UTC)
Received: ([email protected]) by vger.kernel.org via listexpand
id S1356387AbhKPBru (ORCPT
<rfc822;[email protected]>);
Mon, 15 Nov 2021 20:47:50 -0500
Received: from mga07.intel.com ([134.134.136.100]:36057 "EHLO mga07.intel.com"
rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
id S242478AbhKPBlP (ORCPT <rfc822;[email protected]>);
Mon, 15 Nov 2021 20:41:15 -0500
X-IronPort-AV: E=McAfee;i="6200,9189,10169"; a="297023952"
X-IronPort-AV: E=Sophos;i="5.87,237,1631602800";
d="scan'208";a="297023952"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Nov 2021 17:36:19 -0800
X-IronPort-AV: E=Sophos;i="5.87,237,1631602800";
d="scan'208";a="454262527"
Received: from yhuang6-desk2.sh.intel.com ([10.239.159.101])
by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Nov 2021 17:36:11 -0800
From: Huang Ying <[email protected]>
To: Peter Zijlstra <[email protected]>, Mel Gorman <[email protected]>
Feng Tang <[email protected]>,
Huang Ying <[email protected]>,
Andrew Morton <[email protected]>,
Michal Hocko <[email protected]>,
Rik van Riel <[email protected]>,
Dave Hansen <[email protected]>,
Yang Shi <[email protected]>, Zi Yan <[email protected]>,
Wei Xu <[email protected]>, osalvador <[email protected]>,
Shakeel Butt <[email protected]>
Subject: [PATCH -V10 4/6] memory tiering: hot page selection with hint page fault latency
Date: Tue, 16 Nov 2021 09:35:20 +0800
Message-Id: <[email protected]>
X-Mailer: git-send-email 2.30.2
In-Reply-To: <[email protected]>
References: <[email protected]>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: [email protected]
To optimize page placement in a memory tiering system with NUMA
balancing, the hot pages in the slow memory node need to be
identified. Essentially, the original NUMA balancing implementation
selects the mostly recently accessed (MRU) pages as the hot pages.
But this isn't a very good algorithm to identify the hot pages.
So, in this patch we implemented a better hot page selection
algorithm. Which is based on NUMA balancing page table scanning and
hint page fault as follows,
- When the page tables of the processes are scanned to change PTE/PMD
to be PROT_NONE, the current time is recorded in struct page as scan
time.
- When the page is accessed, hint page fault will occur. The scan
time is gotten from the struct page. And The hint page fault
latency is defined as
hint page fault time - scan time
The shorter the hint page fault latency of a page is, the higher the
probability of their access frequency to be higher. So the hint page
fault latency is a good estimation of the page hot/cold.
But it's hard to find some extra space in struct page to hold the scan
time. Fortunately, we can reuse some bits used by the original NUMA
balancing.
NUMA balancing uses some bits in struct page to store the page
accessing CPU and PID (referring to page_cpupid_xchg_last()). Which
is used by the multi-stage node selection algorithm to avoid to
migrate pages shared accessed by the NUMA nodes back and forth. But
for pages in the slow memory node, even if they are shared accessed by
multiple NUMA nodes, as long as the pages are hot, they need to be
promoted to the fast memory node. So the accessing CPU and PID
information are unnecessary for the slow memory pages. We can reuse
these bits in struct page to record the scan time for them. For the
fast memory pages, these bits are used as before.
The remaining problem is how to determine the hot threshold. It's not
easy to be done automatically. So we provide a sysctl knob:
kernel.numa_balancing_hot_threshold_ms. All pages with hint page
fault latency < the threshold will be considered hot. The system
administrator can determine the hot threshold via various information,
such as PMEM bandwidth limit, the average number of the pages pass the
hot threshold, etc. The default hot threshold is 1 second, which
works well in our performance test.
The downside of the patch is that the response time to the workload
hot spot changing may be much longer. For example,
- A previous cold memory area becomes hot
- The hint page fault will be triggered. But the hint page fault
latency isn't shorter than the hot threshold. So the pages will
not be promoted.
- When the memory area is scanned again, maybe after a scan period,
the hint page fault latency measured will be shorter than the hot
threshold and the pages will be promoted.
To mitigate this,
- If there are enough free space in the fast memory node, the hot
threshold will not be used, all pages will be promoted upon the hint
page fault for fast response.
- If fast response is more important for system performance, the
administrator can set a higher hot threshold.
Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: osalvador <[email protected]>
Cc: Shakeel Butt <[email protected]>
---
include/linux/mm.h | 29 ++++++++++++++++
include/linux/sched/sysctl.h | 1 +
kernel/sched/fair.c | 67 ++++++++++++++++++++++++++++++++++++
kernel/sysctl.c | 7 ++++
mm/huge_memory.c | 13 +++++--
mm/memory.c | 11 +++++-
mm/migrate.c | 12 +++++++
mm/mmzone.c | 17 +++++++++
mm/mprotect.c | 8 ++++-
9 files changed, 160 insertions(+), 5 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..a9ea778eafe0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1393,6 +1393,18 @@ static inline int folio_nid(const struct folio *folio)
}
#ifdef CONFIG_NUMA_BALANCING
+/* page access time bits needs to hold at least 4 seconds */
+#define PAGE_ACCESS_TIME_MIN_BITS 12
+#if LAST_CPUPID_SHIFT < PAGE_ACCESS_TIME_MIN_BITS
+#define PAGE_ACCESS_TIME_BUCKETS \
+ (PAGE_ACCESS_TIME_MIN_BITS - LAST_CPUPID_SHIFT)
+#else
+#define PAGE_ACCESS_TIME_BUCKETS 0
+#endif
+
+#define PAGE_ACCESS_TIME_MASK \
+ (LAST_CPUPID_MASK << PAGE_ACCESS_TIME_BUCKETS)
+
static inline int cpu_pid_to_cpupid(int cpu, int pid)
{
return ((cpu & LAST__CPU_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
@@ -1435,6 +1447,16 @@ static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
return xchg(&page->_last_cpupid, cpupid & LAST_CPUPID_MASK);
}
+static inline unsigned int xchg_page_access_time(struct page *page,
+ unsigned int time)
+{
+ unsigned int last_time;
+
+ last_time = xchg(&page->_last_cpupid,
+ (time >> PAGE_ACCESS_TIME_BUCKETS) & LAST_CPUPID_MASK);
+ return last_time << PAGE_ACCESS_TIME_BUCKETS;
+}
+
static inline int page_cpupid_last(struct page *page)
{
return page->_last_cpupid;
@@ -1450,6 +1472,7 @@ static inline int page_cpupid_last(struct page *page)
}
extern int page_cpupid_xchg_last(struct page *page, int cpupid);
+extern unsigned int xchg_page_access_time(struct page *page, unsigned int time);
static inline void page_cpupid_reset_last(struct page *page)
{
@@ -1462,6 +1485,12 @@ static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
return page_to_nid(page); /* XXX */
}
+static inline unsigned int xchg_page_access_time(struct page *page,
+ unsigned int time)
+{
+ return 0;
+}
+
static inline int page_cpupid_last(struct page *page)
{
return page_to_nid(page); /* XXX */
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bc54c1d75d6d..0ea43b146aee 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -41,6 +41,7 @@ enum sched_tunable_scaling {
#ifdef CONFIG_NUMA_BALANCING
extern int sysctl_numa_balancing_mode;
+extern unsigned int sysctl_numa_balancing_hot_threshold;
#else
#define sysctl_numa_balancing_mode 0
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e476f6d9435..2b78664a5ce2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1026,6 +1026,9 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
unsigned int sysctl_numa_balancing_scan_delay = 1000;
+/* The page with hint page fault latency < threshold in ms is considered hot */
+unsigned int sysctl_numa_balancing_hot_threshold = 1000;
+
struct numa_group {
refcount_t refcount;
@@ -1367,6 +1370,37 @@ static inline unsigned long group_weight(struct task_struct *p, int nid,
return 1000 * faults / total_faults;
}
+static bool pgdat_free_space_enough(struct pglist_data *pgdat)
+{
+ int z;
+ unsigned long enough_mark;
+
+ enough_mark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
+ pgdat->node_present_pages >> 4);
+ for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+ struct zone *zone = pgdat->node_zones + z;
+
+ if (!populated_zone(zone))
+ continue;
+
+ if (zone_watermark_ok(zone, 0,
+ high_wmark_pages(zone) + enough_mark,
+ ZONE_MOVABLE, 0))
+ return true;
+ }
+ return false;
+}
+
+static int numa_hint_fault_latency(struct page *page)
+{
+ unsigned int last_time, time;
+
+ time = jiffies_to_msecs(jiffies);
+ last_time = xchg_page_access_time(page, time);
+
+ return (time - last_time) & PAGE_ACCESS_TIME_MASK;
+}
+
bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
int src_nid, int dst_cpu)
{
@@ -1374,6 +1408,27 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
int dst_nid = cpu_to_node(dst_cpu);
int last_cpupid, this_cpupid;
+ /*
+ * The pages in slow memory node should be migrated according
+ * to hot/cold instead of accessing CPU node.
+ */
+ if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
+ !node_is_toptier(src_nid)) {
+ struct pglist_data *pgdat;
+ unsigned long latency, th;
+
+ pgdat = NODE_DATA(dst_nid);
+ if (pgdat_free_space_enough(pgdat))
+ return true;
+
+ th = sysctl_numa_balancing_hot_threshold;
+ latency = numa_hint_fault_latency(page);
+ if (latency > th)
+ return false;
+
+ return true;
+ }
+
this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
@@ -2592,6 +2647,11 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
if (!p->mm)
return;
+ /* Numa faults statistics are unnecessary for the slow memory node */
+ if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
+ !node_is_toptier(mem_node))
+ return;
+
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
int size = sizeof(*p->numa_faults) *
@@ -2611,6 +2671,13 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
*/
if (unlikely(last_cpupid == (-1 & LAST_CPUPID_MASK))) {
priv = 1;
+ } else if (unlikely(!cpu_online(cpupid_to_cpu(last_cpupid)))) {
+ /*
+ * In memory tiering mode, cpupid of slow memory page is
+ * used to record page access time, so its value may be
+ * invalid during numa balancing mode transition.
+ */
+ return;
} else {
priv = cpupid_match_pid(p, last_cpupid);
if (!priv && !(flags & TNF_NO_GROUP))
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index a1be94ea80ba..40432524642a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1811,6 +1811,13 @@ static struct ctl_table kern_table[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = &three,
},
+ {
+ .procname = "numa_balancing_hot_threshold_ms",
+ .data = &sysctl_numa_balancing_hot_threshold,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
#endif /* CONFIG_NUMA_BALANCING */
{
.procname = "sched_rt_period_us",
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cab8048eb779..1999ef14582e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1430,7 +1430,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
struct page *page;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
int page_nid = NUMA_NO_NODE;
- int target_nid, last_cpupid = -1;
+ int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
bool migrated = false;
bool was_writable = pmd_savedwrite(oldpmd);
int flags = 0;
@@ -1451,7 +1451,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
flags |= TNF_NO_GROUP;
page_nid = page_to_nid(page);
- last_cpupid = page_cpupid_last(page);
+ if (node_is_toptier(page_nid))
+ last_cpupid = page_cpupid_last(page);
target_nid = numa_migrate_prep(page, vma, haddr, page_nid,
&flags);
@@ -1769,6 +1770,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
if (prot_numa) {
struct page *page;
+ bool toptier;
/*
* Avoid trapping faults against the zero page. The read-only
* data is likely to be read-cached on the local CPU and
@@ -1781,13 +1783,18 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
goto unlock;
page = pmd_page(*pmd);
+ toptier = node_is_toptier(page_to_nid(page));
/*
* Skip scanning top tier node if normal numa
* balancing is disabled
*/
if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
- node_is_toptier(page_to_nid(page)))
+ toptier)
goto unlock;
+
+ if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
+ !toptier)
+ xchg_page_access_time(page, jiffies_to_msecs(jiffies));
}
/*
* In case prot_numa, we are under mmap_read_lock(mm). It's critical
diff --git a/mm/memory.c b/mm/memory.c
index 8f1de811a1dc..432c144c30b8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -73,6 +73,7 @@
#include <linux/perf_event.h>
#include <linux/ptrace.h>
#include <linux/vmalloc.h>
+#include <linux/sched/sysctl.h>
#include <trace/events/kmem.h>
@@ -4371,8 +4372,16 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
if (page_mapcount(page) > 1 && (vma->vm_flags & VM_SHARED))
flags |= TNF_SHARED;
- last_cpupid = page_cpupid_last(page);
page_nid = page_to_nid(page);
+ /*
+ * In memory tiering mode, cpupid of slow memory page is used
+ * to record page access time. So use default value.
+ */
+ if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
+ !node_is_toptier(page_nid))
+ last_cpupid = (-1 & LAST_CPUPID_MASK);
+ else
+ last_cpupid = page_cpupid_last(page);
target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid,
&flags);
if (target_nid == NUMA_NO_NODE) {
diff --git a/mm/migrate.c b/mm/migrate.c
index 286c84c014dd..03006bbd4042 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -572,6 +572,18 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
* future migrations of this same page.
*/
cpupid = page_cpupid_xchg_last(&folio->page, -1);
+ /*
+ * If migrate between slow and fast memory node, reset cpupid,
+ * because that is used to record page access time in slow
+ * memory node
+ */
+ if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) {
+ bool f_toptier = node_is_toptier(page_to_nid(&folio->page));
+ bool t_toptier = node_is_toptier(page_to_nid(&newfolio->page));
+
+ if (f_toptier != t_toptier)
+ cpupid = -1;
+ }
page_cpupid_xchg_last(&newfolio->page, cpupid);
folio_migrate_ksm(newfolio, folio);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index eb89d6e018e2..27f9075632ee 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -99,4 +99,21 @@ int page_cpupid_xchg_last(struct page *page, int cpupid)
return last_cpupid;
}
+
+unsigned int xchg_page_access_time(struct page *page, unsigned int time)
+{
+ unsigned long old_flags, flags;
+ unsigned int last_time;
+
+ time >>= PAGE_ACCESS_TIME_BUCKETS;
+ do {
+ old_flags = flags = page->flags;
+ last_time = (flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
+
+ flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
+ flags |= (time & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
+ } while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
+
+ return last_time << PAGE_ACCESS_TIME_BUCKETS;
+}
#endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ddc24ca52b12..407559241b58 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -85,6 +85,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (prot_numa) {
struct page *page;
int nid;
+ bool toptier;
/* Avoid TLB flush if possible */
if (pte_protnone(oldpte))
@@ -114,14 +115,19 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
nid = page_to_nid(page);
if (target_node == nid)
continue;
+ toptier = node_is_toptier(nid);
/*
* Skip scanning top tier node if normal numa
* balancing is disabled
*/
if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
- node_is_toptier(nid))
+ toptier)
continue;
+ if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
+ !toptier)
+ xchg_page_access_time(page,
+ jiffies_to_msecs(jiffies));
}
oldpte = ptep_modify_prot_start(vma, addr, pte);
--
2.30.2
From mboxrd@z Thu Jan 1 00:00:00 1970
Return-Path: <[email protected]>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
by smtp.lore.kernel.org (Postfix) with ESMTP id 3EFC1C433EF
for <[email protected]>; Tue, 16 Nov 2021 01:45:05 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
by mail.kernel.org (Postfix) with ESMTP id 250BE61265
for <[email protected]>; Tue, 16 Nov 2021 01:45:05 +0000 (UTC)
Received: ([email protected]) by vger.kernel.org via listexpand
id S1357312AbhKPBr7 (ORCPT
<rfc822;[email protected]>);
Mon, 15 Nov 2021 20:47:59 -0500
Received: from mga07.intel.com ([134.134.136.100]:36068 "EHLO mga07.intel.com"
rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
id S1357178AbhKPBlT (ORCPT <rfc822;[email protected]>);
Mon, 15 Nov 2021 20:41:19 -0500
X-IronPort-AV: E=McAfee;i="6200,9189,10169"; a="297023989"
X-IronPort-AV: E=Sophos;i="5.87,237,1631602800";
d="scan'208";a="297023989"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Nov 2021 17:36:24 -0800
X-IronPort-AV: E=Sophos;i="5.87,237,1631602800";
d="scan'208";a="454262575"
Received: from yhuang6-desk2.sh.intel.com ([10.239.159.101])
by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Nov 2021 17:36:20 -0800
From: Huang Ying <[email protected]>
To: Peter Zijlstra <[email protected]>, Mel Gorman <[email protected]>
Feng Tang <[email protected]>,
Huang Ying <[email protected]>,
Andrew Morton <[email protected]>,
Michal Hocko <[email protected]>,
Rik van Riel <[email protected]>,
Dave Hansen <[email protected]>,
Yang Shi <[email protected]>, Zi Yan <[email protected]>,
Wei Xu <[email protected]>, osalvador <[email protected]>,
Shakeel Butt <[email protected]>
Subject: [PATCH -V10 6/6] memory tiering: adjust hot threshold automatically
Date: Tue, 16 Nov 2021 09:35:22 +0800
Message-Id: <[email protected]>
X-Mailer: git-send-email 2.30.2
In-Reply-To: <[email protected]>
References: <[email protected]>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: [email protected]
It isn't easy for the administrator to determine the hot threshold.
So in this patch, a method to adjust the hot threshold automatically
is implemented. The basic idea is to control the number of the
candidate promotion pages to match the promotion rate limit. If the
hint page fault latency of a page is less than the hot threshold, we
will try to promote the page, and the page is called the candidate
promotion page.
If the number of the candidate promotion pages in the statistics
interval is much more than the promotion rate limit, the hot threshold
will be decreased to reduce the number of the candidate promotion
pages. Otherwise, the hot threshold will be increased to increase the
number of the candidate promotion pages.
To make the above method works, in each statistics interval, the total
number of the pages to check (on which the hint page faults occur) and
the hot/cold distribution need to be stable. Because the page tables
are scanned linearly in NUMA balancing, but the hot/cold distribution
isn't uniform along the address, the statistics interval should be
larger than the NUMA balancing scan period. So in the patch, the max
scan period is used as statistics interval and it works well in our
tests.
The sysctl knob kernel.numa_balancing_hot_threshold_ms becomes the
initial value and max value of the hot threshold.
Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: osalvador <[email protected]>
Cc: Shakeel Butt <[email protected]>
---
include/linux/mmzone.h | 3 ++
include/linux/sched/sysctl.h | 2 ++
kernel/sched/core.c | 15 +++++++++
kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++---
kernel/sysctl.c | 3 +-
5 files changed, 81 insertions(+), 6 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f3b044993bc5..4ac0ae1cf15d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -906,6 +906,9 @@ typedef struct pglist_data {
#ifdef CONFIG_NUMA_BALANCING
unsigned long numa_ts;
unsigned long numa_nr_candidate;
+ unsigned long numa_threshold_ts;
+ unsigned long numa_threshold_nr_candidate;
+ unsigned long numa_threshold;
#endif
/* Fields commonly accessed by the page reclaim scanner */
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 7d937adaac0f..ff2c43e8ebac 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -84,6 +84,8 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
void *buffer, size_t *lenp, loff_t *ppos);
int sysctl_numa_balancing(struct ctl_table *table, int write, void *buffer,
size_t *lenp, loff_t *ppos);
+int sysctl_numa_balancing_threshold(struct ctl_table *table, int write, void *buffer,
+ size_t *lenp, loff_t *ppos);
int sysctl_schedstats(struct ctl_table *table, int write, void *buffer,
size_t *lenp, loff_t *ppos);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5dcabc98432f..1cca2c8a3423 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4285,6 +4285,18 @@ void set_numabalancing_state(bool enabled)
}
#ifdef CONFIG_PROC_SYSCTL
+static void reset_memory_tiering(void)
+{
+ struct pglist_data *pgdat;
+
+ for_each_online_pgdat(pgdat) {
+ pgdat->numa_threshold = 0;
+ pgdat->numa_threshold_nr_candidate =
+ node_page_state(pgdat, PGPROMOTE_CANDIDATE);
+ pgdat->numa_threshold_ts = jiffies;
+ }
+}
+
int sysctl_numa_balancing(struct ctl_table *table, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
@@ -4301,6 +4313,9 @@ int sysctl_numa_balancing(struct ctl_table *table, int write,
if (err < 0)
return err;
if (write) {
+ if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
+ (state & NUMA_BALANCING_MEMORY_TIERING))
+ reset_memory_tiering();
sysctl_numa_balancing_mode = state;
__set_numabalancing_state(state);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7912669a2065..daa978d2d70d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1423,6 +1423,54 @@ static bool numa_migration_check_rate_limit(struct pglist_data *pgdat,
return true;
}
+int sysctl_numa_balancing_threshold(struct ctl_table *table, int write, void *buffer,
+ size_t *lenp, loff_t *ppos)
+{
+ int err;
+ struct pglist_data *pgdat;
+
+ if (write && !capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+ if (err < 0 || !write)
+ return err;
+
+ for_each_online_pgdat(pgdat)
+ pgdat->numa_threshold = 0;
+
+ return err;
+}
+
+#define NUMA_MIGRATION_ADJUST_STEPS 16
+
+static void numa_migration_adjust_threshold(struct pglist_data *pgdat,
+ unsigned long rate_limit,
+ unsigned long ref_th)
+{
+ unsigned long now = jiffies, last_th_ts, th_period;
+ unsigned long unit_th, th;
+ unsigned long nr_cand, ref_cand, diff_cand;
+
+ th_period = msecs_to_jiffies(sysctl_numa_balancing_scan_period_max);
+ last_th_ts = pgdat->numa_threshold_ts;
+ if (now > last_th_ts + th_period &&
+ cmpxchg(&pgdat->numa_threshold_ts, last_th_ts, now) == last_th_ts) {
+ ref_cand = rate_limit *
+ sysctl_numa_balancing_scan_period_max / 1000;
+ nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
+ diff_cand = nr_cand - pgdat->numa_threshold_nr_candidate;
+ unit_th = ref_th / NUMA_MIGRATION_ADJUST_STEPS;
+ th = pgdat->numa_threshold ? : ref_th;
+ if (diff_cand > ref_cand * 11 / 10)
+ th = max(th - unit_th, unit_th);
+ else if (diff_cand < ref_cand * 9 / 10)
+ th = min(th + unit_th, ref_th);
+ pgdat->numa_threshold_nr_candidate = nr_cand;
+ pgdat->numa_threshold = th;
+ }
+}
+
bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
int src_nid, int dst_cpu)
{
@@ -1437,19 +1485,25 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
!node_is_toptier(src_nid)) {
struct pglist_data *pgdat;
- unsigned long rate_limit, latency, th;
+ unsigned long rate_limit, latency, th, def_th;
pgdat = NODE_DATA(dst_nid);
- if (pgdat_free_space_enough(pgdat))
+ if (pgdat_free_space_enough(pgdat)) {
+ /* workload changed, reset hot threshold */
+ pgdat->numa_threshold = 0;
return true;
+ }
- th = sysctl_numa_balancing_hot_threshold;
+ def_th = sysctl_numa_balancing_hot_threshold;
+ rate_limit =
+ sysctl_numa_balancing_rate_limit << (20 - PAGE_SHIFT);
+ numa_migration_adjust_threshold(pgdat, rate_limit, def_th);
+
+ th = pgdat->numa_threshold ? : def_th;
latency = numa_hint_fault_latency(page);
if (latency > th)
return false;
- rate_limit =
- sysctl_numa_balancing_rate_limit << (20 - PAGE_SHIFT);
return numa_migration_check_rate_limit(pgdat, rate_limit,
thp_nr_pages(page));
}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7be964eb0d13..38892422ffac 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1816,7 +1816,8 @@ static struct ctl_table kern_table[] = {
.data = &sysctl_numa_balancing_hot_threshold,
.maxlen = sizeof(unsigned int),
.mode = 0644,
- .proc_handler = proc_dointvec,
+ .proc_handler = sysctl_numa_balancing_threshold,
+ .extra1 = SYSCTL_ZERO,
},
{
.procname = "numa_balancing_rate_limit_mbps",
--
2.30.2
From mboxrd@z Thu Jan 1 00:00:00 1970
Return-Path: <[email protected]>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
by smtp.lore.kernel.org (Postfix) with ESMTP id 1F10DC433FE
for <[email protected]>; Tue, 16 Nov 2021 01:45:14 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
by mail.kernel.org (Postfix) with ESMTP id EDAE16120F
for <[email protected]>; Tue, 16 Nov 2021 01:45:13 +0000 (UTC)
Received: ([email protected]) by vger.kernel.org via listexpand
id S1357948AbhKPBsH (ORCPT
<rfc822;[email protected]>);
Mon, 15 Nov 2021 20:48:07 -0500
Received: from mga07.intel.com ([134.134.136.100]:36061 "EHLO mga07.intel.com"
rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
id S1357267AbhKPBlT (ORCPT <rfc822;[email protected]>);
Mon, 15 Nov 2021 20:41:19 -0500
X-IronPort-AV: E=McAfee;i="6200,9189,10169"; a="297023962"
X-IronPort-AV: E=Sophos;i="5.87,237,1631602800";
d="scan'208";a="297023962"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Nov 2021 17:36:20 -0800
X-IronPort-AV: E=Sophos;i="5.87,237,1631602800";
d="scan'208";a="454262532"
Received: from yhuang6-desk2.sh.intel.com ([10.239.159.101])
by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Nov 2021 17:36:15 -0800
From: Huang Ying <[email protected]>
To: Peter Zijlstra <[email protected]>, Mel Gorman <[email protected]>
Feng Tang <[email protected]>,
Huang Ying <[email protected]>,
Andrew Morton <[email protected]>,
Michal Hocko <[email protected]>,
Rik van Riel <[email protected]>,
Dave Hansen <[email protected]>,
Yang Shi <[email protected]>, Zi Yan <[email protected]>,
Wei Xu <[email protected]>, osalvador <[email protected]>,
Shakeel Butt <[email protected]>
Subject: [PATCH -V10 5/6] memory tiering: rate limit NUMA migration throughput
Date: Tue, 16 Nov 2021 09:35:21 +0800
Message-Id: <[email protected]>
X-Mailer: git-send-email 2.30.2
In-Reply-To: <[email protected]>
References: <[email protected]>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: [email protected]
In NUMA balancing memory tiering mode, the hot slow memory pages could
be promoted to the fast memory node via NUMA balancing. But this