[LU-9601] recovery-mds-scale test_failover_mds: test_failover_mds returned 1 Created: 05/Jun/17  Updated: 24/Sep/20

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.10.1, Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Casper Assignee: Zhenyu Xu
Resolution: Unresolved Votes: 0
Labels: None
Environment:

trevis, failover
clients: SLES12, master branch, v2.9.58, b3591
servers: EL7, ldiskfs, master branch, v2.9.58, b3591


Issue Links:
Related
is related to LU-9977 client ran out of memory when diffing... Resolved
is related to LU-10319 recovery-random-scale, test_fail_clie... Resolved
is related to LU-10221 recovery-mds-scale test_failover_mds:... Open
is related to LU-12067 recovery-mds-scale test failover_mds ... Open
is related to LU-10687 sanity-benchmark test iozone hangs wi... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sessions/e6b87235-1ff0-4e96-a53f-ca46ffe5ed7e

From suite_log:

CMD: trevis-38vm1,trevis-38vm5,trevis-38vm6,trevis-38vm7,trevis-38vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/mpi/gcc/openmpi/bin:/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh check_logdir /shared_test/autotest2/2017-05-24/051508-70323187606440 
trevis-38vm1: trevis-38vm1: executing check_logdir /shared_test/autotest2/2017-05-24/051508-70323187606440
trevis-38vm7: trevis-38vm7.trevis.hpdd.intel.com: executing check_logdir /shared_test/autotest2/2017-05-24/051508-70323187606440
trevis-38vm8: trevis-38vm8.trevis.hpdd.intel.com: executing check_logdir /shared_test/autotest2/2017-05-24/051508-70323187606440
pdsh@trevis-38vm1: trevis-38vm6: mcmd: connect failed: No route to host
pdsh@trevis-38vm1: trevis-38vm5: mcmd: connect failed: No route to host
CMD: trevis-38vm1 uname -n
CMD: trevis-38vm5 uname -n
pdsh@trevis-38vm1: trevis-38vm5: mcmd: connect failed: No route to host

 SKIP: recovery-double-scale  SHARED_DIRECTORY should be specified with a shared directory which is accessable on all of the nodes
Stopping clients: trevis-38vm1,trevis-38vm5,trevis-38vm6 /mnt/lustre (opts:)
CMD: trevis-38vm1,trevis-38vm5,trevis-38vm6 running=\$(grep -c /mnt/lustre' ' /proc/mounts);

and

pdsh@trevis-38vm1: trevis-38vm5: mcmd: connect failed: No route to host
pdsh@trevis-38vm1: trevis-38vm6: mcmd: connect failed: No route to host
 auster : @@@@@@ FAIL: clients environments are insane! 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4952:error()
  = /usr/lib64/lustre/tests/test-framework.sh:1736:sanity_mount_check_clients()
  = /usr/lib64/lustre/tests/test-framework.sh:1741:sanity_mount_check()
  = /usr/lib64/lustre/tests/test-framework.sh:3796:setupall()
  = auster:114:reset_lustre()
  = auster:217:run_suite()
  = auster:234:run_suite_logged()
  = auster:298:run_suites()
  = auster:334:main()


 Comments   
Comment by James Casper [ 05/Jun/17 ]

subsequent test sets:

recovery-small, replay-ost-single, replay-dual, replay-vbr, replay-single:
clients environments are insane!

mmp:
test_* returned 1 & Network not available!

Comment by Sarah Liu [ 08/Jun/17 ]

dup of LU-9600, all caused by the DCO-7216

Comment by James Casper [ 11/Aug/17 ]

This is not a pdsh issue.

subtest 1 (test_failover_mds): 2 MDSs, 2 OSTs, 3 clients
Out of memory: Kill process . . . (onyx-39vm1 console)

subtest 2 (test_failover_ost): 2 MDSs, 2 OSTs, 2 clients
(onyx-39vm1 is gone)

We are only seeing this with SLES failover configs, and DCO-7324 was also opened for this.

Comment by James Nunez (Inactive) [ 05/Sep/17 ]

Looking at the client1 test_log at https://testing.hpdd.intel.com/test_sets/4ef0bae8-8860-11e7-b4b0-5254006e85c2 , we can see that there is some network issue on client2, we get the mcmd no route to host error and then an error saying dump_kernel is an invalid parameter:

16:23:08 (1503530588) waiting for trevis-66vm5 network 5 secs ...
Network not available!
2017-08-23 16:23:11 Terminating clients loads ...
Duration:               86400
Server failover period: 1200 seconds
Exited after:           66 seconds
Number of failovers before exit:
mds1: 1 times
ost1: 0 times
ost2: 0 times
ost3: 0 times
ost4: 0 times
ost5: 0 times
ost6: 0 times
ost7: 0 times
Status: FAIL: rc=1
CMD: trevis-66vm5,trevis-66vm6 test -f /tmp/client-load.pid &&
        { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
pdsh@trevis-66vm1: trevis-66vm5: mcmd: connect failed: No route to host
/usr/lib64/lustre/tests/recovery-mds-scale.sh: line 103: 13774 Killed                  do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK 			BREAK_ON_ERROR=$BREAK_ON_ERROR 			END_RUN_FILE=$END_RUN_FILE 			LOAD_PID_FILE=$LOAD_PID_FILE 			TESTLOG_PREFIX=$TESTLOG_PREFIX 			TESTNAME=$TESTNAME 			DBENCH_LIB=$DBENCH_LIB 			DBENCH_SRC=$DBENCH_SRC 			CLIENT_COUNT=$((CLIENTCOUNT - 1)) 			LFS=$LFS 			LCTL=$LCTL 			FSNAME=$FSNAME 			run_${load}.sh"
/usr/lib64/lustre/tests/recovery-mds-scale.sh: line 103: 13914 Killed                  do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK 			BREAK_ON_ERROR=$BREAK_ON_ERROR 			END_RUN_FILE=$END_RUN_FILE 			LOAD_PID_FILE=$LOAD_PID_FILE 			TESTLOG_PREFIX=$TESTLOG_PREFIX 			TESTNAME=$TESTNAME 			DBENCH_LIB=$DBENCH_LIB 			DBENCH_SRC=$DBENCH_SRC 			CLIENT_COUNT=$((CLIENTCOUNT - 1)) 			LFS=$LFS 			LCTL=$LCTL 			FSNAME=$FSNAME 			run_${load}.sh"
Dumping lctl log to /test_logs/2017-08-23/lustre-b2_10-el7-x86_64-vs-lustre-b2_10-sles12sp2-x86_64--failover--2_1_1__5__-69876128994560-223221/recovery-mds-scale.test_failover_mds.*.1503530603.log
CMD: trevis-66vm3,trevis-66vm4,trevis-66vm7,trevis-66vm8 /usr/sbin/lctl dk > /test_logs/2017-08-23/lustre-b2_10-el7-x86_64-vs-lustre-b2_10-sles12sp2-x86_64--failover--2_1_1__5__-69876128994560-223221/recovery-mds-scale.test_failover_mds.debug_log.\$(hostname -s).1503530603.log;
         dmesg > /test_logs/2017-08-23/lustre-b2_10-el7-x86_64-vs-lustre-b2_10-sles12sp2-x86_64--failover--2_1_1__5__-69876128994560-223221/recovery-mds-scale.test_failover_mds.dmesg.\$(hostname -s).1503530603.log
trevis-66vm7: invalid parameter 'dump_kernel'
trevis-66vm7: open(dump_kernel) failed: No such file or directory

If we look at client2 (trevis-66vm5), we get an OOM error:

23:07:00:[  908.305066] Lustre: Evicted from MGS (at MGC10.9.6.214@tcp_1) after server handle changed from 0xd3eb788f77c2284b to 0x5174f2bcf88f87f5
23:07:00:[  908.310726] Lustre: MGC10.9.6.214@tcp: Connection restored to MGC10.9.6.214@tcp_1 (at 10.9.6.210@tcp)
23:07:00:[  908.369629] LustreError: 8944:0:(client.c:2982:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff8800d4e20040 x1576564775978512/t4294967305(4294967305) o101->lustre-MDT0000-mdc-ffff8800db2f9800@10.9.6.210@tcp:12/10 lens 952/544 e 0 to 0 dl 1503529448 ref 2 fl Interpret:RP/4/0 rc 301/301
23:07:00:[ 1077.515462] dd invoked oom-killer: gfp_mask=0x24200ca(GFP_HIGHUSER_MOVABLE), nodemask=0, order=0, oom_score_adj=0
23:07:00:[ 1077.515475] dd cpuset=/ mems_allowed=0
23:07:00:[ 1077.515480] CPU: 0 PID: 11285 Comm: dd Tainted: G           OE   N  4.4.59-92.24-default #1
23:07:00:[ 1077.515484] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
23:07:00:[ 1077.515491]  0000000000000000 ffffffff8130f0d0 ffff880119f43bf0 0000000000000000
23:07:00:[ 1077.515493]  ffffffff811f739e 0000000000000000 0000000100000000 000000fae0e8fdf6
23:07:00:[ 1077.515494]  0000000001320122 ffff88011fc15c00 ffff8800d81dd880 ffff8800d916c240
23:07:00:[ 1077.515495] Call Trace:
23:07:00:[ 1077.515564]  [<ffffffff81019a99>] dump_trace+0x59/0x310
23:07:00:[ 1077.515568]  [<ffffffff81019e3a>] show_stack_log_lvl+0xea/0x170
23:07:00:[ 1077.515571]  [<ffffffff8101abc1>] show_stack+0x21/0x40
23:07:00:[ 1077.515583]  [<ffffffff8130f0d0>] dump_stack+0x5c/0x7c
23:07:00:[ 1077.515609]  [<ffffffff811f739e>] dump_header+0x82/0x215
23:07:00:[ 1077.515633]  [<ffffffff811887a4>] oom_kill_process+0x214/0x3f0
23:07:00:[ 1077.515643]  [<ffffffff81188e0d>] out_of_memory+0x43d/0x4a0
23:07:00:[ 1077.515650]  [<ffffffff8118d556>] __alloc_pages_nodemask+0xaf6/0xb20
23:07:00:[ 1077.515666]  [<ffffffff811d5804>] alloc_pages_vma+0xa4/0x220
23:07:00:[ 1077.515679]  [<ffffffff811c5f90>] __read_swap_cache_async+0xf0/0x150
23:07:00:[ 1077.515685]  [<ffffffff811c6004>] read_swap_cache_async+0x14/0x30
23:07:00:[ 1077.515687]  [<ffffffff811c611d>] swapin_readahead+0xfd/0x190
23:07:00:[ 1077.515697]  [<ffffffff811b3bde>] handle_pte_fault+0x10fe/0x14b0
23:07:00:[ 1077.515703]  [<ffffffff811b4e7e>] handle_mm_fault+0x29e/0x550
23:07:00:[ 1077.515714]  [<ffffffff8106469a>] __do_page_fault+0x18a/0x410
23:07:00:[ 1077.515721]  [<ffffffff8106494b>] do_page_fault+0x2b/0x70
23:07:00:[ 1077.515741]  [<ffffffff815e63e8>] page_fault+0x28/0x30
23:07:00:[ 1077.516007] DWARF2 unwinder stuck at page_fault+0x28/0x30
23:07:00:[ 1077.516007] 
23:07:00:[ 1077.516007] Leftover inexact backtrace:
23:07:00:[ 1077.516007] 
23:07:00:[ 1077.520402] Mem-Info:
23:07:00:[ 1077.520411] active_anon:0 inactive_anon:18 isolated_anon:0
23:07:00:[ 1077.520411]  active_file:258394 inactive_file:672357 isolated_file:64
23:07:00:[ 1077.520411]  unevictable:20 dirty:34 writeback:3183 unstable:0
23:07:00:[ 1077.520411]  slab_reclaimable:2810 slab_unreclaimable:11643
23:07:00:[ 1077.520411]  mapped:6475 shmem:0 pagetables:852 bounce:0
23:07:00:[ 1077.520411]  free:21107 free_pcp:0 free_cma:0
23:07:00:[ 1077.520420] Node 0 DMA free:15372kB min:276kB low:344kB high:412kB active_anon:0kB inactive_anon:0kB active_file:144kB inactive_file:264kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15892kB mlocked:0kB dirty:0kB writeback:0kB mapped:72kB shmem:0kB slab_reclaimable:8kB slab_unreclaimable:32kB kernel_stack:16kB pagetables:36kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3216 all_unreclaimable? yes
23:07:00:[ 1077.520421] lowmem_reserve[]: 0 3335 3782 3782 3782
23:07:00:[ 1077.520425] Node 0 DMA32 free:61184kB min:59364kB low:74204kB high:89044kB active_anon:0kB inactive_anon:72kB active_file:909496kB inactive_file:2385320kB unevictable:56kB isolated(anon):0kB isolated(file):256kB present:3653620kB managed:3442064kB mlocked:56kB dirty:136kB writeback:12616kB mapped:22140kB shmem:0kB slab_reclaimable:9568kB slab_unreclaimable:37996kB kernel_stack:1808kB pagetables:2828kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:20051824 all_unreclaimable? yes
23:07:00:[ 1077.520426] lowmem_reserve[]: 0 0 446 446 446
23:07:00:[ 1077.520430] Node 0 Normal free:7872kB min:7940kB low:9924kB high:11908kB active_anon:0kB inactive_anon:0kB active_file:123936kB inactive_file:303844kB unevictable:24kB isolated(anon):0kB isolated(file):0kB present:524288kB managed:457064kB mlocked:24kB dirty:0kB writeback:116kB mapped:3688kB shmem:0kB slab_reclaimable:1664kB slab_unreclaimable:8544kB kernel_stack:672kB pagetables:544kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2658164 all_unreclaimable? yes
23:07:00:[ 1077.520431] lowmem_reserve[]: 0 0 0 0 0
23:07:00:[ 1077.520446] Node 0 DMA: 5*4kB (UE) 3*8kB (ME) 0*16kB 3*32kB (UME) 2*64kB (ME) 2*128kB (UE) 2*256kB (UE) 2*512kB (ME) 3*1024kB (UME) 1*2048kB (E) 2*4096kB (M) = 15372kB
23:07:00:[ 1077.520450] Node 0 DMA32: 1115*4kB (UME) 780*8kB (UME) 682*16kB (UME) 419*32kB (UME) 216*64kB (UME) 79*128kB (UE) 7*256kB (UME) 1*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 61260kB
23:07:00:[ 1077.520454] Node 0 Normal: 222*4kB (UME) 191*8kB (UME) 129*16kB (UME) 62*32kB (UME) 22*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7872kB
23:07:00:[ 1077.520462] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
23:07:00:[ 1077.520470] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
23:07:00:[ 1077.520471] 16822 total pagecache pages
23:07:00:[ 1077.520474] 13 pages in swap cache
23:07:00:[ 1077.520475] Swap cache stats: add 10940, delete 10927, find 313/551
23:07:00:[ 1077.520475] Free swap  = 14296096kB
23:07:00:[ 1077.520475] Total swap = 14338044kB
23:07:00:[ 1077.520476] 1048473 pages RAM
23:07:00:[ 1077.520476] 0 pages HighMem/MovableOnly
23:07:00:[ 1077.520477] 69718 pages reserved
23:07:00:[ 1077.520477] 0 pages hwpoisoned
23:07:00:[ 1077.520478] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
23:07:00:[ 1077.520484] [  437]     0   437    10929      674      25       3     1115             0 systemd-journal
23:07:00:[ 1077.520485] [  506]     0   506     3006      363      10       3      800             0 haveged
23:07:00:[ 1077.520487] [  508]     0   508     9209      719      20       3      165         -1000 systemd-udevd
23:07:00:[ 1077.520488] [  510]   495   510    13123      929      30       3      121             0 rpcbind
23:07:00:[ 1077.520489] [  632]   499   632    10912      877      27       3      150          -900 dbus-daemon
23:07:00:[ 1077.520490] [  655]     0   655     4812      594      14       3       59             0 irqbalance
23:07:00:[ 1077.520492] [  657]     0   657     7448     1031      20       3      269             0 wickedd-dhcp4
23:07:00:[ 1077.520493] [  659]     0   659     7448     1036      21       3      262             0 wickedd-dhcp6
23:07:00:[ 1077.520494] [  660]     0   660     7447     1020      21       3      261             0 wickedd-auto4
23:07:00:[ 1077.520495] [  713]     0   713    87480      965      38       3      266             0 rsyslogd
23:07:00:[ 1077.520496] [  717]     0   717    22580     1020      46       3      221             0 sssd
23:07:00:[ 1077.520497] [  722]     0   722    29731     2007      61       3      300             0 sssd_be
23:07:00:[ 1077.520498] [  750]     0   750    19684     1504      43       3      188             0 sssd_nss
23:07:00:[ 1077.520499] [  751]     0   751    20740     1528      43       3      179             0 sssd_pam
23:07:00:[ 1077.520500] [  752]     0   752    19108     1302      41       3      180             0 sssd_ssh
23:07:00:[ 1077.520502] [  896]     0   896     7480     1040      19       3      299             0 wickedd
23:07:00:[ 1077.520503] [  909]     0   909     1663      439       9       3       29             0 agetty
23:07:00:[ 1077.520504] [  910]     0   910     1663      412       9       3       30             0 agetty
23:07:00:[ 1077.520506] [  916]     0   916     7454     1029      19       3      273             0 wickedd-nanny
23:07:00:[ 1077.520507] [ 1579]     0  1579     2139      425      10       3       40             0 xinetd
23:07:00:[ 1077.520508] [ 1598]     0  1598   164140     1845      65       4     1181             0 automount
23:07:00:[ 1077.520509] [ 1606]     0  1606    11801     1281      28       3      154         -1000 sshd
23:07:00:[ 1077.520510] [ 1608]    74  1608     5863      965      16       3      167             0 ntpd
23:07:00:[ 1077.520511] [ 1610]    74  1610     6916      567      17       3      153             0 ntpd
23:07:00:[ 1077.520512] [ 1635]   493  1635    55351      607      20       3      229             0 munged
23:07:00:[ 1077.520514] [ 1691]     0  1691     5510      601      16       3       67             0 systemd-logind
23:07:00:[ 1077.520515] [ 1793]     0  1793     5218      600      14       3       83             0 master
23:07:00:[ 1077.520516] [ 1798]    51  1798     6256      601      17       3       84             0 pickup
23:07:00:[ 1077.520517] [ 1799]    51  1799     6352      884      18       3      124             0 qmgr
23:07:00:[ 1077.520518] [ 1823]     0  1823     5195      540      16       3      150             0 cron
23:07:00:[ 1077.520521] [11253]     0 11253    14910      657      34       4      172             0 in.mrshd
23:07:00:[ 1077.520522] [11254]     0 11254     2893      569      12       3       77             0 bash
23:07:00:[ 1077.520523] [11259]     0 11259     2893      393      11       3       78             0 bash
23:07:00:[ 1077.520524] [11260]     0 11260     3000      575      11       3      189             0 run_dd.sh
23:07:00:[ 1077.520527] [11285]     0 11285     1061      177       8       3       30             0 dd
23:07:00:[ 1077.520529] Out of memory: Kill process 1598 (automount) score 0 or sacrifice child
23:07:00:[ 1077.520545] Killed process 1598 (automount) total-vm:656560kB, anon-rss:0kB, file-rss:7380kB, shmem-rss:0kB
23:07:00:[ 1077.541270] ntpd invoked oom-killer: gfp_mask=0x24200ca(GFP_HIGHUSER_MOVABLE), nodemask=0, order=0, oom_score_adj=0
23:07:00

Note: 1. Charlie ran the failover test group on VMs with 4GB of memory, double what is normally run in our autotesting, and those tests also hit OOM; see ATM-606.
2. We only see the OOM error running the failover test group with SLES12SP2 clients

We see this error with master tags 2.10.51 and 2.10.52:
https://testing.hpdd.intel.com/test_sets/ae9561cc-8f3d-11e7-b5c2-5254006e85c2

b2_10 build 5 and 18:
https://testing.hpdd.intel.com/test_sets/2f62f6aa-8d1f-11e7-b50a-5254006e85c2
https://testing.hpdd.intel.com/test_sets/6bd3aa28-8c8c-11e7-b4e0-5254006e85c2
https://testing.hpdd.intel.com/test_sets/1d91689a-8a85-11e7-b45f-5254006e85c2

Comment by James Casper [ 26/Sep/17 ]

2.10.1:
https://testing.hpdd.intel.com/test_sessions/3035e082-1d27-4979-93c7-9b7048c900c1

Comment by James Casper [ 20/Oct/17 ]

2.10.54:
https://testing.hpdd.intel.com/test_sessions/73765244-fb30-4759-a7b6-2f4aaf88cca7

Comment by James Casper [ 20/Oct/17 ]

Looks like dumps on an SLES VM are saved in /var/crash. EL7 saves them in /scratch/dumps.

Getting a copy of the SLES dumps to an nfs share looks to not be setup:

el7:

  1. mount | grep export
    onyx-4.onyx.hpdd.intel.com:/export/scratch on /scratch type nfs4

sles:

  1. mount | grep export
    onyx-3:/export/home/autotest on /home/autotest type nfs4
    #
Comment by Andreas Dilger [ 06/Dec/17 ]

The system has about 4GB of RAM. There is not a lot of memory in slab objects (only about 40MB). Most of the memory is tied up in inactive_file (about 3GB) and active_file (about 0.5GB), but none of it is reclaimable.

It makes sense that there are a bunch of pages is tied up in active_file for dirty pages and RPC bulk replay, but the pages in inactive_file should be reclaimable. I suspect that there is some bad interaction between how CLIO is tracking pages and the VM page state in the newer SLES kernel that makes it appear to the VM that none of the pages can be reclaimed (e.g. extra page references from DLM locks, OSC extents, etc).

We do have slab callbacks for DLM locks that would release pages, but I'm wondering if dd is using a single large lock on the whole file that this lock cannot be cancelled while it still has dirty pages? This might also relate to LU-9977.

Comment by Brad Hoagland (Inactive) [ 08/Dec/17 ]

Hi YangSheng,

Can you take a look at this one?

Thanks,

Brad

Comment by Yang Sheng [ 12/Dec/17 ]

Looks like sles12sp3 has a little difference alloc_page logic with upstream. It has brought two proc parameters:

/proc/sys/vm/pagecache_limit_mb

This tunable sets a limit to the unmapped pages in the pagecache in megabytes.
If non-zero, it should not be set below 4 (4MB), or the system might behave erratically. In real-life, much larger limits (a few percent of system RAM / a hundred MBs) will be useful.

Examples:
echo 512 >/proc/sys/vm/pagecache_limit_mb

This sets a baseline limits for the page cache (not the buffer cache!) of 0.5GiB.
As we only consider pagecache pages that are unmapped, currently mapped pages (files that are mmap'ed such as e.g. binaries and libraries as well as SysV shared memory) are not limited by this.
NOTE: The real limit depends on the amount of free memory. Every existing free page allows the page cache to grow 8x the amount of free memory above the set baseline. As soon as the free memory is needed, we free up page cache.

/proc/sys/vm/pagecache_limit_ignore_dirty

But it should have less effective if work with default value.

Thanks,
YangSheng

Comment by Andreas Dilger [ 22/Feb/18 ]

Bobijam, it looks like the client is having problems to release pages from the page cache. I suspect there is something going badly with the CLIO page reference/dirty state with the new kernel, that is preventing the page from being released.

Are you able to reproduce something like this in a VM (e.g. dd very large single file) for debugging?

Comment by Sarah Liu [ 02/May/18 ]

+1 on master SLES12 sp3 server/client failover, client hit "page allocation failure"

https://testing.hpdd.intel.com/test_sets/50068624-4679-11e8-960d-52540065bddc

Comment by Sarah Liu [ 17/May/18 ]

+2 on b2_10 https://testing.hpdd.intel.com/test_sets/b7026366-5880-11e8-abc3-52540065bddc

https://testing.hpdd.intel.com/test_sets/d93466da-5878-11e8-b9d3-52540065bddc

Comment by Jian Yu [ 27/Sep/18 ]

Hi Andreas,

Are you able to reproduce something like this in a VM (e.g. dd very large single file) for debugging?

I provisioned 3 SLES12 SP3 VMs (1 client + 1 MGS/MDS +1 OSS) on trevis cluster with the latest master build #3795, and ran dd to create a 30G single file. The command passed:

trevis-59vm1:/usr/lib64/lustre/tests # lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
lustre-MDT0000_UUID         5.6G       45.7M        5.0G   1% /mnt/lustre[MDT:0]
lustre-OST0000_UUID        39.0G       49.0M       36.9G   0% /mnt/lustre[OST:0]
lustre-OST0001_UUID        39.0G       49.0M       36.9G   0% /mnt/lustre[OST:1]

filesystem_summary:        78.0G       98.1M       73.9G   0% /mnt/lustre

trevis-59vm1:/usr/lib64/lustre/tests # dd if=/dev/urandom of=/mnt/lustre/large_file_10G bs=1M count=30720
30720+0 records in
30720+0 records out
32212254720 bytes (32 GB, 30 GiB) copied, 2086.88 s, 15.4 MB/s
Generated at Sat Feb 10 02:27:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.