[LU-10622] recovery-double-scale test_pairwise_fail: client OOM Created: 07/Feb/18  Updated: 03/Nov/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.5
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

recovery-double-scale test_pairwise_fail - Timeout occurred after 484 mins, last suite running was recovery-double-scale, restarting cluster to continue tests
^^^^^^^^^^^^^ DO NOT REMOVE LINE ABOVE ^^^^^^^^^^^^^

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run:
https://testing.hpdd.intel.com/test_sets/4932564a-fded-11e7-bd00-52540065bddc

test_pairwise_fail failed with the following error:

Timeout occurred after 484 mins, last suite running was recovery-double-scale, restarting cluster to continue tests

<<Please provide additional information about the failure here>>



 Comments   
Comment by Sarah Liu [ 07/Feb/18 ]

From client-2 console log linked under recovery-mds-scale, found OOM
https://testing.hpdd.intel.com/test_logs/4a7832a4-fded-11e7-bd00-52540065bddc/show_text

[  890.254565] Lustre: DEBUG MARKER: == recovery-double-scale test pairwise_fail: pairwise combination of clients, MDS, and OST failures == 18:42:46 (1516387366)
[  890.406126] Lustre: DEBUG MARKER: PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/comp
[  890.451121] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: dd on onyx-41vm3
[  890.610069] Lustre: DEBUG MARKER: Started client load: dd on onyx-41vm3
[  890.815339] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on onyx-41vm4
[  890.972956] Lustre: DEBUG MARKER: Started client load: tar on onyx-41vm4
[  893.361427] Lustre: DEBUG MARKER: cat /tmp/client-load.pid
[ 1204.465816] Lustre: DEBUG MARKER: /usr/sbin/lctl mark 
[ 1204.465816] ==== START === test 1: failover MDS, then OST ==========
[ 1210.830668] Lustre: DEBUG MARKER: Fri Jan 19 18:48:05 2018
[ 1210.830668] 
[ 1213.997192] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK
[ 1216.173546] Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK
[ 1222.239838] Lustre: DEBUG MARKER: rc=0;
[ 1222.239838] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
[ 1222.239838] 			if [[ $? -eq 0 && $val -ne 0 ]]; then
[ 1222.239838] 				echo $(hostname -s): $val;
[ 1222.239838] 				rc=$val;
[ 1222.239838] 			fi;
[ 1222.239838] 			exit $rc
[ 1222.744739] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
[ 1223.902089] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Done checking client loads. Failing type1=MDS item1=mds1 ... 
[ 1224.084788] Lustre: DEBUG MARKER: Done checking client loads. Failing type1=MDS item1=mds1 ...
[ 1270.724016] Lustre: 7292:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1516387738/real 1516387738]  req@ffff88001eef7600 x1590047326596256/t0(0) o35->lustre-MDT0000-mdc-ffff88007ba7e800@10.2.8.138@tcp:23/10 lens 512/936 e 0 to 1 dl 1516387745 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[ 1270.726937] Lustre: lustre-MDT0000-mdc-ffff88007ba7e800: Connection to lustre-MDT0000 (at 10.2.8.138@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 1287.728010] Lustre: 3108:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1516387718/real 1516387737]  req@ffff88001eef4f00 x1590047326531120/t0(0) o400->MGC10.2.8.138@tcp@10.2.8.138@tcp:26/25 lens 224/224 e 0 to 1 dl 1516387762 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1287.730909] LustreError: 166-1: MGC10.2.8.138@tcp: Connection to MGS (at 10.2.8.138@tcp) was lost; in progress operations using this service will fail
[ 1292.732016] Lustre: 3109:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1516387723/real 1516387737]  req@ffff88001eef7000 x1590047326560128/t0(0) o400->lustre-MDT0000-mdc-ffff88007ba7e800@10.2.8.138@tcp:12/10 lens 224/224 e 0 to 1 dl 1516387767 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1292.735020] Lustre: 3109:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[ 1297.733277] Lustre: Evicted from MGS (at MGC10.2.8.138@tcp_1) after server handle changed from 0x574864cbfa148717 to 0x25e1f99c947a4e92
[ 1297.734602] Lustre: MGC10.2.8.138@tcp: Connection restored to MGC10.2.8.138@tcp_1 (at 10.2.8.139@tcp)
[ 1297.735509] Lustre: 3109:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1516387728/real 1516387737]  req@ffff88000f130600 x1590047326574624/t0(0) o400->lustre-MDT0000-mdc-ffff88007ba7e800@10.2.8.139@tcp:12/10 lens 224/224 e 0 to 1 dl 1516387772 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1302.738007] Lustre: 3109:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1516387733/real 1516387737]  req@ffff88000f133000 x1590047326593552/t0(0) o400->lustre-MDT0000-mdc-ffff88007ba7e800@10.2.8.139@tcp:12/10 lens 224/224 e 0 to 1 dl 1516387777 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1313.679569] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-41vm8.onyx.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[ 1313.732006] Lustre: 3109:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1516387743/real 1516387743]  req@ffff88000f131200 x1590047326596400/t0(0) o400->lustre-MDT0000-mdc-ffff88007ba7e800@10.2.8.139@tcp:12/10 lens 224/224 e 0 to 1 dl 1516387787 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1313.734755] Lustre: 3109:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[ 1316.920391] Lustre: lustre-MDT0000-mdc-ffff88007ba7e800: Connection restored to 10.2.8.139@tcp (at 10.2.8.139@tcp)
[ 1316.926808] Lustre: DEBUG MARKER: onyx-41vm8.onyx.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[ 1318.510417] Lustre: DEBUG MARKER: /usr/sbin/lctl mark                             Failing type2=OST item2=ost1 ... 
[ 1318.940320] Lustre: DEBUG MARKER: Failing type2=OST item2=ost1 ...
[ 1375.152025] LustreError: 3103:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88005fce2600
[ 1375.153134] LustreError: 3103:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88005fce2600
[ 1375.154243] LustreError: 3100:0:(events.c:199:client_bulk_callback()) event type 1, status -111, desc ffff88005fce2e00
[ 1375.155257] LustreError: 3103:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88005fce2e00
Comment by Sarah Liu [ 07/Feb/18 ]
[ 2029.017611] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
[ 2029.017611] 
[ 2029.018521] CPU: 1 PID: 476 Comm: sssd Tainted: G           OE  ------------   3.10.0-693.11.6.el7.x86_64 #1
[ 2029.019426] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 2029.019970] Call Trace:
[ 2029.020219]  [<ffffffff816a5ea1>] dump_stack+0x19/0x1b
[ 2029.020714]  [<ffffffff8169fd64>] panic+0xe8/0x20d
[ 2029.021171]  [<ffffffff81188265>] check_panic_on_oom+0x55/0x60
[ 2029.021728]  [<ffffffff8118865b>] out_of_memory+0x23b/0x4f0
[ 2029.022256]  [<ffffffff816a1d9a>] __alloc_pages_slowpath+0x5d6/0x724
[ 2029.022861]  [<ffffffff8118eaa5>] __alloc_pages_nodemask+0x405/0x420
[ 2029.023476]  [<ffffffff811d6075>] alloc_pages_vma+0xb5/0x200
[ 2029.024010]  [<ffffffff811c63fd>] read_swap_cache_async+0xed/0x160
[ 2029.024597]  [<ffffffff811c6518>] swapin_readahead+0xa8/0x110
[ 2029.025138]  [<ffffffff811b421b>] handle_mm_fault+0xadb/0xfa0
[ 2029.025685]  [<ffffffff810d025e>] ? dequeue_task_fair+0x41e/0x660
[ 2029.026254]  [<ffffffff810c93f5>] ? sched_clock_cpu+0x85/0xc0
[ 2029.026805]  [<ffffffff816b37e4>] __do_page_fault+0x154/0x450
[ 2029.027344]  [<ffffffff816b3bc6>] trace_do_page_fault+0x56/0x150
[ 2029.027915]  [<ffffffff816b325a>] do_async_page_fault+0x1a/0xd0
[ 2029.028479]  [<ffffffff816af928>] async_page_fault+0x28/0x30
[ 2029.029011]  [<ffffffff81219034>] ? do_sys_poll+0x184/0x580
[ 2029.029544]  [<ffffffff81218ff5>] ? do_sys_poll+0x145/0x580
[ 2029.030070]  [<ffffffff816a9700>] ? bit_wait+0x50/0x50
[ 2029.030565]  [<ffffffff816ab6d9>] ? schedule+0x29/0x70
[ 2029.031053]  [<ffffffff81062efe>] ? kvm_clock_get_cycles+0x1e/0x20
[ 2029.031639]  [<ffffffff81217d10>] ? poll_select_copy_remaining+0x150/0x150
[ 2029.032281]  [<ffffffff81217d10>] ? poll_select_copy_remaining+0x150/0x150
[ 2029.032935]  [<ffffffff81217d10>] ? poll_select_copy_remaining+0x150/0x150
[ 2029.033584]  [<ffffffff81217d10>] ? poll_select_copy_remaining+0x150/0x150
[ 2029.034226]  [<ffffffff81217d10>] ? poll_select_copy_remaining+0x150/0x150
[ 2029.034882]  [<ffffffff81217d10>] ? poll_select_copy_remaining+0x150/0x150
[ 2029.035548]  [<ffffffff81217d10>] ? poll_select_copy_remaining+0x150/0x150
[ 2029.036192]  [<ffffffff81217d10>] ? poll_select_copy_remaining+0x150/0x150
[ 2029.036842]  [<ffffffff81217d10>] ? poll_select_copy_remaining+0x150/0x150
[ 2029.037495]  [<ffffffff81219534>] SyS_poll+0x74/0x110
[ 2029.037973]  [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[ 2029.038543]  [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
Comment by Sarah Liu [ 09/Aug/18 ]

on b2_10 2.10.5 RC1
https://testing.whamcloud.com/test_sets/7c3dd016-9882-11e8-b0aa-52540065bddc

Generated at Sat Feb 10 02:36:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.