[LU-14525] parallel-scale-nfsv4 hangs before any tests can run Created: 16/Mar/21 Updated: 06/Jul/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
parallel-scale-nfsv4 hangs before any tests can run. We’ve only seen this issue a handful of times such as for Lustre 2.12.4.61 (SLES12.4 clients); https://testing.whamcloud.com/test_sets/12a1d345-1f01-45de-9d30-21cbd88ea61f. Testing hang when we try to unmont the clients. Looking at the suite_log, we see Stopping clients: onyx-46vm1,onyx-46vm2 /mnt/lustre (opts:)
CMD: onyx-46vm1,onyx-46vm2 running=\$(grep -c /mnt/lustre' ' /proc/mounts);
if [ \$running -ne 0 ] ; then
echo Stopping client \$(hostname) /mnt/lustre opts:;
lsof /mnt/lustre || need_kill=no;
if [ x != x -a x\$need_kill != xno ]; then
pids=\$(lsof -t /mnt/lustre | sort -u);
if [ -n \"\$pids\" ]; then
kill -9 \$pids;
fi
fi;
while umount /mnt/lustre 2>&1 | grep -q busy; do
echo /mnt/lustre is still busy, wait one second && sleep 1;
done;
fi
Looking at the OSS (vm3) console log, we see client evictions [95316.867814] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: parallel-scale-nfsv4 ============----- Wed May 20 19:03:21 PDT 2020 [95317.080668] Lustre: DEBUG MARKER: -----============= acceptance-small: parallel-scale-nfsv4 ============----- Wed May 20 19:03:21 PDT 2020 [95317.585046] LustreError: 8592:0:(ldlm_lockd.c:681:ldlm_handle_ast_error()) ### client (nid 10.2.8.200@tcp) returned error from blocking AST (req@ffff8dab96ebb600 x1667201457741568 status -107 rc -107), evict it ns: filter-lustre-OST0005_UUID lock: ffff8dabe0c89680/0xb23e2fc63d741a2a lrc: 4/0,0 mode: PW/PW res: [0x80a5b:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400030020 nid: 10.2.8.200@tcp remote: 0xcc7d8b5f5da656a4 expref: 250 pid: 8745 timeout: 95422 lvb_type: 0 [95317.585053] LustreError: 138-a: lustre-OST0005: A client on nid 10.2.8.200@tcp was evicted due to a lock blocking callback time out: rc -107 [95317.585084] LustreError: 24558:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 0s: evicting client at 10.2.8.200@tcp ns: filter-lustre-OST0005_UUID lock: ffff8dabbc77d680/0xb23e2fc63d741aaf lrc: 3/0,0 mode: PW/PW res: [0x80a5c:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 1048576->1179647) flags: 0x60000400030020 nid: 10.2.8.200@tcp remote: 0xcc7d8b5f5da6615c expref: 251 pid: 8623 timeout: 0 lvb_type: 0 [95317.595471] LustreError: 8592:0:(ldlm_lockd.c:681:ldlm_handle_ast_error()) Skipped 6 previous similar messages [95318.000138] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-46vm1: executing check_config_client \/mnt\/lustre [95318.006487] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-46vm2: executing check_config_client \/mnt\/lustre [95318.224807] Lustre: DEBUG MARKER: onyx-46vm1: executing check_config_client /mnt/lustre [95318.229146] Lustre: DEBUG MARKER: onyx-46vm2: executing check_config_client /mnt/lustre [95318.452944] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre-ost1' ' /proc/mounts); On the client (vm2) console log, we see some nfs server issues [95486.336436] echo Stopping client $(hostname) /mnt/lustre opts:; [95486.336436] lsof /mnt/lustre || need_kill=no; [95486.336436] if [ x != x -a x$need_kill != xno ]; then [95486.336436] pids=$(lsof -t /mnt/lustre | sort -u); [95486.336436] if [95626.965822] nfs: server onyx-46vm4 not responding, still trying [95666.899579] nfs: server onyx-46vm4 not responding, still trying [98788.703830] sysrq: SysRq : Changing Loglevel [98788.704447] sysrq: Loglevel set to 8 [98788.705933] sysrq: SysRq : Show State [98788.706414] task PC stack pid father [98788.707056] systemd S 0 1 0 0x00000000 [98788.707654] Call Trace: [98788.708001] ? __schedule+0x292/0x880 [98788.708437] schedule+0x32/0x80 [98788.708805] schedule_hrtimeout_range_clock+0x106/0x110 [98788.709400] ? ep_scan_ready_list.isra.14+0x1af/0x1d0 [98788.709952] ep_poll+0x2b6/0x360 [98788.710378] ? wake_up_q+0x70/0x70 [98788.710958] SyS_epoll_wait+0xb2/0xe0 [98788.710968] do_syscall_64+0x74/0x160 [98788.710979] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [98788.710987] RIP: 0033:0x7f8cb7669523 [98788.712823] RSP: 002b:00007ffc5ae0d4a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e8 [98788.712830] RAX: ffffffffffffffda RBX: 0000561cf19dfde0 RCX: 00007f8cb7669523 [98788.712830] RDX: 0000000000000027 RSI: 00007ffc5ae0d4b0 RDI: 0000000000000004 [98788.712831] RBP: 00007ffc5ae0d790 R08: 00000000000ddb68 R09: 0000561cf09489c0 [98788.712833] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001 [98788.716708] R13: 00007ffc5ae0d4b0 R14: ffffffffffffffff R15: 0000000000000000 Another hang: |
| Comments |
| Comment by Oleg Drokin [ 17/Mar/21 ] |
|
the eviction on the oss is normal - if you check the log up you see the local client is force-umounted so when server sends an AST in - it gets rejected and client is evicted even though client does not exist. now what I also see in the logs is nfs server is stopped and never restarted. So in these circumstances it probably is normal that clients cannot contact nfs server. I see they try to do umount /mnt/lustre which is nfs mountpount? If there's a race where nfs server is down before the client - that probably is the expected result? perhaps umount -f on nfs client would make it more resilient? |