Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.14.0
-
None
-
3
-
9223372036854775807
Description
There are a large number of tests that hang with complaints of the NFS server not responding and a utility, in this case ‘tee’, hangs. It looks like this issue started in early July. Here are a few examples, dates and links to test suites, of tests hanging in ‘tee’:
0 times in May 2020
0 times in June 2020
9 times in July 2020 - all Ubuntu 18.04 client testing on Full test group - seen on master
2020-07-09 2.13.54.148 - sanity-lfsck test 40a - https://testing.whamcloud.com/test_sets/e8ef8293-bc76-4049-a29b-7e16d288c317
2020-07-25 2.13.54.178 - ost-pools 13 - https://testing.whamcloud.com/test_sets/77891514-bbe9-4ca9-b80e-fd00e05ab693
12 times in August - seen on master and b2_12 for Ubuntu 18.04 and el7.8 client testing
2020-08-07 - 2.13.55.16 - sanity-lfsck test 2d - https://testing.whamcloud.com/test_sets/252b6116-6daa-4e8f-9d34-8dc97ed38a14
42 times in September
2020-09-27 2.13.56 - sanity-lfsck test 40a - https://testing.whamcloud.com/test_sets/d107b9c2-c310-45bc-95a1-02a5403b777d
2020-09-26 2.13.56.1 - sanityn test 34 - https://testing.whamcloud.com/test_sets/e94842bb-b741-4523-92e2-03cfc00e96f6
2020-09-26 2.13.56.1 - replay-single test 90 - https://testing.whamcloud.com/test_sets/3304f735-2911-4d6d-9497-06f8c4df674f
2020-09-26 2.13.56.1 - sanity-flr test 38 - https://testing.whamcloud.com/test_sets/ac72de75-8f1b-4195-b6c4-c91f88066cc5
2020-09-26 2.13.56.1 - sanity-flr test 37 - https://testing.whamcloud.com/test_sets/c44c4353-5dd6-481a-a65e-2bbdaa3ac4c6
2020-09-03 2.13.55.67 - conf-sanity test 123F - https://testing.whamcloud.com/test_sets/56511e72-0ec0-4efb-81d0-accb94b3f43c
Let’s look at the logs for the first example of this hang that I can find; sanity-lfsck hanging after skipping test 40a at https://testing.whamcloud.com/test_sets/e8ef8293-bc76-4049-a29b-7e16d288c317 . Looking at the client1 (vm1) console log, we can see the hang
[21561.340768] Lustre: DEBUG MARKER: == sanity-lfsck test 40a: LFSCK correctly fixes lmm_oi in composite layout =========================== 06:22:39 (1594275759) [21561.585054] Lustre: DEBUG MARKER: /usr/sbin/lctl mark SKIP: sanity-lfsck test_40a needs \>= 2 MDTs [21561.795054] Lustre: DEBUG MARKER: SKIP: sanity-lfsck test_40a needs >= 2 MDTs [21562.007611] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts); [21562.007611] if [ $running -ne 0 ] ; then [21562.007611] echo Stopping client $(hostname) /mnt/lustre opts:; [21562.007611] lsof /mnt/lustre || need_kill=no; [21562.007611] if [ x != x -a x$need_kill != xno ]; then [21562.007611] pids=$(lsof -t /mnt/lustre | sort -u); [21562.007611] if [21562.520943] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre2' ' /proc/mounts); [21562.520943] if [ $running -ne 0 ] ; then [21562.520943] echo Stopping client $(hostname) /mnt/lustre2 opts:; [21562.520943] lsof /mnt/lustre2 || need_kill=no; [21562.520943] if [ x != x -a x$need_kill != xno ]; then [21562.520943] pids=$(lsof -t /mnt/lustre2 | sort -u); [21562.520943] [21591.487956] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true [21591.733320] Key type lgssc unregistered [21592.806013] LNet: Removed LNI 10.2.9.15@tcp [21592.955051] Key type .llcrypt unregistered [21592.955960] Key type ._llcrypt unregistered [21788.032741] nfs: server onyx-1.onyx.whamcloud.com not responding, still trying [21809.536149] nfs: server onyx-4.onyx.whamcloud.com not responding, timed out [21824.703708] nfs: server onyx-1.onyx.whamcloud.com not responding, timed out [21824.705110] nfs: server onyx-1.onyx.whamcloud.com not responding, still trying [21870.974479] INFO: task tee:439 blocked for more than 120 seconds. [21870.975847] Tainted: G OE 4.15.0-72-generic #81-Ubuntu [21870.977119] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21870.978617] tee D 0 439 1321 0x00000000 [21870.979697] Call Trace: [21870.980283] __schedule+0x24e/0x880 [21870.981029] ? update_curr+0x7a/0x1d0 [21870.981781] ? bit_wait+0x60/0x60 [21870.982484] schedule+0x2c/0x80 [21870.983152] io_schedule+0x16/0x40 [21870.983864] bit_wait_io+0x11/0x60 [21870.984590] __wait_on_bit+0x4c/0x90 [21870.985324] out_of_line_wait_on_bit+0x90/0xb0 [21870.986207] ? bit_waitqueue+0x40/0x40 [21870.987009] nfs_wait_on_request+0x46/0x50 [nfs] [21870.987947] nfs_lock_and_join_requests+0x121/0x510 [nfs] [21870.988997] ? __switch_to_asm+0x41/0x70 [21870.989795] ? radix_tree_lookup_slot+0x22/0x50 [21870.990706] nfs_updatepage+0x168/0x990 [nfs] [21870.991590] nfs_write_end+0x19a/0x4f0 [nfs] [21870.992464] generic_perform_write+0xf6/0x1b0 [21870.993328] ? _cond_resched+0x19/0x40 [21870.994087] ? _cond_resched+0x19/0x40 [21870.994869] nfs_file_write+0xfd/0x240 [nfs] [21870.995749] new_sync_write+0xe7/0x140 [21870.996523] __vfs_write+0x29/0x40 [21870.997230] vfs_write+0xb1/0x1a0 [21870.997924] SyS_write+0x5c/0xe0 [21870.998628] do_syscall_64+0x73/0x130 [21870.999390] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [21871.000372] RIP: 0033:0x7f7f4d500154 [21871.001106] RSP: 002b:00007ffcf552c158 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [21871.002526] RAX: ffffffffffffffda RBX: 0000000000000041 RCX: 00007f7f4d500154 [21871.003875] RDX: 0000000000000041 RSI: 00007ffcf552c240 RDI: 0000000000000003 [21871.005198] RBP: 00007ffcf552c240 R08: 0000000000000041 R09: 00007f7f4d9ff540 [21871.006533] R10: 00000000000001b6 R11: 0000000000000246 R12: 000055bf8cceb460 [21871.007870] R13: 0000000000000041 R14: 00007f7f4d7d7760 R15: 0000000000000041 [21884.862103] nfs: server onyx-1.onyx.whamcloud.com not responding, still trying [21905.917535] nfs: server onyx-1.onyx.whamcloud.com not responding, still trying [21990.139281] nfs: server onyx-4.onyx.whamcloud.com not responding, timed out …
We can see that the MDS (vm4) has trouble during umount
[21473.702000] Lustre: DEBUG MARKER: /usr/sbin/lctl mark SKIP: sanity-lfsck test_40a needs \>= 2 MDTs [21473.913989] Lustre: DEBUG MARKER: SKIP: sanity-lfsck test_40a needs >= 2 MDTs [21475.090937] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true [21475.408647] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1 [21476.496138] Lustre: lustre-MDT0000: Not available for connect from 10.2.9.17@tcp (stopping) [21476.497684] Lustre: Skipped 1 previous similar message [21481.503425] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.9.17@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [21481.506392] LustreError: Skipped 20 previous similar messages [21483.732374] Lustre: 15103:0:(client.c:2261:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1594275764/real 1594275764] req@ffff8ce7e5553180 x1671696961134016/t0(0) o251->MGC10.2.9.18@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1594275770 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'umount.0' [21483.737598] Lustre: 15103:0:(client.c:2261:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [21483.811617] Lustre: server umount lustre-MDT0000 complete [21483.812679] Lustre: Skipped 4 previous similar messages [21484.110697] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && [21484.110697] lctl dl | grep ' ST ' || true [21484.424086] Lustre: DEBUG MARKER: modprobe dm-flakey; …
And the OSS (vm3) has trouble contacting the MGS
[21473.703626] Lustre: DEBUG MARKER: /usr/sbin/lctl mark SKIP: sanity-lfsck test_40a needs \>= 2 MDTs [21473.910221] Lustre: DEBUG MARKER: SKIP: sanity-lfsck test_40a needs >= 2 MDTs [21476.470901] Lustre: lustre-MDT0000-lwp-OST0002: Connection to lustre-MDT0000 (at 10.2.9.18@tcp) was lost; in progress operations using this service will wait for recovery to complete [21476.474051] Lustre: Skipped 19 previous similar messages [21486.315508] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true [21486.669349] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1 [21487.127074] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && [21487.127074] lctl dl | grep ' ST ' || true [21487.485015] Lustre: DEBUG MARKER: modprobe dm-flakey; [21487.485015] dmsetup targets | grep -q flakey [21487.839735] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost1_flakey >/dev/null 2>&1 [21487.852790] Lustre: 18994:0:(client.c:2261:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1594275767/real 1594275767] req@ffff9854dce39f80 x1671696932054080/t0(0) o400->MGC10.2.9.18@tcp@10.2.9.18@tcp:26/25 lens 224/224 e 0 to 1 dl 1594275774 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' [21487.857569] Lustre: 18994:0:(client.c:2261:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [21487.859321] LustreError: 166-1: MGC10.2.9.18@tcp: Connection to MGS (at 10.2.9.18@tcp) was lost; in progress operations using this service will fail [21487.861542] LustreError: Skipped 4 previous similar messages [21488.205971] Lustre: DEBUG MARKER: dmsetup table /dev/mapper/ost1_flakey [21488.557083] Lustre: DEBUG MARKER: dmsetup remove /dev/mapper/ost1_flakey