[LU-14005] Various tests hang on tee/”nfs: server *.whamcloud.com not responding, timed out” Created: 01/Oct/20  Updated: 08/May/23  Resolved: 18/Jun/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-15015 test timeout with "tee" hung in nfs_u... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

There are a large number of tests that hang with complaints of the NFS server not responding and a utility, in this case ‘tee’, hangs. It looks like this issue started in early July. Here are a few examples, dates and links to test suites, of tests hanging in ‘tee’:

0 times in May 2020
0 times in June 2020
9 times in July 2020 - all Ubuntu 18.04 client testing on Full test group - seen on master
2020-07-09 2.13.54.148 - sanity-lfsck test 40a - https://testing.whamcloud.com/test_sets/e8ef8293-bc76-4049-a29b-7e16d288c317
2020-07-25 2.13.54.178 - ost-pools 13 - https://testing.whamcloud.com/test_sets/77891514-bbe9-4ca9-b80e-fd00e05ab693

12 times in August - seen on master and b2_12 for Ubuntu 18.04 and el7.8 client testing
2020-08-07 - 2.13.55.16 - sanity-lfsck test 2d - https://testing.whamcloud.com/test_sets/252b6116-6daa-4e8f-9d34-8dc97ed38a14

42 times in September
2020-09-27 2.13.56 - sanity-lfsck test 40a - https://testing.whamcloud.com/test_sets/d107b9c2-c310-45bc-95a1-02a5403b777d
2020-09-26 2.13.56.1 - sanityn test 34 - https://testing.whamcloud.com/test_sets/e94842bb-b741-4523-92e2-03cfc00e96f6
2020-09-26 2.13.56.1 - replay-single test 90 - https://testing.whamcloud.com/test_sets/3304f735-2911-4d6d-9497-06f8c4df674f
2020-09-26 2.13.56.1 - sanity-flr test 38 - https://testing.whamcloud.com/test_sets/ac72de75-8f1b-4195-b6c4-c91f88066cc5
2020-09-26 2.13.56.1 - sanity-flr test 37 - https://testing.whamcloud.com/test_sets/c44c4353-5dd6-481a-a65e-2bbdaa3ac4c6
2020-09-03 2.13.55.67 - conf-sanity test 123F - https://testing.whamcloud.com/test_sets/56511e72-0ec0-4efb-81d0-accb94b3f43c

Let’s look at the logs for the first example of this hang that I can find; sanity-lfsck hanging after skipping test 40a at https://testing.whamcloud.com/test_sets/e8ef8293-bc76-4049-a29b-7e16d288c317 . Looking at the client1 (vm1) console log, we can see the hang

[21561.340768] Lustre: DEBUG MARKER: == sanity-lfsck test 40a: LFSCK correctly fixes lmm_oi in composite layout =========================== 06:22:39 (1594275759)
[21561.585054] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  SKIP: sanity-lfsck test_40a needs \>= 2 MDTs
[21561.795054] Lustre: DEBUG MARKER: SKIP: sanity-lfsck test_40a needs >= 2 MDTs
[21562.007611] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts);
[21562.007611] if [ $running -ne 0 ] ; then
[21562.007611] echo Stopping client $(hostname) /mnt/lustre opts:;
[21562.007611] lsof /mnt/lustre || need_kill=no;
[21562.007611] if [ x != x -a x$need_kill != xno ]; then
[21562.007611]     pids=$(lsof -t /mnt/lustre | sort -u);
[21562.007611]     if 
[21562.520943] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre2' ' /proc/mounts);
[21562.520943] if [ $running -ne 0 ] ; then
[21562.520943] echo Stopping client $(hostname) /mnt/lustre2 opts:;
[21562.520943] lsof /mnt/lustre2 || need_kill=no;
[21562.520943] if [ x != x -a x$need_kill != xno ]; then
[21562.520943]     pids=$(lsof -t /mnt/lustre2 | sort -u);
[21562.520943]    
[21591.487956] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true
[21591.733320] Key type lgssc unregistered
[21592.806013] LNet: Removed LNI 10.2.9.15@tcp
[21592.955051] Key type .llcrypt unregistered
[21592.955960] Key type ._llcrypt unregistered
[21788.032741] nfs: server onyx-1.onyx.whamcloud.com not responding, still trying
[21809.536149] nfs: server onyx-4.onyx.whamcloud.com not responding, timed out
[21824.703708] nfs: server onyx-1.onyx.whamcloud.com not responding, timed out
[21824.705110] nfs: server onyx-1.onyx.whamcloud.com not responding, still trying
[21870.974479] INFO: task tee:439 blocked for more than 120 seconds.
[21870.975847]       Tainted: G           OE    4.15.0-72-generic #81-Ubuntu
[21870.977119] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[21870.978617] tee             D    0   439   1321 0x00000000
[21870.979697] Call Trace:
[21870.980283]  __schedule+0x24e/0x880
[21870.981029]  ? update_curr+0x7a/0x1d0
[21870.981781]  ? bit_wait+0x60/0x60
[21870.982484]  schedule+0x2c/0x80
[21870.983152]  io_schedule+0x16/0x40
[21870.983864]  bit_wait_io+0x11/0x60
[21870.984590]  __wait_on_bit+0x4c/0x90
[21870.985324]  out_of_line_wait_on_bit+0x90/0xb0
[21870.986207]  ? bit_waitqueue+0x40/0x40
[21870.987009]  nfs_wait_on_request+0x46/0x50 [nfs]
[21870.987947]  nfs_lock_and_join_requests+0x121/0x510 [nfs]
[21870.988997]  ? __switch_to_asm+0x41/0x70
[21870.989795]  ? radix_tree_lookup_slot+0x22/0x50
[21870.990706]  nfs_updatepage+0x168/0x990 [nfs]
[21870.991590]  nfs_write_end+0x19a/0x4f0 [nfs]
[21870.992464]  generic_perform_write+0xf6/0x1b0
[21870.993328]  ? _cond_resched+0x19/0x40
[21870.994087]  ? _cond_resched+0x19/0x40
[21870.994869]  nfs_file_write+0xfd/0x240 [nfs]
[21870.995749]  new_sync_write+0xe7/0x140
[21870.996523]  __vfs_write+0x29/0x40
[21870.997230]  vfs_write+0xb1/0x1a0
[21870.997924]  SyS_write+0x5c/0xe0
[21870.998628]  do_syscall_64+0x73/0x130
[21870.999390]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[21871.000372] RIP: 0033:0x7f7f4d500154
[21871.001106] RSP: 002b:00007ffcf552c158 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[21871.002526] RAX: ffffffffffffffda RBX: 0000000000000041 RCX: 00007f7f4d500154
[21871.003875] RDX: 0000000000000041 RSI: 00007ffcf552c240 RDI: 0000000000000003
[21871.005198] RBP: 00007ffcf552c240 R08: 0000000000000041 R09: 00007f7f4d9ff540
[21871.006533] R10: 00000000000001b6 R11: 0000000000000246 R12: 000055bf8cceb460
[21871.007870] R13: 0000000000000041 R14: 00007f7f4d7d7760 R15: 0000000000000041
[21884.862103] nfs: server onyx-1.onyx.whamcloud.com not responding, still trying
[21905.917535] nfs: server onyx-1.onyx.whamcloud.com not responding, still trying
[21990.139281] nfs: server onyx-4.onyx.whamcloud.com not responding, timed out
…

We can see that the MDS (vm4) has trouble during umount

[21473.702000] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  SKIP: sanity-lfsck test_40a needs \>= 2 MDTs
[21473.913989] Lustre: DEBUG MARKER: SKIP: sanity-lfsck test_40a needs >= 2 MDTs
[21475.090937] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
[21475.408647] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1
[21476.496138] Lustre: lustre-MDT0000: Not available for connect from 10.2.9.17@tcp (stopping)
[21476.497684] Lustre: Skipped 1 previous similar message
[21481.503425] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.9.17@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[21481.506392] LustreError: Skipped 20 previous similar messages
[21483.732374] Lustre: 15103:0:(client.c:2261:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1594275764/real 1594275764]  req@ffff8ce7e5553180 x1671696961134016/t0(0) o251->MGC10.2.9.18@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1594275770 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'umount.0'
[21483.737598] Lustre: 15103:0:(client.c:2261:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
[21483.811617] Lustre: server umount lustre-MDT0000 complete
[21483.812679] Lustre: Skipped 4 previous similar messages
[21484.110697] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
[21484.110697] lctl dl | grep ' ST ' || true
[21484.424086] Lustre: DEBUG MARKER: modprobe dm-flakey;
…

And the OSS (vm3) has trouble contacting the MGS

[21473.703626] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  SKIP: sanity-lfsck test_40a needs \>= 2 MDTs
[21473.910221] Lustre: DEBUG MARKER: SKIP: sanity-lfsck test_40a needs >= 2 MDTs
[21476.470901] Lustre: lustre-MDT0000-lwp-OST0002: Connection to lustre-MDT0000 (at 10.2.9.18@tcp) was lost; in progress operations using this service will wait for recovery to complete
[21476.474051] Lustre: Skipped 19 previous similar messages
[21486.315508] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true
[21486.669349] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1
[21487.127074] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
[21487.127074] lctl dl | grep ' ST ' || true
[21487.485015] Lustre: DEBUG MARKER: modprobe dm-flakey;
[21487.485015] 			 dmsetup targets | grep -q flakey
[21487.839735] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost1_flakey >/dev/null 2>&1
[21487.852790] Lustre: 18994:0:(client.c:2261:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1594275767/real 1594275767]  req@ffff9854dce39f80 x1671696932054080/t0(0) o400->MGC10.2.9.18@tcp@10.2.9.18@tcp:26/25 lens 224/224 e 0 to 1 dl 1594275774 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
[21487.857569] Lustre: 18994:0:(client.c:2261:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
[21487.859321] LustreError: 166-1: MGC10.2.9.18@tcp: Connection to MGS (at 10.2.9.18@tcp) was lost; in progress operations using this service will fail
[21487.861542] LustreError: Skipped 4 previous similar messages
[21488.205971] Lustre: DEBUG MARKER: dmsetup table /dev/mapper/ost1_flakey
[21488.557083] Lustre: DEBUG MARKER: dmsetup remove /dev/mapper/ost1_flakey


 Comments   
Comment by Andreas Dilger [ 18/Jun/21 ]

Closing this because it was an infrastructure problem, and has not been problematic recently.

Comment by Alena Nikitenko [ 29/Nov/21 ]

Noticed a very similar problem in conf-sanity test 83 on 2.12.8: https://testing.whamcloud.com/test_sets/7fd1dbfe-ba9c-4e15-8e23-46f6c55ff7c5

[21566.364915] Lustre: DEBUG MARKER: == conf-sanity test 83: ENOSPACE on OST doesn't cause message VFS: Busy inodes after unmount ... ===== 08:01:39 (1637395299)
[21566.602420] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mount the OST \/dev\/mapper\/ost1_flakey as a ldiskfs filesystem
[21566.792762] Lustre: DEBUG MARKER: mount the OST /dev/mapper/ost1_flakey as a ldiskfs filesystem
[21569.085312] Lustre: DEBUG MARKER: /usr/sbin/lctl mark run llverfs in partial mode on the OST ldiskfs \/mnt\/lustre-ost1
[21569.278540] Lustre: DEBUG MARKER: run llverfs in partial mode on the OST ldiskfs /mnt/lustre-ost1
[21573.302736] Lustre: DEBUG MARKER: /usr/sbin/lctl mark unmount the OST \/dev\/mapper\/ost1_flakey
[21573.502519] Lustre: DEBUG MARKER: unmount the OST /dev/mapper/ost1_flakey
[21592.702480] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts);
[21592.702480] if [ $running -ne 0 ] ; then
[21592.702480] echo Stopping client $(hostname) /mnt/lustre opts:-f;
[21592.702480] lsof /mnt/lustre || need_kill=no;
[21592.702480] if [ x-f != x -a x$need_kill != xno ]; then
[21592.702480]     pids=$(lsof -t /mnt/lustre | sort -u);
[21592.702480]    
[21593.034838] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre2' ' /proc/mounts);
[21593.034838] if [ $running -ne 0 ] ; then
[21593.034838] echo Stopping client $(hostname) /mnt/lustre2 opts:-f;
[21593.034838] lsof /mnt/lustre2 || need_kill=no;
[21593.034838] if [ x-f != x -a x$need_kill != xno ]; then
[21593.034838]     pids=$(lsof -t /mnt/lustre2 | sort -u);
[21795.718956] nfs: server 10.240.16.204 not responding, still trying
[21811.082846] nfs: server 10.240.16.204 not responding, still trying
[21841.798794] nfs: server 10.240.16.204 not responding, timed out
[21871.498603] INFO: task tee:16545 blocked for more than 120 seconds.
[21871.499751]       Tainted: G           OE    4.15.0-161-generic #169-Ubuntu
[21871.500953] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[21871.502289] tee             D    0 16545   1322 0x00000000
[21871.503270] Call Trace:
[21871.503846]  __schedule+0x24e/0x890
[21871.504523]  ? __cgroup_account_cputime+0x28/0x30
[21871.505388]  ? update_curr+0x7a/0x1d0
[21871.506079]  ? bit_wait+0x60/0x60
[21871.506791]  schedule+0x2c/0x80
[21871.507466]  io_schedule+0x16/0x40
[21871.508267]  bit_wait_io+0x11/0x60
[21871.508917]  __wait_on_bit+0x4c/0x90
[21871.509588]  out_of_line_wait_on_bit+0x90/0xb0
[21871.510403]  ? bit_waitqueue+0x40/0x40
[21871.511182]  nfs_wait_on_request+0x46/0x50 [nfs]
[21871.512034]  nfs_lock_and_join_requests+0x121/0x530 [nfs]
[21871.512991]  ? __switch_to_asm+0x41/0x70
[21871.513723]  ? radix_tree_lookup_slot+0x22/0x50
[21871.514543]  nfs_updatepage+0x168/0x990 [nfs]
[21871.515354]  nfs_write_end+0x19a/0x4f0 [nfs]
[21871.516153]  generic_perform_write+0xf6/0x1b0
[21871.516963]  ? _cond_resched+0x19/0x40
[21871.517665]  ? _cond_resched+0x19/0x40
[21871.518362]  nfs_file_write+0xfd/0x240 [nfs]
[21871.519172]  new_sync_write+0xe7/0x140
[21871.519897]  __vfs_write+0x29/0x40
[21871.520540]  vfs_write+0xb1/0x1a0
[21871.521175]  SyS_write+0x5c/0xe0
[21871.521824]  do_syscall_64+0x73/0x130
[21871.522506]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
[21871.523424] RIP: 0033:0x7f2d3c44f224
[21871.524104] RSP: 002b:00007ffe3e107998 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[21871.525409] RAX: ffffffffffffffda RBX: 000000000000004f RCX: 00007f2d3c44f224
[21871.526645] RDX: 000000000000004f RSI: 00007ffe3e107a80 RDI: 0000000000000003
[21871.527865] RBP: 00007ffe3e107a80 R08: 000000000000004f R09: 00007f2d3c94e540
[21871.529081] R10: 00000000000001b6 R11: 0000000000000246 R12: 000055a51d32d460
[21871.530298] R13: 000000000000004f R14: 00007f2d3c726760 R15: 000000000000004f 
Generated at Sat Feb 10 03:06:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.