Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14005

Various tests hang on tee/”nfs: server *.whamcloud.com not responding, timed out”

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.14.0
    • None
    • 3
    • 9223372036854775807

    Description

      There are a large number of tests that hang with complaints of the NFS server not responding and a utility, in this case ‘tee’, hangs. It looks like this issue started in early July. Here are a few examples, dates and links to test suites, of tests hanging in ‘tee’:

      0 times in May 2020
      0 times in June 2020
      9 times in July 2020 - all Ubuntu 18.04 client testing on Full test group - seen on master
      2020-07-09 2.13.54.148 - sanity-lfsck test 40a - https://testing.whamcloud.com/test_sets/e8ef8293-bc76-4049-a29b-7e16d288c317
      2020-07-25 2.13.54.178 - ost-pools 13 - https://testing.whamcloud.com/test_sets/77891514-bbe9-4ca9-b80e-fd00e05ab693

      12 times in August - seen on master and b2_12 for Ubuntu 18.04 and el7.8 client testing
      2020-08-07 - 2.13.55.16 - sanity-lfsck test 2d - https://testing.whamcloud.com/test_sets/252b6116-6daa-4e8f-9d34-8dc97ed38a14

      42 times in September
      2020-09-27 2.13.56 - sanity-lfsck test 40a - https://testing.whamcloud.com/test_sets/d107b9c2-c310-45bc-95a1-02a5403b777d
      2020-09-26 2.13.56.1 - sanityn test 34 - https://testing.whamcloud.com/test_sets/e94842bb-b741-4523-92e2-03cfc00e96f6
      2020-09-26 2.13.56.1 - replay-single test 90 - https://testing.whamcloud.com/test_sets/3304f735-2911-4d6d-9497-06f8c4df674f
      2020-09-26 2.13.56.1 - sanity-flr test 38 - https://testing.whamcloud.com/test_sets/ac72de75-8f1b-4195-b6c4-c91f88066cc5
      2020-09-26 2.13.56.1 - sanity-flr test 37 - https://testing.whamcloud.com/test_sets/c44c4353-5dd6-481a-a65e-2bbdaa3ac4c6
      2020-09-03 2.13.55.67 - conf-sanity test 123F - https://testing.whamcloud.com/test_sets/56511e72-0ec0-4efb-81d0-accb94b3f43c

      Let’s look at the logs for the first example of this hang that I can find; sanity-lfsck hanging after skipping test 40a at https://testing.whamcloud.com/test_sets/e8ef8293-bc76-4049-a29b-7e16d288c317 . Looking at the client1 (vm1) console log, we can see the hang

      [21561.340768] Lustre: DEBUG MARKER: == sanity-lfsck test 40a: LFSCK correctly fixes lmm_oi in composite layout =========================== 06:22:39 (1594275759)
      [21561.585054] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  SKIP: sanity-lfsck test_40a needs \>= 2 MDTs
      [21561.795054] Lustre: DEBUG MARKER: SKIP: sanity-lfsck test_40a needs >= 2 MDTs
      [21562.007611] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts);
      [21562.007611] if [ $running -ne 0 ] ; then
      [21562.007611] echo Stopping client $(hostname) /mnt/lustre opts:;
      [21562.007611] lsof /mnt/lustre || need_kill=no;
      [21562.007611] if [ x != x -a x$need_kill != xno ]; then
      [21562.007611]     pids=$(lsof -t /mnt/lustre | sort -u);
      [21562.007611]     if 
      [21562.520943] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre2' ' /proc/mounts);
      [21562.520943] if [ $running -ne 0 ] ; then
      [21562.520943] echo Stopping client $(hostname) /mnt/lustre2 opts:;
      [21562.520943] lsof /mnt/lustre2 || need_kill=no;
      [21562.520943] if [ x != x -a x$need_kill != xno ]; then
      [21562.520943]     pids=$(lsof -t /mnt/lustre2 | sort -u);
      [21562.520943]    
      [21591.487956] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true
      [21591.733320] Key type lgssc unregistered
      [21592.806013] LNet: Removed LNI 10.2.9.15@tcp
      [21592.955051] Key type .llcrypt unregistered
      [21592.955960] Key type ._llcrypt unregistered
      [21788.032741] nfs: server onyx-1.onyx.whamcloud.com not responding, still trying
      [21809.536149] nfs: server onyx-4.onyx.whamcloud.com not responding, timed out
      [21824.703708] nfs: server onyx-1.onyx.whamcloud.com not responding, timed out
      [21824.705110] nfs: server onyx-1.onyx.whamcloud.com not responding, still trying
      [21870.974479] INFO: task tee:439 blocked for more than 120 seconds.
      [21870.975847]       Tainted: G           OE    4.15.0-72-generic #81-Ubuntu
      [21870.977119] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [21870.978617] tee             D    0   439   1321 0x00000000
      [21870.979697] Call Trace:
      [21870.980283]  __schedule+0x24e/0x880
      [21870.981029]  ? update_curr+0x7a/0x1d0
      [21870.981781]  ? bit_wait+0x60/0x60
      [21870.982484]  schedule+0x2c/0x80
      [21870.983152]  io_schedule+0x16/0x40
      [21870.983864]  bit_wait_io+0x11/0x60
      [21870.984590]  __wait_on_bit+0x4c/0x90
      [21870.985324]  out_of_line_wait_on_bit+0x90/0xb0
      [21870.986207]  ? bit_waitqueue+0x40/0x40
      [21870.987009]  nfs_wait_on_request+0x46/0x50 [nfs]
      [21870.987947]  nfs_lock_and_join_requests+0x121/0x510 [nfs]
      [21870.988997]  ? __switch_to_asm+0x41/0x70
      [21870.989795]  ? radix_tree_lookup_slot+0x22/0x50
      [21870.990706]  nfs_updatepage+0x168/0x990 [nfs]
      [21870.991590]  nfs_write_end+0x19a/0x4f0 [nfs]
      [21870.992464]  generic_perform_write+0xf6/0x1b0
      [21870.993328]  ? _cond_resched+0x19/0x40
      [21870.994087]  ? _cond_resched+0x19/0x40
      [21870.994869]  nfs_file_write+0xfd/0x240 [nfs]
      [21870.995749]  new_sync_write+0xe7/0x140
      [21870.996523]  __vfs_write+0x29/0x40
      [21870.997230]  vfs_write+0xb1/0x1a0
      [21870.997924]  SyS_write+0x5c/0xe0
      [21870.998628]  do_syscall_64+0x73/0x130
      [21870.999390]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      [21871.000372] RIP: 0033:0x7f7f4d500154
      [21871.001106] RSP: 002b:00007ffcf552c158 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [21871.002526] RAX: ffffffffffffffda RBX: 0000000000000041 RCX: 00007f7f4d500154
      [21871.003875] RDX: 0000000000000041 RSI: 00007ffcf552c240 RDI: 0000000000000003
      [21871.005198] RBP: 00007ffcf552c240 R08: 0000000000000041 R09: 00007f7f4d9ff540
      [21871.006533] R10: 00000000000001b6 R11: 0000000000000246 R12: 000055bf8cceb460
      [21871.007870] R13: 0000000000000041 R14: 00007f7f4d7d7760 R15: 0000000000000041
      [21884.862103] nfs: server onyx-1.onyx.whamcloud.com not responding, still trying
      [21905.917535] nfs: server onyx-1.onyx.whamcloud.com not responding, still trying
      [21990.139281] nfs: server onyx-4.onyx.whamcloud.com not responding, timed out
      …
      

      We can see that the MDS (vm4) has trouble during umount

      [21473.702000] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  SKIP: sanity-lfsck test_40a needs \>= 2 MDTs
      [21473.913989] Lustre: DEBUG MARKER: SKIP: sanity-lfsck test_40a needs >= 2 MDTs
      [21475.090937] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
      [21475.408647] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1
      [21476.496138] Lustre: lustre-MDT0000: Not available for connect from 10.2.9.17@tcp (stopping)
      [21476.497684] Lustre: Skipped 1 previous similar message
      [21481.503425] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.9.17@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [21481.506392] LustreError: Skipped 20 previous similar messages
      [21483.732374] Lustre: 15103:0:(client.c:2261:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1594275764/real 1594275764]  req@ffff8ce7e5553180 x1671696961134016/t0(0) o251->MGC10.2.9.18@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1594275770 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'umount.0'
      [21483.737598] Lustre: 15103:0:(client.c:2261:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
      [21483.811617] Lustre: server umount lustre-MDT0000 complete
      [21483.812679] Lustre: Skipped 4 previous similar messages
      [21484.110697] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
      [21484.110697] lctl dl | grep ' ST ' || true
      [21484.424086] Lustre: DEBUG MARKER: modprobe dm-flakey;
      …
      

      And the OSS (vm3) has trouble contacting the MGS

      [21473.703626] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  SKIP: sanity-lfsck test_40a needs \>= 2 MDTs
      [21473.910221] Lustre: DEBUG MARKER: SKIP: sanity-lfsck test_40a needs >= 2 MDTs
      [21476.470901] Lustre: lustre-MDT0000-lwp-OST0002: Connection to lustre-MDT0000 (at 10.2.9.18@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [21476.474051] Lustre: Skipped 19 previous similar messages
      [21486.315508] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true
      [21486.669349] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1
      [21487.127074] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
      [21487.127074] lctl dl | grep ' ST ' || true
      [21487.485015] Lustre: DEBUG MARKER: modprobe dm-flakey;
      [21487.485015] 			 dmsetup targets | grep -q flakey
      [21487.839735] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/ost1_flakey >/dev/null 2>&1
      [21487.852790] Lustre: 18994:0:(client.c:2261:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1594275767/real 1594275767]  req@ffff9854dce39f80 x1671696932054080/t0(0) o400->MGC10.2.9.18@tcp@10.2.9.18@tcp:26/25 lens 224/224 e 0 to 1 dl 1594275774 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0'
      [21487.857569] Lustre: 18994:0:(client.c:2261:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
      [21487.859321] LustreError: 166-1: MGC10.2.9.18@tcp: Connection to MGS (at 10.2.9.18@tcp) was lost; in progress operations using this service will fail
      [21487.861542] LustreError: Skipped 4 previous similar messages
      [21488.205971] Lustre: DEBUG MARKER: dmsetup table /dev/mapper/ost1_flakey
      [21488.557083] Lustre: DEBUG MARKER: dmsetup remove /dev/mapper/ost1_flakey
      

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: