Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14525

parallel-scale-nfsv4 hangs before any tests can run

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.5
    • None
    • 3
    • 9223372036854775807

    Description

      parallel-scale-nfsv4 hangs before any tests can run. We’ve only seen this issue a handful of times such as for Lustre 2.12.4.61 (SLES12.4 clients); https://testing.whamcloud.com/test_sets/12a1d345-1f01-45de-9d30-21cbd88ea61f. Testing hang when we try to unmont the clients. Looking at the suite_log, we see

      Stopping clients: onyx-46vm1,onyx-46vm2 /mnt/lustre (opts:)
      CMD: onyx-46vm1,onyx-46vm2 running=\$(grep -c /mnt/lustre' ' /proc/mounts);
      if [ \$running -ne 0 ] ; then
      echo Stopping client \$(hostname) /mnt/lustre opts:;
      lsof /mnt/lustre || need_kill=no;
      if [ x != x -a x\$need_kill != xno ]; then
          pids=\$(lsof -t /mnt/lustre | sort -u);
          if [ -n \"\$pids\" ]; then
                   kill -9 \$pids;
          fi
      fi;
      while umount  /mnt/lustre 2>&1 | grep -q busy; do
          echo /mnt/lustre is still busy, wait one second && sleep 1;
      done;
      fi
      

      Looking at the OSS (vm3) console log, we see client evictions

      [95316.867814] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: parallel-scale-nfsv4 ============----- Wed May 20 19:03:21 PDT 2020
      [95317.080668] Lustre: DEBUG MARKER: -----============= acceptance-small: parallel-scale-nfsv4 ============----- Wed May 20 19:03:21 PDT 2020
      [95317.585046] LustreError: 8592:0:(ldlm_lockd.c:681:ldlm_handle_ast_error()) ### client (nid 10.2.8.200@tcp) returned error from blocking AST (req@ffff8dab96ebb600 x1667201457741568 status -107 rc -107), evict it ns: filter-lustre-OST0005_UUID lock: ffff8dabe0c89680/0xb23e2fc63d741a2a lrc: 4/0,0 mode: PW/PW res: [0x80a5b:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400030020 nid: 10.2.8.200@tcp remote: 0xcc7d8b5f5da656a4 expref: 250 pid: 8745 timeout: 95422 lvb_type: 0
      [95317.585053] LustreError: 138-a: lustre-OST0005: A client on nid 10.2.8.200@tcp was evicted due to a lock blocking callback time out: rc -107
      [95317.585084] LustreError: 24558:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 0s: evicting client at 10.2.8.200@tcp  ns: filter-lustre-OST0005_UUID lock: ffff8dabbc77d680/0xb23e2fc63d741aaf lrc: 3/0,0 mode: PW/PW res: [0x80a5c:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 1048576->1179647) flags: 0x60000400030020 nid: 10.2.8.200@tcp remote: 0xcc7d8b5f5da6615c expref: 251 pid: 8623 timeout: 0 lvb_type: 0
      [95317.595471] LustreError: 8592:0:(ldlm_lockd.c:681:ldlm_handle_ast_error()) Skipped 6 previous similar messages
      [95318.000138] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-46vm1: executing check_config_client \/mnt\/lustre
      [95318.006487] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-46vm2: executing check_config_client \/mnt\/lustre
      [95318.224807] Lustre: DEBUG MARKER: onyx-46vm1: executing check_config_client /mnt/lustre
      [95318.229146] Lustre: DEBUG MARKER: onyx-46vm2: executing check_config_client /mnt/lustre
      [95318.452944] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre-ost1' ' /proc/mounts);
      

      On the client (vm2) console log, we see some nfs server issues

      [95486.336436] echo Stopping client $(hostname) /mnt/lustre opts:;
      [95486.336436] lsof /mnt/lustre || need_kill=no;
      [95486.336436] if [ x != x -a x$need_kill != xno ]; then
      [95486.336436]     pids=$(lsof -t /mnt/lustre | sort -u);
      [95486.336436]     if 
      [95626.965822] nfs: server onyx-46vm4 not responding, still trying
      [95666.899579] nfs: server onyx-46vm4 not responding, still trying
      [98788.703830] sysrq: SysRq : Changing Loglevel
      [98788.704447] sysrq: Loglevel set to 8
      [98788.705933] sysrq: SysRq : Show State
      [98788.706414]   task                        PC stack   pid father
      [98788.707056] systemd         S    0     1      0 0x00000000
      [98788.707654] Call Trace:
      [98788.708001]  ? __schedule+0x292/0x880
      [98788.708437]  schedule+0x32/0x80
      [98788.708805]  schedule_hrtimeout_range_clock+0x106/0x110
      [98788.709400]  ? ep_scan_ready_list.isra.14+0x1af/0x1d0
      [98788.709952]  ep_poll+0x2b6/0x360
      [98788.710378]  ? wake_up_q+0x70/0x70
      [98788.710958]  SyS_epoll_wait+0xb2/0xe0
      [98788.710968]  do_syscall_64+0x74/0x160
      [98788.710979]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      [98788.710987] RIP: 0033:0x7f8cb7669523
      [98788.712823] RSP: 002b:00007ffc5ae0d4a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e8
      [98788.712830] RAX: ffffffffffffffda RBX: 0000561cf19dfde0 RCX: 00007f8cb7669523
      [98788.712830] RDX: 0000000000000027 RSI: 00007ffc5ae0d4b0 RDI: 0000000000000004
      [98788.712831] RBP: 00007ffc5ae0d790 R08: 00000000000ddb68 R09: 0000561cf09489c0
      [98788.712833] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001
      [98788.716708] R13: 00007ffc5ae0d4b0 R14: ffffffffffffffff R15: 0000000000000000
      

      Another hang:
      Lustre 2.12.4.47 - CentOS7.7 - https://testing.whamcloud.com/test_sets/790dab9e-22d4-4c0b-8f58-41cff56c5810

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: