Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14137

parallel-scale-nfsv4 racer_on_nfs server not responding leads to pjdfstest to hang or fail

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.14.0, Lustre 2.12.6
    • None
    • 3
    • 9223372036854775807

    Description

      We’ve seen that racer_on_nfs for parallel-scale-nfsv3/4 can lead to the next test suite not being able to run due to the NFS server not responding; see LU-9940. We see this for the latest test suite to follow parallel-scale-nfsv4, pjdfstest; https://testing.whamcloud.com/test_sets/177f113b-8697-4e90-8a5b-a6b9fc4960d4 .

      In client 1 (vm1) console

      [104844.990816] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020
      [104845.232371] Lustre: DEBUG MARKER: -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020
      [104847.239225] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 \(1605188173\)
      [104847.480089] Lustre: DEBUG MARKER: == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 (1605188173)
      [104848.978709] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
      [104849.221429] Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: chown_00 utimensat_08
      [104849.461831] Lustre: DEBUG MARKER: excepting tests: chown_00 utimensat_08
      [105031.551815] nfs: server trevis-29vm4 not responding, still trying
      [105055.615001] nfs: server trevis-29vm4 not responding, still trying
      [105055.616105] nfs: server trevis-29vm4 not responding, timed out
      

      but, if you looks at the last test suite that completed, parallel-scale-nfsv4 racer_on_nfs, you see a segfault

      [104509.071245] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 13:30:35 \(1605187835\)
      [104509.316855] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 13:30:35 (1605187835)
      [104509.542351] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
      [104545.372425] 10[30478]: segfault at 8 ip 00007fbf2f4b07e8 sp 00007fff363082c0 error 4 in ld-2.17.so[7fbf2f4a5000+22000]
      [104554.287433] 3[22368]: segfault at 8 ip 00007f0cf07fb7e8 sp 00007fffde597b50 error 4 in ld-2.17.so[7f0cf07f0000+22000]
      [104597.669765] 4[4673]: segfault at 8 ip 00007f12b63407e8 sp 00007fff862aa300 error 4 in ld-2.17.so[7f12b6335000+22000]
      [104815.417110] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      [104815.783689] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc
      [104817.231922] Lustre: DEBUG MARKER: dmesg
      [104818.127101] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 \(1605188144\)
      [104818.381560] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 (1605188144)
      [104818.597884] Lustre: DEBUG MARKER: umount -f /mnt/lustre
      

      Looking at the OSS (vm3) console log, we see LNet issues

      [104421.383451] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 13:30:35 (1605187835)
      [104541.739060] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
      [104541.740628] LNetError: Skipped 5 previous similar messages
      [104541.741587] LNetError: 21967:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
      [104541.743275] LNetError: 21967:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 5 previous similar messages
      [104541.744877] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
      [104541.747458] LNetError: Skipped 5 previous similar messages
      [104727.505977] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      [104728.383449] Lustre: DEBUG MARKER: rc=0;
      [104728.383449] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      [104728.383449] 			if [[ $? -eq 0 && $val -ne 0 ]]; then
      [104728.383449] 				echo $(hostname -s): $val;
      [104728.383449] 				rc=$val;
      [104728.383449] 			fi;
      [104728.383449] 			exit $rc
      [104729.290527] Lustre: DEBUG MARKER: dmesg
      [104730.187533] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 \(1605188144\)
      [104730.428143] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 (1605188144)
      [104754.664116] Lustre: lustre-OST0000: Connection restored to 6ef95858-0684-3186-09bf-845f7502685d (at 10.9.5.101@tcp)
      [104754.666078] Lustre: Skipped 6 previous similar messages
      [104755.850798] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == rpc test complete, duration -o sec ================================================================ 13:36:09 \(1605188169\)
      [104756.089115] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 13:36:09 (1605188169)
      [104756.487966] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-29vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [104756.726639] Lustre: DEBUG MARKER: trevis-29vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [104757.052466] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020
      [104757.308588] Lustre: DEBUG MARKER: -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020
      [104759.302601] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 \(1605188173\)
      [104759.557414] Lustre: DEBUG MARKER: == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 (1605188173)
      [104760.753243] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null ||
      [104760.753243] 				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
      [104760.753243] 				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
      [104761.283326] Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: chown_00 utimensat_08
      [104761.523935] Lustre: DEBUG MARKER: excepting tests: chown_00 utimensat_08
      [104798.827334] Lustre: lustre-OST0003: haven't heard from client 6c4f2798-36a0-1283-9a2a-402a8215fdc6 (at 10.9.5.103@tcp) in 55 seconds. I think it's dead, and I am evicting it. exp ffff908f82928000, cur 1605188213 expire 1605188183 last 1605188158
      [104798.831028] Lustre: Skipped 6 previous similar messages
      [105141.722329] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
      [105141.723961] LNetError: Skipped 6 previous similar messages
      [105141.724931] LNetError: 21965:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
      [105141.726576] LNetError: 21965:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 6 previous similar messages
      [105141.728191] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
      [105141.730788] LNetError: Skipped 6 previous similar messages
      

      In the console log of the MDS (vm4), we see issues during and after parallel-scale-nfsv4 racer_on_nfs

      [104422.879162] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 13:30:35 (1605187835)
      [104428.673403] LustreError: 5794:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x3060:0x0] get parent: rc = -2
      [104428.675745] LustreError: 5794:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 31 previous similar messages
      [104445.738012] LustreError: 5795:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x3031:0x0] get parent: rc = -2
      [104485.005485] LustreError: 5799:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x324c:0x0] get parent: rc = -2
      [104485.008076] LustreError: 5799:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 16 previous similar messages
      [104566.409419] LustreError: 5796:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x3502:0x0] get parent: rc = -2
      [104566.411605] LustreError: 5796:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 63 previous similar messages
      [104704.226358] LustreError: 5794:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x37e0:0x0] get parent: rc = -2
      [104704.228861] LustreError: 5794:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 69 previous similar messages
      [104729.025141] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      [104730.349763] Lustre: DEBUG MARKER: rc=0;
      [104730.349763] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      [104730.349763] 			if [[ $? -eq 0 && $val -ne 0 ]]; then
      [104730.349763] 				echo $(hostname -s): $val;
      [104730.349763] 				rc=$val;
      [104730.349763] 			fi;
      [104730.349763] 			exit $rc
      [104730.802926] Lustre: DEBUG MARKER: dmesg
      [104731.684167] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 \(1605188144\)
      [104731.944230] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 (1605188144)
      [104743.180814] Lustre: DEBUG MARKER: { [[ -e /etc/SuSE-release ]] &&
      [104743.180814] 				 service nfsserver stop; } ||
      [104743.180814] 				 service nfs stop
      [104743.428775] nfsd: last server has exited, flushing export cache
      [104743.619356] Lustre: DEBUG MARKER: sed -i '/lustre/d' /etc/exports
      [104744.102408] Lustre: DEBUG MARKER: exportfs -v
      [104744.498993] Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts
      [104744.870704] Lustre: DEBUG MARKER: lsof -t /mnt/lustre
      [104745.380995] Lustre: DEBUG MARKER: umount -f /mnt/lustre 2>&1
      [104745.568011] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
      [104745.569374] Lustre: Skipped 7 previous similar messages
      [104745.570520] LustreError: 25275:0:(file.c:216:ll_close_inode_openhandle()) lustre-clilmv-ffff96e419fc4000: inode [0x20006e9a6:0x351a:0x0] mdc close failed: rc = -108
      [104745.572860] LustreError: 25275:0:(file.c:216:ll_close_inode_openhandle()) Skipped 9 previous similar messages
      [104745.585316] Lustre: 3529:0:(llite_lib.c:2762:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.5.103@tcp:/lustre/fid: [0x20006e9a6:0x38cd:0x0]/ may get corrupted (rc -108)
      …
      [104745.630358] Lustre: 3529:0:(llite_lib.c:2762:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.5.103@tcp:/lustre/fid: [0x20006e9a6:0x3912:0x0]/ may get corrupted (rc -108)
      [104755.671260] Lustre: Unmounted lustre-client
      [104756.077342] Lustre: MGS: Connection restored to 43522391-bc41-831c-eaa2-9a67ae6d0c48 (at 10.9.5.101@tcp)
      [104756.079241] Lustre: Skipped 4 previous similar messages
      [104757.351305] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == rpc test complete, duration -o sec ================================================================ 13:36:09 \(1605188169\)
      [104757.588086] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 13:36:09 (1605188169)
      [104757.986890] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-29vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [104758.226754] Lustre: DEBUG MARKER: trevis-29vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [104758.553655] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020
      [104758.798368] Lustre: DEBUG MARKER: -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020
      [104760.797535] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 \(1605188173\)
      [104761.044320] Lustre: DEBUG MARKER: == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 (1605188173)
      [104761.408777] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null ||
      [104761.408777] 				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
      [104761.408777] 				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
      [104761.849199] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null ||
      [104761.849199] 				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
      [104761.849199] 				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
      [104762.783875] Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: chown_00 utimensat_08
      [104763.023125] Lustre: DEBUG MARKER: excepting tests: chown_00 utimensat_08
      [104796.252581] Lustre: lustre-MDT0000: haven't heard from client 6c4f2798-36a0-1283-9a2a-402a8215fdc6 (at 0@lo) in 50 seconds. I think it's dead, and I am evicting it. exp ffff96e3f8f80c00, cur 1605188208 expire 1605188178 last 1605188158
      

      Sometimes pjdfstest does not hang, but just fails; https://testing.whamcloud.com/test_sets/01ee37b4-627e-46cf-abc2-c097b2ce5b69.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: