Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.14.0, Lustre 2.12.6
-
None
-
3
-
9223372036854775807
Description
We’ve seen that racer_on_nfs for parallel-scale-nfsv3/4 can lead to the next test suite not being able to run due to the NFS server not responding; see LU-9940. We see this for the latest test suite to follow parallel-scale-nfsv4, pjdfstest; https://testing.whamcloud.com/test_sets/177f113b-8697-4e90-8a5b-a6b9fc4960d4 .
In client 1 (vm1) console
[104844.990816] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020 [104845.232371] Lustre: DEBUG MARKER: -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020 [104847.239225] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 \(1605188173\) [104847.480089] Lustre: DEBUG MARKER: == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 (1605188173) [104848.978709] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2 [104849.221429] Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: chown_00 utimensat_08 [104849.461831] Lustre: DEBUG MARKER: excepting tests: chown_00 utimensat_08 [105031.551815] nfs: server trevis-29vm4 not responding, still trying [105055.615001] nfs: server trevis-29vm4 not responding, still trying [105055.616105] nfs: server trevis-29vm4 not responding, timed out
but, if you looks at the last test suite that completed, parallel-scale-nfsv4 racer_on_nfs, you see a segfault
[104509.071245] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 13:30:35 \(1605187835\) [104509.316855] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 13:30:35 (1605187835) [104509.542351] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs [104545.372425] 10[30478]: segfault at 8 ip 00007fbf2f4b07e8 sp 00007fff363082c0 error 4 in ld-2.17.so[7fbf2f4a5000+22000] [104554.287433] 3[22368]: segfault at 8 ip 00007f0cf07fb7e8 sp 00007fffde597b50 error 4 in ld-2.17.so[7f0cf07f0000+22000] [104597.669765] 4[4673]: segfault at 8 ip 00007f12b63407e8 sp 00007fff862aa300 error 4 in ld-2.17.so[7f12b6335000+22000] [104815.417110] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [104815.783689] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc [104817.231922] Lustre: DEBUG MARKER: dmesg [104818.127101] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 \(1605188144\) [104818.381560] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 (1605188144) [104818.597884] Lustre: DEBUG MARKER: umount -f /mnt/lustre
Looking at the OSS (vm3) console log, we see LNet issues
[104421.383451] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 13:30:35 (1605187835) [104541.739060] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [104541.740628] LNetError: Skipped 5 previous similar messages [104541.741587] LNetError: 21967:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [104541.743275] LNetError: 21967:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 5 previous similar messages [104541.744877] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [104541.747458] LNetError: Skipped 5 previous similar messages [104727.505977] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [104728.383449] Lustre: DEBUG MARKER: rc=0; [104728.383449] val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); [104728.383449] if [[ $? -eq 0 && $val -ne 0 ]]; then [104728.383449] echo $(hostname -s): $val; [104728.383449] rc=$val; [104728.383449] fi; [104728.383449] exit $rc [104729.290527] Lustre: DEBUG MARKER: dmesg [104730.187533] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 \(1605188144\) [104730.428143] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 (1605188144) [104754.664116] Lustre: lustre-OST0000: Connection restored to 6ef95858-0684-3186-09bf-845f7502685d (at 10.9.5.101@tcp) [104754.666078] Lustre: Skipped 6 previous similar messages [104755.850798] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == rpc test complete, duration -o sec ================================================================ 13:36:09 \(1605188169\) [104756.089115] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 13:36:09 (1605188169) [104756.487966] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-29vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [104756.726639] Lustre: DEBUG MARKER: trevis-29vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [104757.052466] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020 [104757.308588] Lustre: DEBUG MARKER: -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020 [104759.302601] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 \(1605188173\) [104759.557414] Lustre: DEBUG MARKER: == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 (1605188173) [104760.753243] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null || [104760.753243] /usr/sbin/lctl lustre_build_version 2>/dev/null || [104760.753243] /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2 [104761.283326] Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: chown_00 utimensat_08 [104761.523935] Lustre: DEBUG MARKER: excepting tests: chown_00 utimensat_08 [104798.827334] Lustre: lustre-OST0003: haven't heard from client 6c4f2798-36a0-1283-9a2a-402a8215fdc6 (at 10.9.5.103@tcp) in 55 seconds. I think it's dead, and I am evicting it. exp ffff908f82928000, cur 1605188213 expire 1605188183 last 1605188158 [104798.831028] Lustre: Skipped 6 previous similar messages [105141.722329] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [105141.723961] LNetError: Skipped 6 previous similar messages [105141.724931] LNetError: 21965:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [105141.726576] LNetError: 21965:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 6 previous similar messages [105141.728191] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [105141.730788] LNetError: Skipped 6 previous similar messages
In the console log of the MDS (vm4), we see issues during and after parallel-scale-nfsv4 racer_on_nfs
[104422.879162] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 13:30:35 (1605187835)
[104428.673403] LustreError: 5794:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x3060:0x0] get parent: rc = -2
[104428.675745] LustreError: 5794:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 31 previous similar messages
[104445.738012] LustreError: 5795:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x3031:0x0] get parent: rc = -2
[104485.005485] LustreError: 5799:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x324c:0x0] get parent: rc = -2
[104485.008076] LustreError: 5799:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 16 previous similar messages
[104566.409419] LustreError: 5796:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x3502:0x0] get parent: rc = -2
[104566.411605] LustreError: 5796:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 63 previous similar messages
[104704.226358] LustreError: 5794:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x37e0:0x0] get parent: rc = -2
[104704.228861] LustreError: 5794:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 69 previous similar messages
[104729.025141] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null
[104730.349763] Lustre: DEBUG MARKER: rc=0;
[104730.349763] val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
[104730.349763] if [[ $? -eq 0 && $val -ne 0 ]]; then
[104730.349763] echo $(hostname -s): $val;
[104730.349763] rc=$val;
[104730.349763] fi;
[104730.349763] exit $rc
[104730.802926] Lustre: DEBUG MARKER: dmesg
[104731.684167] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 \(1605188144\)
[104731.944230] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 (1605188144)
[104743.180814] Lustre: DEBUG MARKER: { [[ -e /etc/SuSE-release ]] &&
[104743.180814] service nfsserver stop; } ||
[104743.180814] service nfs stop
[104743.428775] nfsd: last server has exited, flushing export cache
[104743.619356] Lustre: DEBUG MARKER: sed -i '/lustre/d' /etc/exports
[104744.102408] Lustre: DEBUG MARKER: exportfs -v
[104744.498993] Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts
[104744.870704] Lustre: DEBUG MARKER: lsof -t /mnt/lustre
[104745.380995] Lustre: DEBUG MARKER: umount -f /mnt/lustre 2>&1
[104745.568011] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
[104745.569374] Lustre: Skipped 7 previous similar messages
[104745.570520] LustreError: 25275:0:(file.c:216:ll_close_inode_openhandle()) lustre-clilmv-ffff96e419fc4000: inode [0x20006e9a6:0x351a:0x0] mdc close failed: rc = -108
[104745.572860] LustreError: 25275:0:(file.c:216:ll_close_inode_openhandle()) Skipped 9 previous similar messages
[104745.585316] Lustre: 3529:0:(llite_lib.c:2762:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.5.103@tcp:/lustre/fid: [0x20006e9a6:0x38cd:0x0]/ may get corrupted (rc -108)
…
[104745.630358] Lustre: 3529:0:(llite_lib.c:2762:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.5.103@tcp:/lustre/fid: [0x20006e9a6:0x3912:0x0]/ may get corrupted (rc -108)
[104755.671260] Lustre: Unmounted lustre-client
[104756.077342] Lustre: MGS: Connection restored to 43522391-bc41-831c-eaa2-9a67ae6d0c48 (at 10.9.5.101@tcp)
[104756.079241] Lustre: Skipped 4 previous similar messages
[104757.351305] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == rpc test complete, duration -o sec ================================================================ 13:36:09 \(1605188169\)
[104757.588086] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 13:36:09 (1605188169)
[104757.986890] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-29vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[104758.226754] Lustre: DEBUG MARKER: trevis-29vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[104758.553655] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020
[104758.798368] Lustre: DEBUG MARKER: -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020
[104760.797535] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 \(1605188173\)
[104761.044320] Lustre: DEBUG MARKER: == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 (1605188173)
[104761.408777] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null ||
[104761.408777] /usr/sbin/lctl lustre_build_version 2>/dev/null ||
[104761.408777] /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
[104761.849199] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null ||
[104761.849199] /usr/sbin/lctl lustre_build_version 2>/dev/null ||
[104761.849199] /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
[104762.783875] Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: chown_00 utimensat_08
[104763.023125] Lustre: DEBUG MARKER: excepting tests: chown_00 utimensat_08
[104796.252581] Lustre: lustre-MDT0000: haven't heard from client 6c4f2798-36a0-1283-9a2a-402a8215fdc6 (at 0@lo) in 50 seconds. I think it's dead, and I am evicting it. exp ffff96e3f8f80c00, cur 1605188208 expire 1605188178 last 1605188158
Sometimes pjdfstest does not hang, but just fails; https://testing.whamcloud.com/test_sets/01ee37b4-627e-46cf-abc2-c097b2ce5b69.
Attachments
Issue Links
- is related to
-
LU-9940 posix no sub tests failed:
-
- Closed
-
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...