[LU-14524] sanity-lnet hangs with no tests executed Created: 15/Mar/21  Updated: 07/Sep/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-lnet hangs before any tests are run. The first time we saw this happen was on 2020-01-12 for Lustre 2.13.51; https://testing.whamcloud.com/test_sets/ddd3b65c-354c-11ea-b1e8-52540065bddc. Since that time, we’ve seen this issue 156 times.

Looking at the latest hang at https://testing.whamcloud.com/test_sets/afb02057-aedb-416c-82d1-0d525970b348, Lustre 2.14.50.160, there are many similarities with the errors see in the client console logs and with LU-14137 like issues with parallel-scale-nfsv4

[72216.586458] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 08:17:04 (1615796224)
[72217.032596] Lustre: DEBUG MARKER: MDSCOUNT=4 OSTCOUNT=8 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
[72419.653697] 8[2606317]: segfault at 8 ip 00007f5294e392b1 sp 00007fff8676ac00 error 4 in ld-2.28.so[7f5294e2d000+29000]
[72419.655815] Code: c0 0f 85 3a 15 00 00 49 8b 83 f0 00 00 00 48 89 85 18 ff ff ff 48 85 c0 0f 85 d0 13 00 00 49 8b 43 68 49 83 bb f8 00 00 00 00 <48> 8b 40 08 48 89 85 40 ff ff ff 0f 84 4e 0c 00 00 45 85 ed 74 59
[72445.594356] 3[2673643]: segfault at 1b00 ip 0000000000001b00 sp 00007fff201d65e0 error 14
[72445.596591] Code: Bad RIP value.
[72475.776786] 4[2757236]: segfault at 8 ip 00007facc802f2b1 sp 00007ffc4feea130 error 4 in ld-2.28.so[7facc8023000+29000]
[72475.779667] Code: c0 0f 85 3a 15 00 00 49 8b 83 f0 00 00 00 48 89 85 18 ff ff ff 48 85 c0 0f 85 d0 13 00 00 49 8b 43 68 49 83 bb f8 00 00 00 00 <48> 8b 40 08 48 89 85 40 ff ff ff 0f 84 4e 0c 00 00 45 85 ed 74 59
[72523.533273] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
[72524.743731] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param catastrophe 2>&1
[72528.080347] Lustre: DEBUG MARKER: dmesg
[72529.332855] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test complete, duration 2680 sec ============================================= 08:22:17 \(1615796537\)

and

[75656.614594] Lustre: DEBUG MARKER: -----============= acceptance-small: sanity-lnet ============----- Mon Mar 15 09:14:37 UTC 2021
[75657.640832] nfs: server onyx-40vm6 not responding, timed out
[75659.885276] Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: 
[75660.401362] Lustre: DEBUG MARKER: excepting tests:
[75660.873522] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts);
[75660.873522] if [ $running -ne 0 ] ; then
[75660.873522] echo Stopping client $(hostname) /mnt/lustre opts:;
[75660.873522] lsof /mnt/lustre || need_kill=no;
[75660.873522] if [ x != x -a x$need_kill != xno ]; then
[75660.873522]     pids=$(lsof -t /mnt/lustre | sort -u);
[75660.873522]     if 
[75668.904042] nfs: server onyx-40vm6 not responding, timed out
[75678.119567] nfs: server onyx-40vm6 not responding, timed out

but, the difference here is that pjdfstest runs between parallel-scale-nfsv4 and sanity-lnet and pjdfstest seems to run with no errors.


Generated at Sat Feb 10 03:10:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.