[LU-14524] sanity-lnet hangs with no tests executed Created: 15/Mar/21 Updated: 07/Sep/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
sanity-lnet hangs before any tests are run. The first time we saw this happen was on 2020-01-12 for Lustre 2.13.51; https://testing.whamcloud.com/test_sets/ddd3b65c-354c-11ea-b1e8-52540065bddc. Since that time, we’ve seen this issue 156 times. Looking at the latest hang at https://testing.whamcloud.com/test_sets/afb02057-aedb-416c-82d1-0d525970b348, Lustre 2.14.50.160, there are many similarities with the errors see in the client console logs and with LU-14137 like issues with parallel-scale-nfsv4 [72216.586458] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 08:17:04 (1615796224) [72217.032596] Lustre: DEBUG MARKER: MDSCOUNT=4 OSTCOUNT=8 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs [72419.653697] 8[2606317]: segfault at 8 ip 00007f5294e392b1 sp 00007fff8676ac00 error 4 in ld-2.28.so[7f5294e2d000+29000] [72419.655815] Code: c0 0f 85 3a 15 00 00 49 8b 83 f0 00 00 00 48 89 85 18 ff ff ff 48 85 c0 0f 85 d0 13 00 00 49 8b 43 68 49 83 bb f8 00 00 00 00 <48> 8b 40 08 48 89 85 40 ff ff ff 0f 84 4e 0c 00 00 45 85 ed 74 59 [72445.594356] 3[2673643]: segfault at 1b00 ip 0000000000001b00 sp 00007fff201d65e0 error 14 [72445.596591] Code: Bad RIP value. [72475.776786] 4[2757236]: segfault at 8 ip 00007facc802f2b1 sp 00007ffc4feea130 error 4 in ld-2.28.so[7facc8023000+29000] [72475.779667] Code: c0 0f 85 3a 15 00 00 49 8b 83 f0 00 00 00 48 89 85 18 ff ff ff 48 85 c0 0f 85 d0 13 00 00 49 8b 43 68 49 83 bb f8 00 00 00 00 <48> 8b 40 08 48 89 85 40 ff ff ff 0f 84 4e 0c 00 00 45 85 ed 74 59 [72523.533273] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [72524.743731] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param catastrophe 2>&1 [72528.080347] Lustre: DEBUG MARKER: dmesg [72529.332855] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test complete, duration 2680 sec ============================================= 08:22:17 \(1615796537\) and [75656.614594] Lustre: DEBUG MARKER: -----============= acceptance-small: sanity-lnet ============----- Mon Mar 15 09:14:37 UTC 2021 [75657.640832] nfs: server onyx-40vm6 not responding, timed out [75659.885276] Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: [75660.401362] Lustre: DEBUG MARKER: excepting tests: [75660.873522] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts); [75660.873522] if [ $running -ne 0 ] ; then [75660.873522] echo Stopping client $(hostname) /mnt/lustre opts:; [75660.873522] lsof /mnt/lustre || need_kill=no; [75660.873522] if [ x != x -a x$need_kill != xno ]; then [75660.873522] pids=$(lsof -t /mnt/lustre | sort -u); [75660.873522] if [75668.904042] nfs: server onyx-40vm6 not responding, timed out [75678.119567] nfs: server onyx-40vm6 not responding, timed out but, the difference here is that pjdfstest runs between parallel-scale-nfsv4 and sanity-lnet and pjdfstest seems to run with no errors. |