Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14524

sanity-lnet hangs with no tests executed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.14.0, Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      sanity-lnet hangs before any tests are run. The first time we saw this happen was on 2020-01-12 for Lustre 2.13.51; https://testing.whamcloud.com/test_sets/ddd3b65c-354c-11ea-b1e8-52540065bddc. Since that time, we’ve seen this issue 156 times.

      Looking at the latest hang at https://testing.whamcloud.com/test_sets/afb02057-aedb-416c-82d1-0d525970b348, Lustre 2.14.50.160, there are many similarities with the errors see in the client console logs and with LU-14137 like issues with parallel-scale-nfsv4

      [72216.586458] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 08:17:04 (1615796224)
      [72217.032596] Lustre: DEBUG MARKER: MDSCOUNT=4 OSTCOUNT=8 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
      [72419.653697] 8[2606317]: segfault at 8 ip 00007f5294e392b1 sp 00007fff8676ac00 error 4 in ld-2.28.so[7f5294e2d000+29000]
      [72419.655815] Code: c0 0f 85 3a 15 00 00 49 8b 83 f0 00 00 00 48 89 85 18 ff ff ff 48 85 c0 0f 85 d0 13 00 00 49 8b 43 68 49 83 bb f8 00 00 00 00 <48> 8b 40 08 48 89 85 40 ff ff ff 0f 84 4e 0c 00 00 45 85 ed 74 59
      [72445.594356] 3[2673643]: segfault at 1b00 ip 0000000000001b00 sp 00007fff201d65e0 error 14
      [72445.596591] Code: Bad RIP value.
      [72475.776786] 4[2757236]: segfault at 8 ip 00007facc802f2b1 sp 00007ffc4feea130 error 4 in ld-2.28.so[7facc8023000+29000]
      [72475.779667] Code: c0 0f 85 3a 15 00 00 49 8b 83 f0 00 00 00 48 89 85 18 ff ff ff 48 85 c0 0f 85 d0 13 00 00 49 8b 43 68 49 83 bb f8 00 00 00 00 <48> 8b 40 08 48 89 85 40 ff ff ff 0f 84 4e 0c 00 00 45 85 ed 74 59
      [72523.533273] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      [72524.743731] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param catastrophe 2>&1
      [72528.080347] Lustre: DEBUG MARKER: dmesg
      [72529.332855] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test complete, duration 2680 sec ============================================= 08:22:17 \(1615796537\)
      

      and

      [75656.614594] Lustre: DEBUG MARKER: -----============= acceptance-small: sanity-lnet ============----- Mon Mar 15 09:14:37 UTC 2021
      [75657.640832] nfs: server onyx-40vm6 not responding, timed out
      [75659.885276] Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: 
      [75660.401362] Lustre: DEBUG MARKER: excepting tests:
      [75660.873522] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts);
      [75660.873522] if [ $running -ne 0 ] ; then
      [75660.873522] echo Stopping client $(hostname) /mnt/lustre opts:;
      [75660.873522] lsof /mnt/lustre || need_kill=no;
      [75660.873522] if [ x != x -a x$need_kill != xno ]; then
      [75660.873522]     pids=$(lsof -t /mnt/lustre | sort -u);
      [75660.873522]     if 
      [75668.904042] nfs: server onyx-40vm6 not responding, timed out
      [75678.119567] nfs: server onyx-40vm6 not responding, timed out
      

      but, the difference here is that pjdfstest runs between parallel-scale-nfsv4 and sanity-lnet and pjdfstest seems to run with no errors.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: