Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10348

replay-ost-single test 5 hangs during OST fail/recovery

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.11.0, Lustre 2.10.4
    • None
    • 3
    • 9223372036854775807

    Description

      replay-ost-single test_5 hangs after starting iozone and waiting for the failed OST to start. Logs for this hang are at https://testing.hpdd.intel.com/test_sets/fd70c104-dae4-11e7-8027-52540065bddc

      There's not much in the console and dmesg logs to determine what the issue is.

      The last thing seen after failing the OST in the test log on the client

      Failing ost1 on trevis-3vm3
      CMD: trevis-3vm3 grep -c /mnt/lustre-ost1' ' /proc/mounts
      Stopping /mnt/lustre-ost1 (opts:) on trevis-3vm3
      CMD: trevis-3vm3 umount -d /mnt/lustre-ost1
      CMD: trevis-3vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      reboot facets: ost1
      Failover ost1 to trevis-3vm3
      06:46:24 (1512542784) waiting for trevis-3vm3 network 900 secs ...
      06:46:24 (1512542784) network interface is UP
      CMD: trevis-3vm3 hostname
      mount facets: ost1
      CMD: trevis-3vm3 test -b /dev/lvm-Role_OSS/P1
      CMD: trevis-3vm3 e2label /dev/lvm-Role_OSS/P1
      Starting ost1:   /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1
      CMD: trevis-3vm3 mkdir -p /mnt/lustre-ost1; mount -t lustre   		                   /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1
      CMD: trevis-3vm3 /usr/sbin/lctl get_param -n health_check
      CMD: trevis-3vm3 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4 
      trevis-3vm3: trevis-3vm3.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      CMD: trevis-3vm3 e2label /dev/lvm-Role_OSS/P1 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      CMD: trevis-3vm3 e2label /dev/lvm-Role_OSS/P1 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      CMD: trevis-3vm3 e2label /dev/lvm-Role_OSS/P1 2>/dev/null
      Started lustre-OST0000
      CMD: trevis-3vm1.trevis.hpdd.intel.com,trevis-3vm2 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid 
      trevis-3vm1: trevis-3vm1.trevis.hpdd.intel.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid
      trevis-3vm2: trevis-3vm2.trevis.hpdd.intel.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid
      trevis-3vm1: CMD: trevis-3vm1.trevis.hpdd.intel.com lctl get_param -n at_max
      trevis-3vm2: CMD: trevis-3vm2.trevis.hpdd.intel.com lctl get_param -n at_max
      trevis-3vm2: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 2 sec
      trevis-3vm1: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 2 sec
      

      Looking at the OST (vm3) console log, we see the following then a node reboot

      [52386.276689] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-3vm1.trevis.hpdd.intel.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid
      [52386.308628] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-3vm2.trevis.hpdd.intel.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid
      [52386.442463] Lustre: DEBUG MARKER: trevis-3vm1.trevis.hpdd.intel.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid
      [52386.593174] Lustre: DEBUG MARKER: trevis-3vm2.trevis.hpdd.intel.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid
      [52388.284557] Lustre: lustre-OST0000: deleting orphan objects from 0x0:12940 to 0x0:12969
      [52388.881925] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 2 sec
      [52389.064745] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 2 sec
      [52389.105603] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 2 sec
      [52392.356528] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 2 sec
      [53031.378937] LNet: Service thread pid 16692 completed after 53.53s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [53033.710662] LNet: Service thread pid 18539 completed after 55.86s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [53074.558626] LNet: Service thread pid 1576 completed after 76.94s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [53290.176331] LNet: Service thread pid 16692 completed after 42.12s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [53584.271884] LNet: Service thread pid 18539 completed after 77.68s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [53808.647337] LNet: Service thread pid 1575 completed after 97.54s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [53864.707967] LNet: Service thread pid 18534 completed after 93.30s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [54237.094039] LNet: Service thread pid 1576 completed after 188.49s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [54335.118180] LNet: Service thread pid 18535 completed after 195.92s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [54527.722072] LNet: Service thread pid 18537 completed after 70.42s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [54527.728118] LNet: Skipped 1 previous similar message
      [54994.845138] LNet: Service thread pid 20092 completed after 133.76s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [54994.851195] LNet: Skipped 2 previous similar messages
      [55644.672185] LNet: Service thread pid 18535 completed after 87.13s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [55644.678120] LNet: Skipped 4 previous similar messages
      [56366.669572] LNet: Service thread pid 18534 completed after 108.69s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [56366.675716] LNet: Skipped 8 previous similar messages
      [57203.664122] LNet: Service thread pid 18534 completed after 87.49s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [57203.670055] LNet: Skipped 6 previous similar messages
      [57422.974061] Lustre: 4713:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1512547818/real 1512547818]  req@ffff88000be18300 x1585999144680672/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.9.4.19@tcp:12/10 lens 224/224 e 0 to 1 dl 1512547825 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [57422.983336] Lustre: 4713:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 35 previous similar messages
      [57422.986340] Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.9.4.19@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [57422.992053] Lustre: Skipped 6 previous similar messages
      [57422.994697] LustreError: 166-1: MGC10.9.4.19@tcp: Connection to MGS (at 10.9.4.19@tcp) was lost; in progress operations using this service will fail
      
      <ConMan> Console [trevis-3vm3] disconnected from <trevis-3:6002> at 12-06 08:10.
      

      Some of these failures don't include the 'many service threads' message in the OST logs.

      This hang looks different from LU-9273.

      Other logs for this test hang are at:
      https://testing.hpdd.intel.com/test_sets/e12a8174-dacb-11e7-8027-52540065bddc
      https://testing.hpdd.intel.com/test_sets/c7b476bc-d7b7-11e7-a066-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: