Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13185

recovery-small test 101 hangs on OST failover/failback

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.3, Lustre 2.14.0
    • 3
    • 9223372036854775807

    Description

      Looking at the client test_log for the hang at https://testing.whamcloud.com/test_sets/c518a288-428c-11ea-b083-52540065bddc, the last thing printed before the hang is

      Started lustre-OST0000
      

      Looking at the OSS (vm6) console log, we see

      [31839.369248] Lustre: lustre-OST0000: Client e8a848d3-e0cf-379f-b910-5e7448e8c6de (at 10.9.0.83@tcp) reconnected, waiting for 3 clients in recovery for 0:57
      [31839.371703] Lustre: Skipped 24 previous similar messages
      [31860.379116] Lustre: lustre-OST0000: Client e8a848d3-e0cf-379f-b910-5e7448e8c6de (at 10.9.0.83@tcp) reconnected, waiting for 3 clients in recovery for 0:36
      [31860.381510] Lustre: Skipped 2 previous similar messages
      [31895.395237] Lustre: lustre-OST0000: Client e8a848d3-e0cf-379f-b910-5e7448e8c6de (at 10.9.0.83@tcp) reconnected, waiting for 3 clients in recovery for 0:01
      [31895.397640] Lustre: Skipped 4 previous similar messages
      [31896.600260] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
      [31896.601650] Lustre: lustre-OST0000: disconnecting 1 stale clients
      [31896.952742] Lustre: lustre-OST0000: Recovery over after 1:06, of 3 clients 2 recovered and 1 was evicted.
      [31896.955583] Lustre: lustre-OST0000: deleting orphan objects from 0x0:75598 to 0x0:75657
      [31988.475076] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
      [31988.476672] LNetError: Skipped 9 previous similar messages
      [31988.477615] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
      [31988.479254] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
      [31988.480974] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
      [31988.483511] LNetError: Skipped 9 previous similar messages
      [32313.626053] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
      [32313.628238] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 34 previous similar messages
      [32608.767216] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
      [32608.768897] LNetError: Skipped 9 previous similar messages
      [32608.770267] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
      [32608.772194] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
      [32608.774014] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
      [32608.777084] LNetError: Skipped 9 previous similar messages
      [32913.908804] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
      [32913.910952] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages
      [33239.062905] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
      [33239.064514] LNetError: Skipped 9 previous similar messages
      [33239.065469] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
      [33239.067127] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
      [33239.068739] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
      [33239.071405] LNetError: Skipped 9 previous similar messages
      [33524.195275] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
      [33524.197461] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages
      [33849.350362] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
      [33849.351963] LNetError: Skipped 9 previous similar messages
      [33849.352892] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
      [33849.354641] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
      [33849.356260] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
      [33849.358975] LNetError: Skipped 9 previous similar messages
      [34134.482730] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
      [34134.484846] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages
      [34459.636819] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
      [34459.638458] LNetError: Skipped 9 previous similar messages
      [34459.639402] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
      [34459.641150] LNetError: 13294:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
      [34459.642761] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
      [34459.645348] LNetError: Skipped 9 previous similar messages
      [34744.769144] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
      [34744.771342] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages
      [35069.924238] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
      [35069.925878] LNetError: Skipped 9 previous similar messages
      [35069.926814] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
      [35069.928553] LNetError: 13293:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
      [35069.930165] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
      [35069.932746] LNetError: Skipped 9 previous similar messages
      [35355.056588] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
      [35355.058740] LNetError: 13301:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages
      

      This test started failing on 13 AUG 2019 and has failed 100% of the time for PPC client testing and started with Lustre 2.12.2.115 and, so far, has only been seen for 2.12.3 and 2.12.4:
      https://testing.whamcloud.com/test_sets/d40a7fde-be73-11e9-a2b6-52540065bddc
      https://testing.whamcloud.com/test_sets/1e950984-cfe0-11e9-a2b6-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: