Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11762

replay-single test 0d fails with 'post-failover df failed'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.2
    • None
    • 3
    • 9223372036854775807

    Description

      r eplay-single test_0d fails with 'post-failover df failed' due to all clients being evicted and not recovering. Looking at the logs from a recent failure, https://testing.whamcloud.com/test_sets/d34a9c44-fd82-11e8-b970-52540065bddc , in the client test_log, we see there is an problem mounting the file system on the second client (vm4)

      Started lustre-MDT0000
      Starting client: trevis-26vm3:  -o user_xattr,flock trevis-26vm6@tcp:/lustre /mnt/lustre
      CMD: trevis-26vm3 mkdir -p /mnt/lustre
      CMD: trevis-26vm3 mount -t lustre -o user_xattr,flock trevis-26vm6@tcp:/lustre /mnt/lustre
      trevis-26vm4: error: invalid path '/mnt/lustre': Input/output error
       replay-single test_0d: @@@@@@ FAIL: post-failover df failed 
      

      Looking at the dmesg log from client 2 (vm4), we see the following errors

      [44229.221245] LustreError: 166-1: MGC10.9.5.67@tcp: Connection to MGS (at 10.9.5.67@tcp) was lost; in progress operations using this service will fail
      [44254.268743] Lustre: Evicted from MGS (at 10.9.5.67@tcp) after server handle changed from 0x306f28dc59d36b9 to 0x306f28dc59d3cc4
      [44425.483787] LustreError: 11-0: lustre-MDT0000-mdc-ffff88007a5ac800: operation mds_reint to node 10.9.5.67@tcp failed: rc = -107
      [44429.540695] LustreError: 167-0: lustre-MDT0000-mdc-ffff88007a5ac800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
      [44429.542381] LustreError: 29222:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5
      [44429.542384] LustreError: 29222:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 15 previous similar messages
      [44429.547526] Lustre: lustre-MDT0000-mdc-ffff88007a5ac800: Connection restored to 10.9.5.67@tcp (at 10.9.5.67@tcp)
      [44429.547533] Lustre: Skipped 1 previous similar message
      [44429.613758] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_0d: @@@@@@ FAIL: post-failover df failed 
      

      In the dmesg log for the MDS (vm6), we see

      [44131.617072] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
      [44135.460894] Lustre: 2440:0:(ldlm_lib.c:1771:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reaching hard limit: 180, extend: 0
      [44196.726935] Lustre: lustre-MDT0000: Denying connection for new client f33a3fe0-b38c-7f20-7b19-3c32e6a1bff3(at 10.9.5.64@tcp), waiting for 2 known clients (0 recovered, 1 in progress, and 0 evicted) already passed deadline 3:05
      [44196.728849] Lustre: Skipped 21 previous similar messages
      [44311.673038] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
      [44311.673797] Lustre: lustre-MDT0000: disconnecting 1 stale clients
      [44311.674391] Lustre: Skipped 1 previous similar message
      [44311.675031] Lustre: 2500:0:(ldlm_lib.c:1771:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reaching hard limit: 180, extend: 1
      [44311.676331] Lustre: 2500:0:(ldlm_lib.c:2048:target_recovery_overseer()) lustre-MDT0000 recovery is aborted by hard timeout
      [44311.677355] Lustre: 2500:0:(ldlm_lib.c:2048:target_recovery_overseer()) Skipped 2 previous similar messages
      [44311.678318] Lustre: 2500:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      [44311.679301] Lustre: 2500:0:(ldlm_lib.c:2058:target_recovery_overseer()) Skipped 2 previous similar messages
      [44311.680369] LustreError: 2500:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 0 != fo_tot_granted 2097152
      [44311.681531] Lustre: 2500:0:(ldlm_lib.c:1617:abort_req_replay_queue()) @@@ aborted:  req@ffff922b644d6400 x1619523210909360/t0(12884901890) o36->94cd1843-54cb-a4d4-a0d3-b3519f2b7d2a@10.9.5.65@tcp:356/0 lens 512/0 e 3 to 0 dl 1544506121 ref 1 fl Complete:/4/ffffffff rc 0/-1
      [44311.739670] Lustre: lustre-MDT0000: Recovery over after 3:00, of 2 clients 0 recovered and 2 were evicted.
      [44311.930592] Lustre: lustre-MDT0000: Connection restored to e9848982-35c6-9607-086a-2eb07fd9bf44 (at 10.9.5.64@tcp)
      [44311.931571] Lustre: Skipped 46 previous similar messages
      [44315.952804] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_0d: @@@@@@ FAIL: post-failover df failed 
      

      We see replay-single test 0c also fail with similar messages in the logs; https://testing.whamcloud.com/test_sets/d20239e0-fd79-11e8-a97c-52540065bddc .

      More logs for these failures are at
      https://testing.whamcloud.com/test_sets/ea4338ea-fd67-11e8-8a18-52540065bddc
      https://testing.whamcloud.com/test_sets/9efcb22c-f712-11e8-815b-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              simmonsja James A Simmons
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: