Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11762

replay-single test 0d fails with 'post-failover df failed'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.2
    • None
    • 3
    • 9223372036854775807

    Description

      r eplay-single test_0d fails with 'post-failover df failed' due to all clients being evicted and not recovering. Looking at the logs from a recent failure, https://testing.whamcloud.com/test_sets/d34a9c44-fd82-11e8-b970-52540065bddc , in the client test_log, we see there is an problem mounting the file system on the second client (vm4)

      Started lustre-MDT0000
      Starting client: trevis-26vm3:  -o user_xattr,flock trevis-26vm6@tcp:/lustre /mnt/lustre
      CMD: trevis-26vm3 mkdir -p /mnt/lustre
      CMD: trevis-26vm3 mount -t lustre -o user_xattr,flock trevis-26vm6@tcp:/lustre /mnt/lustre
      trevis-26vm4: error: invalid path '/mnt/lustre': Input/output error
       replay-single test_0d: @@@@@@ FAIL: post-failover df failed 
      

      Looking at the dmesg log from client 2 (vm4), we see the following errors

      [44229.221245] LustreError: 166-1: MGC10.9.5.67@tcp: Connection to MGS (at 10.9.5.67@tcp) was lost; in progress operations using this service will fail
      [44254.268743] Lustre: Evicted from MGS (at 10.9.5.67@tcp) after server handle changed from 0x306f28dc59d36b9 to 0x306f28dc59d3cc4
      [44425.483787] LustreError: 11-0: lustre-MDT0000-mdc-ffff88007a5ac800: operation mds_reint to node 10.9.5.67@tcp failed: rc = -107
      [44429.540695] LustreError: 167-0: lustre-MDT0000-mdc-ffff88007a5ac800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
      [44429.542381] LustreError: 29222:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5
      [44429.542384] LustreError: 29222:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 15 previous similar messages
      [44429.547526] Lustre: lustre-MDT0000-mdc-ffff88007a5ac800: Connection restored to 10.9.5.67@tcp (at 10.9.5.67@tcp)
      [44429.547533] Lustre: Skipped 1 previous similar message
      [44429.613758] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_0d: @@@@@@ FAIL: post-failover df failed 
      

      In the dmesg log for the MDS (vm6), we see

      [44131.617072] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
      [44135.460894] Lustre: 2440:0:(ldlm_lib.c:1771:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reaching hard limit: 180, extend: 0
      [44196.726935] Lustre: lustre-MDT0000: Denying connection for new client f33a3fe0-b38c-7f20-7b19-3c32e6a1bff3(at 10.9.5.64@tcp), waiting for 2 known clients (0 recovered, 1 in progress, and 0 evicted) already passed deadline 3:05
      [44196.728849] Lustre: Skipped 21 previous similar messages
      [44311.673038] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
      [44311.673797] Lustre: lustre-MDT0000: disconnecting 1 stale clients
      [44311.674391] Lustre: Skipped 1 previous similar message
      [44311.675031] Lustre: 2500:0:(ldlm_lib.c:1771:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reaching hard limit: 180, extend: 1
      [44311.676331] Lustre: 2500:0:(ldlm_lib.c:2048:target_recovery_overseer()) lustre-MDT0000 recovery is aborted by hard timeout
      [44311.677355] Lustre: 2500:0:(ldlm_lib.c:2048:target_recovery_overseer()) Skipped 2 previous similar messages
      [44311.678318] Lustre: 2500:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      [44311.679301] Lustre: 2500:0:(ldlm_lib.c:2058:target_recovery_overseer()) Skipped 2 previous similar messages
      [44311.680369] LustreError: 2500:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 0 != fo_tot_granted 2097152
      [44311.681531] Lustre: 2500:0:(ldlm_lib.c:1617:abort_req_replay_queue()) @@@ aborted:  req@ffff922b644d6400 x1619523210909360/t0(12884901890) o36->94cd1843-54cb-a4d4-a0d3-b3519f2b7d2a@10.9.5.65@tcp:356/0 lens 512/0 e 3 to 0 dl 1544506121 ref 1 fl Complete:/4/ffffffff rc 0/-1
      [44311.739670] Lustre: lustre-MDT0000: Recovery over after 3:00, of 2 clients 0 recovered and 2 were evicted.
      [44311.930592] Lustre: lustre-MDT0000: Connection restored to e9848982-35c6-9607-086a-2eb07fd9bf44 (at 10.9.5.64@tcp)
      [44311.931571] Lustre: Skipped 46 previous similar messages
      [44315.952804] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_0d: @@@@@@ FAIL: post-failover df failed 
      

      We see replay-single test 0c also fail with similar messages in the logs; https://testing.whamcloud.com/test_sets/d20239e0-fd79-11e8-a97c-52540065bddc .

      More logs for these failures are at
      https://testing.whamcloud.com/test_sets/ea4338ea-fd67-11e8-8a18-52540065bddc
      https://testing.whamcloud.com/test_sets/9efcb22c-f712-11e8-815b-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-11762] replay-single test 0d fails with 'post-failover df failed'

             A new patch landed to fix this problem -https://review.whamcloud.com/#/c/39532/

            simmonsja James A Simmons added a comment -  A new patch landed to fix this problem - https://review.whamcloud.com/#/c/39532/

            Hongchao Zhang, if we revert this patch do you see replay-single 0d start to fail again? 

            simmonsja James A Simmons added a comment - Hongchao Zhang, if we revert this patch do you see replay-single 0d start to fail again? 
            tappro Mikhail Pershin added a comment - - edited

            there are several reports pointing to this ticket as reason of failures, check linked tickets:
            LU-13614, LU-13339

            tappro Mikhail Pershin added a comment - - edited there are several reports pointing to this ticket as reason of failures, check linked tickets: LU-13614 , LU-13339

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37141
            Subject: LU-11762 ldlm: ensure the recovery timer is armed
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 8e9dae2b1aa53b3be114922c825742271be08a0b

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37141 Subject: LU-11762 ldlm: ensure the recovery timer is armed Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 8e9dae2b1aa53b3be114922c825742271be08a0b

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36936/
            Subject: LU-11762 ldlm: don't exceed hard timeout
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: bc27e0f6efbdbd256c6459d15391754ce1b36d32

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36936/ Subject: LU-11762 ldlm: don't exceed hard timeout Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: bc27e0f6efbdbd256c6459d15391754ce1b36d32

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35627/
            Subject: LU-11762 ldlm: ensure the recovery timer is armed
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fe5c801657f9ddb5e148bb6076e476df6ba31bba

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35627/ Subject: LU-11762 ldlm: ensure the recovery timer is armed Project: fs/lustre-release Branch: master Current Patch Set: Commit: fe5c801657f9ddb5e148bb6076e476df6ba31bba

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36936
            Subject: LU-11762 ldlm: don't exceed hard timeout
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 373b1cb9232caa4457d63dd04bf6d16af83f493f

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36936 Subject: LU-11762 ldlm: don't exceed hard timeout Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 373b1cb9232caa4457d63dd04bf6d16af83f493f

            LU-12769 has a real fix.

            simmonsja James A Simmons added a comment - LU-12769 has a real fix.

            Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35627
            Subject: LU-11762 ldlm: ensure the recovery timer is armed
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2ae2c2cc41f3c82393e8a19eb4da1c3644846e93

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35627 Subject: LU-11762 ldlm: ensure the recovery timer is armed Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2ae2c2cc41f3c82393e8a19eb4da1c3644846e93

            The patch that landed was a cleanup patch but Oleg does see it in his test harness. So I need to work with him to reproduce this problem.

            simmonsja James A Simmons added a comment - The patch that landed was a cleanup patch but Oleg does see it in his test harness. So I need to work with him to reproduce this problem.

            People

              simmonsja James A Simmons
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: