Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10127

DNE MDT goes back in time after failover, recovery

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.11.0
    • soak cluster
    • 3
    • 9223372036854775807

    Description

      After MDT0003 was successfully failed over, various clients reported transaction issues

      /scratch/logs/syslog/soak-27.log:Oct 17 13:50:47 soak-27 kernel: LustreError: 1917:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146272 was previously committed, server now claims 55835485611)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
      /scratch/logs/syslog/soak-30.log:Oct 17 13:50:48 soak-30 kernel: LustreError: 2448:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146275 was previously committed, server now claims 60130146239)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
      /scratch/logs/syslog/soak-18.log:Oct 17 13:50:50 soak-18 kernel: LustreError: 1966:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146271 was previously committed, server now claims 60130108709)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
      /scratch/logs/syslog/soak-31.log:Oct 17 13:50:50 soak-31 kernel: LustreError: 2451:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146276 was previously committed, server now claims 60130108711)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
      /scratch/logs/syslog/soak-25.log:Oct 17 13:50:51 soak-25 kernel: LustreError: 1862:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130134563 was previously committed, server now claims 55850633401)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
      /scratch/logs/syslog/soak-28.log:Oct 17 13:50:51 soak-28 kernel: LustreError: 1937:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146273 was previously committed, server now claims 55850712527)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
      

      Attachments

        Issue Links

          Activity

            [LU-10127] DNE MDT goes back in time after failover, recovery

            I don't seen this issue once lov objid file update is fixed, but Vladimir Savelev have fixed a multi slot transno issue in same time.. so likely one of them have related to it.

            I think ticket can closed as can't reproduce now.

            shadow Alexey Lyashkov added a comment - I don't seen this issue once lov objid file update is fixed, but Vladimir Savelev have fixed a multi slot transno issue in same time.. so likely one of them have related to it. I think ticket can closed as can't reproduce now.
            shadow Alexey Lyashkov added a comment - - edited

            We have replicated it with IEEL3 code. Currently i have just a crash dump before such issue and picture is very strange.

            New client have connected in 47s before crash, exp_need_sync is false, so data should be committed on storage.

            Client stay in idle until server crashed (crash forced due lfsck deadlock on ost side), crash dump is done, server stay in up and client hit this message, probably it's previous mount, but i don't have info to have compare. From my point view this can be bug in dt_record_write function. In comparing to the VFS code, that function doesn't hold i_mutex during write, but writes in non conflicting ranges, but we have no protection against extent tree manipulation and using a same buffer head.

             

            adilger, what you think about it ?

             

             

            shadow Alexey Lyashkov added a comment - - edited We have replicated it with IEEL3 code. Currently i have just a crash dump before such issue and picture is very strange. New client have connected in 47s before crash, exp_need_sync is false, so data should be committed on storage. Client stay in idle until server crashed (crash forced due lfsck deadlock on ost side), crash dump is done, server stay in up and client hit this message, probably it's previous mount, but i don't have info to have compare. From my point view this can be bug in dt_record_write function. In comparing to the VFS code, that function doesn't hold i_mutex during write, but writes in non conflicting ranges, but we have no protection against extent tree manipulation and using a same buffer head.   adilger , what you think about it ?    
            di.wang Di Wang added a comment -
            One idea that comes to mind is that these might be old open RPCs that are being replayed, or something?
            

            I just checked the code, and did not see anything unusual when exchanging the committed transno during reconnection for DNE. Actually DNE does not touch those area, as I remember. hmm...

            Cliff: Could you please ping me when you are going to run DNE tests. I would like to collecting some debug log. Thanks.

            di.wang Di Wang added a comment - One idea that comes to mind is that these might be old open RPCs that are being replayed, or something? I just checked the code, and did not see anything unusual when exchanging the committed transno during reconnection for DNE. Actually DNE does not touch those area, as I remember. hmm... Cliff: Could you please ping me when you are going to run DNE tests. I would like to collecting some debug log. Thanks.

            We don't have the debug log. Issue was easy to reproduce, and occurred during a fresh restart after a power-cycle of all nodes. We are currently running non-DNE, should be reformatting this week.

            cliffw Cliff White (Inactive) added a comment - We don't have the debug log. Issue was easy to reproduce, and occurred during a fresh restart after a power-cycle of all nodes. We are currently running non-DNE, should be reformatting this week.

            I originally thought it was a skipped transno, but this usually shows up as only a small difference in transnos. In this case, the transno is very different. Looking at these transno values in hex shows that the old transno is from a previous mount epoch: 55850633401 = 0xd00f508b9, 60130134563 = 0xe00090a23

            One idea that comes to mind is that these might be old open RPCs that are being replayed, or something?

            adilger Andreas Dilger added a comment - I originally thought it was a skipped transno, but this usually shows up as only a small difference in transnos. In this case, the transno is very different. Looking at these transno values in hex shows that the old transno is from a previous mount epoch: 55850633401 = 0xd00f508b9 , 60130134563 = 0xe00090a23 One idea that comes to mind is that these might be old open RPCs that are being replayed, or something?
            di.wang Di Wang added a comment -

            Hmm, as I remember, the DNE recovery is based on the master MDT transno, i.e. the slave MDT transno should not be used to update the master MDT transno. I suspect this is caused by recovery failure, i.e. some of the transno is being skipped in this case.

            Is this failure easily reproduced? Could you please post the MDT0003 debug log here? Thanks.

            di.wang Di Wang added a comment - Hmm, as I remember, the DNE recovery is based on the master MDT transno, i.e. the slave MDT transno should not be used to update the master MDT transno. I suspect this is caused by recovery failure, i.e. some of the transno is being skipped in this case. Is this failure easily reproduced? Could you please post the MDT0003 debug log here? Thanks.

            Di, Lai,
            is this something to be concerned about? It looks almost like DNE recovery is using the wrong transno (maybe from another MDS?) during RPC replay?

            adilger Andreas Dilger added a comment - Di, Lai, is this something to be concerned about? It looks almost like DNE recovery is using the wrong transno (maybe from another MDS?) during RPC replay?

            People

              wc-triage WC Triage
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: