[LU-10127] DNE MDT goes back in time after failover, recovery Created: 17/Oct/17  Updated: 05/Dec/18  Resolved: 05/Dec/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: soak
Environment:

soak cluster


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After MDT0003 was successfully failed over, various clients reported transaction issues

/scratch/logs/syslog/soak-27.log:Oct 17 13:50:47 soak-27 kernel: LustreError: 1917:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146272 was previously committed, server now claims 55835485611)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
/scratch/logs/syslog/soak-30.log:Oct 17 13:50:48 soak-30 kernel: LustreError: 2448:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146275 was previously committed, server now claims 60130146239)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
/scratch/logs/syslog/soak-18.log:Oct 17 13:50:50 soak-18 kernel: LustreError: 1966:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146271 was previously committed, server now claims 60130108709)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
/scratch/logs/syslog/soak-31.log:Oct 17 13:50:50 soak-31 kernel: LustreError: 2451:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146276 was previously committed, server now claims 60130108711)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
/scratch/logs/syslog/soak-25.log:Oct 17 13:50:51 soak-25 kernel: LustreError: 1862:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130134563 was previously committed, server now claims 55850633401)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
/scratch/logs/syslog/soak-28.log:Oct 17 13:50:51 soak-28 kernel: LustreError: 1937:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146273 was previously committed, server now claims 55850712527)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646


 Comments   
Comment by Andreas Dilger [ 18/Oct/17 ]

Di, Lai,
is this something to be concerned about? It looks almost like DNE recovery is using the wrong transno (maybe from another MDS?) during RPC replay?

Comment by Di Wang [ 19/Oct/17 ]

Hmm, as I remember, the DNE recovery is based on the master MDT transno, i.e. the slave MDT transno should not be used to update the master MDT transno. I suspect this is caused by recovery failure, i.e. some of the transno is being skipped in this case.

Is this failure easily reproduced? Could you please post the MDT0003 debug log here? Thanks.

Comment by Andreas Dilger [ 19/Oct/17 ]

I originally thought it was a skipped transno, but this usually shows up as only a small difference in transnos. In this case, the transno is very different. Looking at these transno values in hex shows that the old transno is from a previous mount epoch: 55850633401 = 0xd00f508b9, 60130134563 = 0xe00090a23

One idea that comes to mind is that these might be old open RPCs that are being replayed, or something?

Comment by Cliff White (Inactive) [ 23/Oct/17 ]

We don't have the debug log. Issue was easy to reproduce, and occurred during a fresh restart after a power-cycle of all nodes. We are currently running non-DNE, should be reformatting this week.

Comment by Di Wang [ 24/Oct/17 ]
One idea that comes to mind is that these might be old open RPCs that are being replayed, or something?

I just checked the code, and did not see anything unusual when exchanging the committed transno during reconnection for DNE. Actually DNE does not touch those area, as I remember. hmm...

Cliff: Could you please ping me when you are going to run DNE tests. I would like to collecting some debug log. Thanks.

Comment by Alexey Lyashkov [ 02/Apr/18 ]

We have replicated it with IEEL3 code. Currently i have just a crash dump before such issue and picture is very strange.

New client have connected in 47s before crash, exp_need_sync is false, so data should be committed on storage.

Client stay in idle until server crashed (crash forced due lfsck deadlock on ost side), crash dump is done, server stay in up and client hit this message, probably it's previous mount, but i don't have info to have compare. From my point view this can be bug in dt_record_write function. In comparing to the VFS code, that function doesn't hold i_mutex during write, but writes in non conflicting ranges, but we have no protection against extent tree manipulation and using a same buffer head.

 

adilger, what you think about it ?

 

 

Comment by Alexey Lyashkov [ 05/Dec/18 ]

I don't seen this issue once lov objid file update is fixed, but Vladimir Savelev have fixed a multi slot transno issue in same time.. so likely one of them have related to it.

I think ticket can closed as can't reproduce now.

Generated at Sat Feb 10 02:32:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.