[LU-10127] DNE MDT goes back in time after failover, recovery - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.11.0
Labels:
- soak
Environment:
soak cluster

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

After MDT0003 was successfully failed over, various clients reported transaction issues

/scratch/logs/syslog/soak-27.log:Oct 17 13:50:47 soak-27 kernel: LustreError: 1917:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146272 was previously committed, server now claims 55835485611)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
/scratch/logs/syslog/soak-30.log:Oct 17 13:50:48 soak-30 kernel: LustreError: 2448:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146275 was previously committed, server now claims 60130146239)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
/scratch/logs/syslog/soak-18.log:Oct 17 13:50:50 soak-18 kernel: LustreError: 1966:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146271 was previously committed, server now claims 60130108709)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
/scratch/logs/syslog/soak-31.log:Oct 17 13:50:50 soak-31 kernel: LustreError: 2451:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146276 was previously committed, server now claims 60130108711)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
/scratch/logs/syslog/soak-25.log:Oct 17 13:50:51 soak-25 kernel: LustreError: 1862:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130134563 was previously committed, server now claims 55850633401)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
/scratch/logs/syslog/soak-28.log:Oct 17 13:50:51 soak-28 kernel: LustreError: 1937:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146273 was previously committed, server now claims 55850712527)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646

Attachments

Issue Links

mentioned in: Page No Confluence page found with the given URL.

Activity

[LU-10127] DNE MDT goes back in time after failover, recovery

Alexey Lyashkov added a comment - 02/Apr/18 2:51 AM - edited

We have replicated it with IEEL3 code. Currently i have just a crash dump before such issue and picture is very strange.

New client have connected in 47s before crash, exp_need_sync is false, so data should be committed on storage.

Client stay in idle until server crashed (crash forced due lfsck deadlock on ost side), crash dump is done, server stay in up and client hit this message, probably it's previous mount, but i don't have info to have compare. From my point view this can be bug in dt_record_write function. In comparing to the VFS code, that function doesn't hold i_mutex during write, but writes in non conflicting ranges, but we have no protection against extent tree manipulation and using a same buffer head.

adilger, what you think about it ?

Alexey Lyashkov added a comment - 02/Apr/18 2:51 AM - edited We have replicated it with IEEL3 code. Currently i have just a crash dump before such issue and picture is very strange. New client have connected in 47s before crash, exp_need_sync is false, so data should be committed on storage. Client stay in idle until server crashed (crash forced due lfsck deadlock on ost side), crash dump is done, server stay in up and client hit this message, probably it's previous mount, but i don't have info to have compare. From my point view this can be bug in dt_record_write function. In comparing to the VFS code, that function doesn't hold i_mutex during write, but writes in non conflicting ranges, but we have no protection against extent tree manipulation and using a same buffer head. adilger , what you think about it ?

Di Wang added a comment - 24/Oct/17 5:00 PM

One idea that comes to mind is that these might be old open RPCs that are being replayed, or something?

I just checked the code, and did not see anything unusual when exchanging the committed transno during reconnection for DNE. Actually DNE does not touch those area, as I remember. hmm...

Cliff: Could you please ping me when you are going to run DNE tests. I would like to collecting some debug log. Thanks.

Di Wang added a comment - 24/Oct/17 5:00 PM One idea that comes to mind is that these might be old open RPCs that are being replayed, or something? I just checked the code, and did not see anything unusual when exchanging the committed transno during reconnection for DNE. Actually DNE does not touch those area, as I remember. hmm... Cliff: Could you please ping me when you are going to run DNE tests. I would like to collecting some debug log. Thanks.

Cliff White (Inactive) added a comment - 23/Oct/17 6:09 PM

We don't have the debug log. Issue was easy to reproduce, and occurred during a fresh restart after a power-cycle of all nodes. We are currently running non-DNE, should be reformatting this week.

Cliff White (Inactive) added a comment - 23/Oct/17 6:09 PM We don't have the debug log. Issue was easy to reproduce, and occurred during a fresh restart after a power-cycle of all nodes. We are currently running non-DNE, should be reformatting this week.

Andreas Dilger added a comment - 19/Oct/17 7:03 PM

I originally thought it was a skipped transno, but this usually shows up as only a small difference in transnos. In this case, the transno is very different. Looking at these transno values in hex shows that the old transno is from a previous mount epoch: 55850633401 = 0xd00f508b9, 60130134563 = 0xe00090a23

One idea that comes to mind is that these might be old open RPCs that are being replayed, or something?

Andreas Dilger added a comment - 19/Oct/17 7:03 PM I originally thought it was a skipped transno, but this usually shows up as only a small difference in transnos. In this case, the transno is very different. Looking at these transno values in hex shows that the old transno is from a previous mount epoch: 55850633401 = 0xd00f508b9 , 60130134563 = 0xe00090a23 One idea that comes to mind is that these might be old open RPCs that are being replayed, or something?

Di Wang added a comment - 19/Oct/17 5:57 AM

Hmm, as I remember, the DNE recovery is based on the master MDT transno, i.e. the slave MDT transno should not be used to update the master MDT transno. I suspect this is caused by recovery failure, i.e. some of the transno is being skipped in this case.

Is this failure easily reproduced? Could you please post the MDT0003 debug log here? Thanks.

Di Wang added a comment - 19/Oct/17 5:57 AM Hmm, as I remember, the DNE recovery is based on the master MDT transno, i.e. the slave MDT transno should not be used to update the master MDT transno. I suspect this is caused by recovery failure, i.e. some of the transno is being skipped in this case. Is this failure easily reproduced? Could you please post the MDT0003 debug log here? Thanks.

Andreas Dilger added a comment - 18/Oct/17 6:54 PM

Di, Lai,
is this something to be concerned about? It looks almost like DNE recovery is using the wrong transno (maybe from another MDS?) during RPC replay?

Andreas Dilger added a comment - 18/Oct/17 6:54 PM Di, Lai, is this something to be concerned about? It looks almost like DNE recovery is using the wrong transno (maybe from another MDS?) during RPC replay?

People

Assignee:: WC Triage

Reporter:: Cliff White (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 17/Oct/17 3:34 PM

Updated:: 05/Dec/18 5:39 PM

Resolved:: 05/Dec/18 5:39 PM