Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.11.0
-
soak cluster
-
3
-
9223372036854775807
Description
After MDT0003 was successfully failed over, various clients reported transaction issues
/scratch/logs/syslog/soak-27.log:Oct 17 13:50:47 soak-27 kernel: LustreError: 1917:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146272 was previously committed, server now claims 55835485611)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 /scratch/logs/syslog/soak-30.log:Oct 17 13:50:48 soak-30 kernel: LustreError: 2448:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146275 was previously committed, server now claims 60130146239)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 /scratch/logs/syslog/soak-18.log:Oct 17 13:50:50 soak-18 kernel: LustreError: 1966:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146271 was previously committed, server now claims 60130108709)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 /scratch/logs/syslog/soak-31.log:Oct 17 13:50:50 soak-31 kernel: LustreError: 2451:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146276 was previously committed, server now claims 60130108711)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 /scratch/logs/syslog/soak-25.log:Oct 17 13:50:51 soak-25 kernel: LustreError: 1862:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130134563 was previously committed, server now claims 55850633401)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 /scratch/logs/syslog/soak-28.log:Oct 17 13:50:51 soak-28 kernel: LustreError: 1937:0:(import.c:1264:ptlrpc_connect_interpret()) soaked-MDT0003_UUID went back in time (transno 60130146273 was previously committed, server now claims 55850712527)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646
Attachments
Issue Links
- mentioned in
-
Page No Confluence page found with the given URL.
We have replicated it with IEEL3 code. Currently i have just a crash dump before such issue and picture is very strange.
New client have connected in 47s before crash, exp_need_sync is false, so data should be committed on storage.
Client stay in idle until server crashed (crash forced due lfsck deadlock on ost side), crash dump is done, server stay in up and client hit this message, probably it's previous mount, but i don't have info to have compare. From my point view this can be bug in dt_record_write function. In comparing to the VFS code, that function doesn't hold i_mutex during write, but writes in non conflicting ranges, but we have no protection against extent tree manipulation and using a same buffer head.
adilger, what you think about it ?