Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
3
-
9223372036854775807
Description
Message 'More than one transno' in lustre debug log means that during some RPC request handling we do several transaction start/stop cycles. This is potentially dangerous situation, because without proper handling it may cause missing transactions after all. E.g. if first transaction was applied and its transno is written in last_rcvd then second will not. Now consider that only first one was really committed, then its transaction number is considered as committed too and there will be no replay for that request leaving the second part uncommitted and not-replayed. So every message 'More that one transno' means potentia; data loss. At the same time this message is not masked as D_ERROR to be visible in console.
Meanwhile we have limited way to cover multi-transaction RPC handling, if we know that may happen we store in last_rcvd only the latest transno and all previous transactions must be reentrant, e.g. can be applied without errors being already applied before. In that case replay will not cause errors in case they were committed but the last one wasn't. For such cases flag tti_mult_trans is being used to mark RPC handler as legally multi-transactional.
I propose to change default behavior for multi-transation situation to always write the latest transno in last_rcvd file like we do with tti_mult_trans flag and issue warning message if that was not expected. So non-handled cases would be reviewed and resolved properly. Bad effect of that would be replay errors for non-handled cases but that is better than possible data loss.
Attachments
Issue Links
- is related to
-
LU-15776 2.15 RC3: lost writes during server fofb by forced panics
- Resolved