[LU-7732] check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF Created: 02/Feb/16  Updated: 26/Jun/19  Resolved: 11/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Di Wang Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
[ 2270.584467] LustreError: 30824:0:(ldlm_lib.c:1900:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4295194948, ql: 2, comp: 2, conn: 4, next: 4295194950, next_update 4295194960 last_committed: 4295194129)
[ 2270.587115] LustreError: 30824:0:(ldlm_lib.c:1900:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4295194951, ql: 2, comp: 2, conn: 4, next: 4295194952, next_update 4295194960 last_committed: 4295194129)
[ 2270.590338] LustreError: 30824:0:(ldlm_lib.c:1900:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4295194955, ql: 2, comp: 2, conn: 4, next: 4295194958, next_update 4295194960 last_committed: 4295194129)
[ 2270.592946] LustreError: 30824:0:(ldlm_lib.c:1900:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4295194959, ql: 2, comp: 2, conn: 4, next: 4295194960, next_update 4295194960 last_committed: 4295194129)

I met a lot such message during replay test. And it turns out those losing transaction numbers are actually coming from open/close request, because the open request (no create) will be destroyed(removed from replay list) after close, no matter if the transaction is committed or not, i.e. open request will not be replayed usually.

So the question here is that should we have to allocate transno to open (no create) request? If it is necessary, then we at least should turn off these annoying console message, since they are not real error.



 Comments   
Comment by Mikhail Pershin [ 15/Apr/16 ]

normally open replays don't issue such error messages. These particular messages show that we have a problem here:
last_commited is 4295194129, replayed request is 4295194948, so it is larger than last_committed value that means there should be no gaps if all clients were connected. I see 4 clients are connected but can't see how many is expected to connect. So that is one of the reason.
If all clients are connected, then we have a possible problem with recovery - some replays are missing or transaction number were generated not sequentially, e.g. several transno per RPCs or so. Maybe this is result of update recovery? This can be not a problem at all, just result of changed recovery basics, e.g. if gaps become possible even after last_committed.

Comment by Di Wang [ 15/Apr/16 ]
normally open replays don't issue such error messages.

From what I see, this is because of open replay. Note: in current implementation, open (non create) request will be destroyed (removed from the replay list) when close is done, and it does not check the committed transaction at all, that is why you will see a lot skip transno here. I do not think this is related with update recovery, at least in theory. Even in single MDT replay, you will see such annoying message.

Comment by Mikhail Pershin [ 15/Apr/16 ]

you are right, that can be the reason of gaps after last_committed. in that case I agree that D_ERROR level is not needed for such messages

Comment by Andreas Dilger [ 03/Jun/16 ]

Mike, is this just a matter of turning this CERROR() into a CDEBUG(), or is there some way we can limit it to specific cases where open is involved? I guess if the RPC is freed then there is no way to know whether it was an open or not?

In any case, this message is spewing to the console during replay-dual testing (e.g. https://testing.hpdd.intel.com/test_logs/a6e8a554-2901-11e6-a0ce-5254006e85c2/show_text) and should really be fixed in some way before 2.9.0 is released. Is there a particular bug/patch that caused this message to appear that this ticket should link to, so we might better understand why it has started to appear?

Comment by Doug Oucharek (Inactive) [ 13/Jul/16 ]

The patch submitted by Olaf was mis-titled and is not a fix for this ticket. There is no fix for this yet.

Comment by Olaf Weber [ 13/Jul/16 ]

Thanks Doug. Apologies for the confusion caused by my typo.

Comment by Bob Glossman (Inactive) [ 13/Jul/16 ]

if the needed fix is just to eliminate use of D_ERROR in this particular error message and always use D_HA, I can easily construct such a patch. If something more clever is required it probably needs an expert on this part of the code.

Comment by Andreas Dilger [ 19/Jul/16 ]

Bob, could you please make a patch changing D_ERROR to D_HA. I'm not sure that is all that is needed, but it is a start.

Comment by Gerrit Updater [ 19/Jul/16 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/21418
Subject: LU-7732 ldlm: silence verbose "waking for gap" log messages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6e3e224388a4224cf68ed0666126ff541197c061

Comment by Gerrit Updater [ 11/Aug/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21418/
Subject: LU-7732 ldlm: silence verbose "waking for gap" log messages
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f073f11b860fcb42707c50b63bed2b2294ceeeba

Comment by Peter Jones [ 11/Aug/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:11:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.