[LU-7732] check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF Created: 02/Feb/16 Updated: 26/Jun/19 Resolved: 11/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Di Wang | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
[ 2270.584467] LustreError: 30824:0:(ldlm_lib.c:1900:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4295194948, ql: 2, comp: 2, conn: 4, next: 4295194950, next_update 4295194960 last_committed: 4295194129) [ 2270.587115] LustreError: 30824:0:(ldlm_lib.c:1900:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4295194951, ql: 2, comp: 2, conn: 4, next: 4295194952, next_update 4295194960 last_committed: 4295194129) [ 2270.590338] LustreError: 30824:0:(ldlm_lib.c:1900:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4295194955, ql: 2, comp: 2, conn: 4, next: 4295194958, next_update 4295194960 last_committed: 4295194129) [ 2270.592946] LustreError: 30824:0:(ldlm_lib.c:1900:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4295194959, ql: 2, comp: 2, conn: 4, next: 4295194960, next_update 4295194960 last_committed: 4295194129) I met a lot such message during replay test. And it turns out those losing transaction numbers are actually coming from open/close request, because the open request (no create) will be destroyed(removed from replay list) after close, no matter if the transaction is committed or not, i.e. open request will not be replayed usually. So the question here is that should we have to allocate transno to open (no create) request? If it is necessary, then we at least should turn off these annoying console message, since they are not real error. |
| Comments |
| Comment by Mikhail Pershin [ 15/Apr/16 ] |
|
normally open replays don't issue such error messages. These particular messages show that we have a problem here: |
| Comment by Di Wang [ 15/Apr/16 ] |
normally open replays don't issue such error messages. From what I see, this is because of open replay. Note: in current implementation, open (non create) request will be destroyed (removed from the replay list) when close is done, and it does not check the committed transaction at all, that is why you will see a lot skip transno here. I do not think this is related with update recovery, at least in theory. Even in single MDT replay, you will see such annoying message. |
| Comment by Mikhail Pershin [ 15/Apr/16 ] |
|
you are right, that can be the reason of gaps after last_committed. in that case I agree that D_ERROR level is not needed for such messages |
| Comment by Andreas Dilger [ 03/Jun/16 ] |
|
Mike, is this just a matter of turning this CERROR() into a CDEBUG(), or is there some way we can limit it to specific cases where open is involved? I guess if the RPC is freed then there is no way to know whether it was an open or not? In any case, this message is spewing to the console during replay-dual testing (e.g. https://testing.hpdd.intel.com/test_logs/a6e8a554-2901-11e6-a0ce-5254006e85c2/show_text) and should really be fixed in some way before 2.9.0 is released. Is there a particular bug/patch that caused this message to appear that this ticket should link to, so we might better understand why it has started to appear? |
| Comment by Doug Oucharek (Inactive) [ 13/Jul/16 ] |
|
The patch submitted by Olaf was mis-titled and is not a fix for this ticket. There is no fix for this yet. |
| Comment by Olaf Weber [ 13/Jul/16 ] |
|
Thanks Doug. Apologies for the confusion caused by my typo. |
| Comment by Bob Glossman (Inactive) [ 13/Jul/16 ] |
|
if the needed fix is just to eliminate use of D_ERROR in this particular error message and always use D_HA, I can easily construct such a patch. If something more clever is required it probably needs an expert on this part of the code. |
| Comment by Andreas Dilger [ 19/Jul/16 ] |
|
Bob, could you please make a patch changing D_ERROR to D_HA. I'm not sure that is all that is needed, but it is a start. |
| Comment by Gerrit Updater [ 19/Jul/16 ] |
|
Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/21418 |
| Comment by Gerrit Updater [ 11/Aug/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21418/ |
| Comment by Peter Jones [ 11/Aug/16 ] |
|
Landed for 2.9 |