Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
Lustre 2.16.0
-
3
-
9223372036854775807
Description
This issue was created by maloo for S Buisson <sbuisson@ddn.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/c7030573-01fa-4889-ae5f-64824986f779
test_141 failed with the following error:
mgc lost locks (12 != 4)
Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/99081 - 4.18.0-477.21.1.el8_8.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/99081 - 4.18.0-477.21.1.el8_lustre.x86_64
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
recovery-small test_141 - mgc lost locks (12 != 4)
Attachments
Issue Links
- is related to
-
LU-13316 frequent "Connection restored" messages without any prior errors/indicators
-
- Resolved
-
- mentioned in
-
Page No Confluence page found with the given URL.
-
Page No Confluence page found with the given URL.
-
Page No Confluence page found with the given URL.
-
Page No Confluence page found with the given URL.
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
The majority of failure cases are (confusingly) either "oldc 0 != newc 23" or "oldc 23 != newc 0". This makes me think that the delay between "cancel_lru_locks MGC" and "do_facet ost1 ... lock_count" is racy and they randomly get "23" or "0" and sometimes they match (either both "0" or both "23") and sometimes they don't.
It isn't clear to me why the DLM locks are cancelled before being read on the OST before it is restarted? If that is to check that the client is refreshing its locks, then it would make sense for there to be some gap (e.g. 3-5s) so that the client will refresh the MGC locks again before they are counted. Then afterward it should use wait_update() until there are the same number of locks before considering the test a failure.