[LU-17165] recovery-small test_141: mgc lost locks - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: Lustre 2.17.0
Affects Version/s: Lustre 2.16.0
Labels:
- failover
- mtg

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/c7030573-01fa-4889-ae5f-64824986f779

test_141 failed with the following error:

mgc lost locks (12 != 4)

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/99081 - 4.18.0-477.21.1.el8_8.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/99081 - 4.18.0-477.21.1.el8_lustre.x86_64

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
recovery-small test_141 - mgc lost locks (12 != 4)

Attachments

Issue Links

is related to

LU-13316 frequent "Connection restored" messages without any prior errors/indicators

Resolved

mentioned in: Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(7 mentioned in)

Activity

[LU-17165] recovery-small test_141: mgc lost locks

Andreas Dilger added a comment - 08/Nov/24 8:15 AM

The majority of failure cases are (confusingly) either "oldc 0 != newc 23" or "oldc 23 != newc 0". This makes me think that the delay between "cancel_lru_locks MGC" and "do_facet ost1 ... lock_count" is racy and they randomly get "23" or "0" and sometimes they match (either both "0" or both "23") and sometimes they don't.

It isn't clear to me why the DLM locks are cancelled before being read on the OST before it is restarted? If that is to check that the client is refreshing its locks, then it would make sense for there to be some gap (e.g. 3-5s) so that the client will refresh the MGC locks again before they are counted. Then afterward it should use wait_update() until there are the same number of locks before considering the test a failure.

Andreas Dilger added a comment - 08/Nov/24 8:15 AM The majority of failure cases are (confusingly) either " oldc 0 != newc 23 " or " oldc 23 != newc 0 ". This makes me think that the delay between " cancel_lru_locks MGC " and " do_facet ost1 ... lock_count " is racy and they randomly get "23" or "0" and sometimes they match (either both "0" or both "23") and sometimes they don't. It isn't clear to me why the DLM locks are cancelled before being read on the OST before it is restarted? If that is to check that the client is refreshing its locks, then it would make sense for there to be some gap (e.g. 3-5s) so that the client will refresh the MGC locks again before they are counted. Then afterward it should use wait_update() until there are the same number of locks before considering the test a failure.

Jian Yu added a comment - 28/Oct/24 6:18 AM

Lustre 2.16.0 RC5: https://testing.whamcloud.com/test_sets/631b2da4-fd5d-41b6-ab19-f1838b9c4ac5

Jian Yu added a comment - 28/Oct/24 6:18 AM Lustre 2.16.0 RC5: https://testing.whamcloud.com/test_sets/631b2da4-fd5d-41b6-ab19-f1838b9c4ac5

Jian Yu added a comment - 09/Oct/24 7:09 PM

The failure occurred consistently in failover-part-1 test sessions.

Jian Yu added a comment - 09/Oct/24 7:09 PM The failure occurred consistently in failover-part-1 test sessions.

Sebastien Buisson added a comment - 03/Oct/24 3:08 PM

From the recent test failures, it looks like the MGC lock count is pretty erratic. Sometimes the lock count before MGS restart is greater than after MGS restart, sometimes it is the other way around. And sometimes the MGC lock count before the MGS restart is even 0, which is a bit surprising for a running Lustre file system.
See the error message printed in this test results page:
https://testing.whamcloud.com/search?status%5B%5D=FAIL&test_set_script_id=f36cabd0-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=be891259-db0f-49c1-bd23-eafc220a3fc8&start_date=2024-09-27&end_date=2024-10-03&source=sub_tests#redirect

I initially introduced recovery-small test_141, but it does not seem the problem is in the test itself. Unfortunately I am not an ldlm lock specialist.

Sebastien Buisson added a comment - 03/Oct/24 3:08 PM From the recent test failures, it looks like the MGC lock count is pretty erratic. Sometimes the lock count before MGS restart is greater than after MGS restart, sometimes it is the other way around. And sometimes the MGC lock count before the MGS restart is even 0, which is a bit surprising for a running Lustre file system. See the error message printed in this test results page: https://testing.whamcloud.com/search?status%5B%5D=FAIL&test_set_script_id=f36cabd0-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=be891259-db0f-49c1-bd23-eafc220a3fc8&start_date=2024-09-27&end_date=2024-10-03&source=sub_tests#redirect I initially introduced recovery-small test_141, but it does not seem the problem is in the test itself. Unfortunately I am not an ldlm lock specialist.

Jian Yu added a comment - 02/Oct/24 5:10 PM

+1 on master branch: https://testing.whamcloud.com/test_sets/91343880-9495-40dd-b77f-895ac8f90176

Jian Yu added a comment - 02/Oct/24 5:10 PM +1 on master branch: https://testing.whamcloud.com/test_sets/91343880-9495-40dd-b77f-895ac8f90176

Andreas Dilger added a comment - 16/Aug/24 4:41 PM

This is still failing regularly:

https://testing.whamcloud.com/search?horizon=2332800&status%5B%5D=FAIL&test_set_script_id=f36cabd0-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=be891259-db0f-49c1-bd23-eafc220a3fc8&source=sub_tests#redirect

Andreas Dilger added a comment - 16/Aug/24 4:41 PM This is still failing regularly: https://testing.whamcloud.com/search?horizon=2332800&status%5B%5D=FAIL&test_set_script_id=f36cabd0-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=be891259-db0f-49c1-bd23-eafc220a3fc8&source=sub_tests#redirect

Peter Jones added a comment - 14/Jul/24 12:30 AM

Merged for 2.16

Peter Jones added a comment - 14/Jul/24 12:30 AM Merged for 2.16

Gerrit Updater added a comment - 13/Jul/24 8:55 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55591/
Subject: LU-17165 tests: fix recovery-small test_141
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 755c9c0f78a777b342a42a74aa8fb93d04e7cad8

Gerrit Updater added a comment - 13/Jul/24 8:55 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55591/ Subject: LU-17165 tests: fix recovery-small test_141 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 755c9c0f78a777b342a42a74aa8fb93d04e7cad8

Gerrit Updater added a comment - 01/Jul/24 1:24 PM

"Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55591
Subject: LU-17165 tests: fix recovery-small test_141
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c7b45f61bda6d8efed4ec7ccda79e2d11aad823d

Gerrit Updater added a comment - 01/Jul/24 1:24 PM "Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55591 Subject: LU-17165 tests: fix recovery-small test_141 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c7b45f61bda6d8efed4ec7ccda79e2d11aad823d

Andreas Dilger added a comment - 13/Jun/24 3:13 PM

This test was added in patch https://review.whamcloud.com/37344 "LU-13116 mgc: do not lose sptlrpc config lock". It isn't clear if there is a real problem or if there is some unrelated MGC lock that is being canceled during the test?

Andreas Dilger added a comment - 13/Jun/24 3:13 PM This test was added in patch https://review.whamcloud.com/37344 " LU-13116 mgc: do not lose sptlrpc config lock ". It isn't clear if there is a real problem or if there is some unrelated MGC lock that is being canceled during the test?

People

Assignee:: Core Lustre Triage

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 04/Oct/23 8:36 AM

Updated:: 07/Feb/25 9:38 PM