[LU-15123] sanity-quota: test_7a Error: 'reintegration failed' Created: 18/Oct/21  Updated: 02/Aug/23  Resolved: 26/Apr/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0, Lustre 2.15.3
Fix Version/s: Lustre 2.16.0, Lustre 2.15.4

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13795 sanity-quota test_7a: Update not seen... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for paf <pfarrell@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/ded8a9a9-b77c-41ca-bb53-105290b2709e



 Comments   
Comment by Chris Horn [ 02/Dec/21 ]

+1 on master - https://testing.whamcloud.com/test_sets/2d408dbe-88f6-428a-ae7e-0c8796fb3207

Comment by Sergey Cheremencev [ 22/Dec/21 ]

+1 on master - https://testing.whamcloud.com/test_sets/f07f8d61-0faa-41ad-9217-c238bd4c2bb0

I guess It fails because OST0000 waits for client 1 in a recovery blocking reintegration to start(and finish). Finally it evicts this client:

[15386.910732] Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 5 clients reconnect
...
[15485.424725] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity-quota test_7a: @@@@@@ FAIL: reintegration failed 
[15485.858740] Lustre: DEBUG MARKER: sanity-quota test_7a: @@@@@@ FAIL: reintegration failed
[15486.354991] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-1/2021-12-15/lustre-reviews_review-dne-zfs-part-4_85112_1_13_4b065c95-2177-4a6a-b5c8-32b025199627//sanity-quota.test_7a.debug_log.$(hostname -s).1639605011.log;
               		dmesg > /autotest/autotest-1/2021-12-15/lustre-review
[15488.876821] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
[15488.878099] Lustre: lustre-OST0000: disconnecting 1 stale clients 
[15489.069798] Lustre: lustre-OST0000: Recovery over after 1:42, of 5 clients 4 recovered and 1 was evicted.

Logs from client1 dmesg:

[15387.000236] Lustre: lustre-OST0000-osc-ffff9247ca727000: Connection to lustre-OST0000 (at 10.240.26.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
[15407.479024] Lustre: lustre-OST0001-osc-ffff9247ca727000: disconnect after 23s idle
[15409.071515] Lustre: lustre-OST0000-osc-ffff9247ca727000: Connection restored to 10.240.26.143@tcp (at 10.240.26.143@tcp)
[15474.038827] LustreError: 11-0: lustre-OST0000-osc-ffff9247ca727000: operation ost_disconnect to node 10.240.26.143@tcp failed: rc = -107

 

Comment by Andreas Dilger [ 24/Mar/22 ]

+3 on master, all on the same patch:
https://testing.whamcloud.com/test_sessions/dbc4c026-0863-41b3-b17c-b695415fa8aa
https://testing.whamcloud.com/test_sessions/3db9b21f-fd9b-49a0-bbd7-5dda5aea1be9
https://testing.whamcloud.com/test_sessions/e22114c8-b11d-4494-ae19-030d82261043

Comment by Andreas Dilger [ 18/Oct/22 ]

Still being hit on master, 14/310 runs in the past week.

Comment by Qian Yingjin [ 17/Nov/22 ]

+1 on master:
https://testing.whamcloud.com/test_sets/7cd51716-a038-4e24-a46b-4a14e83cc1a2

Comment by Alexander Zarochentsev [ 14/Dec/22 ]

+1 on master:
https://testing.whamcloud.com/test_sets/db2d7940-4287-4ff0-9435-81c6520360b7

Comment by Nikitas Angelinas [ 22/Dec/22 ]

+1 on master: https://testing.whamcloud.com/test_sets/dc29fa4d-ef8c-4838-a182-7a544385f4cc

Comment by Nikitas Angelinas [ 17/Jan/23 ]

+1 on master: https://testing.whamcloud.com/test_sets/3d837ba0-73c7-4737-9894-5bbca2c9b479

Comment by Jian Yu [ 18/Apr/23 ]

+1 on b2_15 branch: https://testing.whamcloud.com/test_sets/150ed79f-6d70-4048-b875-56a9bccc54cf

Comment by Alex Zhuravlev [ 19/Apr/23 ]

[13964.128411] Lustre: lustre-OST0000: Imperative Recovery enabled, recovery window shrunk from 60-180 down to 60-180
[13965.655492] Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 5 clients reconnect
...
[14061.885567] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-quota test_7a: @@@@@@ FAIL: reintegration failed
[14067.469119] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
[14067.470635] Lustre: lustre-OST0000: disconnecting 1 stale clients
[14067.787175] Lustre: lustre-OST0000: Recovery over after 1:42, of 5 clients 4 recovered and 1 was evicted.


reintegration starts only when recovery is over. in this case the recovery process was stuck due to a missing client (to be evicted in the end) and the recovery process took 102 seconds while test 7a waits 90s at most.

Comment by Gerrit Updater [ 19/Apr/23 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50688
Subject: LU-15123 tests: quota reintegration starts after recovery
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e6a2cb8c1aa0a96ec1d6e4603132635459ac0615

Comment by Gerrit Updater [ 26/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50688/
Subject: LU-15123 tests: check quota reintegration after recovery
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4432b6e2824775e292f96e202d6fc0db231bc749

Comment by Peter Jones [ 26/Apr/23 ]

Landed for 2.16

Comment by Gerrit Updater [ 06/Jun/23 ]

"Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51233
Subject: LU-15123 tests: check quota reintegration after recovery
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 8002f7d111f817032ffe8ee3485abf5e4472a148

Comment by Gerrit Updater [ 02/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51233/
Subject: LU-15123 tests: check quota reintegration after recovery
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 13805e3a2d4f520e297bc408d94b9971a6094f9a

Generated at Sat Feb 10 03:15:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.