[LU-14947] recovery-small test_26a: client not evicted from OST Created: 17/Aug/21 Updated: 21/Dec/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.8, Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Mikhail Pershin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | failing_tests | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for Qian Yingjin <qian@ddn.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/acf1ad3c-16d3-4cac-9376-40485f9ed5f1 test_26a failed with the following error: client not evicted from OST VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Alena Nikitenko [ 03/Dec/21 ] |
|
Found similar issue in recovery-small test set on 2.12.8: https://testing.whamcloud.com/test_sets/48b57407-656d-46a2-bcb5-2809ffc48c29 recovery-small test_26a: @@@@@@ FAIL: client not evicted from OST Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5919:error() = /usr/lib64/lustre/tests/test-framework.sh:9154:check_clients_evicted() = /usr/lib64/lustre/tests/recovery-small.sh:1094:test_26a() = /usr/lib64/lustre/tests/test-framework.sh:6222:run_one() = /usr/lib64/lustre/tests/test-framework.sh:6271:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:6111:run_test() = /usr/lib64/lustre/tests/recovery-small.sh:1097:main() Dumping lctl log to /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e//recovery-small.test_26a.*.1637428776.log CMD: onyx-109vm10,onyx-24vm6,onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 /usr/sbin/lctl dk > /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e//recovery-small.test_26a.debug_log.\$(hostname -s).1637428776.log; |
| Comment by James Nunez (Inactive) [ 07/Dec/21 ] |
|
I think this is the same issue as reported in the closed ticket For 2.12.8 we’ve see it a couple of times in November From the master failure, we see that no eviction took place in the time frame expected Check state for lustre-OST0000-osc-ffff97ad5fb89800 CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0000-osc-ffff97ad5fb89800.state lustre-OST0000-osc-ffff97ad5fb89800 is evicted at 1634938357 Check state for lustre-OST0001-osc-ffff97ad5fb89800 CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0001-osc-ffff97ad5fb89800.state /usr/lib64/lustre/tests/test-framework.sh: line 9488: ((: > 1634938297: syntax error: operand expected (error token is "> 1634938297") lustre-OST0001-osc-ffff97ad5fb89800 was not evicted after 1634938297: CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0001-osc-ffff97ad5fb89800.state - [ 1634938350, DISCONN ] - [ 1634938350, CONNECTING ] - [ 1634938351, DISCONN ] - [ 1634938351, CONNECTING ] - [ 1634938352, DISCONN ] - [ 1634938352, CONNECTING ] - [ 1634938352, RECOVER ] - [ 1634938352, FULL ] Check state for lustre-OST0002-osc-ffff97ad5fb89800 CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0002-osc-ffff97ad5fb89800.state /usr/lib64/lustre/tests/test-framework.sh: line 9488: ((: > 1634938297: syntax error: operand expected (error token is "> 1634938297") lustre-OST0002-osc-ffff97ad5fb89800 was not evicted after 1634938297: CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0002-osc-ffff97ad5fb89800.state - [ 1634938350, DISCONN ] - [ 1634938350, CONNECTING ] - [ 1634938351, DISCONN ] - [ 1634938351, CONNECTING ] - [ 1634938352, DISCONN ] - [ 1634938352, CONNECTING ] - [ 1634938352, RECOVER ] - [ 1634938352, FULL ] Check state for lustre-OST0003-osc-ffff97ad5fb89800 CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0003-osc-ffff97ad5fb89800.state /usr/lib64/lustre/tests/test-framework.sh: line 9488: ((: > 1634938297: syntax error: operand expected (error token is "> 1634938297") lustre-OST0003-osc-ffff97ad5fb89800 was not evicted after 1634938297: CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0003-osc-ffff97ad5fb89800.state - [ 1634938350, DISCONN ] - [ 1634938350, CONNECTING ] - [ 1634938351, DISCONN ] - [ 1634938351, CONNECTING ] - [ 1634938352, DISCONN ] - [ 1634938352, CONNECTING ] - [ 1634938352, RECOVER ] - [ 1634938352, FULL ] Check state for lustre-OST0004-osc-ffff97ad5fb89800 CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0004-osc-ffff97ad5fb89800.state /usr/lib64/lustre/tests/test-framework.sh: line 9488: ((: > 1634938297: syntax error: operand expected (error token is "> 1634938297") lustre-OST0004-osc-ffff97ad5fb89800 was not evicted after 1634938297: CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0004-osc-ffff97ad5fb89800.state - [ 1634938350, DISCONN ] - [ 1634938350, CONNECTING ] - [ 1634938351, DISCONN ] - [ 1634938351, CONNECTING ] - [ 1634938352, DISCONN ] - [ 1634938352, CONNECTING ] - [ 1634938352, RECOVER ] - [ 1634938352, FULL ] Check state for lustre-OST0005-osc-ffff97ad5fb89800 CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0005-osc-ffff97ad5fb89800.state /usr/lib64/lustre/tests/test-framework.sh: line 9488: ((: > 1634938297: syntax error: operand expected (error token is "> 1634938297") lustre-OST0005-osc-ffff97ad5fb89800 was not evicted after 1634938297: CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0005-osc-ffff97ad5fb89800.state - [ 1634938350, DISCONN ] - [ 1634938350, CONNECTING ] - [ 1634938351, DISCONN ] - [ 1634938351, CONNECTING ] - [ 1634938352, DISCONN ] - [ 1634938352, CONNECTING ] - [ 1634938352, RECOVER ] - [ 1634938352, FULL ] Check state for lustre-OST0006-osc-ffff97ad5fb89800 CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0006-osc-ffff97ad5fb89800.state /usr/lib64/lustre/tests/test-framework.sh: line 9488: ((: > 1634938297: syntax error: operand expected (error token is "> 1634938297") lustre-OST0006-osc-ffff97ad5fb89800 was not evicted after 1634938297: CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0006-osc-ffff97ad5fb89800.state - [ 1634938350, DISCONN ] - [ 1634938350, CONNECTING ] - [ 1634938351, DISCONN ] - [ 1634938351, CONNECTING ] - [ 1634938352, DISCONN ] - [ 1634938352, CONNECTING ] - [ 1634938352, RECOVER ] - [ 1634938352, FULL ] recovery-small test_26a: @@@@@@ FAIL: client not evicted from OST Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6320:error() = /usr/lib64/lustre/tests/test-framework.sh:9498:check_clients_evicted() = /usr/lib64/lustre/tests/recovery-small.sh:1087:test_26a() The 2.12.8 failure at https://testing.whamcloud.com/test_sets/0ee7439d-1b67-436e-bbb9-2bc9d0561594 looks like an eviction did take place in time Check state for lustre-OST0001-osc-ffff98697bf0b000 CMD: onyx-66vm8 /usr/sbin/lctl get_param osc.lustre-OST0001-osc-ffff98697bf0b000.state /usr/lib64/lustre/tests/test-framework.sh: line 9144: ((: > 1636932232: syntax error: operand expected (error token is "> 1636932232") lustre-OST0001-osc-ffff98697bf0b000 was not evicted after 1636932232: CMD: onyx-66vm8 /usr/sbin/lctl get_param osc.lustre-OST0001-osc-ffff98697bf0b000.state - [ 1636932288, CONNECTING ] - [ 1636932288, EVICTED ] - [ 1636932288, RECOVER ] - [ 1636932288, FULL ] - [ 1636932308, CONNECTING ] - [ 1636932308, IDLE ] - [ 1636932313, CONNECTING ] - [ 1636932313, FULL ] |
| Comment by Vladimir Saveliev [ 14/Dec/21 ] |
|
+1 on master |
| Comment by Vladimir Saveliev [ 15/Dec/21 ] |
|
+1 on master |
| Comment by Vladimir Saveliev [ 20/Dec/21 ] |
This was induced by https://review.whamcloud.com/#/c/43834/ |
| Comment by Colin Faber [ 28/Sep/22 ] |
|
Hi tappro Here's another one we're seeing. Can you please take a look? Thank you! |
| Comment by Andreas Dilger [ 21/Oct/22 ] |
|
There is a patch in LU-12066 (duplicate of this one) which may fix this problem for test_26b, and the same change may also fix test_26a. I would close this as a duplicate, but the error message has changed slightly and this ticket matches the current message, and I don't want to close LU-12066 because it has the patch. First step should be to review/rebase that patch to see if it fixes the problem. |