[LU-14947] recovery-small test_26a: client not evicted from OST Created: 17/Aug/21  Updated: 21/Dec/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.8, Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Unresolved Votes: 0
Labels: failing_tests

Issue Links:
Related
is related to LU-10632 recovery-small test 26a fails with ‘c... Resolved
is related to LU-14748 gcc9 (RHEL 7.x with devtoolset-9) bui... Resolved
is related to LU-12066 recovery-small test 26b fails with “C... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Qian Yingjin <qian@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/acf1ad3c-16d3-4cac-9376-40485f9ed5f1

test_26a failed with the following error:

client not evicted from OST

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
recovery-small test_26a - client not evicted from OST



 Comments   
Comment by Alena Nikitenko [ 03/Dec/21 ]

Found similar issue in recovery-small test set on 2.12.8: https://testing.whamcloud.com/test_sets/48b57407-656d-46a2-bcb5-2809ffc48c29 

recovery-small test_26a: @@@@@@ FAIL: client not evicted from OST 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5919:error()
  = /usr/lib64/lustre/tests/test-framework.sh:9154:check_clients_evicted()
  = /usr/lib64/lustre/tests/recovery-small.sh:1094:test_26a()
  = /usr/lib64/lustre/tests/test-framework.sh:6222:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:6271:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:6111:run_test()
  = /usr/lib64/lustre/tests/recovery-small.sh:1097:main()
Dumping lctl log to /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e//recovery-small.test_26a.*.1637428776.log
CMD: onyx-109vm10,onyx-24vm6,onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 /usr/sbin/lctl dk > /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e//recovery-small.test_26a.debug_log.\$(hostname -s).1637428776.log; 
Comment by James Nunez (Inactive) [ 07/Dec/21 ]

I think this is the same issue as reported in the closed ticket LU-10632.

For 2.12.8 we’ve see it a couple of times in November
2.12.7.28 - https://testing.whamcloud.com/test_sets/0ee7439d-1b67-436e-bbb9-2bc9d0561594
2.12.8 - https://testing.whamcloud.com/test_sets/48b57407-656d-46a2-bcb5-2809ffc48c29
and for master once in the past two months
2.14.55.43 - https://testing.whamcloud.com/test_sets/d594a8a7-9153-4741-8289-1258864c9c80

From the master failure, we see that no eviction took place in the time frame expected

Check state for lustre-OST0000-osc-ffff97ad5fb89800
CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0000-osc-ffff97ad5fb89800.state
lustre-OST0000-osc-ffff97ad5fb89800 is evicted at 1634938357
Check state for lustre-OST0001-osc-ffff97ad5fb89800
CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0001-osc-ffff97ad5fb89800.state
/usr/lib64/lustre/tests/test-framework.sh: line 9488: ((: > 1634938297: syntax error: operand expected (error token is "> 1634938297")
lustre-OST0001-osc-ffff97ad5fb89800 was not evicted after 1634938297:
CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0001-osc-ffff97ad5fb89800.state
 - [ 1634938350, DISCONN ]
 - [ 1634938350, CONNECTING ]
 - [ 1634938351, DISCONN ]
 - [ 1634938351, CONNECTING ]
 - [ 1634938352, DISCONN ]
 - [ 1634938352, CONNECTING ]
 - [ 1634938352, RECOVER ]
 - [ 1634938352, FULL ]
Check state for lustre-OST0002-osc-ffff97ad5fb89800
CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0002-osc-ffff97ad5fb89800.state
/usr/lib64/lustre/tests/test-framework.sh: line 9488: ((: > 1634938297: syntax error: operand expected (error token is "> 1634938297")
lustre-OST0002-osc-ffff97ad5fb89800 was not evicted after 1634938297:
CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0002-osc-ffff97ad5fb89800.state
 - [ 1634938350, DISCONN ]
 - [ 1634938350, CONNECTING ]
 - [ 1634938351, DISCONN ]
 - [ 1634938351, CONNECTING ]
 - [ 1634938352, DISCONN ]
 - [ 1634938352, CONNECTING ]
 - [ 1634938352, RECOVER ]
 - [ 1634938352, FULL ]
Check state for lustre-OST0003-osc-ffff97ad5fb89800
CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0003-osc-ffff97ad5fb89800.state
/usr/lib64/lustre/tests/test-framework.sh: line 9488: ((: > 1634938297: syntax error: operand expected (error token is "> 1634938297")
lustre-OST0003-osc-ffff97ad5fb89800 was not evicted after 1634938297:
CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0003-osc-ffff97ad5fb89800.state
 - [ 1634938350, DISCONN ]
 - [ 1634938350, CONNECTING ]
 - [ 1634938351, DISCONN ]
 - [ 1634938351, CONNECTING ]
 - [ 1634938352, DISCONN ]
 - [ 1634938352, CONNECTING ]
 - [ 1634938352, RECOVER ]
 - [ 1634938352, FULL ]
Check state for lustre-OST0004-osc-ffff97ad5fb89800
CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0004-osc-ffff97ad5fb89800.state
/usr/lib64/lustre/tests/test-framework.sh: line 9488: ((: > 1634938297: syntax error: operand expected (error token is "> 1634938297")
lustre-OST0004-osc-ffff97ad5fb89800 was not evicted after 1634938297:
CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0004-osc-ffff97ad5fb89800.state
 - [ 1634938350, DISCONN ]
 - [ 1634938350, CONNECTING ]
 - [ 1634938351, DISCONN ]
 - [ 1634938351, CONNECTING ]
 - [ 1634938352, DISCONN ]
 - [ 1634938352, CONNECTING ]
 - [ 1634938352, RECOVER ]
 - [ 1634938352, FULL ]
Check state for lustre-OST0005-osc-ffff97ad5fb89800
CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0005-osc-ffff97ad5fb89800.state
/usr/lib64/lustre/tests/test-framework.sh: line 9488: ((: > 1634938297: syntax error: operand expected (error token is "> 1634938297")
lustre-OST0005-osc-ffff97ad5fb89800 was not evicted after 1634938297:
CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0005-osc-ffff97ad5fb89800.state
 - [ 1634938350, DISCONN ]
 - [ 1634938350, CONNECTING ]
 - [ 1634938351, DISCONN ]
 - [ 1634938351, CONNECTING ]
 - [ 1634938352, DISCONN ]
 - [ 1634938352, CONNECTING ]
 - [ 1634938352, RECOVER ]
 - [ 1634938352, FULL ]
Check state for lustre-OST0006-osc-ffff97ad5fb89800
CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0006-osc-ffff97ad5fb89800.state
/usr/lib64/lustre/tests/test-framework.sh: line 9488: ((: > 1634938297: syntax error: operand expected (error token is "> 1634938297")
lustre-OST0006-osc-ffff97ad5fb89800 was not evicted after 1634938297:
CMD: trevis-209vm1.trevis.whamcloud.com /usr/sbin/lctl get_param osc.lustre-OST0006-osc-ffff97ad5fb89800.state
 - [ 1634938350, DISCONN ]
 - [ 1634938350, CONNECTING ]
 - [ 1634938351, DISCONN ]
 - [ 1634938351, CONNECTING ]
 - [ 1634938352, DISCONN ]
 - [ 1634938352, CONNECTING ]
 - [ 1634938352, RECOVER ]
 - [ 1634938352, FULL ]
 recovery-small test_26a: @@@@@@ FAIL: client not evicted from OST 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6320:error()
  = /usr/lib64/lustre/tests/test-framework.sh:9498:check_clients_evicted()
  = /usr/lib64/lustre/tests/recovery-small.sh:1087:test_26a()

The 2.12.8 failure at https://testing.whamcloud.com/test_sets/0ee7439d-1b67-436e-bbb9-2bc9d0561594 looks like an eviction did take place in time

Check state for lustre-OST0001-osc-ffff98697bf0b000
CMD: onyx-66vm8 /usr/sbin/lctl get_param osc.lustre-OST0001-osc-ffff98697bf0b000.state
/usr/lib64/lustre/tests/test-framework.sh: line 9144: ((: > 1636932232: syntax error: operand expected (error token is "> 1636932232")
lustre-OST0001-osc-ffff98697bf0b000 was not evicted after 1636932232:
CMD: onyx-66vm8 /usr/sbin/lctl get_param osc.lustre-OST0001-osc-ffff98697bf0b000.state
 - [ 1636932288, CONNECTING ]
 - [ 1636932288, EVICTED ]
 - [ 1636932288, RECOVER ]
 - [ 1636932288, FULL ]
 - [ 1636932308, CONNECTING ]
 - [ 1636932308, IDLE ]
 - [ 1636932313, CONNECTING ]
 - [ 1636932313, FULL ]
Comment by Vladimir Saveliev [ 14/Dec/21 ]

+1 on master
https://testing.whamcloud.com/test_sets/f4a1c4f1-b91b-4648-abcd-19f452939b6c

Comment by Vladimir Saveliev [ 15/Dec/21 ]

+1 on master
https://testing.whamcloud.com/test_sets/d0a65bcc-9af0-4224-9727-41507dfce00f

Comment by Vladimir Saveliev [ 20/Dec/21 ]

+1 on master
https://testing.whamcloud.com/test_sets/d0a65bcc-9af0-4224-9727-41507dfce00f

This was induced by https://review.whamcloud.com/#/c/43834/

Comment by Colin Faber [ 28/Sep/22 ]

Hi tappro 

Here's another one we're seeing. Can you please take a look? Thank you!

Comment by Andreas Dilger [ 21/Oct/22 ]

There is a patch in LU-12066 (duplicate of this one) which may fix this problem for test_26b, and the same change may also fix test_26a. I would close this as a duplicate, but the error message has changed slightly and this ticket matches the current message, and I don't want to close LU-12066 because it has the patch. First step should be to review/rebase that patch to see if it fixes the problem.

Generated at Sat Feb 10 03:14:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.