[LU-10632] recovery-small test 26a fails with ‘client not evicted from OST’ Created: 07/Feb/18  Updated: 04/Jun/22  Resolved: 06/Apr/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.5
Fix Version/s: Lustre 2.12.7, Lustre 2.15.0

Type: Bug Priority: Blocker
Reporter: James Nunez (Inactive) Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-2461 recovery-small test_26a: @@@@@@ FAIL:... Resolved
Related
is related to LU-2461 recovery-small test_26a: @@@@@@ FAIL:... Resolved
is related to LU-14947 recovery-small test_26a: client not e... Open
is related to LU-12066 recovery-small test 26b fails with “C... Open
is related to LU-15737 recovery-small: ll_ost00 - service th... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

recovery-small test_26a fails with the error

recovery-small test_26a: @@@@@@ FAIL: client not evicted from OST 

The following output comes from the failure at https://testing.hpdd.intel.com/test_sets/fe065d76-0baa-11e8-a7cd-52540065bddc

Looking at the test_log, we see that there is a problem getting osc.*.state parameter on the client

CMD: trevis-6vm6.trevis.hpdd.intel.com /usr/sbin/lctl get_param osc.lustre-OST0000-osc-ffff880039c89800.state
/usr/lib64/lustre/tests/test-framework.sh: line 8496: ((: > 1517964926: syntax error: operand expected (error token is "> 1517964926")

Test 26a calls the check_clients_evicted() routine to get the time of eviction and check it against the “before” time. If you can’t get the time the client osc was evicted, then you can’t compare it to the times and the test will fail.

8482 # check that clients "oscs" was evicted after "before"
8483 check_clients_evicted() {
8484         local before=$1
8485         shift
8486         local oscs=${@}
8487         local osc
8488         local rc=0
8489 
8490         for osc in $oscs; do
8491                 ((rc++))
8492                 echo "Check state for $osc"
8493                 local evicted=$(do_facet client $LCTL get_param osc.$osc.state |
8494                         tail -n 3 | awk -F"[ [,]" \
8495                         '/EVICTED ]$/ { if (mx<$5) {mx=$5;} } END { print mx }')
8496                 if (($? == 0)) && (($evicted > $before)); then
8497                         echo "$osc is evicted at $evicted"
8498                         ((rc--))
8499                 fi
8500         done
8501 
8502         [ $rc -eq 0 ] || error "client not evicted from OST"
8503 }

you see that getting no value for $evicted, then we don’t decrement the return code and the test fails.

There’s nothing obviously wrong in the console and dmesg logs except in the client console log, we do see some LustreErrors

[12802.505434] Lustre: DEBUG MARKER: df
[12805.745429] Lustre: Evicted from MGS (at 10.9.4.63@tcp) after server handle changed from 0x6728fd52bd37c36a to 0x6728fd52bd37c913
[12805.745586] LustreError: 14144:0:(file.c:4097:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5
[12805.746256] LustreError: 14144:0:(lmv_obd.c:1387:lmv_statfs()) can't stat MDS #0 (lustre-MDT0000-mdc-ffff880039c89800), error -108
[12805.746260] LustreError: 14144:0:(llite_lib.c:1785:ll_statfs_internal()) md_statfs fails: rc = -108
[12805.760747] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param osc.lustre-OST0000-osc-ffff880039c89800.state

Here are logs for a few other recovery-small test 26a failures:
review-dne-part-1: https://testing.hpdd.intel.com/test_sets/4ec62074-f473-11e7-8c43-52540065bddc
failover: https://testing.hpdd.intel.com/test_sets/bd9510da-f6cd-11e7-bd00-52540065bddc
failover: https://testing.hpdd.intel.com/test_sets/01049780-ffef-11e7-a6ad-52540065bddc
failover: https://testing.hpdd.intel.com/test_sets/4b4b3144-06d6-11e8-a7cd-52540065bddc



 Comments   
Comment by Jian Yu [ 24/Aug/19 ]

There were more than 10 failures last week, which is affecting patch review testing on master branch.

Comment by Chris Horn [ 24/Apr/20 ]

+1 on master https://testing.whamcloud.com/test_sessions/b2965d82-a459-4188-a035-72180920afb6

Comment by Chris Horn [ 12/Aug/20 ]

+1 on master https://testing.whamcloud.com/test_sets/92b3221e-d412-484a-abc7-53e1247a2d71

Comment by Andreas Dilger [ 04/Mar/21 ]

+1 on master https://testing.whamcloud.com/test_sets/4a1b1ac5-51a3-4ecf-87a6-b45d4e24ced0
and 10 more in the past week.

Comment by Andreas Dilger [ 11/Mar/21 ]

+1 on master https://testing.whamcloud.com/test_sessions/6ef53a85-b6ba-478b-b42a-2b12c33135bb
and 7 more in the past week

Comment by Gerrit Updater [ 11/Mar/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/42006
Subject: LU-10632 tests: recovery-small test_26 idle_timeout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 93e6a185980954e6df072c64831d4465080469e2

Comment by Andreas Dilger [ 23/Mar/21 ]

Test failed 16x in the past week.

Comment by Gerrit Updater [ 06/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/42006/
Subject: LU-10632 tests: recovery-small test_26 idle_timeout
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b4391fcdaf392a50bd1419342eca3b730c077ed2

Comment by Peter Jones [ 06/Apr/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 08/Apr/21 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43237
Subject: LU-10632 tests: recovery-small test_26 idle_timeout
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: f46f05cf20c59dce4703181bbc24928c54717798

Comment by Gerrit Updater [ 05/May/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43237/
Subject: LU-10632 tests: recovery-small test_26 idle_timeout
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: ff8f84a216b8ef432891220971b3ca6d5f1df39d

Generated at Sat Feb 10 02:36:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.