[LU-14973] ost-pools test_20 test_23b: mds1:pool add failed testpool; lustre-OST[0000-0006/3] Created: 27/Aug/21  Updated: 27/Oct/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run, which has failed about 20 times in the past 4 months with this exact error message, and another 20 with other error messages that may or may not be related:

https://testing.whamcloud.com/test_sets/f1ed827f-01f0-4992-af7e-ab7f204f6bb8

test_23b failed with the following error:

CMD: trevis-45vm1 lctl pool_new lustre.testpool
trevis-45vm1: Pool lustre.testpool created
CMD: trevis-45vm1 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool 2>/dev/null || echo foo
CMD: trevis-77vm7.trevis.whamcloud.com lctl get_param -n lov.lustre-*.pools.testpool 2>/dev/null || echo foo
CMD: trevis-45vm1 lctl pool_add lustre.testpool lustre-OST[0000-0006/3]
trevis-45vm1: OST lustre-OST0000_UUID added to pool lustre.testpool
trevis-45vm1: OST lustre-OST0003_UUID added to pool lustre.testpool
trevis-45vm1: OST lustre-OST0006_UUID added to pool lustre.testpool
CMD: trevis-45vm1 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool |
				sort -u | tr '\n' ' ' 
:
 ost-pools test_23b: @@@@@@ FAIL: mds1:pool add failed testpool; lustre-OST[0000-0006/3] 

There are no errors when adding and removing the OSTs from testpool, so I think they are added properly.

I suspect this is some kind of bug/race in ost-pools.sh::add_pool() where wait_update_facet() is checking for "$tgt" that is not specified exactly the same way that get_param is printing it (e.g. extra space(s), different OST ordering, etc).

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
ost-pools test_23b - mds1:pool add failed testpool; lustre-OST[0000-0006/3]



 Comments   
Comment by Andreas Dilger [ 27/Aug/21 ]

It appears that the test_23b failure is a follow-on from test_20 failing with a similar problem:

CMD: trevis-27vm6 lctl pool_add lustre.testpool2 lustre-OST[0001-0006/2]
trevis-27vm6: OST lustre-OST0001_UUID added to pool lustre.testpool2
trevis-27vm6: OST lustre-OST0003_UUID added to pool lustre.testpool2
trevis-27vm6: OST lustre-OST0005_UUID added to pool lustre.testpool2
CMD: trevis-27vm6 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool2 |
				sort -u | tr '\n' ' ' 
:
ost-pools test_20: @@@@@@ FAIL: mds1:pool add failed testpool2; lustre-OST[0001-0006/2] 
Generated at Sat Feb 10 03:14:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.