[LU-12060] ost-pools test 24 fails with 'Pool '' not on /mnt/lustre/d24.ost-pools/dir3/f24.ost-pools0:test_85b' Created: 11/Mar/19  Updated: 20/Mar/19  Resolved: 20/Mar/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.10.7
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Patrick Farrell (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9000 ost-pools test_24: Pool '' not on /mn... Closed
is related to LU-12061 Destroyed pools still added for new f... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

ost-pools test_24 fails with 'Pool '' not on /mnt/lustre/d24.ost-pools/dir3/f24.ost-pools0:test_85b'. We only see this test fail with this error message in full test sessions. So, some test suite prior to ost-pools may be failing or not cleaning up after itself. The pool name the test is looking for is strange; 'test_85b'. Test 24 is failing at line 1440 in the following ost-pools test script

1415         for i in 1 2 3 4; do
1416                 dir=${POOL_ROOT}/dir${i}
1417                 local pool
1418                 local pool1
1419                 local count
1420                 local count1
1421                 local index
1422                 local size
1423                 local size1
1424 
1425                 createmany -o $dir/${tfile} $numfiles ||
1426                         error "createmany $dir/${tfile} failed!"
1427                 pool=$($LFS getstripe --pool $dir)
1428                 index=$($LFS getstripe -i $dir)
1429                 size=$($LFS getstripe -S $dir)
1430                 count=$($LFS getstripe -c $dir)
1431 
1432                 for file in $dir/*; do
1433                         if [ "$pool" != "" ]; then
1434                                 check_file_in_pool $file $pool
1435                         fi
1436                         pool1=$($LFS getstripe --pool $file)
1437                         count1=$($LFS getstripe -c $file)
1438                         size1=$($LFS getstripe -S $file)
1439                         [[ "$pool" != "$pool1" ]] &&
1440                                 error "Pool '$pool' not on $file:$pool1"
1441                         [[ "$count" != "$count1" ]] &&
1442                                 [[ "$count" != "-1" ]] &&
1443                                         error "Stripe count $count not on"\
1444                                                 "$file:$count1"
1445                         [[ "$count1" != "$OSTCOUNT" ]] &&
1446                                 [[ "$count" = "-1" ]] &&
1447                                         error "Stripe count $count1 not on"\
1448                                                 "$file:$OSTCOUNT"
1449                         [[ "$size" != "$size1" ]] && [[ "$size" != "0" ]] &&
1450                                 error "Stripe size $size not on $file:$size1"
1451                 done
1452         done

Looking at a recent 2.10.7 RC1 test failure, https://testing.whamcloud.com/test_sets/8bc401ce-4320-11e9-8e92-52540065bddc, ost-pools is the only failure out of all test suites. Looking at the MDS (vm4) debug log, we see that we are calling with a bad pool name about 10 times:

00010000:00010000:1.0:1552189237.557826:0:12918:0:(ldlm_request.c:504:ldlm_cli_enqueue_local()) ### client-side local enqueue handler, new lock created ns: mdt-lustre-MDT0000_UUID lock: ffff8f36d7c0b200/0xe4fa7e4c800ca51f lrc: 3/0,1 mode: PW/PW res: [0x20006bac1:0x1972a:0x0].0xa55d7462 bits 0x2 rrc: 2 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 12918 timeout: 0 lvb_type: 0
00020000:01000000:1.0:1552189237.557892:0:12918:0:(lod_pool.c:920:lod_find_pool()) lustre-MDT0000-osd: request for an unknown pool (test_85b)
00000004:00080000:1.0:1552189237.557972:0:12918:0:(osp_object.c:1517:osp_create()) lustre-OST0000-osc-MDT0000: Wrote last used FID: [0x100000000:0x20219:0x0], index 0: 0

We do see that replay-single test 85b does create a test_85b pool, but it looks like it is destroyed

== replay-single test 85b: check the cancellation of unused locks during recovery(EXTENT) ============ 12:13:45 (1552162425)
CMD: trevis-39vm4 lctl pool_new lustre.test_85b
trevis-39vm4: Pool lustre.test_85b created
CMD: trevis-39vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.test_85b 				2>/dev/null || echo foo
CMD: trevis-39vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.test_85b 				2>/dev/null || echo foo
CMD: trevis-39vm1 lctl get_param -n lov.lustre-*.pools.test_85b 		2>/dev/null || echo foo
CMD: trevis-39vm1 lctl get_param -n lov.lustre-*.pools.test_85b 		2>/dev/null || echo foo
CMD: trevis-39vm4 /usr/sbin/lctl pool_add lustre.test_85b lustre-OST0000
trevis-39vm4: OST lustre-OST0000_UUID added to pool lustre.test_85b
before recovery: unused locks count = 100
...
after recovery: unused locks count = 0
CMD: trevis-39vm4 /usr/sbin/lctl pool_remove lustre.test_85b lustre-OST0000
trevis-39vm4: OST lustre-OST0000_UUID removed from pool lustre.test_85b
CMD: trevis-39vm4 /usr/sbin/lctl pool_destroy lustre.test_85b
trevis-39vm4: Pool lustre.test_85b destroyed
...
CMD: trevis-39vm1,trevis-39vm2,trevis-39vm3,trevis-39vm4 dmesg
Destroy the created pools: test_85b
CMD: trevis-39vm4 /usr/sbin/lctl pool_list lustre
PASS 85b (50s)

Yet, looking at output from later replay-single tests, we see that test_85b pool does exist. From test 90, lmm_pool is test_85b

Check getstripe: /usr/bin/lfs getstripe -r --obd lustre-OST0006_UUID
/mnt/lustre/d90.replay-single/all
lmm_stripe_count:  7
lmm_stripe_size:   1048576
lmm_pattern:       1
lmm_layout_gen:    0
lmm_stripe_offset: 3
lmm_pool:          test_85b
	obdidx		 objid		 objid		 group
	     6	          4930	       0x1342	             0 *

/mnt/lustre/d90.replay-single/f6
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:       1
lmm_layout_gen:    0
lmm_stripe_offset: 6
lmm_pool:          test_85b
	obdidx		 objid		 objid		 group
	     6	          4931	       0x1343	             0 *
/mnt/lustre/d90.replay-single/all
/mnt/lustre/d90.replay-single/f6
Failover ost7 to trevis-39vm3

Similar is test 132a

/mnt/lustre/f132a.replay-single
  lcm_layout_gen:  3
  lcm_entry_count: 2
    lcme_id:             1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   1048576
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 3
      lmm_pool:          test_85b
      lmm_objects:
      - 0: { l_ost_idx: 3, l_fid: [0x100030000:0x1382:0x0] }

    lcme_id:             2
    lcme_flags:          init
    lcme_extent.e_start: 1048576
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  2
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 4
      lmm_pool:          test_85b
      lmm_objects:
      - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x1382:0x0] }
      - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x14e2:0x0] }

Looking at the results since June 2018, ost-pools test 24 did not fail with the error "Pool '' not on /mnt/lustre/d24.ost-pools/dir3/f24.ost-pools0:test_85b" on any branch from June 2018 - January 2019. Then, in February, this test started failing with this error again. Here are all the failures with this error in 2019:
12-FEB 2.12.51.28 - https://testing.whamcloud.com/test_sets/d9cbd72c-2ea6-11e9-a700-52540065bddc

27-FEB 2.10.6.63 - https://testing.whamcloud.com/test_sets/489d0546-3b35-11e9-913f-52540065bddc
27-FEB 2.12.51.51 - https://testing.whamcloud.com/test_sets/f7d5b686-3b15-11e9-b88b-52540065bddc
28-FEB 2.10.6.63 - https://testing.whamcloud.com/test_sets/6790308a-3b1d-11e9-a5c6-52540065bddc
28-FEB 2.10.6.63 - https://testing.whamcloud.com/test_sets/cec55e7c-3b83-11e9-9646-52540065bddc
28-FEB 2.10.6.63 - https://testing.whamcloud.com/test_sets/aa662b4a-3b62-11e9-913f-52540065bddc
28-FEB 2.10.6.63 - https://testing.whamcloud.com/test_sets/a446daf4-3b56-11e9-913f-52540065bddc
28-FEB 2.10.6.63 - https://testing.whamcloud.com/test_sets/17558fde-3b54-11e9-8f69-52540065bddc
28-FEB 2.10.6.63 - https://testing.whamcloud.com/test_sets/d13c77de-3b3e-11e9-913f-52540065bddc

3-MAR 2.12.51.79 server/2.12.0 clients - https://testing.whamcloud.com/test_sets/c816dce2-3e47-11e9-9720-52540065bddc
6-MAR 2.10.6.79 - https://testing.whamcloud.com/test_sets/07c7f1ca-4053-11e9-9646-52540065bddc
6-MAR 2.10.6.79 - https://testing.whamcloud.com/test_sets/aae422be-405f-11e9-a256-52540065bddc
6-MAR 2.10.6.79 - https://testing.whamcloud.com/test_sets/58314dc4-4061-11e9-9720-52540065bddc
6-MAR 2.10.6.79 - https://testing.whamcloud.com/test_sets/dd263538-4063-11e9-9720-52540065bddc
6-MAR 2.10.6.79 - https://testing.whamcloud.com/test_sets/8aca8b3e-4065-11e9-9720-52540065bddc
6-MAR 2.10.6.79 - https://testing.whamcloud.com/test_sets/e3b30308-406e-11e9-9646-52540065bddc
6-MAR 2.10.6.79 - https://testing.whamcloud.com/test_sets/848095a4-4077-11e9-8e92-52540065bddc
6-MAR 2.10.6.79 - https://testing.whamcloud.com/test_sets/a32fa184-40a4-11e9-92fe-52540065bddc
6-MAR 2.10.6.79 - https://testing.whamcloud.com/test_sets/2f488558-40ac-11e9-8e92-52540065bddc
7-MAR 2.10.6.79 - https://testing.whamcloud.com/test_sets/a4b30b40-40cc-11e9-a256-52540065bddc
9-MAR 2.10.7 RC1 - https://testing.whamcloud.com/test_sets/f6d552f6-42ee-11e9-b98a-52540065bddc
9-MAR 2.10.7 RC1 - https://testing.whamcloud.com/test_sets/12f341da-4300-11e9-9646-52540065bddc
9-MAR 2.10.7 RC1 - https://testing.whamcloud.com/test_sets/df7dc85c-430e-11e9-b98a-52540065bddc
10-MAR 2.10.7 RC1 - https://testing.whamcloud.com/test_sets/8bc401ce-4320-11e9-8e92-52540065bddc
10-MAR 2.10.7 RC1 - https://testing.whamcloud.com/test_sets/5a54930a-4335-11e9-a256-52540065bddc
10-MAR 2.10.7 RC1 - https://testing.whamcloud.com/test_sets/d29d65d4-433b-11e9-b98a-52540065bddc
10-MAR 2.10.7 RC1 - https://testing.whamcloud.com/test_sets/68760eee-4341-11e9-a256-52540065bddc
10-MAR 2.10.7 RC1 - https://testing.whamcloud.com/test_sets/f906e83a-434f-11e9-a256-52540065bddc



 Comments   
Comment by Patrick Farrell (Inactive) [ 11/Mar/19 ]

James,
See:

https://review.whamcloud.com/#/c/33777/

 

That covers almost but not quite all of your instances, since it's not in 2.12.0 or 2.10.x...

But looking:
27-FEB 2.12.51.51 - https://testing.whamcloud.com/test_sets/f7d5b686-3b15-11e9-b88b-52540065bddc

12-FEB 2.12.51.28 - https://testing.whamcloud.com/test_sets/d9cbd72c-2ea6-11e9-a700-52540065bddc

In fact have 2.12.0 clients.

So, it's that fix.

Comment by Patrick Farrell (Inactive) [ 20/Mar/19 ]

It's not clear why this started happening all of a sudden, but the issue is resolved by this patch against LU-10070 (which is a feature ticket, so not a good candidate for duping this to that):
https://review.whamcloud.com/#/c/33777/

Generated at Sat Feb 10 02:49:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.