[LU-12060] ost-pools test 24 fails with 'Pool '' not on /mnt/lustre/d24.ost-pools/dir3/f24.ost-pools0:test_85b' Created: 11/Mar/19 Updated: 20/Mar/19 Resolved: 20/Mar/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.10.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Patrick Farrell (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
ost-pools test_24 fails with 'Pool '' not on /mnt/lustre/d24.ost-pools/dir3/f24.ost-pools0:test_85b'. We only see this test fail with this error message in full test sessions. So, some test suite prior to ost-pools may be failing or not cleaning up after itself. The pool name the test is looking for is strange; 'test_85b'. Test 24 is failing at line 1440 in the following ost-pools test script 1415 for i in 1 2 3 4; do 1416 dir=${POOL_ROOT}/dir${i} 1417 local pool 1418 local pool1 1419 local count 1420 local count1 1421 local index 1422 local size 1423 local size1 1424 1425 createmany -o $dir/${tfile} $numfiles || 1426 error "createmany $dir/${tfile} failed!" 1427 pool=$($LFS getstripe --pool $dir) 1428 index=$($LFS getstripe -i $dir) 1429 size=$($LFS getstripe -S $dir) 1430 count=$($LFS getstripe -c $dir) 1431 1432 for file in $dir/*; do 1433 if [ "$pool" != "" ]; then 1434 check_file_in_pool $file $pool 1435 fi 1436 pool1=$($LFS getstripe --pool $file) 1437 count1=$($LFS getstripe -c $file) 1438 size1=$($LFS getstripe -S $file) 1439 [[ "$pool" != "$pool1" ]] && 1440 error "Pool '$pool' not on $file:$pool1" 1441 [[ "$count" != "$count1" ]] && 1442 [[ "$count" != "-1" ]] && 1443 error "Stripe count $count not on"\ 1444 "$file:$count1" 1445 [[ "$count1" != "$OSTCOUNT" ]] && 1446 [[ "$count" = "-1" ]] && 1447 error "Stripe count $count1 not on"\ 1448 "$file:$OSTCOUNT" 1449 [[ "$size" != "$size1" ]] && [[ "$size" != "0" ]] && 1450 error "Stripe size $size not on $file:$size1" 1451 done 1452 done Looking at a recent 2.10.7 RC1 test failure, https://testing.whamcloud.com/test_sets/8bc401ce-4320-11e9-8e92-52540065bddc, ost-pools is the only failure out of all test suites. Looking at the MDS (vm4) debug log, we see that we are calling with a bad pool name about 10 times: 00010000:00010000:1.0:1552189237.557826:0:12918:0:(ldlm_request.c:504:ldlm_cli_enqueue_local()) ### client-side local enqueue handler, new lock created ns: mdt-lustre-MDT0000_UUID lock: ffff8f36d7c0b200/0xe4fa7e4c800ca51f lrc: 3/0,1 mode: PW/PW res: [0x20006bac1:0x1972a:0x0].0xa55d7462 bits 0x2 rrc: 2 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 12918 timeout: 0 lvb_type: 0 00020000:01000000:1.0:1552189237.557892:0:12918:0:(lod_pool.c:920:lod_find_pool()) lustre-MDT0000-osd: request for an unknown pool (test_85b) 00000004:00080000:1.0:1552189237.557972:0:12918:0:(osp_object.c:1517:osp_create()) lustre-OST0000-osc-MDT0000: Wrote last used FID: [0x100000000:0x20219:0x0], index 0: 0 We do see that replay-single test 85b does create a test_85b pool, but it looks like it is destroyed == replay-single test 85b: check the cancellation of unused locks during recovery(EXTENT) ============ 12:13:45 (1552162425) CMD: trevis-39vm4 lctl pool_new lustre.test_85b trevis-39vm4: Pool lustre.test_85b created CMD: trevis-39vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.test_85b 2>/dev/null || echo foo CMD: trevis-39vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.test_85b 2>/dev/null || echo foo CMD: trevis-39vm1 lctl get_param -n lov.lustre-*.pools.test_85b 2>/dev/null || echo foo CMD: trevis-39vm1 lctl get_param -n lov.lustre-*.pools.test_85b 2>/dev/null || echo foo CMD: trevis-39vm4 /usr/sbin/lctl pool_add lustre.test_85b lustre-OST0000 trevis-39vm4: OST lustre-OST0000_UUID added to pool lustre.test_85b before recovery: unused locks count = 100 ... after recovery: unused locks count = 0 CMD: trevis-39vm4 /usr/sbin/lctl pool_remove lustre.test_85b lustre-OST0000 trevis-39vm4: OST lustre-OST0000_UUID removed from pool lustre.test_85b CMD: trevis-39vm4 /usr/sbin/lctl pool_destroy lustre.test_85b trevis-39vm4: Pool lustre.test_85b destroyed ... CMD: trevis-39vm1,trevis-39vm2,trevis-39vm3,trevis-39vm4 dmesg Destroy the created pools: test_85b CMD: trevis-39vm4 /usr/sbin/lctl pool_list lustre PASS 85b (50s) Yet, looking at output from later replay-single tests, we see that test_85b pool does exist. From test 90, lmm_pool is test_85b Check getstripe: /usr/bin/lfs getstripe -r --obd lustre-OST0006_UUID /mnt/lustre/d90.replay-single/all lmm_stripe_count: 7 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 3 lmm_pool: test_85b obdidx objid objid group 6 4930 0x1342 0 * /mnt/lustre/d90.replay-single/f6 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 6 lmm_pool: test_85b obdidx objid objid group 6 4931 0x1343 0 * /mnt/lustre/d90.replay-single/all /mnt/lustre/d90.replay-single/f6 Failover ost7 to trevis-39vm3 Similar is test 132a /mnt/lustre/f132a.replay-single
lcm_layout_gen: 3
lcm_entry_count: 2
lcme_id: 1
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: 1048576
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 3
lmm_pool: test_85b
lmm_objects:
- 0: { l_ost_idx: 3, l_fid: [0x100030000:0x1382:0x0] }
lcme_id: 2
lcme_flags: init
lcme_extent.e_start: 1048576
lcme_extent.e_end: EOF
lmm_stripe_count: 2
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 4
lmm_pool: test_85b
lmm_objects:
- 0: { l_ost_idx: 4, l_fid: [0x100040000:0x1382:0x0] }
- 1: { l_ost_idx: 5, l_fid: [0x100050000:0x14e2:0x0] }
Looking at the results since June 2018, ost-pools test 24 did not fail with the error "Pool '' not on /mnt/lustre/d24.ost-pools/dir3/f24.ost-pools0:test_85b" on any branch from June 2018 - January 2019. Then, in February, this test started failing with this error again. Here are all the failures with this error in 2019: 27-FEB 2.10.6.63 - https://testing.whamcloud.com/test_sets/489d0546-3b35-11e9-913f-52540065bddc 3-MAR 2.12.51.79 server/2.12.0 clients - https://testing.whamcloud.com/test_sets/c816dce2-3e47-11e9-9720-52540065bddc |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 11/Mar/19 ] |
|
James, https://review.whamcloud.com/#/c/33777/
That covers almost but not quite all of your instances, since it's not in 2.12.0 or 2.10.x... But looking: 12-FEB 2.12.51.28 - https://testing.whamcloud.com/test_sets/d9cbd72c-2ea6-11e9-a700-52540065bddc In fact have 2.12.0 clients. So, it's that fix. |
| Comment by Patrick Farrell (Inactive) [ 20/Mar/19 ] |
|
It's not clear why this started happening all of a sudden, but the issue is resolved by this patch against |