[LU-15658] Interop sanity-flr test_0b test_0c test_0e test_0f: verify pool failed != flash Created: 17/Mar/22 Updated: 30/Nov/23 Resolved: 30/May/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0, Lustre 2.15.3 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Vitaly Fertman |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/0e8656f6-a118-4d42-8ce4-e179ce9c7e5c test_0b failed with the following error: verify pool failed on /mnt/lustre/d0b.sanity-flr/f0b.sanity-flr: != flash env CMD: trevis-80vm1.trevis.whamcloud.com lctl get_param -n lov.lustre-*.pools.test_0b | sort -u | tr '\n' ' '
Waiting 90 secs for update
CMD: trevis-80vm1.trevis.whamcloud.com lctl get_param -n lov.lustre-*.pools.test_0b | sort -u | tr '\n' ' '
CMD: trevis-80vm1.trevis.whamcloud.com lctl get_param -n lov.lustre-*.pools.test_0b | sort -u | tr '\n' ' '
Updated after 2s: wanted 'lustre-OST0000_UUID lustre-OST0001_UUID lustre-OST0002_UUID lustre-OST0003_UUID lustre-OST0004_UUID lustre-OST0005_UUID lustre-OST0006_UUID ' got 'lustre-OST0000_UUID lustre-OST0001_UUID lustre-OST0002_UUID lustre-OST0003_UUID lustre-OST0004_UUID lustre-OST0005_UUID lustre-OST0006_UUID '
/mnt/lustre/d0b.sanity-flr/f0b.sanity-flr
composite_header:
lcm_magic: 0x0BD60BD0
lcm_size: 1216
lcm_flags: ro
lcm_layout_gen: 6
lcm_mirror_count: 6
lcm_entry_count: 6
components:
- lcme_id: 131074
lcme_mirror_id: 2
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: EOF
lcme_offset: 392
lcme_size: 80
sub_layout:
lmm_magic: 0x0BD10BD0
lmm_seq: 0x200039de1
lmm_object_id: 0x2d72
lmm_fid: [0x200039de1:0x2d72:0x0]
lmm_stripe_count: 2
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 2
lmm_objects:
- 0: { l_ost_idx: 2, l_fid: [0x100020000:0x27071:0x0] }
- 1: { l_ost_idx: 3, l_fid: [0x100030000:0x2716f:0x0] }
sanity-flr test_0b: @@@@@@ FAIL: verify pool failed on /mnt/lustre/d0b.sanity-flr/f0b.sanity-flr: != flash
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 17/Mar/22 ] |
|
It looks like this has been failing since 2021-03-02, and it looks like the failure started with patch https://review.whamcloud.com/41815 " There were two identical failures on the patch itself before it landed, and in full testing starting on the day it landed on master:
The problem is that the older releases check for the "flash" pool on mirror 2, but the |
| Comment by Andreas Dilger [ 17/Mar/22 ] |
|
Hi Vitaly, could you please take a look at this test failure. Per my previous comment, it looks like this started failing 100% in interop testing since patch https://review.whamcloud.com/41815 landed on 2021-03-22. |
| Comment by Gerrit Updater [ 23/Mar/22 ] |
|
|
| Comment by Andreas Dilger [ 23/Mar/22 ] |
|
As mentioned in the commit message of the revert patch, I think there are a couple of things wrong with the previous patch:
The latter inconsistency became very apparent when I was looking into how the first issue might be fixed properly, by moving the pool check lower down in lod_qos_parse_config() to where the existing lod_check_index_in_pool() is called. It seems odd that the "check single OST" will return an error, while "check two OSTs" will succeed. I'd also be OK to fix this in a different way than reverting the 46913 patch, but there needs to be some clear reason why this new approach (ignoring default pool on parent directory) is better than consistently returning an error for any invalid OST index, and making the error message more clear in "lfs setstripe" to the end user, or alternately accepting that an "out-of-pool" OST index overrides a default pool name. LUS-9579 was referenced in the commit message, but I can't see that. Is this a problem reported by an end user, or something found in internal testing, or what was the original motivation for this change? |
| Comment by Etienne Aujames [ 24/Mar/22 ] |
|
I have backported the https://review.whamcloud.com/41815 on b2_12 because the behavior could introduce some inconsistencies when the pool is inherited: a file could be created in a pool with OSTs not in the pool. The CEA relies on poolname for OST space balancing between flash and HDD (robinhood) and for different type of user projects. I agree that returning an error only when OST list is incompatible with pool is a better solution. And it would be great to have a proper error message returned to the user for "lfs setstripe -o/-i" when OSTs are incompatible with the pool. Nowadays, we can not override the pool inheritance with "lfs setstripe": we can force the file to use another pool, but we can not create a file without a pool set. Because "-p none" forces to inherit from the parent. |
| Comment by Andreas Dilger [ 25/Mar/22 ] |
This sounds like a bug that should be fixed. Could you please file a separate LU ticket for this, and look when this problem was introduced. At least using "lfs setstripe -p none" should allow creating a file without a pool. |
| Comment by Andreas Dilger [ 28/Mar/22 ] |
|
I tested with master, 2.14+patches (EXAScaler), and 2.12.8, and wasn't able to create a file without a pool, if the parent or root had a pool specified. This seems like a major problem if specific OSTs cannot be selected, but this should be consistent with toher uses. |
| Comment by Etienne Aujames [ 28/Mar/22 ] |
|
Hi Andreas, I will create a ticket for this. man lfs-setstripe -p, --pool <pool_name> I have tried to read some 2.10 code, and it seemed this issue was present. |
| Comment by Andreas Dilger [ 28/Mar/22 ] |
|
There has to be some way to create files on specific OSTs, even if there is a default pool specified on the root directory. |
| Comment by Gerrit Updater [ 31/Mar/22 ] |
|
"Vitaly Fertman <vitaly.fertman@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46967 |
| Comment by Gerrit Updater [ 30/May/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46967/ |
| Comment by Peter Jones [ 30/May/22 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 21/Dec/22 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49475 |
| Comment by Gerrit Updater [ 12/Jun/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51285 |