[LU-17037] Tests should run with high and sparse index numbers for OSTs and MDTs Created: 17/Aug/23 Updated: 26/Oct/23 |
|
| Status: | In Progress |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.3 |
| Fix Version/s: | Lustre 2.17.0 |
| Type: | Task | Priority: | Minor |
| Reporter: | Colin Faber | Assignee: | Jian Yu |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | tests | ||
| Issue Links: |
|
||||||||||||
| Severity: | 4 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
As a long term effort to improve the overall stability of lustre, the test suite should be evaluated and modified to allow for testing against OST and MDT sets which contain index numbers that are both high and sparse. What I mean by this is that in many cases we're seeing more sites choosing to deploy flash OSTs within the first 100 slots, and then moving all HDD OSTs to index slots > 100 or vice-versa. This has been shown to introduce issues which some features (most recently a memory corruption issue within pool quotas: Right now the existing test suite assumes that OSTs will exist on certain low-index numbers which are hard coded into it. This is sub-optimal to catch cases such as |
| Comments |
| Comment by Andreas Dilger [ 17/Aug/23 ] |
|
Note that there are some test cases which are already testing sparse OST indexes, for example conf-sanity.sh test_81, test_82a. There is already test-framework.sh support for non-sequential OST numbers by using OST_INDEX_LIST to specify the index values, but it might make sense to improve this support as needed. This is "documented" in lustre/tests/cfg/local.sh: # OST indices can be specified as follows: # OSTINDEX1="1" # OSTINDEX2="2" # OSTINDEX3="4" # ...... # or # OST_INDEX_LIST="[1,2,4-6,8]" # [n-m,l-k,...], where n < m and l < k, etc. # # The default index value of an individual OST is its facet number minus 1. # More specific ones override more general ones. See facet_index(). What needs to be done here is to fix the many, many subtests that assume ost1 == OST0000, ost2 == OST0001, etc. (often using the facet number - 1 as the index, or the index number + 1 as the facet name), and instead use helpers that map the facet name/number to the OST number in OST_INDEX_LIST (which is mapped internally to an associative array $OST_INDICES. There are some helper functions that exist, but might need to be updated, and definitely need to be used more widely:
It might be useful to add some more helper functions to simplify the remapping, like:
There is currently no support for non-contiguous MDT index numbers in test-framework.sh, and I don't think this has been tested anywhere. Until we get MDT pools, I'm not sure if there is much motivation to configure discontiguous index numbers, but I'm sure it will happen somewhere eventually. However, I don't think implementing support for testing this and fixing the many resulting bugs is a priority compared to fixing discontiguous OST support. |
| Comment by Andreas Dilger [ 17/Aug/23 ] |
|
Probably due to the many subtests that need to be fixed, it would make sense to split patches into separate files (or maybe multiple patches for large scripts like sanity.sh) so that they can land independently, unless there are only a few changes in a single file. It might be possible to test which subtests are having obvious problems by running "env=OST_INDEX_LIST=[0,10,20,40,55,60,80]" (for OSTCOUNT=8) or similar in autotest (or just set OST_INDEX_LIST in your local test environment) and run through the test scripts multiple times to fix failures as they are hit. Probably a huge number of test failures would be hit if there is no OST0000, so that might be last to test after other subtests are fixed. |
| Comment by Jian Yu [ 29/Aug/23 ] |
|
In conf-sanity test_82a(), the random sparse indices for OSTs are generated as follows:
# Format OSTs with random sparse indices.
local i
local index
local ost_indices
local LOV_V1_INSANE_STRIPE_COUNT=65532
for i in $(seq $OSTCOUNT); do
index=$(((RANDOM * 2) % LOV_V1_INSANE_STRIPE_COUNT))
ost_indices+=" $index"
done
ost_indices=$(comma_list $ost_indices)
stack_trap "restore_ostindex" EXIT
echo -e "\nFormat $OSTCOUNT OSTs with sparse indices $ost_indices"
OST_INDEX_LIST=[$ost_indices] formatall
To make a quick experiment, I used the above way in cfg/local.sh to set OST_INDEX_LIST with random sparse indices, and then ran runtests. It passed: UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 95248 4340 82252 6% /mnt/lustre[MDT:0] lustre-OST090c_UUID 142216 7288 120928 6% /mnt/lustre[OST:2316] lustre-OST4b24_UUID 142216 9088 119128 8% /mnt/lustre[OST:19236] lustre-OST9234_UUID 142216 9416 118800 8% /mnt/lustre[OST:37428] lustre-OST986c_UUID 142216 14568 113648 12% /mnt/lustre[OST:39020] filesystem_summary: 568864 40360 472504 8% /mnt/lustre I'm going to push a fortestonly patch to run the full test group by autotest with the above change to see which subtests are failing. |
| Comment by Gerrit Updater [ 29/Aug/23 ] |
|
"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52158 |
| Comment by Jian Yu [ 30/Aug/23 ] |
|
I'm vetting the test results in https://review.whamcloud.com/52158 on master branch. |
| Comment by Andreas Dilger [ 30/Aug/23 ] |
|
I think of particular interest is also sanity-quota and ost-pools, since pools + quota + sparse OST index was the source of the problem. |
| Comment by Jian Yu [ 30/Aug/23 ] |
|
Here are the full-dne-part-{1,2,3} test results with "OST_INDEX_LIST=[0,10,20,40,55,60,80]" and "ENABLE_QUOTA=yes": |
| Comment by Jian Yu [ 30/Aug/23 ] |
|
I just removed the "ENABLE_QUOTA=yes" test parameter and triggered the full group testing again to make LBUG not block other test suites. After |