[LU-1410] Test failure on test suite sanity, subtest test_200c Created: 15/May/12 Updated: 11/Jun/13 Resolved: 11/Jun/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 1.8.9 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Keith Mannthey (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 4481 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/b43d3072-9ecb-11e1-b567-52540035b04c. The sub-test test_200c failed with the following error:
I got this error when doing rolling upgrade from 1.8.7 to 2.2.52. MDS is upgraded to 2.2.52-RHEL6 while OST and clients are 1.8.7 |
| Comments |
| Comment by Andreas Dilger [ 17/May/12 ] |
|
The test log reports: == test 200c: Set pool on a directory ================================= == 11:30:23 |
| Comment by Andreas Dilger [ 17/May/12 ] |
== test 200b: Add targets to a pool ==================================== == 11:30:21 fat-amd-1: add the named OSTs to the pool fat-amd-1: usage pool_add . Updated after 0 sec: wanted '' got '' Resetting fail_loc on all nodes...done. It looks like test_200c() failed because of test_200b() not adding any OSTs to the pool. So, test_200b() needs to be examined why this failed (looks like a syntax error, or maybe an empty OST list?). Looking at other test results, it seems test_200b() has been working properly until this test run. |
| Comment by Peter Jones [ 18/May/12 ] |
|
Keith Could you please look into this one? Thanks Peter |
| Comment by Keith Mannthey (Inactive) [ 18/May/12 ] |
|
Not sure if it is related yet but from Client 2 (client-5) the previous test from 200x created an error condition. Lustre: DEBUG MARKER: == test 171: test libcfs_debug_dumplog_thread stuck in do_exit() ====== == 11:29:38 LustreError: 31901:0:(file.c:281:ll_file_release()) obd_fail_timeout id 50e sleeping for 3000 ms LustreError: 31901:0:(file.c:281:ll_file_release()) obd_fail_timeout id 50e awake LustreError: dumping log to /tmp/lustre-log.1337106582.31901 Lustre: DEBUG MARKER: SKIP: sanity test_180 skipping excluded test 180 Lustre: DEBUG MARKER: == test 181: Test open-unlinked dir ======================== == 11:29:48 Lustre: DEBUG MARKER: == test 200a: Create new pool ========================================== == 11:30:07 Lustre: DEBUG MARKER: == test 200b: Add targets to a pool ==================================== == 11:30:21 Lustre: DEBUG MARKER: == test 200c: Set pool on a directory ================================= == 11:30:23 Lustre: DEBUG MARKER: sanity test_200c: @@@@@@ FAIL: Cannot set pool cea1 to /mnt/lustre/d200.pools/dir_tst Still looking. |
| Comment by Keith Mannthey (Inactive) [ 18/May/12 ] |
|
More likely related to : == test 200b: Add targets to a pool ==================================== == 11:30:21 fat-amd-1: add the named OSTs to the pool fat-amd-1: usage pool_add <fsname>.<poolname> <ostname indexed list> Updated after 0 sec: wanted '' got '' Resetting fail_loc on all nodes...done. the pool_add usage note seems like perhaps arguments got mangled into the call for 200b. |
| Comment by Keith Mannthey (Inactive) [ 18/May/12 ] |
|
It seems the pool_add didn't end up with valid arguments. It looks like jt_pool_cmd (in utils/obd.c) returned CMD_HELP and aborted the pool_add. It likely had the wrong number of arguments passed into the function. The code around the test sanity.sh has been stable. test_200b() {
remote_mgs_nodsh && skip "remote MGS with nodsh" && return
TGT=$(for i in $TGTPOOL_LIST; do printf "$FSNAME-OST%04x_UUID " $i; done)
do_facet mgs $LCTL pool_add $FSNAME.$POOL \
$FSNAME-OST[$TGTPOOL_FIRST-$TGTPOOL_MAX/$TGTPOOL_STEP]
wait_update $HOSTNAME "lctl get_param -n lov.$FSNAME-*.pools.$POOL | sort -u | tr '\n' ' ' " "$TGT" ||
error "Add to pool failed"
local lfscount=$($LFS pool_list $FSNAME.$POOL | grep -c "\-OST")
local addcount=$((($TGTPOOL_MAX - $TGTPOOL_FIRST) / $TGTPOOL_STEP + 1))
[ $lfscount -eq $addcount ] ||
error "lfs pool_list bad ost count $lfscount != $addcount"
}
run_test 200b "Add targets to a pool ===================================="
|
| Comment by Keith Mannthey (Inactive) [ 26/Jul/12 ] |
|
I have ran sanity and the just 200 tests in loops on a local test environment (not with rolling upgrades). I have not seen this issue and there are no similar reports maloo issues. The rolling upgrade is likely the key to reproduction. I know this is from a while ago but do we know what part of the failover (MDS or OST) we were at when the test failed? My local vm's are a little busy right now but I think running sanity test 200 (with debugging) while doing a rollover in the next step. |
| Comment by Keith Mannthey (Inactive) [ 07/Aug/12 ] |
|
What is the proper rollover target at this time? 1.8.7 to Master? I think there is now a srub issue that introduces forward interoperability. It is not the case I will give it a go. |
| Comment by Keith Mannthey (Inactive) [ 09/Aug/12 ] |
|
I got scripts from sarah to run tests and I am working on getting 1.8.8 to Master test run completed. |
| Comment by Keith Mannthey (Inactive) [ 15/Aug/12 ] |
|
Ok I am still working to get the first re-test done. With some direction and help from Sarah it appears I need physical nodes (I have several virtual nodes right now). Tomorrow I will search out correct nodes to run this test. |
| Comment by Keith Mannthey (Inactive) [ 15/Aug/12 ] |
|
It appears we can add persistent storage to the VM nodes. I am working to enable this so virtual (easy to get) nodes are able to complete this test. |
| Comment by Keith Mannthey (Inactive) [ 17/Aug/12 ] |
|
With Chris's help I have a virtual node with persistent storage. YEA! I have just kicked off the first test run, I will update when I know more. |
| Comment by Keith Mannthey (Inactive) [ 17/Aug/12 ] |
|
Ok so the root issues this line: pdsh -l root -t 100 -S -w client-12vm3 '(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; sh -c "/usr/sbin/lctl' pool_add lustre.cea1 'lustre-OST[1-0/2]")' lustre-OST[1-0/2] is the ostname index list arg and it should be lustre-OST[0-1/2] for the one OST case. Sarah and I are working to fine the correct part of the testing macro stack to fix. I have manually test the lustre-OST[0-1/2] change and it works. I will update when the issue is fixed. |
| Comment by Keith Mannthey (Inactive) [ 17/Aug/12 ] |
|
Well this issue is in the sanity test. There are 2 versions of this test. The 1.8.8 version and the master version. The main client is 1.8.8 so it uses the older test code. When it upgrades (clients upgrade last) it will use the master test code. This is very likely a 1.8.8 branch issue with Sanity 200 an a 1 OST only configuration. A full rolling upgrade test (it will take a while) will tell us if a 2.3 change is needed. Very likely running with 2 or more OSTs and 1.8 is fine with this test. At this point there is no indication there is a problem with master but more testing is needed. |
| Comment by Keith Mannthey (Inactive) [ 20/Aug/12 ] |
|
Master also has this test issue. I have submitted patches for both 1_8 and Master for further review and test. b1_8 master: |
| Comment by Keith Mannthey (Inactive) [ 21/Aug/12 ] |
|
I have given up my nodes as the initial issue has patches pending. An official retest should be the next step. A patched 1.8 and Master will allow the automated the run the Sanity 200 tests to run on a one OST setup. |
| Comment by Peter Jones [ 22/Aug/12 ] |
|
Sarah Given that upgrade/downgrade testing is done manually, does knowing what triggers this issue allow you to setup in a way to workaround it and complete the rest of the testing? Peter |
| Comment by Sarah Liu [ 22/Aug/12 ] |
|
Hi Peter, yes, I know the workaround way to run this test |
| Comment by Peter Jones [ 22/Aug/12 ] |
|
ok then dropping priority. The patches can still land to improve the flexibility of the test in the long term but this is really only a problem that will crop up in testing situations and not production situations. |
| Comment by Keith Mannthey (Inactive) [ 11/Jun/13 ] |
|
B1_8 patch now landed. |