[LU-5375] Failure on test suite sanity test_151 test_156: roc_hit is not safe to use Created: 19/Jul/14 Updated: 27/Jan/19 Resolved: 12/Sep/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
client and server: lustre-b2_6-rc2 ldiskfs |
||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 14980 | ||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/97672104-0dca-11e4-b3f5-5254006e85c2. The sub-test test_151 failed with the following error:
== sanity test 151: test cache on oss and controls ================================= 19:31:03 (1405477863) CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable osd-*.lustre-OST*.read_cache_enable 2>&1 CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable osd-*.lustre-OST*.read_cache_enable 2>&1 CMD: onyx-40vm8 /usr/sbin/lctl set_param -n obdfilter.lustre-OST*.writethrough_cache_enable=1 osd-*.lustre-OST*.writethrough_cache_enable=1 2>&1 CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.writethrough_cache_enable osd-*.lustre-OST*.writethrough_cache_enable 2>&1 4+0 records in 4+0 records out 16384 bytes (16 kB) copied, 0.00947514 s, 1.7 MB/s CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.*OST*0000.stats osd-*.*OST*0000.stats 2>&1 CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.*OST*0000.stats osd-*.*OST*0000.stats 2>&1 BEFORE:11 AFTER:12 sanity test_151: @@@@@@ FAIL: roc_hit is not safe to use |
| Comments |
| Comment by Oleg Drokin [ 21/Jul/14 ] |
|
So this failing code was added as part of |
| Comment by Sarah Liu [ 24/Jul/14 ] |
|
Also seen this during rolling upgrade from 2.5 ldiskfs to 2.6. before upgrade: 2.5.2 |
| Comment by Andreas Dilger [ 27/May/15 ] |
|
Haven't seen this in a long time. |
| Comment by Sarah Liu [ 02/Jul/15 ] |
|
hit this again in interop testing with lustre-master server(EL7) and 2.5.3 client https://testing.hpdd.intel.com/test_sets/e006681c-1250-11e5-bec9-5254006e85c2 |
| Comment by Andreas Dilger [ 07/Oct/15 ] |
|
This is failing between 0-5 times per day, maybe twice per day on average. It looks like most of these recent failures (excluding those attributable to BEFORE:18720 AFTER:18721 sanity test_151: @@@@@@ FAIL: roc_hit is not safe to use so the before/after values are only off by one. I suspect this is just a problem with the test script - the roc_hit_init() function is just using cat $DIR/$tfile to read the file and with proper readahead of files smaller than max_readahead_whole it should only do a single read. So roc_hit_init() should be changed to use something like: if (( AFTER - BEFORE == 0 || AFTER - BEFORE > 4)); then rm -rf $dir error "roc_hit is not safe to use: BEFORE=$BEFORE, AFTER=$AFTER" fi The rm -rf $dir at the end of roc_hit_init() should also be changed to just use rmdir $dir since this directory should be empty at this point because $file is deleted for each loop. |
| Comment by Sarah Liu [ 11/Dec/15 ] |
|
also hit this issue after rolling downgrade from master/3264 RHEL6.7 to 2.5.5 RHEL6.6 |
| Comment by Saurabh Tandan (Inactive) [ 15/Dec/15 ] |
|
Encountered another instance for Interop master<->2.5.5 |
| Comment by Saurabh Tandan (Inactive) [ 16/Dec/15 ] |
|
Server: Master, Build# 3266, Tag 2.7.64 , RHEL 7 |
| Comment by James A Simmons [ 15/Jan/16 ] |
|
I know why this test is failing. Symlinks are being created by the obdfilter into the osd-* layer for writehthrough_cache_enable, read cache_max_filesize, read_cache_enable and brw_stats. This works for ldiskfs but not ZFS. ZFS only has brw_stats but lacks the rest. This is why sanity test 151 fails for ZFS. |
| Comment by Andreas Dilger [ 15/Jan/16 ] |
|
James, wouldn't that cause review-zfs to fail all the time? It appears that test_151 has checks for read_cache_enable and writethrough_cache_enable, though it does "return 0" instead of "skip" as it probably should. |
| Comment by James A Simmons [ 15/Jan/16 ] |
|
The source of the failure is get_osd_param(). It should be reporting that those parameters don't exist. Its doing a 'grep -v 'Found no match' but that is not being reported by lctl get_param. Running the command manually gives: root@ninja11 lustre-OST0000]# lctl get_param -n obdfilter.lustre-OST0000.read_cache_enable Ah yes I moved from the custom globerrstr() to the standard strerror(...). The failure in this case is due to the |
| Comment by James A Simmons [ 15/Jan/16 ] |
|
I'm thinking the "grep -v 'Found no match'" test might not always work. I'm exploring testing the return value "$?" of the command. I like to test to see if "$?" is less than zero. Would something like this work? do_nodes $nodes "$LCTL set_param -n obdfilter.$device.$name=$value \ - osd-*.$device.$name=$value 2>&1" | grep -v 'Found no match' + osd-*.$device.$name=$value 2>&1" || return [ $? -lt 0 ] Sorry not the greatest bash scripter. |
| Comment by Andreas Dilger [ 15/Jan/16 ] |
|
No, because bash always returns positive error numbers and not negative ones. You could check [ $? -ne 0 ] but that might as well just be return $?, which is also the default behaviour when returning from a function - to return the exit code from the last function. The other question is whether "lctl set_param" actually returns an error code on errors, or just prints a message? In this case, you might be better off using | egrep -v 'Found no match|no such file or directory' or similar, to ensure it works for both old and new lctl, since this will also run in interop mode with servers that do not have your patches. Is there a reason you got rid of globerrstr() and went to strerror()? |
| Comment by James A Simmons [ 15/Jan/16 ] |
|
globstrerr only handled 3 error cases. The move to cfs_get_paths() expanded the possible errors. I have a working solution now. Just pushed the patch. |
| Comment by Saurabh Tandan (Inactive) [ 19/Jan/16 ] |
|
Another instance found for interop : EL6.7 Server/2.5.5 Client |
| Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ] |
|
Another instance found for interop tag 2.7.66 - EL6.7 Server/2.5.5 Client, build# 3316 Another instance found for interop tag 2.7.66 - EL7 Server/2.5.5 Client, build# 3316 |
| Comment by Saurabh Tandan (Inactive) [ 24/Feb/16 ] |
|
Another instance found for interop - EL6.7 Server/2.5.5 Client, tag 2.7.90. |
| Comment by James A Simmons [ 10/Sep/18 ] |
|
Can we close this? |
| Comment by Andreas Dilger [ 12/Sep/18 ] |
|
Recent failures reporting this ticket were caused by |
| Comment by James A Simmons [ 25/Jan/19 ] |
|
I'm seeing this bug again |
| Comment by Andreas Dilger [ 27/Jan/19 ] |
|
Recent failures were triggered by |