[LU-5375] Failure on test suite sanity test_151 test_156: roc_hit is not safe to use Created: 19/Jul/14  Updated: 27/Jan/19  Resolved: 12/Sep/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

client and server: lustre-b2_6-rc2 ldiskfs
client is SLES11 SP3


Issue Links:
Related
is related to LU-11889 sanity test 156 fails on ZFS: roc_hit... Open
is related to LU-2902 sanity test_156: NOT IN CACHE: before... Resolved
is related to LU-11347 Do not use pagecache for SSD I/O when... Resolved
is related to LU-2261 Add cache stats to zfs osd Resolved
is related to LU-11607 Reduce repeated function calls in Lus... Resolved
Severity: 3
Rank (Obsolete): 14980

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/97672104-0dca-11e4-b3f5-5254006e85c2.

The sub-test test_151 failed with the following error:

roc_hit is not safe to use

== sanity test 151: test cache on oss and controls ================================= 19:31:03 (1405477863)
CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
CMD: onyx-40vm8 /usr/sbin/lctl set_param -n obdfilter.lustre-OST*.writethrough_cache_enable=1 		osd-*.lustre-OST*.writethrough_cache_enable=1 2>&1
CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.writethrough_cache_enable 		osd-*.lustre-OST*.writethrough_cache_enable 2>&1
4+0 records in
4+0 records out
16384 bytes (16 kB) copied, 0.00947514 s, 1.7 MB/s
CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.*OST*0000.stats 		osd-*.*OST*0000.stats 2>&1
CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.*OST*0000.stats 		osd-*.*OST*0000.stats 2>&1
BEFORE:11 AFTER:12
 sanity test_151: @@@@@@ FAIL: roc_hit is not safe to use 


 Comments   
Comment by Oleg Drokin [ 21/Jul/14 ]

So this failing code was added as part of LU-2902 so whoever is going to look at it, you might want to look there too.

Comment by Sarah Liu [ 24/Jul/14 ]

Also seen this during rolling upgrade from 2.5 ldiskfs to 2.6.
After MDS and OSS were upgraded to 2.6, both clients keep 2.5 and then run the sanity test_151 failed as the same error.

before upgrade: 2.5.2
after upgrade: b2_6-rc2

Comment by Andreas Dilger [ 27/May/15 ]

Haven't seen this in a long time.

Comment by Sarah Liu [ 02/Jul/15 ]

hit this again in interop testing with lustre-master server(EL7) and 2.5.3 client

https://testing.hpdd.intel.com/test_sets/e006681c-1250-11e5-bec9-5254006e85c2

Comment by Andreas Dilger [ 07/Oct/15 ]

This is failing between 0-5 times per day, maybe twice per day on average. It looks like most of these recent failures (excluding those attributable to LU-5030 breaking /proc access completely) are of the form:

BEFORE:18720 AFTER:18721
 sanity test_151: @@@@@@ FAIL: roc_hit is not safe to use 

so the before/after values are only off by one. I suspect this is just a problem with the test script - the roc_hit_init() function is just using cat $DIR/$tfile to read the file and with proper readahead of files smaller than max_readahead_whole it should only do a single read. So roc_hit_init() should be changed to use something like:

                if (( AFTER - BEFORE == 0 || AFTER - BEFORE > 4)); then
                        rm -rf $dir
                        error "roc_hit is not safe to use: BEFORE=$BEFORE, AFTER=$AFTER"
                fi

The rm -rf $dir at the end of roc_hit_init() should also be changed to just use rmdir $dir since this directory should be empty at this point because $file is deleted for each loop.

Comment by Sarah Liu [ 11/Dec/15 ]

also hit this issue after rolling downgrade from master/3264 RHEL6.7 to 2.5.5 RHEL6.6

Comment by Saurabh Tandan (Inactive) [ 15/Dec/15 ]

Encountered another instance for Interop master<->2.5.5
Server: Master, Build# 3266, Tag 2.7.64
Client: 2.5.5, b2_5_fe/62
https://testing.hpdd.intel.com/test_sets/ac332386-9fcc-11e5-a33d-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 16/Dec/15 ]

Server: Master, Build# 3266, Tag 2.7.64 , RHEL 7
Client: 2.5.5, b2_5_fe/62
https://testing.hpdd.intel.com/test_sets/e4f27f18-9fff-11e5-a33d-5254006e85c2

Comment by James A Simmons [ 15/Jan/16 ]

I know why this test is failing. Symlinks are being created by the obdfilter into the osd-* layer for writehthrough_cache_enable, read cache_max_filesize, read_cache_enable and brw_stats. This works for ldiskfs but not ZFS. ZFS only has brw_stats but lacks the rest. This is why sanity test 151 fails for ZFS.

Comment by Andreas Dilger [ 15/Jan/16 ]

James, wouldn't that cause review-zfs to fail all the time? It appears that test_151 has checks for read_cache_enable and writethrough_cache_enable, though it does "return 0" instead of "skip" as it probably should.

Comment by James A Simmons [ 15/Jan/16 ]

The source of the failure is get_osd_param(). It should be reporting that those parameters don't exist. Its doing a 'grep -v 'Found no match' but that is not being reported by lctl get_param. Running the command manually gives:

root@ninja11 lustre-OST0000]# lctl get_param -n obdfilter.lustre-OST0000.read_cache_enable
error: list_param: obdfilter/lustre-OST0000/read_cache_enable: No such file or directory

Ah yes I moved from the custom globerrstr() to the standard strerror(...). The failure in this case is due to the LU-5030 changes.

Comment by James A Simmons [ 15/Jan/16 ]

I'm thinking the "grep -v 'Found no match'" test might not always work. I'm exploring testing the return value "$?" of the command. I like to test to see if "$?" is less than zero. Would something like this work?

do_nodes $nodes "$LCTL set_param -n obdfilter.$device.$name=$value \
-              osd-*.$device.$name=$value 2>&1" | grep -v 'Found no match'
+              osd-*.$device.$name=$value 2>&1" || return [ $? -lt 0 ]

Sorry not the greatest bash scripter.

Comment by Andreas Dilger [ 15/Jan/16 ]

No, because bash always returns positive error numbers and not negative ones. You could check [ $? -ne 0 ] but that might as well just be return $?, which is also the default behaviour when returning from a function - to return the exit code from the last function. The other question is whether "lctl set_param" actually returns an error code on errors, or just prints a message?

In this case, you might be better off using | egrep -v 'Found no match|no such file or directory' or similar, to ensure it works for both old and new lctl, since this will also run in interop mode with servers that do not have your patches. Is there a reason you got rid of globerrstr() and went to strerror()?

Comment by James A Simmons [ 15/Jan/16 ]

globstrerr only handled 3 error cases. The move to cfs_get_paths() expanded the possible errors. I have a working solution now. Just pushed the patch.

Comment by Saurabh Tandan (Inactive) [ 19/Jan/16 ]

Another instance found for interop : EL6.7 Server/2.5.5 Client
Server: master, build# 3303, RHEL 6.7
Client: 2.5.5, b2_5_fe/62
https://testing.hpdd.intel.com/test_sets/24b4b54e-bad6-11e5-9137-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ]

Another instance found for interop tag 2.7.66 - EL6.7 Server/2.5.5 Client, build# 3316
https://testing.hpdd.intel.com/test_sets/9ed7c1d8-cc9f-11e5-963e-5254006e85c2

Another instance found for interop tag 2.7.66 - EL7 Server/2.5.5 Client, build# 3316
https://testing.hpdd.intel.com/test_sets/5ea975e2-cc46-11e5-901d-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 24/Feb/16 ]

Another instance found for interop - EL6.7 Server/2.5.5 Client, tag 2.7.90.
https://testing.hpdd.intel.com/test_sessions/f99a2d60-d567-11e5-bc47-5254006e85c2
Another instance found for interop - EL7 Server/2.5.5 Client, tag 2.7.90.
https://testing.hpdd.intel.com/test_sessions/93baffee-d2ae-11e5-8697-5254006e85c2

Comment by James A Simmons [ 10/Sep/18 ]

Can we close this?

Comment by Andreas Dilger [ 12/Sep/18 ]

Recent failures reporting this ticket were caused by LU-11347 patch, this hasn't been hit in a long time.

Comment by James A Simmons [ 25/Jan/19 ]

I'm seeing this bug again  in 2.12.50 testing.

Comment by Andreas Dilger [ 27/Jan/19 ]

Recent failures were triggered by LU-11607 patch landing, but it turns out the problem was in the original LU-2261 patch. See LU-11889 for details.

Generated at Sat Feb 10 01:50:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.