[LU-5375] Failure on test suite sanity test_151 test_156: roc_hit is not safe to use - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.8.0
Labels:
None
Environment:
client and server: lustre-b2_6-rc2 ldiskfs
client is SLES11 SP3

Severity:
3
Rank (Obsolete):
14980

Description

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/97672104-0dca-11e4-b3f5-5254006e85c2.

The sub-test test_151 failed with the following error:

roc_hit is not safe to use

== sanity test 151: test cache on oss and controls ================================= 19:31:03 (1405477863)
CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
CMD: onyx-40vm8 /usr/sbin/lctl set_param -n obdfilter.lustre-OST*.writethrough_cache_enable=1 		osd-*.lustre-OST*.writethrough_cache_enable=1 2>&1
CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.writethrough_cache_enable 		osd-*.lustre-OST*.writethrough_cache_enable 2>&1
4+0 records in
4+0 records out
16384 bytes (16 kB) copied, 0.00947514 s, 1.7 MB/s
CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.*OST*0000.stats 		osd-*.*OST*0000.stats 2>&1
CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.*OST*0000.stats 		osd-*.*OST*0000.stats 2>&1
BEFORE:11 AFTER:12
 sanity test_151: @@@@@@ FAIL: roc_hit is not safe to use

Attachments

Issue Links

is related to

LU-11889 sanity test 156 fails on ZFS: roc_hit not safe to use

Open

LU-2902 sanity test_156: NOT IN CACHE: before: , after:

Resolved

LU-11347 Do not use pagecache for SSD I/O when read/write cache are disabled

Resolved

LU-2261 Add cache stats to zfs osd

Resolved

LU-11607 Reduce repeated function calls in Lustre test suites

Resolved

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(29 mentioned in)

Activity

[LU-5375] Failure on test suite sanity test_151 test_156: roc_hit is not safe to use

James A Simmons added a comment - 15/Jan/16 3:20 AM - edited

I'm thinking the "grep -v 'Found no match'" test might not always work. I'm exploring testing the return value "$?" of the command. I like to test to see if "$?" is less than zero. Would something like this work?

do_nodes $nodes "$LCTL set_param -n obdfilter.$device.$name=$value \
-              osd-*.$device.$name=$value 2>&1" | grep -v 'Found no match'
+              osd-*.$device.$name=$value 2>&1" || return [ $? -lt 0 ]

Sorry not the greatest bash scripter.

James A Simmons added a comment - 15/Jan/16 3:20 AM - edited I'm thinking the "grep -v 'Found no match'" test might not always work. I'm exploring testing the return value "$?" of the command. I like to test to see if "$?" is less than zero. Would something like this work? do_nodes $nodes "$LCTL set_param -n obdfilter.$device.$name=$value \ - osd-*.$device.$name=$value 2>&1" | grep -v 'Found no match' + osd-*.$device.$name=$value 2>&1" || return [ $? -lt 0 ] Sorry not the greatest bash scripter.

James A Simmons added a comment - 15/Jan/16 2:53 AM

The source of the failure is get_osd_param(). It should be reporting that those parameters don't exist. Its doing a 'grep -v 'Found no match' but that is not being reported by lctl get_param. Running the command manually gives:

root@ninja11 lustre-OST0000]# lctl get_param -n obdfilter.lustre-OST0000.read_cache_enable
error: list_param: obdfilter/lustre-OST0000/read_cache_enable: No such file or directory

Ah yes I moved from the custom globerrstr() to the standard strerror(...). The failure in this case is due to the ~~LU-5030~~ changes.

James A Simmons added a comment - 15/Jan/16 2:53 AM The source of the failure is get_osd_param(). It should be reporting that those parameters don't exist. Its doing a 'grep -v 'Found no match' but that is not being reported by lctl get_param. Running the command manually gives: root@ninja11 lustre-OST0000]# lctl get_param -n obdfilter.lustre-OST0000.read_cache_enable error: list_param: obdfilter/lustre-OST0000/read_cache_enable: No such file or directory Ah yes I moved from the custom globerrstr() to the standard strerror(...). The failure in this case is due to the LU-5030 changes.

Andreas Dilger added a comment - 15/Jan/16 2:23 AM

James, wouldn't that cause review-zfs to fail all the time? It appears that test_151 has checks for read_cache_enable and writethrough_cache_enable, though it does "return 0" instead of "skip" as it probably should.

Andreas Dilger added a comment - 15/Jan/16 2:23 AM James, wouldn't that cause review-zfs to fail all the time? It appears that test_151 has checks for read_cache_enable and writethrough_cache_enable , though it does "return 0" instead of "skip" as it probably should.

James A Simmons added a comment - 15/Jan/16 1:36 AM

I know why this test is failing. Symlinks are being created by the obdfilter into the osd-* layer for writehthrough_cache_enable, read cache_max_filesize, read_cache_enable and brw_stats. This works for ldiskfs but not ZFS. ZFS only has brw_stats but lacks the rest. This is why sanity test 151 fails for ZFS.

James A Simmons added a comment - 15/Jan/16 1:36 AM I know why this test is failing. Symlinks are being created by the obdfilter into the osd-* layer for writehthrough_cache_enable, read cache_max_filesize, read_cache_enable and brw_stats. This works for ldiskfs but not ZFS. ZFS only has brw_stats but lacks the rest. This is why sanity test 151 fails for ZFS.

Saurabh Tandan (Inactive) added a comment - 16/Dec/15 1:25 AM

Server: Master, Build# 3266, Tag 2.7.64 , RHEL 7
Client: 2.5.5, b2_5_fe/62
https://testing.hpdd.intel.com/test_sets/e4f27f18-9fff-11e5-a33d-5254006e85c2

Saurabh Tandan (Inactive) added a comment - 16/Dec/15 1:25 AM Server: Master, Build# 3266, Tag 2.7.64 , RHEL 7 Client: 2.5.5, b2_5_fe/62 https://testing.hpdd.intel.com/test_sets/e4f27f18-9fff-11e5-a33d-5254006e85c2

Saurabh Tandan (Inactive) added a comment - 15/Dec/15 6:43 PM

Encountered another instance for Interop master<->2.5.5
Server: Master, Build# 3266, Tag 2.7.64
Client: 2.5.5, b2_5_fe/62
https://testing.hpdd.intel.com/test_sets/ac332386-9fcc-11e5-a33d-5254006e85c2

Saurabh Tandan (Inactive) added a comment - 15/Dec/15 6:43 PM Encountered another instance for Interop master<->2.5.5 Server: Master, Build# 3266, Tag 2.7.64 Client: 2.5.5, b2_5_fe/62 https://testing.hpdd.intel.com/test_sets/ac332386-9fcc-11e5-a33d-5254006e85c2

Sarah Liu added a comment - 11/Dec/15 6:25 PM

also hit this issue after rolling downgrade from master/3264 RHEL6.7 to 2.5.5 RHEL6.6

Sarah Liu added a comment - 11/Dec/15 6:25 PM also hit this issue after rolling downgrade from master/3264 RHEL6.7 to 2.5.5 RHEL6.6

Andreas Dilger added a comment - 07/Oct/15 1:54 AM - edited

This is failing between 0-5 times per day, maybe twice per day on average. It looks like most of these recent failures (excluding those attributable to ~~LU-5030~~ breaking /proc access completely) are of the form:

BEFORE:18720 AFTER:18721
 sanity test_151: @@@@@@ FAIL: roc_hit is not safe to use

so the before/after values are only off by one. I suspect this is just a problem with the test script - the roc_hit_init() function is just using cat $DIR/$tfile to read the file and with proper readahead of files smaller than max_readahead_whole it should only do a single read. So roc_hit_init() should be changed to use something like:

                if (( AFTER - BEFORE == 0 || AFTER - BEFORE > 4)); then
                        rm -rf $dir
                        error "roc_hit is not safe to use: BEFORE=$BEFORE, AFTER=$AFTER"
                fi

The rm -rf $dir at the end of roc_hit_init() should also be changed to just use rmdir $dir since this directory should be empty at this point because $file is deleted for each loop.

Andreas Dilger added a comment - 07/Oct/15 1:54 AM - edited This is failing between 0-5 times per day, maybe twice per day on average. It looks like most of these recent failures (excluding those attributable to LU-5030 breaking /proc access completely) are of the form: BEFORE:18720 AFTER:18721 sanity test_151: @@@@@@ FAIL: roc_hit is not safe to use so the before/after values are only off by one. I suspect this is just a problem with the test script - the roc_hit_init() function is just using cat $DIR/$tfile to read the file and with proper readahead of files smaller than max_readahead_whole it should only do a single read. So roc_hit_init() should be changed to use something like: if (( AFTER - BEFORE == 0 || AFTER - BEFORE > 4)); then rm -rf $dir error "roc_hit is not safe to use: BEFORE=$BEFORE, AFTER=$AFTER" fi The rm -rf $dir at the end of roc_hit_init() should also be changed to just use rmdir $dir since this directory should be empty at this point because $file is deleted for each loop.

Sarah Liu added a comment - 02/Jul/15 6:49 PM

hit this again in interop testing with lustre-master server(EL7) and 2.5.3 client

https://testing.hpdd.intel.com/test_sets/e006681c-1250-11e5-bec9-5254006e85c2

Sarah Liu added a comment - 02/Jul/15 6:49 PM hit this again in interop testing with lustre-master server(EL7) and 2.5.3 client https://testing.hpdd.intel.com/test_sets/e006681c-1250-11e5-bec9-5254006e85c2

Andreas Dilger added a comment - 27/May/15 6:21 PM

Haven't seen this in a long time.

Andreas Dilger added a comment - 27/May/15 6:21 PM Haven't seen this in a long time.

Sarah Liu added a comment - 24/Jul/14 2:24 AM - edited

Also seen this during rolling upgrade from 2.5 ldiskfs to 2.6.
After MDS and OSS were upgraded to 2.6, both clients keep 2.5 and then run the sanity test_151 failed as the same error.

before upgrade: 2.5.2
after upgrade: b2_6-rc2

Sarah Liu added a comment - 24/Jul/14 2:24 AM - edited Also seen this during rolling upgrade from 2.5 ldiskfs to 2.6. After MDS and OSS were upgraded to 2.6, both clients keep 2.5 and then run the sanity test_151 failed as the same error. before upgrade: 2.5.2 after upgrade: b2_6-rc2

People

Assignee:: WC Triage

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 19/Jul/14 12:22 AM

Updated:: 27/Jan/19 9:10 PM

Resolved:: 12/Sep/18 12:28 AM