[LU-2735] sanity.sh test_151: NOT IN CACHE: before: 337, after: 337 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.4.0
Affects Version/s: Lustre 2.4.0
Labels:
None

Severity:
3
Rank (Obsolete):
6639

Description

This issue was created by maloo for Li Wei <liwei@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/b8f32682-6c5f-11e2-91d6-52540035b04c.

The sub-test test_151 failed with the following error:

NOT IN CACHE: before: 337, after: 337

Info required for matching: sanity 151

== sanity test 151: test cache on oss and controls ================================= 21:28:47 (1359696527)
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
CMD: client-21-ib /usr/sbin/lctl set_param -n obdfilter.lustre-OST*.writethrough_cache_enable=1 		osd-*.lustre-OST*.writethrough_cache_enable=1 2>&1
3+0 records in
3+0 records out
12288 bytes (12 kB) copied, 0.00445821 s, 2.8 MB/s
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
 sanity test_151: @@@@@@ FAIL: NOT IN CACHE: before: 337, after: 337

Attachments

Issue Links

is related to

LU-2902 sanity test_156: NOT IN CACHE: before: , after:

Resolved

is related to

LU-2848 Failure on test suite sanity test_151: NOT IN CACHE before: , after:

Resolved

Activity

[LU-2735] sanity.sh test_151: NOT IN CACHE: before: 337, after: 337

Andreas Dilger added a comment - 15/Apr/13 12:03 AM

Not sure if this is a spurious observation or not, but it appears that most of these recent failures are on client-23-ib (i.e. running IB networking) instead of of TCP networking. At least it appears that there was a decrease in the number of this style of failure when running on IB.

Andreas Dilger added a comment - 15/Apr/13 12:03 AM Not sure if this is a spurious observation or not, but it appears that most of these recent failures are on client-23-ib (i.e. running IB networking) instead of of TCP networking. At least it appears that there was a decrease in the number of this style of failure when running on IB.

Oleg Drokin added a comment - 11/Apr/13 5:04 PM

happened again https://maloo.whamcloud.com/test_sets/f0607d16-a285-11e2-81ba-52540035b04c

Oleg Drokin added a comment - 11/Apr/13 5:04 PM happened again https://maloo.whamcloud.com/test_sets/f0607d16-a285-11e2-81ba-52540035b04c

Keith Mannthey (Inactive) added a comment - 04/Apr/13 5:08 PM

ok new update:

https://maloo.whamcloud.com/test_sets/c11cebd6-9cde-11e2-8b27-52540035b04c
test_151

Error: 'NOT IN CACHE: before: 349, after: 349'
Failure Rate: 7.00% of last 100 executions [all branches]

There are still null errors out there.

The is with should be with the fail_loc fix in place.
From the test log:

fail_loc=0x609
3+0 records in
3+0 records out
12288 bytes (12 kB) copied, 0.00473753 s, 2.6 MB/s
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
snapshot_time 1365007767.342119 secs.usecs read_bytes 4 samples [bytes] 4096 1048576 1380352 write_bytes 3 samples [bytes] 327680 1048576 1900544 get_info 81 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 7 samples [reqs] commitrw 7 samples [reqs] ping 10 samples [reqs] snapshot_time 1365007767.342180 secs.usecs write_bytes 1 samples [bytes] 50400 50400 50400 get_info 76 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 11 samples [reqs] snapshot_time 1365007767.342212 secs.usecs read_bytes 7 samples [bytes] 4096 8192 49152 write_bytes 3 samples [bytes] 1910 6096 9922 get_info 85 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 10 samples [reqs] commitrw 10 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342242 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342268 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 17 samples [reqs] snapshot_time 1365007767.342293 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342324 secs.usecs get_info 72 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342552 secs.usecs get_page 7 samples [usec] 1 11 24 162 cache_access 337 samples [pages] 1 1 337 cache_hit 337 samples [pages] 1 1 337 snapshot_time 1365007767.342582 secs.usecs get_page 1 samples [usec] 1 1 1 1 snapshot_time 1365007767.342599 secs.usecs get_page 10 samples [usec] 0 1 3 3 cache_access 12 samples [pages] 1 1 12 cache_hit 12 samples [pages] 1 1 12
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
snapshot_time 1365007768.89231 secs.usecs read_bytes 4 samples [bytes] 4096 1048576 1380352 write_bytes 3 samples [bytes] 327680 1048576 1900544 get_info 81 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 36 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 7 samples [reqs] commitrw 7 samples [reqs] ping 10 samples [reqs] snapshot_time 1365007768.89292 secs.usecs write_bytes 1 samples [bytes] 50400 50400 50400 get_info 76 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 11 samples [reqs] snapshot_time 1365007768.89323 secs.usecs read_bytes 7 samples [bytes] 4096 8192 49152 write_bytes 3 samples [bytes] 1910 6096 9922 get_info 86 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 10 samples [reqs] commitrw 10 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89353 secs.usecs read_bytes 2 samples [bytes] 4096 12288 16384 write_bytes 1 samples [bytes] 12288 12288 12288 get_info 74 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 3 samples [reqs] commitrw 3 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89382 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 17 samples [reqs] snapshot_time 1365007768.89413 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89447 secs.usecs get_info 72 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89653 secs.usecs get_page 7 samples [usec] 1 11 24 162 cache_access 337 samples [pages] 1 1 337 cache_hit 337 samples [pages] 1 1 337 snapshot_time 1365007768.89681 secs.usecs get_page 1 samples [usec] 1 1 1 1 snapshot_time 1365007768.89697 secs.usecs get_page 10 samples [usec] 0 1 3 3 cache_access 12 samples [pages] 1 1 12 cache_hit 12 samples [pages] 1 1 12
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
CMD: client-21-ib /usr/sbin/lctl set_param fail_loc=0
fail_loc=0
 sanity test_151: @@@@@@ FAIL: NOT IN CACHE: before: 349, after: 349

It seems we are back to square one.

Keith Mannthey (Inactive) added a comment - 04/Apr/13 5:08 PM ok new update: https://maloo.whamcloud.com/test_sets/c11cebd6-9cde-11e2-8b27-52540035b04c test_151 Error: 'NOT IN CACHE: before: 349, after: 349' Failure Rate: 7.00% of last 100 executions [all branches] There are still null errors out there. The is with should be with the fail_loc fix in place. From the test log: fail_loc=0x609 3+0 records in 3+0 records out 12288 bytes (12 kB) copied, 0.00473753 s, 2.6 MB/s CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats osd-*.lustre-OST*.stats 2>&1 snapshot_time 1365007767.342119 secs.usecs read_bytes 4 samples [bytes] 4096 1048576 1380352 write_bytes 3 samples [bytes] 327680 1048576 1900544 get_info 81 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 7 samples [reqs] commitrw 7 samples [reqs] ping 10 samples [reqs] snapshot_time 1365007767.342180 secs.usecs write_bytes 1 samples [bytes] 50400 50400 50400 get_info 76 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 11 samples [reqs] snapshot_time 1365007767.342212 secs.usecs read_bytes 7 samples [bytes] 4096 8192 49152 write_bytes 3 samples [bytes] 1910 6096 9922 get_info 85 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 10 samples [reqs] commitrw 10 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342242 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342268 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 17 samples [reqs] snapshot_time 1365007767.342293 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342324 secs.usecs get_info 72 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342552 secs.usecs get_page 7 samples [usec] 1 11 24 162 cache_access 337 samples [pages] 1 1 337 cache_hit 337 samples [pages] 1 1 337 snapshot_time 1365007767.342582 secs.usecs get_page 1 samples [usec] 1 1 1 1 snapshot_time 1365007767.342599 secs.usecs get_page 10 samples [usec] 0 1 3 3 cache_access 12 samples [pages] 1 1 12 cache_hit 12 samples [pages] 1 1 12 CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats osd-*.lustre-OST*.stats 2>&1 CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats osd-*.lustre-OST*.stats 2>&1 snapshot_time 1365007768.89231 secs.usecs read_bytes 4 samples [bytes] 4096 1048576 1380352 write_bytes 3 samples [bytes] 327680 1048576 1900544 get_info 81 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 36 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 7 samples [reqs] commitrw 7 samples [reqs] ping 10 samples [reqs] snapshot_time 1365007768.89292 secs.usecs write_bytes 1 samples [bytes] 50400 50400 50400 get_info 76 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 11 samples [reqs] snapshot_time 1365007768.89323 secs.usecs read_bytes 7 samples [bytes] 4096 8192 49152 write_bytes 3 samples [bytes] 1910 6096 9922 get_info 86 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 10 samples [reqs] commitrw 10 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89353 secs.usecs read_bytes 2 samples [bytes] 4096 12288 16384 write_bytes 1 samples [bytes] 12288 12288 12288 get_info 74 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 3 samples [reqs] commitrw 3 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89382 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 17 samples [reqs] snapshot_time 1365007768.89413 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89447 secs.usecs get_info 72 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89653 secs.usecs get_page 7 samples [usec] 1 11 24 162 cache_access 337 samples [pages] 1 1 337 cache_hit 337 samples [pages] 1 1 337 snapshot_time 1365007768.89681 secs.usecs get_page 1 samples [usec] 1 1 1 1 snapshot_time 1365007768.89697 secs.usecs get_page 10 samples [usec] 0 1 3 3 cache_access 12 samples [pages] 1 1 12 cache_hit 12 samples [pages] 1 1 12 CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats osd-*.lustre-OST*.stats 2>&1 CMD: client-21-ib /usr/sbin/lctl set_param fail_loc=0 fail_loc=0 sanity test_151: @@@@@@ FAIL: NOT IN CACHE: before: 349, after: 349 It seems we are back to square one.

Keith Mannthey (Inactive) added a comment - 18/Mar/13 4:46 PM

~~LU-2848~~ is the " NOT IN CACHE before: , after: " behavior. I don't think that that issue is cased by items being dropped from the cache (that is what the landed fail_loc patch protects against). In my testing for ~~LU-2902~~ I was unable to reproduce the issue outside of the OOM killer killing ssh between the nodes. There is still another issue out there in the roc_hit family of tests. I have submitted a patch to see what roc_hit is getting from /proc (as part of lu-2902), in order to see what it sees.

Keith Mannthey (Inactive) added a comment - 18/Mar/13 4:46 PM LU-2848 is the " NOT IN CACHE before: , after: " behavior. I don't think that that issue is cased by items being dropped from the cache (that is what the landed fail_loc patch protects against). In my testing for LU-2902 I was unable to reproduce the issue outside of the OOM killer killing ssh between the nodes. There is still another issue out there in the roc_hit family of tests. I have submitted a patch to see what roc_hit is getting from /proc (as part of lu-2902), in order to see what it sees.

Andreas Dilger added a comment - 18/Mar/13 7:38 AM

The reason I ask is because this test is still failing several times a week.

Andreas Dilger added a comment - 18/Mar/13 7:38 AM The reason I ask is because this test is still failing several times a week.

Andreas Dilger added a comment - 18/Mar/13 7:37 AM

This seems closely related, if not a duplicate of ~~LU-2848~~? Can this be closed as a duplicate, or are they for separate issues?

Andreas Dilger added a comment - 18/Mar/13 7:37 AM This seems closely related, if not a duplicate of LU-2848 ? Can this be closed as a duplicate, or are they for separate issues?

Peter Jones added a comment - 15/Mar/13 7:49 PM

ok then I am dropping the priority and reassigning to Keith for the extra work he is proposing. Alternatively this could have been covered under a new enhancement ticket.

Peter Jones added a comment - 15/Mar/13 7:49 PM ok then I am dropping the priority and reassigning to Keith for the extra work he is proposing. Alternatively this could have been covered under a new enhancement ticket.

Keith Mannthey (Inactive) added a comment - 15/Mar/13 2:18 PM

I agree this is a touchy problem but it is good to test this behavior when conditions are correct.

A memhog allocation might oom the box if we are under enough memory pressure to evict the last small write from the page cache in short order.

I am going to propose a patch that attempts to detect when we are dropped from the cache under memory pressure. It can't be a 100% thing (without a hard cache_drop proc stat) but it could be a guide as to when to skip the test.

Keith Mannthey (Inactive) added a comment - 15/Mar/13 2:18 PM I agree this is a touchy problem but it is good to test this behavior when conditions are correct. A memhog allocation might oom the box if we are under enough memory pressure to evict the last small write from the page cache in short order. I am going to propose a patch that attempts to detect when we are dropped from the cache under memory pressure. It can't be a 100% thing (without a hard cache_drop proc stat) but it could be a guide as to when to skip the test.

Oleg Drokin added a comment - 15/Mar/13 2:20 AM

well, I agree it's sort of a hack to disable cache pruging for the test.
But what other options do we have? not much.

Another one I can think of is we can somehow ensure there's a lot of free ram, so no mem pressure. (e.g. tail /dev/zero or other such memhog that then terminates and leaves a lot of free ram behind).

Oleg Drokin added a comment - 15/Mar/13 2:20 AM well, I agree it's sort of a hack to disable cache pruging for the test. But what other options do we have? not much. Another one I can think of is we can somehow ensure there's a lot of free ram, so no mem pressure. (e.g. tail /dev/zero or other such memhog that then terminates and leaves a lot of free ram behind).

Keith Mannthey (Inactive) added a comment - 14/Mar/13 2:36 PM

Sorry to have to reopen this issue but I think we need to be really clear about what the situation is.

The patch creates special Lustre behavior (accessed by setting a fail_loc value) to conform Lustre to the desired test behavior.

I don't expect fail_loc to be used outside of test.

I argue this is not a fix and that either the test needs to be changed with the understanding that cache behavior can be non-deterministic or the cache needs to conform to the behavior expected under all conditions.

This patch is a great debug point but I don't think it is a solution to the core issue. I happen to be working on ~~LU-2902~~ that is basically this issue.

Also perhaps this is not a blocking issue and I will not argue if this is closed again.

Keith Mannthey (Inactive) added a comment - 14/Mar/13 2:36 PM Sorry to have to reopen this issue but I think we need to be really clear about what the situation is. The patch creates special Lustre behavior (accessed by setting a fail_loc value) to conform Lustre to the desired test behavior. I don't expect fail_loc to be used outside of test. I argue this is not a fix and that either the test needs to be changed with the understanding that cache behavior can be non-deterministic or the cache needs to conform to the behavior expected under all conditions. This patch is a great debug point but I don't think it is a solution to the core issue. I happen to be working on LU-2902 that is basically this issue. Also perhaps this is not a blocking issue and I will not argue if this is closed again.

Jodi Levi (Inactive) added a comment - 14/Mar/13 12:56 PM

Patch landed, and confirmed with Yu Jian, this ticket can be closed.

Jodi Levi (Inactive) added a comment - 14/Mar/13 12:56 PM Patch landed, and confirmed with Yu Jian, this ticket can be closed.

People

Assignee:: Keith Mannthey (Inactive)

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 01/Feb/13 9:25 AM

Updated:: 21/Jun/13 11:43 PM

Resolved:: 21/Jun/13 11:43 PM