[LU-2735] sanity.sh test_151: NOT IN CACHE: before: 337, after: 337 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.4.0
Affects Version/s: Lustre 2.4.0
Labels:
None

Severity:
3
Rank (Obsolete):
6639

Description

This issue was created by maloo for Li Wei <liwei@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/b8f32682-6c5f-11e2-91d6-52540035b04c.

The sub-test test_151 failed with the following error:

NOT IN CACHE: before: 337, after: 337

Info required for matching: sanity 151

== sanity test 151: test cache on oss and controls ================================= 21:28:47 (1359696527)
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
CMD: client-21-ib /usr/sbin/lctl set_param -n obdfilter.lustre-OST*.writethrough_cache_enable=1 		osd-*.lustre-OST*.writethrough_cache_enable=1 2>&1
3+0 records in
3+0 records out
12288 bytes (12 kB) copied, 0.00445821 s, 2.8 MB/s
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
 sanity test_151: @@@@@@ FAIL: NOT IN CACHE: before: 337, after: 337

Attachments

Issue Links

is related to

LU-2902 sanity test_156: NOT IN CACHE: before: , after:

Resolved

is related to

LU-2848 Failure on test suite sanity test_151: NOT IN CACHE before: , after:

Resolved

Activity

[LU-2735] sanity.sh test_151: NOT IN CACHE: before: 337, after: 337

Andreas Dilger added a comment - 15/Apr/13 12:22 AM

To compare these numbers, IB has had 7/11 runs fail since 2013-03-31, while TCP has had about 350 passes without hitting this specific failure mode (for "review", though "review-dne" is failing for potentially another reason).

This might just relate to a race condition based on the speed of the networking, or it might relate to the fact that IB testing is run on separate physical nodes, while TCP testing is run on VMs on a single node.

Andreas Dilger added a comment - 15/Apr/13 12:22 AM To compare these numbers, IB has had 7/11 runs fail since 2013-03-31, while TCP has had about 350 passes without hitting this specific failure mode (for "review", though "review-dne" is failing for potentially another reason). This might just relate to a race condition based on the speed of the networking, or it might relate to the fact that IB testing is run on separate physical nodes, while TCP testing is run on VMs on a single node.

Andreas Dilger added a comment - 15/Apr/13 12:07 AM

There are no more ~~LU-2735~~ failures reported for TCP since 2013-03-31 (https://maloo.whamcloud.com/sub_tests/4b048dda-9bd2-11e2-be58-52540035b04c), but there have been 7 failures on client-23-ib since that time.

Andreas Dilger added a comment - 15/Apr/13 12:07 AM There are no more LU-2735 failures reported for TCP since 2013-03-31 ( https://maloo.whamcloud.com/sub_tests/4b048dda-9bd2-11e2-be58-52540035b04c ), but there have been 7 failures on client-23-ib since that time.

Andreas Dilger added a comment - 15/Apr/13 12:04 AM

Not sure if this is a spurious observation or not, but it appears that most of these recent failures are on client-23-ib (i.e. running IB networking) instead of of TCP networking. At least it appears that there was a decrease in the number of this style of failure when running on IB.

Andreas Dilger added a comment - 15/Apr/13 12:04 AM Not sure if this is a spurious observation or not, but it appears that most of these recent failures are on client-23-ib (i.e. running IB networking) instead of of TCP networking. At least it appears that there was a decrease in the number of this style of failure when running on IB.

Andreas Dilger added a comment - 15/Apr/13 12:03 AM

Andreas Dilger added a comment - 15/Apr/13 12:03 AM Not sure if this is a spurious observation or not, but it appears that most of these recent failures are on client-23-ib (i.e. running IB networking) instead of of TCP networking. At least it appears that there was a decrease in the number of this style of failure when running on IB.

Oleg Drokin added a comment - 11/Apr/13 5:04 PM

happened again https://maloo.whamcloud.com/test_sets/f0607d16-a285-11e2-81ba-52540035b04c

Oleg Drokin added a comment - 11/Apr/13 5:04 PM happened again https://maloo.whamcloud.com/test_sets/f0607d16-a285-11e2-81ba-52540035b04c

Keith Mannthey (Inactive) added a comment - 04/Apr/13 5:08 PM

ok new update:

https://maloo.whamcloud.com/test_sets/c11cebd6-9cde-11e2-8b27-52540035b04c
test_151

Error: 'NOT IN CACHE: before: 349, after: 349'
Failure Rate: 7.00% of last 100 executions [all branches]

There are still null errors out there.

The is with should be with the fail_loc fix in place.
From the test log:

fail_loc=0x609
3+0 records in
3+0 records out
12288 bytes (12 kB) copied, 0.00473753 s, 2.6 MB/s
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
snapshot_time 1365007767.342119 secs.usecs read_bytes 4 samples [bytes] 4096 1048576 1380352 write_bytes 3 samples [bytes] 327680 1048576 1900544 get_info 81 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 7 samples [reqs] commitrw 7 samples [reqs] ping 10 samples [reqs] snapshot_time 1365007767.342180 secs.usecs write_bytes 1 samples [bytes] 50400 50400 50400 get_info 76 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 11 samples [reqs] snapshot_time 1365007767.342212 secs.usecs read_bytes 7 samples [bytes] 4096 8192 49152 write_bytes 3 samples [bytes] 1910 6096 9922 get_info 85 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 10 samples [reqs] commitrw 10 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342242 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342268 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 17 samples [reqs] snapshot_time 1365007767.342293 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342324 secs.usecs get_info 72 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342552 secs.usecs get_page 7 samples [usec] 1 11 24 162 cache_access 337 samples [pages] 1 1 337 cache_hit 337 samples [pages] 1 1 337 snapshot_time 1365007767.342582 secs.usecs get_page 1 samples [usec] 1 1 1 1 snapshot_time 1365007767.342599 secs.usecs get_page 10 samples [usec] 0 1 3 3 cache_access 12 samples [pages] 1 1 12 cache_hit 12 samples [pages] 1 1 12
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
snapshot_time 1365007768.89231 secs.usecs read_bytes 4 samples [bytes] 4096 1048576 1380352 write_bytes 3 samples [bytes] 327680 1048576 1900544 get_info 81 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 36 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 7 samples [reqs] commitrw 7 samples [reqs] ping 10 samples [reqs] snapshot_time 1365007768.89292 secs.usecs write_bytes 1 samples [bytes] 50400 50400 50400 get_info 76 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 11 samples [reqs] snapshot_time 1365007768.89323 secs.usecs read_bytes 7 samples [bytes] 4096 8192 49152 write_bytes 3 samples [bytes] 1910 6096 9922 get_info 86 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 10 samples [reqs] commitrw 10 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89353 secs.usecs read_bytes 2 samples [bytes] 4096 12288 16384 write_bytes 1 samples [bytes] 12288 12288 12288 get_info 74 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 3 samples [reqs] commitrw 3 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89382 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 17 samples [reqs] snapshot_time 1365007768.89413 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89447 secs.usecs get_info 72 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89653 secs.usecs get_page 7 samples [usec] 1 11 24 162 cache_access 337 samples [pages] 1 1 337 cache_hit 337 samples [pages] 1 1 337 snapshot_time 1365007768.89681 secs.usecs get_page 1 samples [usec] 1 1 1 1 snapshot_time 1365007768.89697 secs.usecs get_page 10 samples [usec] 0 1 3 3 cache_access 12 samples [pages] 1 1 12 cache_hit 12 samples [pages] 1 1 12
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
CMD: client-21-ib /usr/sbin/lctl set_param fail_loc=0
fail_loc=0
 sanity test_151: @@@@@@ FAIL: NOT IN CACHE: before: 349, after: 349

It seems we are back to square one.

Keith Mannthey (Inactive) added a comment - 04/Apr/13 5:08 PM ok new update: https://maloo.whamcloud.com/test_sets/c11cebd6-9cde-11e2-8b27-52540035b04c test_151 Error: 'NOT IN CACHE: before: 349, after: 349' Failure Rate: 7.00% of last 100 executions [all branches] There are still null errors out there. The is with should be with the fail_loc fix in place. From the test log: fail_loc=0x609 3+0 records in 3+0 records out 12288 bytes (12 kB) copied, 0.00473753 s, 2.6 MB/s CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats osd-*.lustre-OST*.stats 2>&1 snapshot_time 1365007767.342119 secs.usecs read_bytes 4 samples [bytes] 4096 1048576 1380352 write_bytes 3 samples [bytes] 327680 1048576 1900544 get_info 81 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 7 samples [reqs] commitrw 7 samples [reqs] ping 10 samples [reqs] snapshot_time 1365007767.342180 secs.usecs write_bytes 1 samples [bytes] 50400 50400 50400 get_info 76 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 11 samples [reqs] snapshot_time 1365007767.342212 secs.usecs read_bytes 7 samples [bytes] 4096 8192 49152 write_bytes 3 samples [bytes] 1910 6096 9922 get_info 85 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 10 samples [reqs] commitrw 10 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342242 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342268 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 17 samples [reqs] snapshot_time 1365007767.342293 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342324 secs.usecs get_info 72 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342552 secs.usecs get_page 7 samples [usec] 1 11 24 162 cache_access 337 samples [pages] 1 1 337 cache_hit 337 samples [pages] 1 1 337 snapshot_time 1365007767.342582 secs.usecs get_page 1 samples [usec] 1 1 1 1 snapshot_time 1365007767.342599 secs.usecs get_page 10 samples [usec] 0 1 3 3 cache_access 12 samples [pages] 1 1 12 cache_hit 12 samples [pages] 1 1 12 CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats osd-*.lustre-OST*.stats 2>&1 CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats osd-*.lustre-OST*.stats 2>&1 snapshot_time 1365007768.89231 secs.usecs read_bytes 4 samples [bytes] 4096 1048576 1380352 write_bytes 3 samples [bytes] 327680 1048576 1900544 get_info 81 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 36 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 7 samples [reqs] commitrw 7 samples [reqs] ping 10 samples [reqs] snapshot_time 1365007768.89292 secs.usecs write_bytes 1 samples [bytes] 50400 50400 50400 get_info 76 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 11 samples [reqs] snapshot_time 1365007768.89323 secs.usecs read_bytes 7 samples [bytes] 4096 8192 49152 write_bytes 3 samples [bytes] 1910 6096 9922 get_info 86 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 10 samples [reqs] commitrw 10 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89353 secs.usecs read_bytes 2 samples [bytes] 4096 12288 16384 write_bytes 1 samples [bytes] 12288 12288 12288 get_info 74 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 3 samples [reqs] commitrw 3 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89382 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 17 samples [reqs] snapshot_time 1365007768.89413 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89447 secs.usecs get_info 72 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89653 secs.usecs get_page 7 samples [usec] 1 11 24 162 cache_access 337 samples [pages] 1 1 337 cache_hit 337 samples [pages] 1 1 337 snapshot_time 1365007768.89681 secs.usecs get_page 1 samples [usec] 1 1 1 1 snapshot_time 1365007768.89697 secs.usecs get_page 10 samples [usec] 0 1 3 3 cache_access 12 samples [pages] 1 1 12 cache_hit 12 samples [pages] 1 1 12 CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats osd-*.lustre-OST*.stats 2>&1 CMD: client-21-ib /usr/sbin/lctl set_param fail_loc=0 fail_loc=0 sanity test_151: @@@@@@ FAIL: NOT IN CACHE: before: 349, after: 349 It seems we are back to square one.

Keith Mannthey (Inactive) added a comment - 18/Mar/13 4:46 PM

~~LU-2848~~ is the " NOT IN CACHE before: , after: " behavior. I don't think that that issue is cased by items being dropped from the cache (that is what the landed fail_loc patch protects against). In my testing for ~~LU-2902~~ I was unable to reproduce the issue outside of the OOM killer killing ssh between the nodes. There is still another issue out there in the roc_hit family of tests. I have submitted a patch to see what roc_hit is getting from /proc (as part of lu-2902), in order to see what it sees.

Keith Mannthey (Inactive) added a comment - 18/Mar/13 4:46 PM LU-2848 is the " NOT IN CACHE before: , after: " behavior. I don't think that that issue is cased by items being dropped from the cache (that is what the landed fail_loc patch protects against). In my testing for LU-2902 I was unable to reproduce the issue outside of the OOM killer killing ssh between the nodes. There is still another issue out there in the roc_hit family of tests. I have submitted a patch to see what roc_hit is getting from /proc (as part of lu-2902), in order to see what it sees.

Andreas Dilger added a comment - 18/Mar/13 7:38 AM

The reason I ask is because this test is still failing several times a week.

Andreas Dilger added a comment - 18/Mar/13 7:38 AM The reason I ask is because this test is still failing several times a week.

Andreas Dilger added a comment - 18/Mar/13 7:37 AM

This seems closely related, if not a duplicate of ~~LU-2848~~? Can this be closed as a duplicate, or are they for separate issues?

Andreas Dilger added a comment - 18/Mar/13 7:37 AM This seems closely related, if not a duplicate of LU-2848 ? Can this be closed as a duplicate, or are they for separate issues?

Peter Jones added a comment - 15/Mar/13 7:49 PM

ok then I am dropping the priority and reassigning to Keith for the extra work he is proposing. Alternatively this could have been covered under a new enhancement ticket.

Peter Jones added a comment - 15/Mar/13 7:49 PM ok then I am dropping the priority and reassigning to Keith for the extra work he is proposing. Alternatively this could have been covered under a new enhancement ticket.

Keith Mannthey (Inactive) added a comment - 15/Mar/13 2:18 PM

I agree this is a touchy problem but it is good to test this behavior when conditions are correct.

A memhog allocation might oom the box if we are under enough memory pressure to evict the last small write from the page cache in short order.

I am going to propose a patch that attempts to detect when we are dropped from the cache under memory pressure. It can't be a 100% thing (without a hard cache_drop proc stat) but it could be a guide as to when to skip the test.

Keith Mannthey (Inactive) added a comment - 15/Mar/13 2:18 PM I agree this is a touchy problem but it is good to test this behavior when conditions are correct. A memhog allocation might oom the box if we are under enough memory pressure to evict the last small write from the page cache in short order. I am going to propose a patch that attempts to detect when we are dropped from the cache under memory pressure. It can't be a 100% thing (without a hard cache_drop proc stat) but it could be a guide as to when to skip the test.

People

Assignee:: Keith Mannthey (Inactive)

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 01/Feb/13 9:25 AM

Updated:: 21/Jun/13 11:43 PM

Resolved:: 21/Jun/13 11:43 PM