[LU-2735] sanity.sh test_151: NOT IN CACHE: before: 337, after: 337 Created: 01/Feb/13  Updated: 21/Jun/13  Resolved: 21/Jun/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Keith Mannthey (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-2848 Failure on test suite sanity test_151... Resolved
is related to LU-2902 sanity test_156: NOT IN CACHE: before... Resolved
Severity: 3
Rank (Obsolete): 6639

 Description   

This issue was created by maloo for Li Wei <liwei@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/b8f32682-6c5f-11e2-91d6-52540035b04c.

The sub-test test_151 failed with the following error:

NOT IN CACHE: before: 337, after: 337

Info required for matching: sanity 151

== sanity test 151: test cache on oss and controls ================================= 21:28:47 (1359696527)
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
CMD: client-21-ib /usr/sbin/lctl set_param -n obdfilter.lustre-OST*.writethrough_cache_enable=1 		osd-*.lustre-OST*.writethrough_cache_enable=1 2>&1
3+0 records in
3+0 records out
12288 bytes (12 kB) copied, 0.00445821 s, 2.8 MB/s
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
 sanity test_151: @@@@@@ FAIL: NOT IN CACHE: before: 337, after: 337 


 Comments   
Comment by Hongchao Zhang [ 18/Feb/13 ]

in the debug logs in OST, the read requests are indeed sent to the OST from client, then the possible reason of this issue could be the pages just dropped
between the write and read operation, will create a debug patch to test it.

Comment by Hongchao Zhang [ 20/Feb/13 ]

the patch is tracked at http://review.whamcloud.com/#change,5475

Comment by nasf (Inactive) [ 20/Feb/13 ]

Another failure instance:

https://maloo.whamcloud.com/test_sets/91aa7cb6-7ad7-11e2-b916-52540035b04c

Comment by Keith Mannthey (Inactive) [ 06/Mar/13 ]

Another instance: It reported 5 out of the last 100 had failed.

https://maloo.whamcloud.com/test_sets/1a87288c-85f4-11e2-9f8d-52540035b04c

Comment by Jian Yu [ 09/Mar/13 ]

Another one: https://maloo.whamcloud.com/test_sets/ec2cde8e-8885-11e2-b643-52540035b04c

Comment by Jodi Levi (Inactive) [ 14/Mar/13 ]

Patch landed to master and confirmed with Yu Jian, this ticket can be closed.

Comment by Jodi Levi (Inactive) [ 14/Mar/13 ]

Patch landed, and confirmed with Yu Jian, this ticket can be closed.

Comment by Keith Mannthey (Inactive) [ 14/Mar/13 ]

Sorry to have to reopen this issue but I think we need to be really clear about what the situation is.

The patch creates special Lustre behavior (accessed by setting a fail_loc value) to conform Lustre to the desired test behavior.

I don't expect fail_loc to be used outside of test.

I argue this is not a fix and that either the test needs to be changed with the understanding that cache behavior can be non-deterministic or the cache needs to conform to the behavior expected under all conditions.

This patch is a great debug point but I don't think it is a solution to the core issue. I happen to be working on LU-2902 that is basically this issue.

Also perhaps this is not a blocking issue and I will not argue if this is closed again.

Comment by Oleg Drokin [ 15/Mar/13 ]

well, I agree it's sort of a hack to disable cache pruging for the test.
But what other options do we have? not much.

Another one I can think of is we can somehow ensure there's a lot of free ram, so no mem pressure. (e.g. tail /dev/zero or other such memhog that then terminates and leaves a lot of free ram behind).

Comment by Keith Mannthey (Inactive) [ 15/Mar/13 ]

I agree this is a touchy problem but it is good to test this behavior when conditions are correct.

A memhog allocation might oom the box if we are under enough memory pressure to evict the last small write from the page cache in short order.

I am going to propose a patch that attempts to detect when we are dropped from the cache under memory pressure. It can't be a 100% thing (without a hard cache_drop proc stat) but it could be a guide as to when to skip the test.

Comment by Peter Jones [ 15/Mar/13 ]

ok then I am dropping the priority and reassigning to Keith for the extra work he is proposing. Alternatively this could have been covered under a new enhancement ticket.

Comment by Andreas Dilger [ 18/Mar/13 ]

This seems closely related, if not a duplicate of LU-2848? Can this be closed as a duplicate, or are they for separate issues?

Comment by Andreas Dilger [ 18/Mar/13 ]

The reason I ask is because this test is still failing several times a week.

Comment by Keith Mannthey (Inactive) [ 18/Mar/13 ]

LU-2848 is the " NOT IN CACHE before: , after: " behavior. I don't think that that issue is cased by items being dropped from the cache (that is what the landed fail_loc patch protects against). In my testing for LU-2902 I was unable to reproduce the issue outside of the OOM killer killing ssh between the nodes. There is still another issue out there in the roc_hit family of tests. I have submitted a patch to see what roc_hit is getting from /proc (as part of lu-2902), in order to see what it sees.

Comment by Keith Mannthey (Inactive) [ 04/Apr/13 ]

ok new update:

https://maloo.whamcloud.com/test_sets/c11cebd6-9cde-11e2-8b27-52540035b04c
test_151

Error: 'NOT IN CACHE: before: 349, after: 349'
Failure Rate: 7.00% of last 100 executions [all branches]

There are still null errors out there.

The is with should be with the fail_loc fix in place.
From the test log:

fail_loc=0x609
3+0 records in
3+0 records out
12288 bytes (12 kB) copied, 0.00473753 s, 2.6 MB/s
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
snapshot_time 1365007767.342119 secs.usecs read_bytes 4 samples [bytes] 4096 1048576 1380352 write_bytes 3 samples [bytes] 327680 1048576 1900544 get_info 81 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 7 samples [reqs] commitrw 7 samples [reqs] ping 10 samples [reqs] snapshot_time 1365007767.342180 secs.usecs write_bytes 1 samples [bytes] 50400 50400 50400 get_info 76 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 11 samples [reqs] snapshot_time 1365007767.342212 secs.usecs read_bytes 7 samples [bytes] 4096 8192 49152 write_bytes 3 samples [bytes] 1910 6096 9922 get_info 85 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 10 samples [reqs] commitrw 10 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342242 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342268 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 17 samples [reqs] snapshot_time 1365007767.342293 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342324 secs.usecs get_info 72 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342552 secs.usecs get_page 7 samples [usec] 1 11 24 162 cache_access 337 samples [pages] 1 1 337 cache_hit 337 samples [pages] 1 1 337 snapshot_time 1365007767.342582 secs.usecs get_page 1 samples [usec] 1 1 1 1 snapshot_time 1365007767.342599 secs.usecs get_page 10 samples [usec] 0 1 3 3 cache_access 12 samples [pages] 1 1 12 cache_hit 12 samples [pages] 1 1 12
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
snapshot_time 1365007768.89231 secs.usecs read_bytes 4 samples [bytes] 4096 1048576 1380352 write_bytes 3 samples [bytes] 327680 1048576 1900544 get_info 81 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 36 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 7 samples [reqs] commitrw 7 samples [reqs] ping 10 samples [reqs] snapshot_time 1365007768.89292 secs.usecs write_bytes 1 samples [bytes] 50400 50400 50400 get_info 76 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 11 samples [reqs] snapshot_time 1365007768.89323 secs.usecs read_bytes 7 samples [bytes] 4096 8192 49152 write_bytes 3 samples [bytes] 1910 6096 9922 get_info 86 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 10 samples [reqs] commitrw 10 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89353 secs.usecs read_bytes 2 samples [bytes] 4096 12288 16384 write_bytes 1 samples [bytes] 12288 12288 12288 get_info 74 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 3 samples [reqs] commitrw 3 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89382 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 17 samples [reqs] snapshot_time 1365007768.89413 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89447 secs.usecs get_info 72 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89653 secs.usecs get_page 7 samples [usec] 1 11 24 162 cache_access 337 samples [pages] 1 1 337 cache_hit 337 samples [pages] 1 1 337 snapshot_time 1365007768.89681 secs.usecs get_page 1 samples [usec] 1 1 1 1 snapshot_time 1365007768.89697 secs.usecs get_page 10 samples [usec] 0 1 3 3 cache_access 12 samples [pages] 1 1 12 cache_hit 12 samples [pages] 1 1 12
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
CMD: client-21-ib /usr/sbin/lctl set_param fail_loc=0
fail_loc=0
 sanity test_151: @@@@@@ FAIL: NOT IN CACHE: before: 349, after: 349 

It seems we are back to square one.

Comment by Oleg Drokin [ 11/Apr/13 ]

happened again https://maloo.whamcloud.com/test_sets/f0607d16-a285-11e2-81ba-52540035b04c

Comment by Andreas Dilger [ 15/Apr/13 ]

Not sure if this is a spurious observation or not, but it appears that most of these recent failures are on client-23-ib (i.e. running IB networking) instead of of TCP networking. At least it appears that there was a decrease in the number of this style of failure when running on IB.

Comment by Andreas Dilger [ 15/Apr/13 ]

Not sure if this is a spurious observation or not, but it appears that most of these recent failures are on client-23-ib (i.e. running IB networking) instead of of TCP networking. At least it appears that there was a decrease in the number of this style of failure when running on IB.

Comment by Andreas Dilger [ 15/Apr/13 ]

There are no more LU-2735 failures reported for TCP since 2013-03-31 (https://maloo.whamcloud.com/sub_tests/4b048dda-9bd2-11e2-be58-52540035b04c), but there have been 7 failures on client-23-ib since that time.

Comment by Andreas Dilger [ 15/Apr/13 ]

To compare these numbers, IB has had 7/11 runs fail since 2013-03-31, while TCP has had about 350 passes without hitting this specific failure mode (for "review", though "review-dne" is failing for potentially another reason).

This might just relate to a race condition based on the speed of the networking, or it might relate to the fact that IB testing is run on separate physical nodes, while TCP testing is run on VMs on a single node.

Comment by Keith Mannthey (Inactive) [ 15/Apr/13 ]

It known the NOT IN CACHE issues can be caused by memory pressure. The IB systems could be under more memory pressure. Thanks for spotting the IB issue I have only really been tracking "review" queue with "master" builds.

Comment by Keith Mannthey (Inactive) [ 25/Apr/13 ]

From a client ib test failure a "before" debug output : https://maloo.whamcloud.com/test_logs/c0ac67cc-ad68-11e2-b72d-52540035b04c

snapshot_time 1366819124.688743 secs.usecs read_bytes 6 samples [bytes] 4096 1048576 1396736 write_bytes 6 samples [bytes] 777 1048576 1909333 get_info 87 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 59 samples [reqs] create 2 samples [reqs] destroy 77 samples [reqs] setattr 2 samples [reqs] punch 4 samples [reqs] sync 6 samples [reqs] preprw 12 samples [reqs] commitrw 12 samples [reqs] ping 41 samples [reqs]

snapshot_time 1366819124.688804 secs.usecs read_bytes 130 samples [bytes] 8192 1048576 134234112 write_bytes 130 samples [bytes] 1916 1048576 134225740 get_info 96 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 79 samples [reqs] setattr 1 samples [reqs] punch 5 samples [reqs] sync 6 samples [reqs] preprw 260 samples [reqs] commitrw 260 samples [reqs] ping 38 samples [reqs]

snapshot_time 1366819124.688838 secs.usecs write_bytes 1 samples [bytes] 50400 50400 50400 get_info 76 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] setattr 1 samples [reqs] sync 1 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 44 samples [reqs]

snapshot_time 1366819124.688867 secs.usecs read_bytes 7 samples [bytes] 4096 8192 49152 write_bytes 3 samples [bytes] 1910 6096 9922 get_info 85 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] setattr 1 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 10 samples [reqs] commitrw 10 samples [reqs] ping 44 samples [reqs]

snapshot_time 1366819124.688919 secs.usecs read_bytes 6 samples [bytes] 4096 12288 49152 write_bytes 2 samples [bytes] 3019 12288 15307 get_info 78 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 75 samples [reqs] punch 1 samples [reqs] sync 2 samples [reqs] preprw 8 samples [reqs] commitrw 8 samples [reqs] ping 42 samples [reqs]

snapshot_time 1366819124.688957 secs.usecs read_bytes 3 samples [bytes] 8192 8192 24576 write_bytes 2 samples [bytes] 6096 6096 12192 get_info 83 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 74 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 5 samples [reqs] commitrw 5 samples [reqs] ping 50 samples [reqs]

snapshot_time 1366819124.688991 secs.usecs read_bytes 2 samples [bytes] 8192 8192 16384 write_bytes 3 samples [bytes] 1916 6096 12108 get_info 81 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 75 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 5 samples [reqs] commitrw 5 samples [reqs] ping 49 samples [reqs]

snapshot_time 1366819124.689203 secs.usecs get_page 12 samples [usec] 0 10 37 217 cache_access 341 samples [pages] 1 1 341 cache_hit 339 samples [pages] 1 1 339 cache_miss 2 samples [pages] 1 1 2

snapshot_time 1366819124.689231 secs.usecs get_page 260 samples [usec] 1 234 10228 728602 cache_access 32772 samples [pages] 1 1 32772 cache_hit 16384 samples [pages] 1 1 16384 cache_miss 16388 samples [pages] 1 1 16388

snapshot_time 1366819124.689262 secs.usecs get_page 10 samples [usec] 0 1 4 4 cache_access 12 samples [pages] 1 1 12 cache_hit 12 samples [pages] 1 1 12

snapshot_time 1366819124.689280 secs.usecs get_page 8 samples [usec] 0 3 6 14 cache_access 9 samples [pages] 1 1 9 cache_hit 6 samples [pages] 1 1 6 cache_miss 3 samples [pages] 1 1 3

snapshot_time 1366819124.689311 secs.usecs get_page 5 samples [usec] 0 5 11 43 cache_access 4 samples [pages] 1 1 4 cache_hit 2 samples [pages] 1 1 2 cache_miss 2 samples [pages] 1 1 2

What is important to note is that there are only 5 proc entires about cache_access. 2 of the ost /proc stats are missing. Also when compared to the "After" results there is no ost that claims the cache_access. This may be just a different form on the no values read from /proc.

I will peruse further debug work in lu-2902. The ib clients do seem to help trigger the issue.

Comment by Keith Mannthey (Inactive) [ 30/May/13 ]

LU-2979 is the root cause of it appears All of these issues. Since the lu-2970 patch landed we have seen no errors.

Perhaps http://review.whamcloud.com/5475 should be reverted?

Comment by Andreas Dilger [ 30/May/13 ]

It looks like 5475 is just adding a test case, and it is never a bad idea to keep the test case after the bug has been fixed.

Comment by Keith Mannthey (Inactive) [ 30/May/13 ]

I just wanted to be sure we are ok with OBD_FAIL_OBD_NO_LRU and the fail_loc mode change to pass the test.

Comment by Keith Mannthey (Inactive) [ 21/Jun/13 ]

There has been no sign of this issue in over a month. The proc changes fixed any remaining errors.

Generated at Sat Feb 10 01:27:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.