[LU-2735] sanity.sh test_151: NOT IN CACHE: before: 337, after: 337 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.4.0
Affects Version/s: Lustre 2.4.0
Labels:
None

Severity:
3
Rank (Obsolete):
6639

Description

This issue was created by maloo for Li Wei <liwei@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/b8f32682-6c5f-11e2-91d6-52540035b04c.

The sub-test test_151 failed with the following error:

NOT IN CACHE: before: 337, after: 337

Info required for matching: sanity 151

== sanity test 151: test cache on oss and controls ================================= 21:28:47 (1359696527)
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
CMD: client-21-ib /usr/sbin/lctl set_param -n obdfilter.lustre-OST*.writethrough_cache_enable=1 		osd-*.lustre-OST*.writethrough_cache_enable=1 2>&1
3+0 records in
3+0 records out
12288 bytes (12 kB) copied, 0.00445821 s, 2.8 MB/s
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
 sanity test_151: @@@@@@ FAIL: NOT IN CACHE: before: 337, after: 337

Attachments

Issue Links

is related to

LU-2902 sanity test_156: NOT IN CACHE: before: , after:

Resolved

is related to

LU-2848 Failure on test suite sanity test_151: NOT IN CACHE before: , after:

Resolved

Activity

[LU-2735] sanity.sh test_151: NOT IN CACHE: before: 337, after: 337

Andreas Dilger added a comment - 18/Mar/13 7:37 AM

This seems closely related, if not a duplicate of ~~LU-2848~~? Can this be closed as a duplicate, or are they for separate issues?

Andreas Dilger added a comment - 18/Mar/13 7:37 AM This seems closely related, if not a duplicate of LU-2848 ? Can this be closed as a duplicate, or are they for separate issues?

Peter Jones added a comment - 15/Mar/13 7:49 PM

ok then I am dropping the priority and reassigning to Keith for the extra work he is proposing. Alternatively this could have been covered under a new enhancement ticket.

Peter Jones added a comment - 15/Mar/13 7:49 PM ok then I am dropping the priority and reassigning to Keith for the extra work he is proposing. Alternatively this could have been covered under a new enhancement ticket.

Keith Mannthey (Inactive) added a comment - 15/Mar/13 2:18 PM

I agree this is a touchy problem but it is good to test this behavior when conditions are correct.

A memhog allocation might oom the box if we are under enough memory pressure to evict the last small write from the page cache in short order.

I am going to propose a patch that attempts to detect when we are dropped from the cache under memory pressure. It can't be a 100% thing (without a hard cache_drop proc stat) but it could be a guide as to when to skip the test.

Keith Mannthey (Inactive) added a comment - 15/Mar/13 2:18 PM I agree this is a touchy problem but it is good to test this behavior when conditions are correct. A memhog allocation might oom the box if we are under enough memory pressure to evict the last small write from the page cache in short order. I am going to propose a patch that attempts to detect when we are dropped from the cache under memory pressure. It can't be a 100% thing (without a hard cache_drop proc stat) but it could be a guide as to when to skip the test.

Oleg Drokin added a comment - 15/Mar/13 2:20 AM

well, I agree it's sort of a hack to disable cache pruging for the test.
But what other options do we have? not much.

Another one I can think of is we can somehow ensure there's a lot of free ram, so no mem pressure. (e.g. tail /dev/zero or other such memhog that then terminates and leaves a lot of free ram behind).

Oleg Drokin added a comment - 15/Mar/13 2:20 AM well, I agree it's sort of a hack to disable cache pruging for the test. But what other options do we have? not much. Another one I can think of is we can somehow ensure there's a lot of free ram, so no mem pressure. (e.g. tail /dev/zero or other such memhog that then terminates and leaves a lot of free ram behind).

Keith Mannthey (Inactive) added a comment - 14/Mar/13 2:36 PM

Sorry to have to reopen this issue but I think we need to be really clear about what the situation is.

The patch creates special Lustre behavior (accessed by setting a fail_loc value) to conform Lustre to the desired test behavior.

I don't expect fail_loc to be used outside of test.

I argue this is not a fix and that either the test needs to be changed with the understanding that cache behavior can be non-deterministic or the cache needs to conform to the behavior expected under all conditions.

This patch is a great debug point but I don't think it is a solution to the core issue. I happen to be working on ~~LU-2902~~ that is basically this issue.

Also perhaps this is not a blocking issue and I will not argue if this is closed again.

Keith Mannthey (Inactive) added a comment - 14/Mar/13 2:36 PM Sorry to have to reopen this issue but I think we need to be really clear about what the situation is. The patch creates special Lustre behavior (accessed by setting a fail_loc value) to conform Lustre to the desired test behavior. I don't expect fail_loc to be used outside of test. I argue this is not a fix and that either the test needs to be changed with the understanding that cache behavior can be non-deterministic or the cache needs to conform to the behavior expected under all conditions. This patch is a great debug point but I don't think it is a solution to the core issue. I happen to be working on LU-2902 that is basically this issue. Also perhaps this is not a blocking issue and I will not argue if this is closed again.

Jodi Levi (Inactive) added a comment - 14/Mar/13 12:56 PM

Patch landed, and confirmed with Yu Jian, this ticket can be closed.

Jodi Levi (Inactive) added a comment - 14/Mar/13 12:56 PM Patch landed, and confirmed with Yu Jian, this ticket can be closed.

Jodi Levi (Inactive) added a comment - 14/Mar/13 12:56 PM

Patch landed to master and confirmed with Yu Jian, this ticket can be closed.

Jodi Levi (Inactive) added a comment - 14/Mar/13 12:56 PM Patch landed to master and confirmed with Yu Jian, this ticket can be closed.

Jian Yu added a comment - 09/Mar/13 8:09 AM

Another one: https://maloo.whamcloud.com/test_sets/ec2cde8e-8885-11e2-b643-52540035b04c

Jian Yu added a comment - 09/Mar/13 8:09 AM Another one: https://maloo.whamcloud.com/test_sets/ec2cde8e-8885-11e2-b643-52540035b04c

Keith Mannthey (Inactive) added a comment - 06/Mar/13 10:45 PM

Another instance: It reported 5 out of the last 100 had failed.

https://maloo.whamcloud.com/test_sets/1a87288c-85f4-11e2-9f8d-52540035b04c

Keith Mannthey (Inactive) added a comment - 06/Mar/13 10:45 PM Another instance: It reported 5 out of the last 100 had failed. https://maloo.whamcloud.com/test_sets/1a87288c-85f4-11e2-9f8d-52540035b04c

nasf (Inactive) added a comment - 20/Feb/13 9:42 PM

Another failure instance:

https://maloo.whamcloud.com/test_sets/91aa7cb6-7ad7-11e2-b916-52540035b04c

nasf (Inactive) added a comment - 20/Feb/13 9:42 PM Another failure instance: https://maloo.whamcloud.com/test_sets/91aa7cb6-7ad7-11e2-b916-52540035b04c

Hongchao Zhang added a comment - 20/Feb/13 5:43 AM

the patch is tracked at http://review.whamcloud.com/#change,5475

Hongchao Zhang added a comment - 20/Feb/13 5:43 AM the patch is tracked at http://review.whamcloud.com/#change,5475

People

Assignee:: Keith Mannthey (Inactive)

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 01/Feb/13 9:25 AM

Updated:: 21/Jun/13 11:43 PM

Resolved:: 21/Jun/13 11:43 PM