Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.4.0
-
None
-
3
-
6639
Description
This issue was created by maloo for Li Wei <liwei@whamcloud.com>
This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/b8f32682-6c5f-11e2-91d6-52540035b04c.
The sub-test test_151 failed with the following error:
NOT IN CACHE: before: 337, after: 337
Info required for matching: sanity 151
== sanity test 151: test cache on oss and controls ================================= 21:28:47 (1359696527) CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable osd-*.lustre-OST*.read_cache_enable 2>&1 CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable osd-*.lustre-OST*.read_cache_enable 2>&1 CMD: client-21-ib /usr/sbin/lctl set_param -n obdfilter.lustre-OST*.writethrough_cache_enable=1 osd-*.lustre-OST*.writethrough_cache_enable=1 2>&1 3+0 records in 3+0 records out 12288 bytes (12 kB) copied, 0.00445821 s, 2.8 MB/s CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats osd-*.lustre-OST*.stats 2>&1 CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats osd-*.lustre-OST*.stats 2>&1 sanity test_151: @@@@@@ FAIL: NOT IN CACHE: before: 337, after: 337
Attachments
Issue Links
Activity
From a client ib test failure a "before" debug output : https://maloo.whamcloud.com/test_logs/c0ac67cc-ad68-11e2-b72d-52540035b04c
snapshot_time 1366819124.688743 secs.usecs read_bytes 6 samples [bytes] 4096 1048576 1396736 write_bytes 6 samples [bytes] 777 1048576 1909333 get_info 87 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 59 samples [reqs] create 2 samples [reqs] destroy 77 samples [reqs] setattr 2 samples [reqs] punch 4 samples [reqs] sync 6 samples [reqs] preprw 12 samples [reqs] commitrw 12 samples [reqs] ping 41 samples [reqs] snapshot_time 1366819124.688804 secs.usecs read_bytes 130 samples [bytes] 8192 1048576 134234112 write_bytes 130 samples [bytes] 1916 1048576 134225740 get_info 96 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 79 samples [reqs] setattr 1 samples [reqs] punch 5 samples [reqs] sync 6 samples [reqs] preprw 260 samples [reqs] commitrw 260 samples [reqs] ping 38 samples [reqs] snapshot_time 1366819124.688838 secs.usecs write_bytes 1 samples [bytes] 50400 50400 50400 get_info 76 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] setattr 1 samples [reqs] sync 1 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 44 samples [reqs] snapshot_time 1366819124.688867 secs.usecs read_bytes 7 samples [bytes] 4096 8192 49152 write_bytes 3 samples [bytes] 1910 6096 9922 get_info 85 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] setattr 1 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 10 samples [reqs] commitrw 10 samples [reqs] ping 44 samples [reqs] snapshot_time 1366819124.688919 secs.usecs read_bytes 6 samples [bytes] 4096 12288 49152 write_bytes 2 samples [bytes] 3019 12288 15307 get_info 78 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 75 samples [reqs] punch 1 samples [reqs] sync 2 samples [reqs] preprw 8 samples [reqs] commitrw 8 samples [reqs] ping 42 samples [reqs] snapshot_time 1366819124.688957 secs.usecs read_bytes 3 samples [bytes] 8192 8192 24576 write_bytes 2 samples [bytes] 6096 6096 12192 get_info 83 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 74 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 5 samples [reqs] commitrw 5 samples [reqs] ping 50 samples [reqs] snapshot_time 1366819124.688991 secs.usecs read_bytes 2 samples [bytes] 8192 8192 16384 write_bytes 3 samples [bytes] 1916 6096 12108 get_info 81 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 75 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 5 samples [reqs] commitrw 5 samples [reqs] ping 49 samples [reqs] snapshot_time 1366819124.689203 secs.usecs get_page 12 samples [usec] 0 10 37 217 cache_access 341 samples [pages] 1 1 341 cache_hit 339 samples [pages] 1 1 339 cache_miss 2 samples [pages] 1 1 2 snapshot_time 1366819124.689231 secs.usecs get_page 260 samples [usec] 1 234 10228 728602 cache_access 32772 samples [pages] 1 1 32772 cache_hit 16384 samples [pages] 1 1 16384 cache_miss 16388 samples [pages] 1 1 16388 snapshot_time 1366819124.689262 secs.usecs get_page 10 samples [usec] 0 1 4 4 cache_access 12 samples [pages] 1 1 12 cache_hit 12 samples [pages] 1 1 12 snapshot_time 1366819124.689280 secs.usecs get_page 8 samples [usec] 0 3 6 14 cache_access 9 samples [pages] 1 1 9 cache_hit 6 samples [pages] 1 1 6 cache_miss 3 samples [pages] 1 1 3 snapshot_time 1366819124.689311 secs.usecs get_page 5 samples [usec] 0 5 11 43 cache_access 4 samples [pages] 1 1 4 cache_hit 2 samples [pages] 1 1 2 cache_miss 2 samples [pages] 1 1 2
What is important to note is that there are only 5 proc entires about cache_access. 2 of the ost /proc stats are missing. Also when compared to the "After" results there is no ost that claims the cache_access. This may be just a different form on the no values read from /proc.
I will peruse further debug work in lu-2902. The ib clients do seem to help trigger the issue.
It known the NOT IN CACHE issues can be caused by memory pressure. The IB systems could be under more memory pressure. Thanks for spotting the IB issue I have only really been tracking "review" queue with "master" builds.
To compare these numbers, IB has had 7/11 runs fail since 2013-03-31, while TCP has had about 350 passes without hitting this specific failure mode (for "review", though "review-dne" is failing for potentially another reason).
This might just relate to a race condition based on the speed of the networking, or it might relate to the fact that IB testing is run on separate physical nodes, while TCP testing is run on VMs on a single node.
There are no more LU-2735 failures reported for TCP since 2013-03-31 (https://maloo.whamcloud.com/sub_tests/4b048dda-9bd2-11e2-be58-52540035b04c), but there have been 7 failures on client-23-ib since that time.
Not sure if this is a spurious observation or not, but it appears that most of these recent failures are on client-23-ib (i.e. running IB networking) instead of of TCP networking. At least it appears that there was a decrease in the number of this style of failure when running on IB.
Not sure if this is a spurious observation or not, but it appears that most of these recent failures are on client-23-ib (i.e. running IB networking) instead of of TCP networking. At least it appears that there was a decrease in the number of this style of failure when running on IB.
ok new update:
https://maloo.whamcloud.com/test_sets/c11cebd6-9cde-11e2-8b27-52540035b04c
test_151
Error: 'NOT IN CACHE: before: 349, after: 349'
Failure Rate: 7.00% of last 100 executions [all branches]
There are still null errors out there.
The is with should be with the fail_loc fix in place.
From the test log:
fail_loc=0x609 3+0 records in 3+0 records out 12288 bytes (12 kB) copied, 0.00473753 s, 2.6 MB/s CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats osd-*.lustre-OST*.stats 2>&1 snapshot_time 1365007767.342119 secs.usecs read_bytes 4 samples [bytes] 4096 1048576 1380352 write_bytes 3 samples [bytes] 327680 1048576 1900544 get_info 81 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 7 samples [reqs] commitrw 7 samples [reqs] ping 10 samples [reqs] snapshot_time 1365007767.342180 secs.usecs write_bytes 1 samples [bytes] 50400 50400 50400 get_info 76 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 11 samples [reqs] snapshot_time 1365007767.342212 secs.usecs read_bytes 7 samples [bytes] 4096 8192 49152 write_bytes 3 samples [bytes] 1910 6096 9922 get_info 85 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 10 samples [reqs] commitrw 10 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342242 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342268 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 17 samples [reqs] snapshot_time 1365007767.342293 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342324 secs.usecs get_info 72 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007767.342552 secs.usecs get_page 7 samples [usec] 1 11 24 162 cache_access 337 samples [pages] 1 1 337 cache_hit 337 samples [pages] 1 1 337 snapshot_time 1365007767.342582 secs.usecs get_page 1 samples [usec] 1 1 1 1 snapshot_time 1365007767.342599 secs.usecs get_page 10 samples [usec] 0 1 3 3 cache_access 12 samples [pages] 1 1 12 cache_hit 12 samples [pages] 1 1 12 CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats osd-*.lustre-OST*.stats 2>&1 CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats osd-*.lustre-OST*.stats 2>&1 snapshot_time 1365007768.89231 secs.usecs read_bytes 4 samples [bytes] 4096 1048576 1380352 write_bytes 3 samples [bytes] 327680 1048576 1900544 get_info 81 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 36 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 7 samples [reqs] commitrw 7 samples [reqs] ping 10 samples [reqs] snapshot_time 1365007768.89292 secs.usecs write_bytes 1 samples [bytes] 50400 50400 50400 get_info 76 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 11 samples [reqs] snapshot_time 1365007768.89323 secs.usecs read_bytes 7 samples [bytes] 4096 8192 49152 write_bytes 3 samples [bytes] 1910 6096 9922 get_info 86 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 10 samples [reqs] commitrw 10 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89353 secs.usecs read_bytes 2 samples [bytes] 4096 12288 16384 write_bytes 1 samples [bytes] 12288 12288 12288 get_info 74 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 35 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] sync 1 samples [reqs] preprw 3 samples [reqs] commitrw 3 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89382 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 17 samples [reqs] snapshot_time 1365007768.89413 secs.usecs get_info 73 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89447 secs.usecs get_info 72 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 34 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] ping 12 samples [reqs] snapshot_time 1365007768.89653 secs.usecs get_page 7 samples [usec] 1 11 24 162 cache_access 337 samples [pages] 1 1 337 cache_hit 337 samples [pages] 1 1 337 snapshot_time 1365007768.89681 secs.usecs get_page 1 samples [usec] 1 1 1 1 snapshot_time 1365007768.89697 secs.usecs get_page 10 samples [usec] 0 1 3 3 cache_access 12 samples [pages] 1 1 12 cache_hit 12 samples [pages] 1 1 12 CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats osd-*.lustre-OST*.stats 2>&1 CMD: client-21-ib /usr/sbin/lctl set_param fail_loc=0 fail_loc=0 sanity test_151: @@@@@@ FAIL: NOT IN CACHE: before: 349, after: 349
It seems we are back to square one.
LU-2848 is the " NOT IN CACHE before: , after: " behavior. I don't think that that issue is cased by items being dropped from the cache (that is what the landed fail_loc patch protects against). In my testing for LU-2902 I was unable to reproduce the issue outside of the OOM killer killing ssh between the nodes. There is still another issue out there in the roc_hit family of tests. I have submitted a patch to see what roc_hit is getting from /proc (as part of lu-2902), in order to see what it sees.
The reason I ask is because this test is still failing several times a week.
LU-2979is the root cause of it appears All of these issues. Since the lu-2970 patch landed we have seen no errors.Perhaps http://review.whamcloud.com/5475 should be reverted?