Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2735

sanity.sh test_151: NOT IN CACHE: before: 337, after: 337

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.4.0
    • Lustre 2.4.0
    • None
    • 3
    • 6639

    Description

      This issue was created by maloo for Li Wei <liwei@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/b8f32682-6c5f-11e2-91d6-52540035b04c.

      The sub-test test_151 failed with the following error:

      NOT IN CACHE: before: 337, after: 337

      Info required for matching: sanity 151

      == sanity test 151: test cache on oss and controls ================================= 21:28:47 (1359696527)
      CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
      CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
      CMD: client-21-ib /usr/sbin/lctl set_param -n obdfilter.lustre-OST*.writethrough_cache_enable=1 		osd-*.lustre-OST*.writethrough_cache_enable=1 2>&1
      3+0 records in
      3+0 records out
      12288 bytes (12 kB) copied, 0.00445821 s, 2.8 MB/s
      CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
      CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
       sanity test_151: @@@@@@ FAIL: NOT IN CACHE: before: 337, after: 337 
      

      Attachments

        Issue Links

          Activity

            [LU-2735] sanity.sh test_151: NOT IN CACHE: before: 337, after: 337

            There has been no sign of this issue in over a month. The proc changes fixed any remaining errors.

            keith Keith Mannthey (Inactive) added a comment - There has been no sign of this issue in over a month. The proc changes fixed any remaining errors.

            I just wanted to be sure we are ok with OBD_FAIL_OBD_NO_LRU and the fail_loc mode change to pass the test.

            keith Keith Mannthey (Inactive) added a comment - I just wanted to be sure we are ok with OBD_FAIL_OBD_NO_LRU and the fail_loc mode change to pass the test.

            It looks like 5475 is just adding a test case, and it is never a bad idea to keep the test case after the bug has been fixed.

            adilger Andreas Dilger added a comment - It looks like 5475 is just adding a test case, and it is never a bad idea to keep the test case after the bug has been fixed.

            LU-2979 is the root cause of it appears All of these issues. Since the lu-2970 patch landed we have seen no errors.

            Perhaps http://review.whamcloud.com/5475 should be reverted?

            keith Keith Mannthey (Inactive) added a comment - LU-2979 is the root cause of it appears All of these issues. Since the lu-2970 patch landed we have seen no errors. Perhaps http://review.whamcloud.com/5475 should be reverted?

            From a client ib test failure a "before" debug output : https://maloo.whamcloud.com/test_logs/c0ac67cc-ad68-11e2-b72d-52540035b04c

            snapshot_time 1366819124.688743 secs.usecs read_bytes 6 samples [bytes] 4096 1048576 1396736 write_bytes 6 samples [bytes] 777 1048576 1909333 get_info 87 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 59 samples [reqs] create 2 samples [reqs] destroy 77 samples [reqs] setattr 2 samples [reqs] punch 4 samples [reqs] sync 6 samples [reqs] preprw 12 samples [reqs] commitrw 12 samples [reqs] ping 41 samples [reqs]
            
            snapshot_time 1366819124.688804 secs.usecs read_bytes 130 samples [bytes] 8192 1048576 134234112 write_bytes 130 samples [bytes] 1916 1048576 134225740 get_info 96 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 79 samples [reqs] setattr 1 samples [reqs] punch 5 samples [reqs] sync 6 samples [reqs] preprw 260 samples [reqs] commitrw 260 samples [reqs] ping 38 samples [reqs]
            
            snapshot_time 1366819124.688838 secs.usecs write_bytes 1 samples [bytes] 50400 50400 50400 get_info 76 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] setattr 1 samples [reqs] sync 1 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 44 samples [reqs]
            
            snapshot_time 1366819124.688867 secs.usecs read_bytes 7 samples [bytes] 4096 8192 49152 write_bytes 3 samples [bytes] 1910 6096 9922 get_info 85 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] setattr 1 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 10 samples [reqs] commitrw 10 samples [reqs] ping 44 samples [reqs]
            
            snapshot_time 1366819124.688919 secs.usecs read_bytes 6 samples [bytes] 4096 12288 49152 write_bytes 2 samples [bytes] 3019 12288 15307 get_info 78 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 75 samples [reqs] punch 1 samples [reqs] sync 2 samples [reqs] preprw 8 samples [reqs] commitrw 8 samples [reqs] ping 42 samples [reqs]
            
            snapshot_time 1366819124.688957 secs.usecs read_bytes 3 samples [bytes] 8192 8192 24576 write_bytes 2 samples [bytes] 6096 6096 12192 get_info 83 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 74 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 5 samples [reqs] commitrw 5 samples [reqs] ping 50 samples [reqs]
            
            snapshot_time 1366819124.688991 secs.usecs read_bytes 2 samples [bytes] 8192 8192 16384 write_bytes 3 samples [bytes] 1916 6096 12108 get_info 81 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 75 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 5 samples [reqs] commitrw 5 samples [reqs] ping 49 samples [reqs]
            
            snapshot_time 1366819124.689203 secs.usecs get_page 12 samples [usec] 0 10 37 217 cache_access 341 samples [pages] 1 1 341 cache_hit 339 samples [pages] 1 1 339 cache_miss 2 samples [pages] 1 1 2
            
            snapshot_time 1366819124.689231 secs.usecs get_page 260 samples [usec] 1 234 10228 728602 cache_access 32772 samples [pages] 1 1 32772 cache_hit 16384 samples [pages] 1 1 16384 cache_miss 16388 samples [pages] 1 1 16388
            
            snapshot_time 1366819124.689262 secs.usecs get_page 10 samples [usec] 0 1 4 4 cache_access 12 samples [pages] 1 1 12 cache_hit 12 samples [pages] 1 1 12
            
            snapshot_time 1366819124.689280 secs.usecs get_page 8 samples [usec] 0 3 6 14 cache_access 9 samples [pages] 1 1 9 cache_hit 6 samples [pages] 1 1 6 cache_miss 3 samples [pages] 1 1 3
            
            snapshot_time 1366819124.689311 secs.usecs get_page 5 samples [usec] 0 5 11 43 cache_access 4 samples [pages] 1 1 4 cache_hit 2 samples [pages] 1 1 2 cache_miss 2 samples [pages] 1 1 2
            

            What is important to note is that there are only 5 proc entires about cache_access. 2 of the ost /proc stats are missing. Also when compared to the "After" results there is no ost that claims the cache_access. This may be just a different form on the no values read from /proc.

            I will peruse further debug work in lu-2902. The ib clients do seem to help trigger the issue.

            keith Keith Mannthey (Inactive) added a comment - From a client ib test failure a "before" debug output : https://maloo.whamcloud.com/test_logs/c0ac67cc-ad68-11e2-b72d-52540035b04c snapshot_time 1366819124.688743 secs.usecs read_bytes 6 samples [bytes] 4096 1048576 1396736 write_bytes 6 samples [bytes] 777 1048576 1909333 get_info 87 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 59 samples [reqs] create 2 samples [reqs] destroy 77 samples [reqs] setattr 2 samples [reqs] punch 4 samples [reqs] sync 6 samples [reqs] preprw 12 samples [reqs] commitrw 12 samples [reqs] ping 41 samples [reqs] snapshot_time 1366819124.688804 secs.usecs read_bytes 130 samples [bytes] 8192 1048576 134234112 write_bytes 130 samples [bytes] 1916 1048576 134225740 get_info 96 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 79 samples [reqs] setattr 1 samples [reqs] punch 5 samples [reqs] sync 6 samples [reqs] preprw 260 samples [reqs] commitrw 260 samples [reqs] ping 38 samples [reqs] snapshot_time 1366819124.688838 secs.usecs write_bytes 1 samples [bytes] 50400 50400 50400 get_info 76 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] setattr 1 samples [reqs] sync 1 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 44 samples [reqs] snapshot_time 1366819124.688867 secs.usecs read_bytes 7 samples [bytes] 4096 8192 49152 write_bytes 3 samples [bytes] 1910 6096 9922 get_info 85 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] setattr 1 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 10 samples [reqs] commitrw 10 samples [reqs] ping 44 samples [reqs] snapshot_time 1366819124.688919 secs.usecs read_bytes 6 samples [bytes] 4096 12288 49152 write_bytes 2 samples [bytes] 3019 12288 15307 get_info 78 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 75 samples [reqs] punch 1 samples [reqs] sync 2 samples [reqs] preprw 8 samples [reqs] commitrw 8 samples [reqs] ping 42 samples [reqs] snapshot_time 1366819124.688957 secs.usecs read_bytes 3 samples [bytes] 8192 8192 24576 write_bytes 2 samples [bytes] 6096 6096 12192 get_info 83 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 74 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 5 samples [reqs] commitrw 5 samples [reqs] ping 50 samples [reqs] snapshot_time 1366819124.688991 secs.usecs read_bytes 2 samples [bytes] 8192 8192 16384 write_bytes 3 samples [bytes] 1916 6096 12108 get_info 81 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 58 samples [reqs] create 2 samples [reqs] destroy 75 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 5 samples [reqs] commitrw 5 samples [reqs] ping 49 samples [reqs] snapshot_time 1366819124.689203 secs.usecs get_page 12 samples [usec] 0 10 37 217 cache_access 341 samples [pages] 1 1 341 cache_hit 339 samples [pages] 1 1 339 cache_miss 2 samples [pages] 1 1 2 snapshot_time 1366819124.689231 secs.usecs get_page 260 samples [usec] 1 234 10228 728602 cache_access 32772 samples [pages] 1 1 32772 cache_hit 16384 samples [pages] 1 1 16384 cache_miss 16388 samples [pages] 1 1 16388 snapshot_time 1366819124.689262 secs.usecs get_page 10 samples [usec] 0 1 4 4 cache_access 12 samples [pages] 1 1 12 cache_hit 12 samples [pages] 1 1 12 snapshot_time 1366819124.689280 secs.usecs get_page 8 samples [usec] 0 3 6 14 cache_access 9 samples [pages] 1 1 9 cache_hit 6 samples [pages] 1 1 6 cache_miss 3 samples [pages] 1 1 3 snapshot_time 1366819124.689311 secs.usecs get_page 5 samples [usec] 0 5 11 43 cache_access 4 samples [pages] 1 1 4 cache_hit 2 samples [pages] 1 1 2 cache_miss 2 samples [pages] 1 1 2 What is important to note is that there are only 5 proc entires about cache_access. 2 of the ost /proc stats are missing. Also when compared to the "After" results there is no ost that claims the cache_access. This may be just a different form on the no values read from /proc. I will peruse further debug work in lu-2902. The ib clients do seem to help trigger the issue.

            It known the NOT IN CACHE issues can be caused by memory pressure. The IB systems could be under more memory pressure. Thanks for spotting the IB issue I have only really been tracking "review" queue with "master" builds.

            keith Keith Mannthey (Inactive) added a comment - It known the NOT IN CACHE issues can be caused by memory pressure. The IB systems could be under more memory pressure. Thanks for spotting the IB issue I have only really been tracking "review" queue with "master" builds.

            To compare these numbers, IB has had 7/11 runs fail since 2013-03-31, while TCP has had about 350 passes without hitting this specific failure mode (for "review", though "review-dne" is failing for potentially another reason).

            This might just relate to a race condition based on the speed of the networking, or it might relate to the fact that IB testing is run on separate physical nodes, while TCP testing is run on VMs on a single node.

            adilger Andreas Dilger added a comment - To compare these numbers, IB has had 7/11 runs fail since 2013-03-31, while TCP has had about 350 passes without hitting this specific failure mode (for "review", though "review-dne" is failing for potentially another reason). This might just relate to a race condition based on the speed of the networking, or it might relate to the fact that IB testing is run on separate physical nodes, while TCP testing is run on VMs on a single node.

            There are no more LU-2735 failures reported for TCP since 2013-03-31 (https://maloo.whamcloud.com/sub_tests/4b048dda-9bd2-11e2-be58-52540035b04c), but there have been 7 failures on client-23-ib since that time.

            adilger Andreas Dilger added a comment - There are no more LU-2735 failures reported for TCP since 2013-03-31 ( https://maloo.whamcloud.com/sub_tests/4b048dda-9bd2-11e2-be58-52540035b04c ), but there have been 7 failures on client-23-ib since that time.

            Not sure if this is a spurious observation or not, but it appears that most of these recent failures are on client-23-ib (i.e. running IB networking) instead of of TCP networking. At least it appears that there was a decrease in the number of this style of failure when running on IB.

            adilger Andreas Dilger added a comment - Not sure if this is a spurious observation or not, but it appears that most of these recent failures are on client-23-ib (i.e. running IB networking) instead of of TCP networking. At least it appears that there was a decrease in the number of this style of failure when running on IB.

            Not sure if this is a spurious observation or not, but it appears that most of these recent failures are on client-23-ib (i.e. running IB networking) instead of of TCP networking. At least it appears that there was a decrease in the number of this style of failure when running on IB.

            adilger Andreas Dilger added a comment - Not sure if this is a spurious observation or not, but it appears that most of these recent failures are on client-23-ib (i.e. running IB networking) instead of of TCP networking. At least it appears that there was a decrease in the number of this style of failure when running on IB.

            People

              keith Keith Mannthey (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: