Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2735

sanity.sh test_151: NOT IN CACHE: before: 337, after: 337

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.4.0
    • Lustre 2.4.0
    • None
    • 3
    • 6639

    Description

      This issue was created by maloo for Li Wei <liwei@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/b8f32682-6c5f-11e2-91d6-52540035b04c.

      The sub-test test_151 failed with the following error:

      NOT IN CACHE: before: 337, after: 337

      Info required for matching: sanity 151

      == sanity test 151: test cache on oss and controls ================================= 21:28:47 (1359696527)
      CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
      CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
      CMD: client-21-ib /usr/sbin/lctl set_param -n obdfilter.lustre-OST*.writethrough_cache_enable=1 		osd-*.lustre-OST*.writethrough_cache_enable=1 2>&1
      3+0 records in
      3+0 records out
      12288 bytes (12 kB) copied, 0.00445821 s, 2.8 MB/s
      CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
      CMD: client-21-ib /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.stats 		osd-*.lustre-OST*.stats 2>&1
       sanity test_151: @@@@@@ FAIL: NOT IN CACHE: before: 337, after: 337 
      

      Attachments

        Issue Links

          Activity

            [LU-2735] sanity.sh test_151: NOT IN CACHE: before: 337, after: 337

            This seems closely related, if not a duplicate of LU-2848? Can this be closed as a duplicate, or are they for separate issues?

            adilger Andreas Dilger added a comment - This seems closely related, if not a duplicate of LU-2848 ? Can this be closed as a duplicate, or are they for separate issues?
            pjones Peter Jones added a comment -

            ok then I am dropping the priority and reassigning to Keith for the extra work he is proposing. Alternatively this could have been covered under a new enhancement ticket.

            pjones Peter Jones added a comment - ok then I am dropping the priority and reassigning to Keith for the extra work he is proposing. Alternatively this could have been covered under a new enhancement ticket.

            I agree this is a touchy problem but it is good to test this behavior when conditions are correct.

            A memhog allocation might oom the box if we are under enough memory pressure to evict the last small write from the page cache in short order.

            I am going to propose a patch that attempts to detect when we are dropped from the cache under memory pressure. It can't be a 100% thing (without a hard cache_drop proc stat) but it could be a guide as to when to skip the test.

            keith Keith Mannthey (Inactive) added a comment - I agree this is a touchy problem but it is good to test this behavior when conditions are correct. A memhog allocation might oom the box if we are under enough memory pressure to evict the last small write from the page cache in short order. I am going to propose a patch that attempts to detect when we are dropped from the cache under memory pressure. It can't be a 100% thing (without a hard cache_drop proc stat) but it could be a guide as to when to skip the test.
            green Oleg Drokin added a comment -

            well, I agree it's sort of a hack to disable cache pruging for the test.
            But what other options do we have? not much.

            Another one I can think of is we can somehow ensure there's a lot of free ram, so no mem pressure. (e.g. tail /dev/zero or other such memhog that then terminates and leaves a lot of free ram behind).

            green Oleg Drokin added a comment - well, I agree it's sort of a hack to disable cache pruging for the test. But what other options do we have? not much. Another one I can think of is we can somehow ensure there's a lot of free ram, so no mem pressure. (e.g. tail /dev/zero or other such memhog that then terminates and leaves a lot of free ram behind).

            Sorry to have to reopen this issue but I think we need to be really clear about what the situation is.

            The patch creates special Lustre behavior (accessed by setting a fail_loc value) to conform Lustre to the desired test behavior.

            I don't expect fail_loc to be used outside of test.

            I argue this is not a fix and that either the test needs to be changed with the understanding that cache behavior can be non-deterministic or the cache needs to conform to the behavior expected under all conditions.

            This patch is a great debug point but I don't think it is a solution to the core issue. I happen to be working on LU-2902 that is basically this issue.

            Also perhaps this is not a blocking issue and I will not argue if this is closed again.

            keith Keith Mannthey (Inactive) added a comment - Sorry to have to reopen this issue but I think we need to be really clear about what the situation is. The patch creates special Lustre behavior (accessed by setting a fail_loc value) to conform Lustre to the desired test behavior. I don't expect fail_loc to be used outside of test. I argue this is not a fix and that either the test needs to be changed with the understanding that cache behavior can be non-deterministic or the cache needs to conform to the behavior expected under all conditions. This patch is a great debug point but I don't think it is a solution to the core issue. I happen to be working on LU-2902 that is basically this issue. Also perhaps this is not a blocking issue and I will not argue if this is closed again.

            Patch landed, and confirmed with Yu Jian, this ticket can be closed.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed, and confirmed with Yu Jian, this ticket can be closed.

            Patch landed to master and confirmed with Yu Jian, this ticket can be closed.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to master and confirmed with Yu Jian, this ticket can be closed.
            yujian Jian Yu added a comment - Another one: https://maloo.whamcloud.com/test_sets/ec2cde8e-8885-11e2-b643-52540035b04c

            Another instance: It reported 5 out of the last 100 had failed.

            https://maloo.whamcloud.com/test_sets/1a87288c-85f4-11e2-9f8d-52540035b04c

            keith Keith Mannthey (Inactive) added a comment - Another instance: It reported 5 out of the last 100 had failed. https://maloo.whamcloud.com/test_sets/1a87288c-85f4-11e2-9f8d-52540035b04c
            yong.fan nasf (Inactive) added a comment - Another failure instance: https://maloo.whamcloud.com/test_sets/91aa7cb6-7ad7-11e2-b916-52540035b04c
            hongchao.zhang Hongchao Zhang added a comment - the patch is tracked at http://review.whamcloud.com/#change,5475

            People

              keith Keith Mannthey (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: