Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2902

sanity test_156: NOT IN CACHE: before: , after:

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.4.1, Lustre 2.5.0
    • Lustre 2.4.0
    • 3
    • 6990

    Description

      This issue was created by maloo for Oleg Drokin <green@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/406900a6-84d3-11e2-9ab1-52540035b04c.

      The sub-test test_156 failed with the following error:

      NOT IN CACHE: before: 16741, after: 16741

      This seems to have an astounding 21% failure rate and nobody filed a ticket for it yet.
      Might be related to older LU-2009 that never was really investigated it seems.

      Info required for matching: sanity 156

      Attachments

        Issue Links

          Activity

            [LU-2902] sanity test_156: NOT IN CACHE: before: , after:

            The Improved test patch as been landed. There is now a check to make sure all they lproc stats are present and working as expected.

            Please reopen if more work is needed.

            keith Keith Mannthey (Inactive) added a comment - The Improved test patch as been landed. There is now a check to make sure all they lproc stats are present and working as expected. Please reopen if more work is needed.

            Ok I decided it would be best to have a decent check in roc_hit to make sure this does not happen again.

            Please see:
            http://review.whamcloud.com/6564

            It is a patch that check to make sure we are pulling information from the correct number of OSTs. It also reverts the v1 debug patch as it is not needed.

            keith Keith Mannthey (Inactive) added a comment - Ok I decided it would be best to have a decent check in roc_hit to make sure this does not happen again. Please see: http://review.whamcloud.com/6564 It is a patch that check to make sure we are pulling information from the correct number of OSTs. It also reverts the v1 debug patch as it is not needed.

            There are still no Errors to report with the "roc_hit" tests (Sanity 151 and 156). It is pretty clear to me LU-2979 is the root cause for the errors that were seen. The lproc entries were just not being exposed to userspace and there is no sign of any Cache issues in Autotest

            I would like to move to close these issues.

            Perhaps http://review.whamcloud.com/5648 should be revoked? It is not hugely invasive but it need to be permanent debug code in the tree.

            keith Keith Mannthey (Inactive) added a comment - There are still no Errors to report with the "roc_hit" tests (Sanity 151 and 156). It is pretty clear to me LU-2979 is the root cause for the errors that were seen. The lproc entries were just not being exposed to userspace and there is no sign of any Cache issues in Autotest I would like to move to close these issues. Perhaps http://review.whamcloud.com/5648 should be revoked? It is not hugely invasive but it need to be permanent debug code in the tree.

            OK one week into the LU-2979 patch landing and there are no failures of Sanity 151 or 156. This is good news.

            keith Keith Mannthey (Inactive) added a comment - OK one week into the LU-2979 patch landing and there are no failures of Sanity 151 or 156. This is good news.

            I did a quick search in Maloo today. I didn't see any failures in the last 2 days. I will check again next week.

            keith Keith Mannthey (Inactive) added a comment - I did a quick search in Maloo today. I didn't see any failures in the last 2 days. I will check again next week.
            sarah Sarah Liu added a comment -

            hit this issue when running interop between 2.3.0 client and 2.4-tag-2.3.65:
            https://maloo.whamcloud.com/test_sets/e89c3aac-bbee-11e2-b013-52540035b04c

            sarah Sarah Liu added a comment - hit this issue when running interop between 2.3.0 client and 2.4-tag-2.3.65: https://maloo.whamcloud.com/test_sets/e89c3aac-bbee-11e2-b013-52540035b04c
            jhammond John Hammond added a comment -

            The cases with missing values are probably occurrences of LU-2979.

            jhammond John Hammond added a comment - The cases with missing values are probably occurrences of LU-2979 .

            I have stated a debug run of sorts here: http://review.whamcloud.com/6006

            keith Keith Mannthey (Inactive) added a comment - I have stated a debug run of sorts here: http://review.whamcloud.com/6006

            LU-3094 does not seem to be the root issue for this problem.

            As a summary there is not sing of the /proc we are looking for. The kernel is not reporting values this area.

            I have not been able to reproduce this outside of autotest. Other Lustre things in /proc seem to be working so I am guessing for some reason this chunk to initialize but not a whole lproc collapse?

            I am starting to work on a debug patch for Lustre to help identify what might happening.

            keith Keith Mannthey (Inactive) added a comment - LU-3094 does not seem to be the root issue for this problem. As a summary there is not sing of the /proc we are looking for. The kernel is not reporting values this area. I have not been able to reproduce this outside of autotest. Other Lustre things in /proc seem to be working so I am guessing for some reason this chunk to initialize but not a whole lproc collapse? I am starting to work on a debug patch for Lustre to help identify what might happening.
            sarah Sarah Liu added a comment -

            Also seen this issue after upgrade from 1.8.9 to 2.4 and then add one new MDT:

            https://maloo.whamcloud.com/test_sets/a02cc9b2-9ec5-11e2-975f-52540035b04c

            sarah Sarah Liu added a comment - Also seen this issue after upgrade from 1.8.9 to 2.4 and then add one new MDT: https://maloo.whamcloud.com/test_sets/a02cc9b2-9ec5-11e2-975f-52540035b04c

            People

              keith Keith Mannthey (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: