Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4970

Test failure sanity-lfsck test_14: ls should success

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0
    • Lustre 2.6.0
    • None
    • 3
    • 13760

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      http://maloo.whamcloud.com/test_sets/33c0eb0c-cd6b-11e3-b548-52540035b04c
      https://maloo.whamcloud.com/test_sets/1ca115d8-bb68-11e3-8ec1-52540035b04c
      https://maloo.whamcloud.com/test_sets/beed7496-c0a2-11e3-b5ea-52540035b04c

      The sub-test test_14 failed with the following error:

      'ls' should success after layout LFSCK repairing
      ls: cannot access /mnt/lustre/d14.sanity-lfsck/f46: Cannot allocate memory
      ls: cannot access /mnt/lustre/d14.sanity-lfsck/f32: Cannot allocate memory
      sanity-lfsck test_14: @@@@@@ FAIL: (5) ls should success.

      Info required for matching: sanity-lfsck 14

      Attachments

        Issue Links

          Activity

            [LU-4970] Test failure sanity-lfsck test_14: ls should success
            utopiabound Nathaniel Clark added a comment - EXCEPT test for zfs: http://review.whamcloud.com/10473
            utopiabound Nathaniel Clark added a comment - Turn up debugging: http://review.whamcloud.com/10472

            I see there is still the "ls" failure in the test logs:

            ls: cannot access /mnt/lustre/d14.sanity-lfsck/f46: Cannot allocate memory
            

            but nothing in the Lustre debug logs, since they are only running with the default debug level.

            It probably makes sense to bump the debug level to -1 for test_14, and make the ls (9) failure an error_ignore() for the short term. That would allow debugging the problem more easily than just disabling the test.

            adilger Andreas Dilger added a comment - I see there is still the "ls" failure in the test logs: ls: cannot access /mnt/lustre/d14.sanity-lfsck/f46: Cannot allocate memory but nothing in the Lustre debug logs, since they are only running with the default debug level. It probably makes sense to bump the debug level to -1 for test_14, and make the ls (9) failure an error_ignore() for the short term. That would allow debugging the problem more easily than just disabling the test.
            utopiabound Nathaniel Clark added a comment - - edited

            This bug doesn't date back much prior to Apr 1, 2014

            The earliest commit I can associate a failure to is:
            588a29b Wed Mar 26 00:14:46 2014 +0000 LU-4462 mdt: don't apply mdt_object_fid() to ERR_PTRs

            but this is the root of long patch chains. The earliest patch I can associate a direct failure to:
            a38ac90 Tue Apr 1 16:11:43 2014 +0000 LU-4805 lmv: lookup remote migrating object in LMV

            I don't have a suspect patch but it appears somewhere in here an issue was introduced.

            utopiabound Nathaniel Clark added a comment - - edited This bug doesn't date back much prior to Apr 1, 2014 The earliest commit I can associate a failure to is: 588a29b Wed Mar 26 00:14:46 2014 +0000 LU-4462 mdt: don't apply mdt_object_fid() to ERR_PTRs but this is the root of long patch chains. The earliest patch I can associate a direct failure to: a38ac90 Tue Apr 1 16:11:43 2014 +0000 LU-4805 lmv: lookup remote migrating object in LMV I don't have a suspect patch but it appears somewhere in here an issue was introduced.

            I don't see this on LFSCK

            OST debug_log:

            2306:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue0()) ### server-side enqueue handler START
            2305:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x171:0x0: rc = -2
            2306:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x180:0x0: rc = -2
            2305:0:(ldlm_lockd.c:1431:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock (null), rc -12)
            
            utopiabound Nathaniel Clark added a comment - I don't see this on LFSCK OST debug_log: 2306:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue0()) ### server-side enqueue handler START 2305:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x171:0x0: rc = -2 2306:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x180:0x0: rc = -2 2305:0:(ldlm_lockd.c:1431:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock (null), rc -12)

            Nathaniel, it looks like there is a memory allocation problem during the running of test_14 on all three failures you reported.

            Could you please take a look to see if this is something specific to ZFS, or is it related to LFSCK?

            adilger Andreas Dilger added a comment - Nathaniel, it looks like there is a memory allocation problem during the running of test_14 on all three failures you reported. Could you please take a look to see if this is something specific to ZFS, or is it related to LFSCK?

            People

              yong.fan nasf (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: