Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4970

Test failure sanity-lfsck test_14: ls should success

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0
    • Lustre 2.6.0
    • None
    • 3
    • 13760

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      http://maloo.whamcloud.com/test_sets/33c0eb0c-cd6b-11e3-b548-52540035b04c
      https://maloo.whamcloud.com/test_sets/1ca115d8-bb68-11e3-8ec1-52540035b04c
      https://maloo.whamcloud.com/test_sets/beed7496-c0a2-11e3-b5ea-52540035b04c

      The sub-test test_14 failed with the following error:

      'ls' should success after layout LFSCK repairing
      ls: cannot access /mnt/lustre/d14.sanity-lfsck/f46: Cannot allocate memory
      ls: cannot access /mnt/lustre/d14.sanity-lfsck/f32: Cannot allocate memory
      sanity-lfsck test_14: @@@@@@ FAIL: (5) ls should success.

      Info required for matching: sanity-lfsck 14

      Attachments

        Issue Links

          Activity

            [LU-4970] Test failure sanity-lfsck test_14: ls should success
            utopiabound Nathaniel Clark added a comment - This seems to have shown up on review-dne-part-2: https://testing.hpdd.intel.com/test_sets/3db646c0-1433-11e4-88ed-5254006e85c2

            This has been EXCEPT'd for 2.6

            utopiabound Nathaniel Clark added a comment - This has been EXCEPT'd for 2.6

            OST debug_log:

            (ofd_objects.c:634:ofd_attr_get()) Process entered
            (ofd_objects.c:659:ofd_attr_get()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
            (service.c:1068:ptlrpc_update_export_timer()) Process entered
            (service.c:1081:ptlrpc_update_export_timer()) Process leaving
            (ofd_lvb.c:98:ofd_lvbo_init()) Process leaving via out_obj (rc=18446744073709551614 : -2 : 0xfffffffffffffffe)
            (lustre_fid.h:712:fid_flatten32()) Process leaving (rc=4278190179 : 4278190179 : ff000063)
            (lu_object.c:1643:keys_init()) kmalloced 'ctx->lc_value': 320 at ffff88006f9e9000.
            (lu_object.c:429:lu_global_key_fini()) kfreed 'info': 512 at ffff88006f9e2e00.
            (capa.c:416:lu_capainfo_key_init()) kmalloced 'value': 128 at ffff880071527ac0.
            (dt_object.c:56:dt_global_key_fini()) kfreed 'info': 1216 at ffff88006f4ef800.
            (lu_ucred.c:53:lu_ucred_key_init()) kmalloced 'value': 72 at ffff880072641cc0.
            (tgt_main.c:163:tgt_key_fini()) kfreed 'info': 1152 at ffff88006bace000.
            (fld_handler.c:60:fld_key_fini()) kfreed 'info': 72 at ffff88006fdac7c0.
            (tgt_main.c:187:tgt_ses_key_init()) kmalloced 'value': 144 at ffff880071461bc0.
            (fid_handler.c:406:seq_key_fini()) kfreed 'info': 48 at ffff88006e3b4800.
            (service.c:1678:ptlrpc_server_request_add()) Process entered
            (service.c:1579:ptlrpc_server_hpreq_init()) Process entered
            (lquota_lib.c:53:lquota_key_fini()) kfreed 'info': 240 at ffff880072539cc0.
            (qsd_lib.c:71:qsd_key_fini()) kfreed 'info': 376 at ffff88006e688a00.
            (service.c:1630:ptlrpc_server_hpreq_init()) Process leaving (rc=0 : 0 : 0)
            (service.c:1693:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0)
            (lfsck_lib.c:55:lfsck_key_fini()) kfreed 'info': 1144 at ffff8800723f7000.
            (ofd_dev.c:2332:ofd_key_fini()) kfreed 'info': 656 at ffff880071e1e000.
            (service.c:2006:ptlrpc_server_handle_req_in()) Process leaving (rc=1 : 1 : 1)
            (lu_object.c:1593:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff880071a58c00.
            (ldlm_resource.c:1161:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x163:0x0: rc = -2
            

            So ENOENT is the correct error that should be being passed up.

            utopiabound Nathaniel Clark added a comment - OST debug_log: (ofd_objects.c:634:ofd_attr_get()) Process entered (ofd_objects.c:659:ofd_attr_get()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) (service.c:1068:ptlrpc_update_export_timer()) Process entered (service.c:1081:ptlrpc_update_export_timer()) Process leaving (ofd_lvb.c:98:ofd_lvbo_init()) Process leaving via out_obj (rc=18446744073709551614 : -2 : 0xfffffffffffffffe) (lustre_fid.h:712:fid_flatten32()) Process leaving (rc=4278190179 : 4278190179 : ff000063) (lu_object.c:1643:keys_init()) kmalloced 'ctx->lc_value': 320 at ffff88006f9e9000. (lu_object.c:429:lu_global_key_fini()) kfreed 'info': 512 at ffff88006f9e2e00. (capa.c:416:lu_capainfo_key_init()) kmalloced 'value': 128 at ffff880071527ac0. (dt_object.c:56:dt_global_key_fini()) kfreed 'info': 1216 at ffff88006f4ef800. (lu_ucred.c:53:lu_ucred_key_init()) kmalloced 'value': 72 at ffff880072641cc0. (tgt_main.c:163:tgt_key_fini()) kfreed 'info': 1152 at ffff88006bace000. (fld_handler.c:60:fld_key_fini()) kfreed 'info': 72 at ffff88006fdac7c0. (tgt_main.c:187:tgt_ses_key_init()) kmalloced 'value': 144 at ffff880071461bc0. (fid_handler.c:406:seq_key_fini()) kfreed 'info': 48 at ffff88006e3b4800. (service.c:1678:ptlrpc_server_request_add()) Process entered (service.c:1579:ptlrpc_server_hpreq_init()) Process entered (lquota_lib.c:53:lquota_key_fini()) kfreed 'info': 240 at ffff880072539cc0. (qsd_lib.c:71:qsd_key_fini()) kfreed 'info': 376 at ffff88006e688a00. (service.c:1630:ptlrpc_server_hpreq_init()) Process leaving (rc=0 : 0 : 0) (service.c:1693:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) (lfsck_lib.c:55:lfsck_key_fini()) kfreed 'info': 1144 at ffff8800723f7000. (ofd_dev.c:2332:ofd_key_fini()) kfreed 'info': 656 at ffff880071e1e000. (service.c:2006:ptlrpc_server_handle_req_in()) Process leaving (rc=1 : 1 : 1) (lu_object.c:1593:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff880071a58c00. (ldlm_resource.c:1161:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x163:0x0: rc = -2 So ENOENT is the correct error that should be being passed up.
            utopiabound Nathaniel Clark added a comment - Failures with full debugging turned on: https://maloo.whamcloud.com/test_sets/8d617942-cffa-11e3-9aa7-52540035b04c https://maloo.whamcloud.com/test_sets/c14112a2-e7d5-11e3-8bff-52540035b04c

            It appears that ls: cannot allocate memory is a bogus message caused by lvbo_init returning NULL if the object cannot be found, and NULL being interpreted by the higher layers as -ENOMEM. This would be fixed by the patch http://review.whamcloud.com/9004 that returns the proper error in this case (-ENOENT). There would still a test failure, but at least it would be more understandable.

            adilger Andreas Dilger added a comment - It appears that ls: cannot allocate memory is a bogus message caused by lvbo_init returning NULL if the object cannot be found, and NULL being interpreted by the higher layers as -ENOMEM. This would be fixed by the patch http://review.whamcloud.com/9004 that returns the proper error in this case (-ENOENT). There would still a test failure, but at least it would be more understandable.
            utopiabound Nathaniel Clark added a comment - EXCEPT test for zfs: http://review.whamcloud.com/10473
            utopiabound Nathaniel Clark added a comment - Turn up debugging: http://review.whamcloud.com/10472

            I see there is still the "ls" failure in the test logs:

            ls: cannot access /mnt/lustre/d14.sanity-lfsck/f46: Cannot allocate memory
            

            but nothing in the Lustre debug logs, since they are only running with the default debug level.

            It probably makes sense to bump the debug level to -1 for test_14, and make the ls (9) failure an error_ignore() for the short term. That would allow debugging the problem more easily than just disabling the test.

            adilger Andreas Dilger added a comment - I see there is still the "ls" failure in the test logs: ls: cannot access /mnt/lustre/d14.sanity-lfsck/f46: Cannot allocate memory but nothing in the Lustre debug logs, since they are only running with the default debug level. It probably makes sense to bump the debug level to -1 for test_14, and make the ls (9) failure an error_ignore() for the short term. That would allow debugging the problem more easily than just disabling the test.
            utopiabound Nathaniel Clark added a comment - - edited

            This bug doesn't date back much prior to Apr 1, 2014

            The earliest commit I can associate a failure to is:
            588a29b Wed Mar 26 00:14:46 2014 +0000 LU-4462 mdt: don't apply mdt_object_fid() to ERR_PTRs

            but this is the root of long patch chains. The earliest patch I can associate a direct failure to:
            a38ac90 Tue Apr 1 16:11:43 2014 +0000 LU-4805 lmv: lookup remote migrating object in LMV

            I don't have a suspect patch but it appears somewhere in here an issue was introduced.

            utopiabound Nathaniel Clark added a comment - - edited This bug doesn't date back much prior to Apr 1, 2014 The earliest commit I can associate a failure to is: 588a29b Wed Mar 26 00:14:46 2014 +0000 LU-4462 mdt: don't apply mdt_object_fid() to ERR_PTRs but this is the root of long patch chains. The earliest patch I can associate a direct failure to: a38ac90 Tue Apr 1 16:11:43 2014 +0000 LU-4805 lmv: lookup remote migrating object in LMV I don't have a suspect patch but it appears somewhere in here an issue was introduced.

            I don't see this on LFSCK

            OST debug_log:

            2306:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue0()) ### server-side enqueue handler START
            2305:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x171:0x0: rc = -2
            2306:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x180:0x0: rc = -2
            2305:0:(ldlm_lockd.c:1431:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock (null), rc -12)
            
            utopiabound Nathaniel Clark added a comment - I don't see this on LFSCK OST debug_log: 2306:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue0()) ### server-side enqueue handler START 2305:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x171:0x0: rc = -2 2306:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x180:0x0: rc = -2 2305:0:(ldlm_lockd.c:1431:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock (null), rc -12)

            Nathaniel, it looks like there is a memory allocation problem during the running of test_14 on all three failures you reported.

            Could you please take a look to see if this is something specific to ZFS, or is it related to LFSCK?

            adilger Andreas Dilger added a comment - Nathaniel, it looks like there is a memory allocation problem during the running of test_14 on all three failures you reported. Could you please take a look to see if this is something specific to ZFS, or is it related to LFSCK?

            People

              yong.fan nasf (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: