Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7546

conf-sanity conf-sanity: lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 )

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for wangdi <di.wang@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/b123def0-a00b-11e5-ae0a-5254006e85c2.

      The sub-test conf-sanity failed with the following error:

      03:27:23:Lustre: DEBUG MARKER: umount -d -f /mnt/mds1
      03:27:23:LustreError: 10651:0:(client.c:1130:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88006ed629c0 x1520238940666796/t0(0) o13->lustre-OST0001-osc-MDT0000@10.1.4.239@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      03:27:23:LustreError: 10651:0:(client.c:1130:ptlrpc_import_delay_req()) Skipped 7 previous similar messages
      03:27:23:LustreError: 30453:0:(osp_object.c:587:osp_attr_get()) lustre-MDT0001-osp-MDT0000:osp_attr_get update error [0x240000402:0x1:0x0]: rc = -5
      03:27:23:LustreError: 30453:0:(osp_object.c:587:osp_attr_get()) Skipped 5 previous similar messages
      03:27:23:Lustre: lustre-MDT0000: Not available for connect from 10.1.4.239@tcp (stopping)
      03:27:23:Lustre: Skipped 7 previous similar messages
      03:27:23:LustreError: 10651:0:(client.c:1130:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88006ed629c0 x1520238940666828/t0(0) o13->lustre-OST0005-osc-MDT0000@10.1.4.239@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      03:27:23:LustreError: 10651:0:(client.c:1130:ptlrpc_import_delay_req()) Skipped 4 previous similar messages
      03:27:23:Lustre: lustre-MDT0000: Not available for connect from 10.1.4.244@tcp (stopping)
      03:27:23:Lustre: Skipped 2 previous similar messages
      03:27:23:LustreError: 10652:0:(client.c:1130:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880044a39680 x1520238940666836/t0(0) o13->lustre-OST0007-osc-MDT0000@10.1.4.239@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      03:27:23:LustreError: 10636:0:(lod_dev.c:1578:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: lu is ffff880070784000
      03:27:23:LustreError: 10636:0:(lod_dev.c:1578:lod_device_free()) LBUG
      

      Please provide additional information about the failure here.

      Info required for matching: conf-sanity conf-sanity

      Attachments

        Issue Links

          Activity

            [LU-7546] conf-sanity conf-sanity: lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 )
            di.wang Di Wang (Inactive) added a comment - - edited

            Since the reason has been found, let's close this ticket for now, and the fix will be included in LU-3538 http://review.whamcloud.com/#/c/12530/

            di.wang Di Wang (Inactive) added a comment - - edited Since the reason has been found, let's close this ticket for now, and the fix will be included in LU-3538 http://review.whamcloud.com/#/c/12530/
            laisiyao Lai Siyao added a comment -

            Di, thanks, I'll update the patch later.

            laisiyao Lai Siyao added a comment - Di, thanks, I'll update the patch later.

            It seems few mdt_reint_xxx changes did not release the object in the error handler path, which cause this assertion

            for example

            @@ -1926,20 +2038,20 @@ static int mdt_reint_rename_internal(struct mdt_thread_info *info,
            
                            lh_newp = &info->mti_lh[MDT_LH_NEW];
                            mdt_lock_reg_init(lh_newp, LCK_EX);
            -               rc = mdt_object_lock(info, mnew, lh_newp,
            -                                    MDS_INODELOCK_LOOKUP |
            -                                    MDS_INODELOCK_UPDATE);
            +               rc = mdt_reint_object_lock(info, mnew, lh_newp,
            +                                          MDS_INODELOCK_LOOKUP |
            +                                          MDS_INODELOCK_UPDATE,
            +                                          cos_incompat);
                            if (rc != 0)
                                    GOTO(out_unlock_old, rc);
            
                            /* get and save version after locking */
                            mdt_version_get_save(info, mnew, 3);
            -       } else if (rc != -EREMOTE && rc != -ENOENT) {
            -               GOTO(out_put_old, rc);
            +       } else if (rc2 != -EREMOTE && rc2 != -ENOENT) {
            +               GOTO(out_unlock_parents, rc = rc2);   --> this should be out_put_old, instead of out_unlock_parents
                    } else {
                            lh_oldp = &info->mti_lh[MDT_LH_OLD];
                            mdt_lock_reg_init(lh_oldp, LCK_EX);
            -
                            lock_ibits = MDS_INODELOCK_LOOKUP | MDS_INODELOCK_XATTR;
                            if (mdt_object_remote(msrcdir)) {
                                    /* Enqueue lookup lock from the parent MDT */
            

            I will update the patch soon

            di.wang Di Wang (Inactive) added a comment - It seems few mdt_reint_xxx changes did not release the object in the error handler path, which cause this assertion for example @@ -1926,20 +2038,20 @@ static int mdt_reint_rename_internal(struct mdt_thread_info *info, lh_newp = &info->mti_lh[MDT_LH_NEW]; mdt_lock_reg_init(lh_newp, LCK_EX); - rc = mdt_object_lock(info, mnew, lh_newp, - MDS_INODELOCK_LOOKUP | - MDS_INODELOCK_UPDATE); + rc = mdt_reint_object_lock(info, mnew, lh_newp, + MDS_INODELOCK_LOOKUP | + MDS_INODELOCK_UPDATE, + cos_incompat); if (rc != 0) GOTO(out_unlock_old, rc); /* get and save version after locking */ mdt_version_get_save(info, mnew, 3); - } else if (rc != -EREMOTE && rc != -ENOENT) { - GOTO(out_put_old, rc); + } else if (rc2 != -EREMOTE && rc2 != -ENOENT) { + GOTO(out_unlock_parents, rc = rc2); --> this should be out_put_old, instead of out_unlock_parents } else { lh_oldp = &info->mti_lh[MDT_LH_OLD]; mdt_lock_reg_init(lh_oldp, LCK_EX); - lock_ibits = MDS_INODELOCK_LOOKUP | MDS_INODELOCK_XATTR; if (mdt_object_remote(msrcdir)) { /* Enqueue lookup lock from the parent MDT */ I will update the patch soon
            jamesanunez James Nunez (Inactive) added a comment - - edited

            Some of these failures have part of the error message cut off. Only seen for LU-3538 patch 12530:

            09:01:15:LustreError: 10624:0:(client.c:1130:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88005b5300c0 x1520711352535976/t0(0) o13->lustre-OST0007-osc-MDT0000@10.1.4.239@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
            09:01:15:LustreError: 10624:0:(client.c:1130:ptlrpc_import_delay_req()) Skipped 1 previous similar message
            09:01:15:LustreError: 10609:0:(lod_dev.c:1578:lod_device_free()) ASSERTIONInitializing cgroup subsys cpuset
            09:01:15:Initializing cgroup subsys cpu
            

            2015-12-16 14:42:37 - sanity-quota test_7c - https://testing.hpdd.intel.com/test_sets/07a4d22e-a427-11e5-8701-5254006e85c2
            2015-12-16 16:00:56 - https://testing.hpdd.intel.com/test_sets/a25754d8-a425-11e5-8701-5254006e85c2

            jamesanunez James Nunez (Inactive) added a comment - - edited Some of these failures have part of the error message cut off. Only seen for LU-3538 patch 12530: 09:01:15:LustreError: 10624:0:(client.c:1130:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88005b5300c0 x1520711352535976/t0(0) o13->lustre-OST0007-osc-MDT0000@10.1.4.239@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 09:01:15:LustreError: 10624:0:(client.c:1130:ptlrpc_import_delay_req()) Skipped 1 previous similar message 09:01:15:LustreError: 10609:0:(lod_dev.c:1578:lod_device_free()) ASSERTIONInitializing cgroup subsys cpuset 09:01:15:Initializing cgroup subsys cpu 2015-12-16 14:42:37 - sanity-quota test_7c - https://testing.hpdd.intel.com/test_sets/07a4d22e-a427-11e5-8701-5254006e85c2 2015-12-16 16:00:56 - https://testing.hpdd.intel.com/test_sets/a25754d8-a425-11e5-8701-5254006e85c2
            jamesanunez James Nunez (Inactive) added a comment - - edited More instances on master for LU-3538 patch #12530: 2015-12-15 13:15:27 - https://testing.hpdd.intel.com/test_sets/c683ddec-a360-11e5-9b3d-5254006e85c2 2015-12-15 16:25:11 - https://testing.hpdd.intel.com/test_sets/3b4c4efc-a36b-11e5-a3ed-5254006e85c2 2015-12-15 17:28:53 - https://testing.hpdd.intel.com/test_sets/40cf7012-a379-11e5-b94e-5254006e85c2 2015-12-15 19:13:49 - https://testing.hpdd.intel.com/test_sets/df907784-a381-11e5-867e-5254006e85c2 2015-12-16 01:45:20 - https://testing.hpdd.intel.com/test_sets/9fcfe704-a3a1-11e5-b94e-5254006e85c2
            laisiyao Lai Siyao added a comment -

            https://testing.hpdd.intel.com/test_sets/7f0570ee-a242-11e5-afd0-5254006e85c2 shows it's a single MDS test, so I doubt this is CoS related (because on such system CoS is not enabled/active).

            laisiyao Lai Siyao added a comment - https://testing.hpdd.intel.com/test_sets/7f0570ee-a242-11e5-afd0-5254006e85c2 shows it's a single MDS test, so I doubt this is CoS related (because on such system CoS is not enabled/active).

            Hmm, this assertion happened again in the test of COS patch

            https://testing.hpdd.intel.com/test_sets/7f0570ee-a242-11e5-afd0-5254006e85c2

            So this failure might be related with COS patch.

            di.wang Di Wang (Inactive) added a comment - Hmm, this assertion happened again in the test of COS patch https://testing.hpdd.intel.com/test_sets/7f0570ee-a242-11e5-afd0-5254006e85c2 So this failure might be related with COS patch.

            People

              laisiyao Lai Siyao
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: