[LU-7546] conf-sanity conf-sanity: lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) Created: 11/Dec/15  Updated: 27/Oct/17  Resolved: 18/Dec/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7184 (lod_dev.c:1493:lod_device_free()) AS... Resolved
is related to LU-3538 commit on share for cross-MDT operation. Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/b123def0-a00b-11e5-ae0a-5254006e85c2.

The sub-test conf-sanity failed with the following error:

03:27:23:Lustre: DEBUG MARKER: umount -d -f /mnt/mds1
03:27:23:LustreError: 10651:0:(client.c:1130:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88006ed629c0 x1520238940666796/t0(0) o13->lustre-OST0001-osc-MDT0000@10.1.4.239@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
03:27:23:LustreError: 10651:0:(client.c:1130:ptlrpc_import_delay_req()) Skipped 7 previous similar messages
03:27:23:LustreError: 30453:0:(osp_object.c:587:osp_attr_get()) lustre-MDT0001-osp-MDT0000:osp_attr_get update error [0x240000402:0x1:0x0]: rc = -5
03:27:23:LustreError: 30453:0:(osp_object.c:587:osp_attr_get()) Skipped 5 previous similar messages
03:27:23:Lustre: lustre-MDT0000: Not available for connect from 10.1.4.239@tcp (stopping)
03:27:23:Lustre: Skipped 7 previous similar messages
03:27:23:LustreError: 10651:0:(client.c:1130:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88006ed629c0 x1520238940666828/t0(0) o13->lustre-OST0005-osc-MDT0000@10.1.4.239@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
03:27:23:LustreError: 10651:0:(client.c:1130:ptlrpc_import_delay_req()) Skipped 4 previous similar messages
03:27:23:Lustre: lustre-MDT0000: Not available for connect from 10.1.4.244@tcp (stopping)
03:27:23:Lustre: Skipped 2 previous similar messages
03:27:23:LustreError: 10652:0:(client.c:1130:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880044a39680 x1520238940666836/t0(0) o13->lustre-OST0007-osc-MDT0000@10.1.4.239@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
03:27:23:LustreError: 10636:0:(lod_dev.c:1578:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: lu is ffff880070784000
03:27:23:LustreError: 10636:0:(lod_dev.c:1578:lod_device_free()) LBUG

Please provide additional information about the failure here.

Info required for matching: conf-sanity conf-sanity



 Comments   
Comment by Di Wang [ 14/Dec/15 ]

Hmm, this assertion happened again in the test of COS patch

https://testing.hpdd.intel.com/test_sets/7f0570ee-a242-11e5-afd0-5254006e85c2

So this failure might be related with COS patch.

Comment by Lai Siyao [ 16/Dec/15 ]

https://testing.hpdd.intel.com/test_sets/7f0570ee-a242-11e5-afd0-5254006e85c2 shows it's a single MDS test, so I doubt this is CoS related (because on such system CoS is not enabled/active).

Comment by James Nunez (Inactive) [ 16/Dec/15 ]

More instances on master for LU-3538 patch #12530:
2015-12-15 13:15:27 - https://testing.hpdd.intel.com/test_sets/c683ddec-a360-11e5-9b3d-5254006e85c2
2015-12-15 16:25:11 - https://testing.hpdd.intel.com/test_sets/3b4c4efc-a36b-11e5-a3ed-5254006e85c2
2015-12-15 17:28:53 - https://testing.hpdd.intel.com/test_sets/40cf7012-a379-11e5-b94e-5254006e85c2
2015-12-15 19:13:49 - https://testing.hpdd.intel.com/test_sets/df907784-a381-11e5-867e-5254006e85c2
2015-12-16 01:45:20 - https://testing.hpdd.intel.com/test_sets/9fcfe704-a3a1-11e5-b94e-5254006e85c2

Comment by James Nunez (Inactive) [ 17/Dec/15 ]

Some of these failures have part of the error message cut off. Only seen for LU-3538 patch 12530:

09:01:15:LustreError: 10624:0:(client.c:1130:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88005b5300c0 x1520711352535976/t0(0) o13->lustre-OST0007-osc-MDT0000@10.1.4.239@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
09:01:15:LustreError: 10624:0:(client.c:1130:ptlrpc_import_delay_req()) Skipped 1 previous similar message
09:01:15:LustreError: 10609:0:(lod_dev.c:1578:lod_device_free()) ASSERTIONInitializing cgroup subsys cpuset
09:01:15:Initializing cgroup subsys cpu

2015-12-16 14:42:37 - sanity-quota test_7c - https://testing.hpdd.intel.com/test_sets/07a4d22e-a427-11e5-8701-5254006e85c2
2015-12-16 16:00:56 - https://testing.hpdd.intel.com/test_sets/a25754d8-a425-11e5-8701-5254006e85c2

Comment by Di Wang [ 17/Dec/15 ]

It seems few mdt_reint_xxx changes did not release the object in the error handler path, which cause this assertion

for example

@@ -1926,20 +2038,20 @@ static int mdt_reint_rename_internal(struct mdt_thread_info *info,

                lh_newp = &info->mti_lh[MDT_LH_NEW];
                mdt_lock_reg_init(lh_newp, LCK_EX);
-               rc = mdt_object_lock(info, mnew, lh_newp,
-                                    MDS_INODELOCK_LOOKUP |
-                                    MDS_INODELOCK_UPDATE);
+               rc = mdt_reint_object_lock(info, mnew, lh_newp,
+                                          MDS_INODELOCK_LOOKUP |
+                                          MDS_INODELOCK_UPDATE,
+                                          cos_incompat);
                if (rc != 0)
                        GOTO(out_unlock_old, rc);

                /* get and save version after locking */
                mdt_version_get_save(info, mnew, 3);
-       } else if (rc != -EREMOTE && rc != -ENOENT) {
-               GOTO(out_put_old, rc);
+       } else if (rc2 != -EREMOTE && rc2 != -ENOENT) {
+               GOTO(out_unlock_parents, rc = rc2);   --> this should be out_put_old, instead of out_unlock_parents
        } else {
                lh_oldp = &info->mti_lh[MDT_LH_OLD];
                mdt_lock_reg_init(lh_oldp, LCK_EX);
-
                lock_ibits = MDS_INODELOCK_LOOKUP | MDS_INODELOCK_XATTR;
                if (mdt_object_remote(msrcdir)) {
                        /* Enqueue lookup lock from the parent MDT */

I will update the patch soon

Comment by Lai Siyao [ 18/Dec/15 ]

Di, thanks, I'll update the patch later.

Comment by Di Wang [ 18/Dec/15 ]

Since the reason has been found, let's close this ticket for now, and the fix will be included in LU-3538 http://review.whamcloud.com/#/c/12530/

Generated at Sat Feb 10 02:09:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.