Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3765

2.5.0<->2.1.5 interop: sanity test 24u: (mdt_handler.c:224:mdt_lock_pdo_init()) ASSERTION( namelen > 0 ) failed

Details

    • 3
    • 9699

    Description

      sanity test 24u hit the following failure on MDS:

      11:06:31:Lustre: DEBUG MARKER: == sanity test 24u: create stripe file == 11:06:31 (1376417191)
      11:06:31:LustreError: 13255:0:(mdt_handler.c:224:mdt_lock_pdo_init()) ASSERTION( namelen > 0 ) failed: 
      11:06:31:LustreError: 13255:0:(mdt_handler.c:224:mdt_lock_pdo_init()) LBUG
      11:06:31:Pid: 13255, comm: mdt_01
      11:06:31:
      11:06:31:Call Trace:
      11:06:31: [<ffffffffa04d0785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      11:06:31: [<ffffffffa04d0d97>] lbug_with_loc+0x47/0xb0 [libcfs]
      11:06:31: [<ffffffffa0bdea65>] mdt_lock_pdo_init+0xe5/0xf0 [mdt]
      11:06:31: [<ffffffffa0c127c6>] mdt_reint_open+0x1f6/0x2940 [mdt]
      11:06:31: [<ffffffffa077b764>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]
      11:06:32: [<ffffffffa0ba256e>] ? md_ucred+0x1e/0x60 [mdd]
      11:06:32: [<ffffffffa0be15d5>] ? mdt_ucred+0x15/0x20 [mdt]
      11:06:32: [<ffffffffa0bf84ec>] ? mdt_root_squash+0x2c/0x3e0 [mdt]
      11:06:32: [<ffffffffa0bfcc51>] mdt_reint_rec+0x41/0xe0 [mdt]
      11:06:32: [<ffffffffa0bf3ed4>] mdt_reint_internal+0x544/0x8e0 [mdt]
      11:06:32: [<ffffffffa0bf453d>] mdt_intent_reint+0x1ed/0x500 [mdt]
      11:06:32: [<ffffffffa0bf2c09>] mdt_intent_policy+0x379/0x690 [mdt]
      11:06:32: [<ffffffffa0737391>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
      11:06:32: [<ffffffffa075d1ed>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
      11:06:32: [<ffffffffa0bf3586>] mdt_enqueue+0x46/0x130 [mdt]
      11:06:32: [<ffffffffa0be8772>] mdt_handle_common+0x932/0x1750 [mdt]
      11:06:32: [<ffffffffa0be9665>] mdt_regular_handle+0x15/0x20 [mdt]
      11:06:32: [<ffffffffa078bbae>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
      11:06:32: [<ffffffffa078af60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
      11:06:32: [<ffffffff8100c0ca>] child_rip+0xa/0x20
      11:06:32: [<ffffffffa078af60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
      11:06:32: [<ffffffffa078af60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
      11:06:32: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      11:06:32:
      11:06:32:Kernel panic - not syncing: LBUG
      

      Maloo report: https://maloo.whamcloud.com/test_sets/e3f3b3d8-0525-11e3-8d88-52540035b04c

      More instances:
      https://maloo.whamcloud.com/test_sets/369e054c-0059-11e3-bb00-52540035b04c
      https://maloo.whamcloud.com/test_sets/0bf3fdbc-f8f5-11e2-8917-52540035b04c
      https://maloo.whamcloud.com/test_sets/59b2a818-f504-11e2-a8f6-52540035b04c

      Attachments

        Issue Links

          Activity

            [LU-3765] 2.5.0<->2.1.5 interop: sanity test 24u: (mdt_handler.c:224:mdt_lock_pdo_init()) ASSERTION( namelen > 0 ) failed

            Andreas: Thanks for the reminder.

            Reading it again, I see my comment above is flawed. The change in newer kernels is all anonymous dentries, I wasn't thinking clearly when I wrote that. We don't need to identify merely the root dentries, this issue applies to all anonymous dentries.

            The problem is that when we change the names as we did, we fail that NAMELEN related assertion in 2.1. In retrospect, I think we have succeeded in our goal of isolating anonymous dentries, but perhaps there is a difference between passing NULL in to ll_prep_md_op_data for the name and the anonymous dentry names in 2.6.32 kernels. They appear (I haven't tested, but this is my reading of the code) to be pointers to a string containing nothing but the null terminator, but they aren't actually NULL.

            I wonder if this difference isn't significant.

            paf Patrick Farrell (Inactive) added a comment - Andreas: Thanks for the reminder. Reading it again, I see my comment above is flawed. The change in newer kernels is all anonymous dentries, I wasn't thinking clearly when I wrote that. We don't need to identify merely the root dentries, this issue applies to all anonymous dentries. The problem is that when we change the names as we did, we fail that NAMELEN related assertion in 2.1. In retrospect, I think we have succeeded in our goal of isolating anonymous dentries, but perhaps there is a difference between passing NULL in to ll_prep_md_op_data for the name and the anonymous dentry names in 2.6.32 kernels. They appear (I haven't tested, but this is my reading of the code) to be pointers to a string containing nothing but the null terminator, but they aren't actually NULL. I wonder if this difference isn't significant.

            FYI, DCACHE_DUSCONNECTED is used on any NFS inode that is not connected to the namespace (i.e. if NFS client does its own getattr-by-handle operation). To determine the root dentry you should check if dentry == sb->s_root.

            adilger Andreas Dilger added a comment - FYI, DCACHE_DUSCONNECTED is used on any NFS inode that is not connected to the namespace (i.e. if NFS client does its own getattr-by-handle operation). To determine the root dentry you should check if dentry == sb->s_root.

            While I agree with Vitaly in the abstract, I don't see how we can fix this issue on the 2.4 clients.

            The open_by_fid code on 2.1 is activated when the namelength is 0. SLES11SP2 changes anonymous dentries to be non-null/non-zero name length, so they no longer hit this check. We tried to fix this by having ll_intent_file_open not pass the name, but hit issues. The problem, as I see it, is that we need to only not send the name in the case of a root dentry.

            We tried that by using that code only when the DCACHE_DISCONNECTED flag was observed, but that still triggered the assertion on the MDS. It's not clear to me why - I thought DCACHE_DISCONNECTED was unique to the root dentry, but the observed behavior suggests not.

            The other possibility is that DCACHE_DISCONNECTED does uniquely identify the root dentry, but the zero-length root dentry from CentOS/SLES11SP1 comes across in ll_intent_file_open differently than just a 0 namelen and null pointer for the name, and that this difference is essential in avoiding the MDS crashes.

            [THIS INFORMATION IS INCORRECT. See my reply to Andreas below.]
            So the problem remains: In order to achieve a 2.4/2.5 only fix, we must somehow force open_by_fid on 2.1 servers for root dentries, and only for root dentries. And we need to do it in a way that does not crash the 2.1 server, or cause other problems on 2.4 because of the nulled file name.
            [^-- This is not only for root dentries, it is for all anonymous dentries.]

            I'm not sure that's possible. That's why we chose to add MDS_OPEN_BY_FID to 2.1. This is a fairly minor patch to 2.1, as all it adds is this way to force 2.1 to do open_by_fid.

            paf Patrick Farrell (Inactive) added a comment - - edited While I agree with Vitaly in the abstract, I don't see how we can fix this issue on the 2.4 clients. The open_by_fid code on 2.1 is activated when the namelength is 0. SLES11SP2 changes anonymous dentries to be non-null/non-zero name length, so they no longer hit this check. We tried to fix this by having ll_intent_file_open not pass the name, but hit issues. The problem, as I see it, is that we need to only not send the name in the case of a root dentry. We tried that by using that code only when the DCACHE_DISCONNECTED flag was observed, but that still triggered the assertion on the MDS. It's not clear to me why - I thought DCACHE_DISCONNECTED was unique to the root dentry, but the observed behavior suggests not. The other possibility is that DCACHE_DISCONNECTED does uniquely identify the root dentry, but the zero-length root dentry from CentOS/SLES11SP1 comes across in ll_intent_file_open differently than just a 0 namelen and null pointer for the name, and that this difference is essential in avoiding the MDS crashes. [THIS INFORMATION IS INCORRECT. See my reply to Andreas below.] So the problem remains: In order to achieve a 2.4/2.5 only fix, we must somehow force open_by_fid on 2.1 servers for root dentries, and only for root dentries. And we need to do it in a way that does not crash the 2.1 server, or cause other problems on 2.4 because of the nulled file name. [^-- This is not only for root dentries, it is for all anonymous dentries.] I'm not sure that's possible. That's why we chose to add MDS_OPEN_BY_FID to 2.1. This is a fairly minor patch to 2.1, as all it adds is this way to force 2.1 to do open_by_fid.

            Lai, for your comment below and the related patch here http://review.whamcloud.com/#/c/7627/,

            Once these patches are accepted, they need to be backported to 2.4, and also fix 2.1 server code to maintain 2.5 <-> 2.1 interop. I'll continue working on this.

            Vitaly has a different thought - we shouldn't make change in the old version server code to accommodate features introduced in the newer version. The necessary changes should be done solely in the newer version to fix any interop issues with old versions. I think that is a legitimate comment, although fixing the server side is much simpler in this case. Thoughts?

            cheng_shao Cheng Shao (Inactive) added a comment - Lai, for your comment below and the related patch here http://review.whamcloud.com/#/c/7627/ , Once these patches are accepted, they need to be backported to 2.4, and also fix 2.1 server code to maintain 2.5 <-> 2.1 interop. I'll continue working on this. Vitaly has a different thought - we shouldn't make change in the old version server code to accommodate features introduced in the newer version. The necessary changes should be done solely in the newer version to fix any interop issues with old versions. I think that is a legitimate comment, although fixing the server side is much simpler in this case. Thoughts?
            jhammond John Hammond added a comment -

            The MDT is still generally at the mercy of the client to send valid names. Please see http://review.whamcloud.com/#/c/7961/ from LU-2875.

            jhammond John Hammond added a comment - The MDT is still generally at the mercy of the client to send valid names. Please see http://review.whamcloud.com/#/c/7961/ from LU-2875 .

            This problem was introduced by the patches for LU-3544, and is no longer an issue now that the patch has been reverted.

            adilger Andreas Dilger added a comment - This problem was introduced by the patches for LU-3544 , and is no longer an issue now that the patch has been reverted.
            green Oleg Drokin added a comment -

            LU-3544 was reverted from master as well

            green Oleg Drokin added a comment - LU-3544 was reverted from master as well
            laisiyao Lai Siyao added a comment -

            Patch for b2_1 is on: http://review.whamcloud.com/#/c/7627/

            Now it's ready to continue interop test between 2.5 and 2.1 with these three patches.

            laisiyao Lai Siyao added a comment - Patch for b2_1 is on: http://review.whamcloud.com/#/c/7627/ Now it's ready to continue interop test between 2.5 and 2.1 with these three patches.
            pjones Peter Jones added a comment -

            We have reverted LU-3544 from b2_4 for now but are continuing to work on a more complete fix on master

            pjones Peter Jones added a comment - We have reverted LU-3544 from b2_4 for now but are continuing to work on a more complete fix on master
            laisiyao Lai Siyao added a comment -

            Patch for master is on:
            http://review.whamcloud.com/#/c/7475/
            http://review.whamcloud.com/#/c/7476/

            These patches enabled getattr/open-by-fid by default, thus either fid or name is packed in these requests, and server can handle op-by-fid correctly.

            Once these patches are accepted, they need to be backported to 2.4, and also fix 2.1 server code to maintain 2.5 <-> 2.1 interop. I'll continue working on this.

            laisiyao Lai Siyao added a comment - Patch for master is on: http://review.whamcloud.com/#/c/7475/ http://review.whamcloud.com/#/c/7476/ These patches enabled getattr/open-by-fid by default, thus either fid or name is packed in these requests, and server can handle op-by-fid correctly. Once these patches are accepted, they need to be backported to 2.4, and also fix 2.1 server code to maintain 2.5 <-> 2.1 interop. I'll continue working on this.

            People

              laisiyao Lai Siyao
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: