Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3765

2.5.0<->2.1.5 interop: sanity test 24u: (mdt_handler.c:224:mdt_lock_pdo_init()) ASSERTION( namelen > 0 ) failed

Details

    • 3
    • 9699

    Description

      sanity test 24u hit the following failure on MDS:

      11:06:31:Lustre: DEBUG MARKER: == sanity test 24u: create stripe file == 11:06:31 (1376417191)
      11:06:31:LustreError: 13255:0:(mdt_handler.c:224:mdt_lock_pdo_init()) ASSERTION( namelen > 0 ) failed: 
      11:06:31:LustreError: 13255:0:(mdt_handler.c:224:mdt_lock_pdo_init()) LBUG
      11:06:31:Pid: 13255, comm: mdt_01
      11:06:31:
      11:06:31:Call Trace:
      11:06:31: [<ffffffffa04d0785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      11:06:31: [<ffffffffa04d0d97>] lbug_with_loc+0x47/0xb0 [libcfs]
      11:06:31: [<ffffffffa0bdea65>] mdt_lock_pdo_init+0xe5/0xf0 [mdt]
      11:06:31: [<ffffffffa0c127c6>] mdt_reint_open+0x1f6/0x2940 [mdt]
      11:06:31: [<ffffffffa077b764>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]
      11:06:32: [<ffffffffa0ba256e>] ? md_ucred+0x1e/0x60 [mdd]
      11:06:32: [<ffffffffa0be15d5>] ? mdt_ucred+0x15/0x20 [mdt]
      11:06:32: [<ffffffffa0bf84ec>] ? mdt_root_squash+0x2c/0x3e0 [mdt]
      11:06:32: [<ffffffffa0bfcc51>] mdt_reint_rec+0x41/0xe0 [mdt]
      11:06:32: [<ffffffffa0bf3ed4>] mdt_reint_internal+0x544/0x8e0 [mdt]
      11:06:32: [<ffffffffa0bf453d>] mdt_intent_reint+0x1ed/0x500 [mdt]
      11:06:32: [<ffffffffa0bf2c09>] mdt_intent_policy+0x379/0x690 [mdt]
      11:06:32: [<ffffffffa0737391>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
      11:06:32: [<ffffffffa075d1ed>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
      11:06:32: [<ffffffffa0bf3586>] mdt_enqueue+0x46/0x130 [mdt]
      11:06:32: [<ffffffffa0be8772>] mdt_handle_common+0x932/0x1750 [mdt]
      11:06:32: [<ffffffffa0be9665>] mdt_regular_handle+0x15/0x20 [mdt]
      11:06:32: [<ffffffffa078bbae>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
      11:06:32: [<ffffffffa078af60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
      11:06:32: [<ffffffff8100c0ca>] child_rip+0xa/0x20
      11:06:32: [<ffffffffa078af60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
      11:06:32: [<ffffffffa078af60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
      11:06:32: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      11:06:32:
      11:06:32:Kernel panic - not syncing: LBUG
      

      Maloo report: https://maloo.whamcloud.com/test_sets/e3f3b3d8-0525-11e3-8d88-52540035b04c

      More instances:
      https://maloo.whamcloud.com/test_sets/369e054c-0059-11e3-bb00-52540035b04c
      https://maloo.whamcloud.com/test_sets/0bf3fdbc-f8f5-11e2-8917-52540035b04c
      https://maloo.whamcloud.com/test_sets/59b2a818-f504-11e2-a8f6-52540035b04c

      Attachments

        Issue Links

          Activity

            [LU-3765] 2.5.0<->2.1.5 interop: sanity test 24u: (mdt_handler.c:224:mdt_lock_pdo_init()) ASSERTION( namelen > 0 ) failed

            While I agree with Vitaly in the abstract, I don't see how we can fix this issue on the 2.4 clients.

            The open_by_fid code on 2.1 is activated when the namelength is 0. SLES11SP2 changes anonymous dentries to be non-null/non-zero name length, so they no longer hit this check. We tried to fix this by having ll_intent_file_open not pass the name, but hit issues. The problem, as I see it, is that we need to only not send the name in the case of a root dentry.

            We tried that by using that code only when the DCACHE_DISCONNECTED flag was observed, but that still triggered the assertion on the MDS. It's not clear to me why - I thought DCACHE_DISCONNECTED was unique to the root dentry, but the observed behavior suggests not.

            The other possibility is that DCACHE_DISCONNECTED does uniquely identify the root dentry, but the zero-length root dentry from CentOS/SLES11SP1 comes across in ll_intent_file_open differently than just a 0 namelen and null pointer for the name, and that this difference is essential in avoiding the MDS crashes.

            [THIS INFORMATION IS INCORRECT. See my reply to Andreas below.]
            So the problem remains: In order to achieve a 2.4/2.5 only fix, we must somehow force open_by_fid on 2.1 servers for root dentries, and only for root dentries. And we need to do it in a way that does not crash the 2.1 server, or cause other problems on 2.4 because of the nulled file name.
            [^-- This is not only for root dentries, it is for all anonymous dentries.]

            I'm not sure that's possible. That's why we chose to add MDS_OPEN_BY_FID to 2.1. This is a fairly minor patch to 2.1, as all it adds is this way to force 2.1 to do open_by_fid.

            paf Patrick Farrell (Inactive) added a comment - - edited While I agree with Vitaly in the abstract, I don't see how we can fix this issue on the 2.4 clients. The open_by_fid code on 2.1 is activated when the namelength is 0. SLES11SP2 changes anonymous dentries to be non-null/non-zero name length, so they no longer hit this check. We tried to fix this by having ll_intent_file_open not pass the name, but hit issues. The problem, as I see it, is that we need to only not send the name in the case of a root dentry. We tried that by using that code only when the DCACHE_DISCONNECTED flag was observed, but that still triggered the assertion on the MDS. It's not clear to me why - I thought DCACHE_DISCONNECTED was unique to the root dentry, but the observed behavior suggests not. The other possibility is that DCACHE_DISCONNECTED does uniquely identify the root dentry, but the zero-length root dentry from CentOS/SLES11SP1 comes across in ll_intent_file_open differently than just a 0 namelen and null pointer for the name, and that this difference is essential in avoiding the MDS crashes. [THIS INFORMATION IS INCORRECT. See my reply to Andreas below.] So the problem remains: In order to achieve a 2.4/2.5 only fix, we must somehow force open_by_fid on 2.1 servers for root dentries, and only for root dentries. And we need to do it in a way that does not crash the 2.1 server, or cause other problems on 2.4 because of the nulled file name. [^-- This is not only for root dentries, it is for all anonymous dentries.] I'm not sure that's possible. That's why we chose to add MDS_OPEN_BY_FID to 2.1. This is a fairly minor patch to 2.1, as all it adds is this way to force 2.1 to do open_by_fid.

            Lai, for your comment below and the related patch here http://review.whamcloud.com/#/c/7627/,

            Once these patches are accepted, they need to be backported to 2.4, and also fix 2.1 server code to maintain 2.5 <-> 2.1 interop. I'll continue working on this.

            Vitaly has a different thought - we shouldn't make change in the old version server code to accommodate features introduced in the newer version. The necessary changes should be done solely in the newer version to fix any interop issues with old versions. I think that is a legitimate comment, although fixing the server side is much simpler in this case. Thoughts?

            cheng_shao Cheng Shao (Inactive) added a comment - Lai, for your comment below and the related patch here http://review.whamcloud.com/#/c/7627/ , Once these patches are accepted, they need to be backported to 2.4, and also fix 2.1 server code to maintain 2.5 <-> 2.1 interop. I'll continue working on this. Vitaly has a different thought - we shouldn't make change in the old version server code to accommodate features introduced in the newer version. The necessary changes should be done solely in the newer version to fix any interop issues with old versions. I think that is a legitimate comment, although fixing the server side is much simpler in this case. Thoughts?
            jhammond John Hammond added a comment -

            The MDT is still generally at the mercy of the client to send valid names. Please see http://review.whamcloud.com/#/c/7961/ from LU-2875.

            jhammond John Hammond added a comment - The MDT is still generally at the mercy of the client to send valid names. Please see http://review.whamcloud.com/#/c/7961/ from LU-2875 .

            This problem was introduced by the patches for LU-3544, and is no longer an issue now that the patch has been reverted.

            adilger Andreas Dilger added a comment - This problem was introduced by the patches for LU-3544 , and is no longer an issue now that the patch has been reverted.
            green Oleg Drokin added a comment -

            LU-3544 was reverted from master as well

            green Oleg Drokin added a comment - LU-3544 was reverted from master as well
            laisiyao Lai Siyao added a comment -

            Patch for b2_1 is on: http://review.whamcloud.com/#/c/7627/

            Now it's ready to continue interop test between 2.5 and 2.1 with these three patches.

            laisiyao Lai Siyao added a comment - Patch for b2_1 is on: http://review.whamcloud.com/#/c/7627/ Now it's ready to continue interop test between 2.5 and 2.1 with these three patches.
            pjones Peter Jones added a comment -

            We have reverted LU-3544 from b2_4 for now but are continuing to work on a more complete fix on master

            pjones Peter Jones added a comment - We have reverted LU-3544 from b2_4 for now but are continuing to work on a more complete fix on master
            laisiyao Lai Siyao added a comment -

            Patch for master is on:
            http://review.whamcloud.com/#/c/7475/
            http://review.whamcloud.com/#/c/7476/

            These patches enabled getattr/open-by-fid by default, thus either fid or name is packed in these requests, and server can handle op-by-fid correctly.

            Once these patches are accepted, they need to be backported to 2.4, and also fix 2.1 server code to maintain 2.5 <-> 2.1 interop. I'll continue working on this.

            laisiyao Lai Siyao added a comment - Patch for master is on: http://review.whamcloud.com/#/c/7475/ http://review.whamcloud.com/#/c/7476/ These patches enabled getattr/open-by-fid by default, thus either fid or name is packed in these requests, and server can handle op-by-fid correctly. Once these patches are accepted, they need to be backported to 2.4, and also fix 2.1 server code to maintain 2.5 <-> 2.1 interop. I'll continue working on this.

            Lai,

            Take a look at my latest in LU-3544. I think it belongs there and not here, but it could go in either. It's about the problems with the proposed patch for LU-3765.

            paf Patrick Farrell (Inactive) added a comment - Lai, Take a look at my latest in LU-3544 . I think it belongs there and not here, but it could go in either. It's about the problems with the proposed patch for LU-3765 .
            laisiyao Lai Siyao added a comment -

            IMO once client specified MDS_OPEN_BY_FID, MDS should never with open with name because name may be invalid, or it will cause inconsistency. If this is true, MDS open by fid code can be simplified a lot.

            Patch is on http://review.whamcloud.com/#/c/7358/

            laisiyao Lai Siyao added a comment - IMO once client specified MDS_OPEN_BY_FID, MDS should never with open with name because name may be invalid, or it will cause inconsistency. If this is true, MDS open by fid code can be simplified a lot. Patch is on http://review.whamcloud.com/#/c/7358/
            yujian Jian Yu added a comment -

            This is blocking the whole test session on Lustre b2_4 client with 2.1.6 server:
            https://maloo.whamcloud.com/test_sessions/ac905704-0569-11e3-b127-52540035b04c

            yujian Jian Yu added a comment - This is blocking the whole test session on Lustre b2_4 client with 2.1.6 server: https://maloo.whamcloud.com/test_sessions/ac905704-0569-11e3-b127-52540035b04c

            People

              laisiyao Lai Siyao
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: