Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6602

ASSERTION( rec->lrh_len <= 8192 ) failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

        Testing this build: https://build.hpdd.intel.com/job/lustre-reviews/32021/

      In AWS environment with 64 MDTs (8 MDS * 8 MDT each).

      1. cd /mnt/lustre
      2. lfs mkdir -c 8 8stripedir
      3. lfs mkdir -c 64 64stripedir
        <hang>
        On MDS0
        LustreError: 1291:0:(llog_cat.c:319:llog_cat_add_rec()) ASSERTION( rec->lrh_len <= 8192 ) failed: 
        LustreError: 1291:0:(llog_cat.c:319:llog_cat_add_rec()) LBUG
        Pid: 1291, comm: mdt00_002
        
        Call Trace:
         [<ffffffffa00f2875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
         [<ffffffffa00f2e77>] lbug_with_loc+0x47/0xb0 [libcfs]
         [<ffffffffa0207848>] llog_cat_add_rec+0x3e8/0x450 [obdclass]
         [<ffffffffa01ff039>] llog_add+0x89/0x1c0 [obdclass]
         [<ffffffffa187b6f4>] sub_updates_write+0x154/0x600 [ptlrpc]
         [<ffffffffa187c247>] top_trans_stop+0x6a7/0xb40 [ptlrpc]
         [<ffffffffa1d8cd21>] lod_trans_stop+0x61/0x70 [lod]
         [<ffffffffa1e3149a>] mdd_trans_stop+0x1a/0xac [mdd]
         [<ffffffffa1e20909>] mdd_create+0x13a9/0x1750 [mdd]
         [<ffffffffa1cdb65c>] ? mdt_version_save+0x8c/0x1a0 [mdt]
         [<ffffffffa1cdf9ec>] mdt_reint_create+0xbbc/0xcc0 [mdt]
         [<ffffffffa1cdab1d>] mdt_reint_rec+0x5d/0x200 [mdt]
         [<ffffffffa1cbffcb>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
         [<ffffffffa1cc073b>] mdt_reint+0x6b/0x120 [mdt]
         [<ffffffffa1868e8e>] tgt_request_handle+0x8be/0xfe0 [ptlrpc]
         [<ffffffffa1818aa1>] ptlrpc_main+0xe41/0x1970 [ptlrpc]
         [<ffffffff81060c3f>] ? finish_task_switch+0x4f/0xf0
         [<ffffffffa1817c60>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
         [<ffffffff8109e71e>] kthread+0x9e/0xc0
         [<ffffffff8100c20a>] child_rip+0xa/0x20
         [<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b
         [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
         [<ffffffff8100c200>] ? child_rip+0x0/0x20
        
        Kernel panic - not syncing: LBUG
        Pid: 1291, comm: mdt00_002 Not tainted 2.6.32-504.16.2.el6_lustre.gd805a88.x86_64 #1
        Call Trace:
         [<ffffffff81529fbc>] ? panic+0xa7/0x16f
         [<ffffffffa00f2ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
         [<ffffffffa0207848>] ? llog_cat_add_rec+0x3e8/0x450 [obdclass]
         [<ffffffffa01ff039>] ? llog_add+0x89/0x1c0 [obdclass]
         [<ffffffffa187b6f4>] ? sub_updates_write+0x154/0x600 [ptlrpc]
         [<ffffffffa187c247>] ? top_trans_stop+0x6a7/0xb40 [ptlrpc]
         [<ffffffffa1d8cd21>] ? lod_trans_stop+0x61/0x70 [lod]
         [<ffffffffa1e3149a>] ? mdd_trans_stop+0x1a/0xac [mdd]
         [<ffffffffa1e20909>] ? mdd_create+0x13a9/0x1750 [mdd]
         [<ffffffffa1cdb65c>] ? mdt_version_save+0x8c/0x1a0 [mdt]
         [<ffffffffa1cdf9ec>] ? mdt_reint_create+0xbbc/0xcc0 [mdt]
         [<ffffffffa1cdab1d>] ? mdt_reint_rec+0x5d/0x200 [mdt]
         [<ffffffffa1cbffcb>] ? mdt_reint_internal+0x4cb/0x7a0 [mdt]
         [<ffffffffa1cc073b>] ? mdt_reint+0x6b/0x120 [mdt]
         [<ffffffffa1868e8e>] ? tgt_request_handle+0x8be/0xfe0 [ptlrpc]
         [<ffffffffa1818aa1>] ? ptlrpc_main+0xe41/0x1970 [ptlrpc]
         [<ffffffff81060c3f>] ? finish_task_switch+0x4f/0xf0
         [<ffffffffa1817c60>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
         [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
         [<ffffffff8100c20a>] ? child_rip+0xa/0x20
         [<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b
         [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
         [<ffffffff8100c200>] ? child_rip+0x0/0x20
        

      After each reboot/recovery cycle the MDS would LBUG again with same error right after recovery completed. Presumably the client was resending the mkdir. Once I killed lfs, the crashes stopped.

      Attachments

        Issue Links

          Activity

            [LU-6602] ASSERTION( rec->lrh_len <= 8192 ) failed

            sorry, typo. Meant to be LU-6202

            simmonsja James A Simmons added a comment - sorry, typo. Meant to be LU-6202
            simmonsja James A Simmons made changes -
            Description Original: Testing this build: https://build.hpdd.intel.com/job/lustre-reviews/32021/

            In AWS environment with 64 MDTs (8 MDS * 8 MDT each).

            # cd /mnt/lustre
            # lfs mkdir -c 8 8stripedir
            # lfs mkdir -c 64 64stripedir
            <hang>
            On MDS0
            {noformat}
            LustreError: 1291:0:(llog_cat.c:319:llog_cat_add_rec()) ASSERTION( rec->lrh_len <= 8192 ) failed:
            LustreError: 1291:0:(llog_cat.c:319:llog_cat_add_rec()) LBUG
            Pid: 1291, comm: mdt00_002

            Call Trace:
             [<ffffffffa00f2875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
             [<ffffffffa00f2e77>] lbug_with_loc+0x47/0xb0 [libcfs]
             [<ffffffffa0207848>] llog_cat_add_rec+0x3e8/0x450 [obdclass]
             [<ffffffffa01ff039>] llog_add+0x89/0x1c0 [obdclass]
             [<ffffffffa187b6f4>] sub_updates_write+0x154/0x600 [ptlrpc]
             [<ffffffffa187c247>] top_trans_stop+0x6a7/0xb40 [ptlrpc]
             [<ffffffffa1d8cd21>] lod_trans_stop+0x61/0x70 [lod]
             [<ffffffffa1e3149a>] mdd_trans_stop+0x1a/0xac [mdd]
             [<ffffffffa1e20909>] mdd_create+0x13a9/0x1750 [mdd]
             [<ffffffffa1cdb65c>] ? mdt_version_save+0x8c/0x1a0 [mdt]
             [<ffffffffa1cdf9ec>] mdt_reint_create+0xbbc/0xcc0 [mdt]
             [<ffffffffa1cdab1d>] mdt_reint_rec+0x5d/0x200 [mdt]
             [<ffffffffa1cbffcb>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
             [<ffffffffa1cc073b>] mdt_reint+0x6b/0x120 [mdt]
             [<ffffffffa1868e8e>] tgt_request_handle+0x8be/0xfe0 [ptlrpc]
             [<ffffffffa1818aa1>] ptlrpc_main+0xe41/0x1970 [ptlrpc]
             [<ffffffff81060c3f>] ? finish_task_switch+0x4f/0xf0
             [<ffffffffa1817c60>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
             [<ffffffff8109e71e>] kthread+0x9e/0xc0
             [<ffffffff8100c20a>] child_rip+0xa/0x20
             [<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b
             [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
             [<ffffffff8100c200>] ? child_rip+0x0/0x20

            Kernel panic - not syncing: LBUG
            Pid: 1291, comm: mdt00_002 Not tainted 2.6.32-504.16.2.el6_lustre.gd805a88.x86_64 #1
            Call Trace:
             [<ffffffff81529fbc>] ? panic+0xa7/0x16f
             [<ffffffffa00f2ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
             [<ffffffffa0207848>] ? llog_cat_add_rec+0x3e8/0x450 [obdclass]
             [<ffffffffa01ff039>] ? llog_add+0x89/0x1c0 [obdclass]
             [<ffffffffa187b6f4>] ? sub_updates_write+0x154/0x600 [ptlrpc]
             [<ffffffffa187c247>] ? top_trans_stop+0x6a7/0xb40 [ptlrpc]
             [<ffffffffa1d8cd21>] ? lod_trans_stop+0x61/0x70 [lod]
             [<ffffffffa1e3149a>] ? mdd_trans_stop+0x1a/0xac [mdd]
             [<ffffffffa1e20909>] ? mdd_create+0x13a9/0x1750 [mdd]
             [<ffffffffa1cdb65c>] ? mdt_version_save+0x8c/0x1a0 [mdt]
             [<ffffffffa1cdf9ec>] ? mdt_reint_create+0xbbc/0xcc0 [mdt]
             [<ffffffffa1cdab1d>] ? mdt_reint_rec+0x5d/0x200 [mdt]
             [<ffffffffa1cbffcb>] ? mdt_reint_internal+0x4cb/0x7a0 [mdt]
             [<ffffffffa1cc073b>] ? mdt_reint+0x6b/0x120 [mdt]
             [<ffffffffa1868e8e>] ? tgt_request_handle+0x8be/0xfe0 [ptlrpc]
             [<ffffffffa1818aa1>] ? ptlrpc_main+0xe41/0x1970 [ptlrpc]
             [<ffffffff81060c3f>] ? finish_task_switch+0x4f/0xf0
             [<ffffffffa1817c60>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
             [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
             [<ffffffff8100c20a>] ? child_rip+0xa/0x20
             [<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b
             [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
             [<ffffffff8100c200>] ? child_rip+0x0/0x20
            {noformat}


            After each reboot/recovery cycle the MDS would LBUG again with same error right after recovery completed. Presumably the client was resending the mkdir. Once I killed lfs, the crashes stopped.
            New:   Testing this build: [https://build.hpdd.intel.com/job/lustre-reviews/32021/]

            In AWS environment with 64 MDTs (8 MDS * 8 MDT each).
             # cd /mnt/lustre
             # lfs mkdir -c 8 8stripedir
             # lfs mkdir -c 64 64stripedir
             <hang>
             On MDS0
            {noformat}
            LustreError: 1291:0:(llog_cat.c:319:llog_cat_add_rec()) ASSERTION( rec->lrh_len <= 8192 ) failed:
            LustreError: 1291:0:(llog_cat.c:319:llog_cat_add_rec()) LBUG
            Pid: 1291, comm: mdt00_002

            Call Trace:
             [<ffffffffa00f2875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
             [<ffffffffa00f2e77>] lbug_with_loc+0x47/0xb0 [libcfs]
             [<ffffffffa0207848>] llog_cat_add_rec+0x3e8/0x450 [obdclass]
             [<ffffffffa01ff039>] llog_add+0x89/0x1c0 [obdclass]
             [<ffffffffa187b6f4>] sub_updates_write+0x154/0x600 [ptlrpc]
             [<ffffffffa187c247>] top_trans_stop+0x6a7/0xb40 [ptlrpc]
             [<ffffffffa1d8cd21>] lod_trans_stop+0x61/0x70 [lod]
             [<ffffffffa1e3149a>] mdd_trans_stop+0x1a/0xac [mdd]
             [<ffffffffa1e20909>] mdd_create+0x13a9/0x1750 [mdd]
             [<ffffffffa1cdb65c>] ? mdt_version_save+0x8c/0x1a0 [mdt]
             [<ffffffffa1cdf9ec>] mdt_reint_create+0xbbc/0xcc0 [mdt]
             [<ffffffffa1cdab1d>] mdt_reint_rec+0x5d/0x200 [mdt]
             [<ffffffffa1cbffcb>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
             [<ffffffffa1cc073b>] mdt_reint+0x6b/0x120 [mdt]
             [<ffffffffa1868e8e>] tgt_request_handle+0x8be/0xfe0 [ptlrpc]
             [<ffffffffa1818aa1>] ptlrpc_main+0xe41/0x1970 [ptlrpc]
             [<ffffffff81060c3f>] ? finish_task_switch+0x4f/0xf0
             [<ffffffffa1817c60>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
             [<ffffffff8109e71e>] kthread+0x9e/0xc0
             [<ffffffff8100c20a>] child_rip+0xa/0x20
             [<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b
             [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
             [<ffffffff8100c200>] ? child_rip+0x0/0x20

            Kernel panic - not syncing: LBUG
            Pid: 1291, comm: mdt00_002 Not tainted 2.6.32-504.16.2.el6_lustre.gd805a88.x86_64 #1
            Call Trace:
             [<ffffffff81529fbc>] ? panic+0xa7/0x16f
             [<ffffffffa00f2ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
             [<ffffffffa0207848>] ? llog_cat_add_rec+0x3e8/0x450 [obdclass]
             [<ffffffffa01ff039>] ? llog_add+0x89/0x1c0 [obdclass]
             [<ffffffffa187b6f4>] ? sub_updates_write+0x154/0x600 [ptlrpc]
             [<ffffffffa187c247>] ? top_trans_stop+0x6a7/0xb40 [ptlrpc]
             [<ffffffffa1d8cd21>] ? lod_trans_stop+0x61/0x70 [lod]
             [<ffffffffa1e3149a>] ? mdd_trans_stop+0x1a/0xac [mdd]
             [<ffffffffa1e20909>] ? mdd_create+0x13a9/0x1750 [mdd]
             [<ffffffffa1cdb65c>] ? mdt_version_save+0x8c/0x1a0 [mdt]
             [<ffffffffa1cdf9ec>] ? mdt_reint_create+0xbbc/0xcc0 [mdt]
             [<ffffffffa1cdab1d>] ? mdt_reint_rec+0x5d/0x200 [mdt]
             [<ffffffffa1cbffcb>] ? mdt_reint_internal+0x4cb/0x7a0 [mdt]
             [<ffffffffa1cc073b>] ? mdt_reint+0x6b/0x120 [mdt]
             [<ffffffffa1868e8e>] ? tgt_request_handle+0x8be/0xfe0 [ptlrpc]
             [<ffffffffa1818aa1>] ? ptlrpc_main+0xe41/0x1970 [ptlrpc]
             [<ffffffff81060c3f>] ? finish_task_switch+0x4f/0xf0
             [<ffffffffa1817c60>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
             [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
             [<ffffffff8100c20a>] ? child_rip+0xa/0x20
             [<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b
             [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
             [<ffffffff8100c200>] ? child_rip+0x0/0x20
            {noformat}

            After each reboot/recovery cycle the MDS would LBUG again with same error right after recovery completed. Presumably the client was resending the mkdir. Once I killed lfs, the crashes stopped.
            adilger Andreas Dilger made changes -
            Link New: This issue is related to DDN-651 [ DDN-651 ]
            adilger Andreas Dilger made changes -
            Labels Original: DNE2 New: dne2
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-7666 [ LU-7666 ]
            di.wang Di Wang made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: In Progress [ 3 ] New: Resolved [ 5 ]

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15274/
            Subject: LU-6602 osp: change lgh_hdr_lock to mutex
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fffe8ac7e42b6638bff9fe19c4bfeb6635023c92

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15274/ Subject: LU-6602 osp: change lgh_hdr_lock to mutex Project: fs/lustre-release Branch: master Current Patch Set: Commit: fffe8ac7e42b6638bff9fe19c4bfeb6635023c92
            di.wang Di Wang made changes -
            Link New: This issue is related to LU-6831 [ LU-6831 ]

            One patch left!!

            simmonsja James A Simmons added a comment - One patch left!!

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15162/
            Subject: LU-6602 update: split update llog record
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fb80ae7c7601a03c1181de381f067f553e7b8c6f

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15162/ Subject: LU-6602 update: split update llog record Project: fs/lustre-release Branch: master Current Patch Set: Commit: fb80ae7c7601a03c1181de381f067f553e7b8c6f

            People

              di.wang Di Wang
              rread Robert Read
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: