Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6602

ASSERTION( rec->lrh_len <= 8192 ) failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

        Testing this build: https://build.hpdd.intel.com/job/lustre-reviews/32021/

      In AWS environment with 64 MDTs (8 MDS * 8 MDT each).

      1. cd /mnt/lustre
      2. lfs mkdir -c 8 8stripedir
      3. lfs mkdir -c 64 64stripedir
        <hang>
        On MDS0
        LustreError: 1291:0:(llog_cat.c:319:llog_cat_add_rec()) ASSERTION( rec->lrh_len <= 8192 ) failed: 
        LustreError: 1291:0:(llog_cat.c:319:llog_cat_add_rec()) LBUG
        Pid: 1291, comm: mdt00_002
        
        Call Trace:
         [<ffffffffa00f2875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
         [<ffffffffa00f2e77>] lbug_with_loc+0x47/0xb0 [libcfs]
         [<ffffffffa0207848>] llog_cat_add_rec+0x3e8/0x450 [obdclass]
         [<ffffffffa01ff039>] llog_add+0x89/0x1c0 [obdclass]
         [<ffffffffa187b6f4>] sub_updates_write+0x154/0x600 [ptlrpc]
         [<ffffffffa187c247>] top_trans_stop+0x6a7/0xb40 [ptlrpc]
         [<ffffffffa1d8cd21>] lod_trans_stop+0x61/0x70 [lod]
         [<ffffffffa1e3149a>] mdd_trans_stop+0x1a/0xac [mdd]
         [<ffffffffa1e20909>] mdd_create+0x13a9/0x1750 [mdd]
         [<ffffffffa1cdb65c>] ? mdt_version_save+0x8c/0x1a0 [mdt]
         [<ffffffffa1cdf9ec>] mdt_reint_create+0xbbc/0xcc0 [mdt]
         [<ffffffffa1cdab1d>] mdt_reint_rec+0x5d/0x200 [mdt]
         [<ffffffffa1cbffcb>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
         [<ffffffffa1cc073b>] mdt_reint+0x6b/0x120 [mdt]
         [<ffffffffa1868e8e>] tgt_request_handle+0x8be/0xfe0 [ptlrpc]
         [<ffffffffa1818aa1>] ptlrpc_main+0xe41/0x1970 [ptlrpc]
         [<ffffffff81060c3f>] ? finish_task_switch+0x4f/0xf0
         [<ffffffffa1817c60>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
         [<ffffffff8109e71e>] kthread+0x9e/0xc0
         [<ffffffff8100c20a>] child_rip+0xa/0x20
         [<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b
         [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
         [<ffffffff8100c200>] ? child_rip+0x0/0x20
        
        Kernel panic - not syncing: LBUG
        Pid: 1291, comm: mdt00_002 Not tainted 2.6.32-504.16.2.el6_lustre.gd805a88.x86_64 #1
        Call Trace:
         [<ffffffff81529fbc>] ? panic+0xa7/0x16f
         [<ffffffffa00f2ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
         [<ffffffffa0207848>] ? llog_cat_add_rec+0x3e8/0x450 [obdclass]
         [<ffffffffa01ff039>] ? llog_add+0x89/0x1c0 [obdclass]
         [<ffffffffa187b6f4>] ? sub_updates_write+0x154/0x600 [ptlrpc]
         [<ffffffffa187c247>] ? top_trans_stop+0x6a7/0xb40 [ptlrpc]
         [<ffffffffa1d8cd21>] ? lod_trans_stop+0x61/0x70 [lod]
         [<ffffffffa1e3149a>] ? mdd_trans_stop+0x1a/0xac [mdd]
         [<ffffffffa1e20909>] ? mdd_create+0x13a9/0x1750 [mdd]
         [<ffffffffa1cdb65c>] ? mdt_version_save+0x8c/0x1a0 [mdt]
         [<ffffffffa1cdf9ec>] ? mdt_reint_create+0xbbc/0xcc0 [mdt]
         [<ffffffffa1cdab1d>] ? mdt_reint_rec+0x5d/0x200 [mdt]
         [<ffffffffa1cbffcb>] ? mdt_reint_internal+0x4cb/0x7a0 [mdt]
         [<ffffffffa1cc073b>] ? mdt_reint+0x6b/0x120 [mdt]
         [<ffffffffa1868e8e>] ? tgt_request_handle+0x8be/0xfe0 [ptlrpc]
         [<ffffffffa1818aa1>] ? ptlrpc_main+0xe41/0x1970 [ptlrpc]
         [<ffffffff81060c3f>] ? finish_task_switch+0x4f/0xf0
         [<ffffffffa1817c60>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
         [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
         [<ffffffff8100c20a>] ? child_rip+0xa/0x20
         [<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b
         [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
         [<ffffffff8100c200>] ? child_rip+0x0/0x20
        

      After each reboot/recovery cycle the MDS would LBUG again with same error right after recovery completed. Presumably the client was resending the mkdir. Once I killed lfs, the crashes stopped.

      Attachments

        Issue Links

          Activity

            [LU-6602] ASSERTION( rec->lrh_len <= 8192 ) failed

            sorry, typo. Meant to be LU-6202

            simmonsja James A Simmons added a comment - sorry, typo. Meant to be LU-6202

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15274/
            Subject: LU-6602 osp: change lgh_hdr_lock to mutex
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fffe8ac7e42b6638bff9fe19c4bfeb6635023c92

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15274/ Subject: LU-6602 osp: change lgh_hdr_lock to mutex Project: fs/lustre-release Branch: master Current Patch Set: Commit: fffe8ac7e42b6638bff9fe19c4bfeb6635023c92

            One patch left!!

            simmonsja James A Simmons added a comment - One patch left!!

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15162/
            Subject: LU-6602 update: split update llog record
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fb80ae7c7601a03c1181de381f067f553e7b8c6f

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15162/ Subject: LU-6602 update: split update llog record Project: fs/lustre-release Branch: master Current Patch Set: Commit: fb80ae7c7601a03c1181de381f067f553e7b8c6f

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15161/
            Subject: LU-6602 llog: increase update llog chunk size
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b45e500e5a996d8529ab3d85d542908c93b1e1ce

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15161/ Subject: LU-6602 llog: increase update llog chunk size Project: fs/lustre-release Branch: master Current Patch Set: Commit: b45e500e5a996d8529ab3d85d542908c93b1e1ce
            di.wang Di Wang added a comment -

            Robert: could you please try https://build.hpdd.intel.com/job/lustre-reviews/32865/ Thanks.

            di.wang Di Wang added a comment - Robert: could you please try https://build.hpdd.intel.com/job/lustre-reviews/32865/ Thanks.

            wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15274
            Subject: LU-6602 osp: change lgh_hdr_lock to mutex
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6e46cf6e89a578b2e8a236ac4e00433d0bed0bba

            gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15274 Subject: LU-6602 osp: change lgh_hdr_lock to mutex Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6e46cf6e89a578b2e8a236ac4e00433d0bed0bba

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14883/
            Subject: LU-6602 obdclass: variable llog chunk size
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: dc689955366c895f9cdcc86d78f4221866fe0926

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14883/ Subject: LU-6602 obdclass: variable llog chunk size Project: fs/lustre-release Branch: master Current Patch Set: Commit: dc689955366c895f9cdcc86d78f4221866fe0926
            di.wang Di Wang added a comment - - edited

            Robert: Thanks for testing. It indeed looks like DNE issue. I will check the code. Thanks.

            di.wang Di Wang added a comment - - edited Robert: Thanks for testing. It indeed looks like DNE issue. I will check the code. Thanks.
            rread Robert Read added a comment -

            Not sure if this is related to DNE or not, but during setup one of the MDS nodes panic with this trace right after mounting an MDT:

            LDISKFS-fs (xvdg1): mounted filesystem with ordered data mode. quota=on. Opts: 
            LDISKFS-fs (xvdg1): mounted filesystem with ordered data mode. quota=on. Opts: 
            LDISKFS-fs (xvdg1): mounted filesystem with ordered data mode. quota=on. Opts: 
            BUG: unable to handle kernel NULL pointer dereference at 0000000000000024
            IP: [<ffffffffa0232eb6>] llog_cat_process_or_fork+0x46/0x300 [obdclass]
            PGD 0 
            Oops: 0000 [#1] SMP 
            last sysfs file: /sys/devices/vbd-2145/block/xvdg1/dev
            CPU 7 
            Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ipv6 xen_netfront ext4 jbd2 mbcache xen_blkfront dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
            
            Pid: 7552, comm: lod0061_rec0063 Not tainted 2.6.32-504.16.2.el6_lustre.g2f99b7f.x86_64 #1  
            RIP: e030:[<ffffffffa0232eb6>]  [<ffffffffa0232eb6>] llog_cat_process_or_fork+0x46/0x300 [obdclass]
            RSP: e02b:ffff8806694b7da0  EFLAGS: 00010246
            RAX: ffff88067e6aa378 RBX: ffff88068d020380 RCX: ffff88069c9bc240
            RDX: ffffffffa0c48be0 RSI: ffff88068d020380 RDI: ffff8806694b7e70
            RBP: ffff8806694b7e20 R08: 0000000000000000 R09: 0000000000000000
            R10: ffff88067232eec0 R11: 1000000000000000 R12: 0000000000000000
            R13: 0000000000000000 R14: ffff8806694b7e70 R15: ffff8806694b7e70
            FS:  00007fe3a4090700(0000) GS:ffff880028122000(0000) knlGS:0000000000000000
            CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
            CR2: 0000000000000024 CR3: 0000000001a85000 CR4: 0000000000002660
            DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            Process lod0061_rec0063 (pid: 7552, threadinfo ffff8806694b6000, task ffff8806694b5520)
            Stack:
             ffff8806694b7e70 ffff88067232a800 ffff8806694b7e40 ffffffffa0c72dc0
            <d> 0000000000000000 ffff8806694b7e70 ffff88066a66db40 ffff8806694b7e08
            <d> ffff88067e6aa078 ffff88067fb62030 ffff88067fb622b8 ffff88069c9bc240
            Call Trace:
             [<ffffffffa0c72dc0>] ? lod_sub_prep_llog+0x4f0/0x7b0 [lod]
             [<ffffffffa0233189>] llog_cat_process+0x19/0x20 [obdclass]
             [<ffffffffa0c4870a>] lod_sub_recovery_thread+0x4ba/0x990 [lod]
             [<ffffffff81007d82>] ? check_events+0x12/0x20
             [<ffffffff8152dbbc>] ? _spin_unlock_irqrestore+0x1c/0x20
             [<ffffffffa0c48250>] ? lod_sub_recovery_thread+0x0/0x990 [lod]
             [<ffffffff8109e71e>] kthread+0x9e/0xc0
             [<ffffffff8100c20a>] child_rip+0xa/0x20
             [<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b
             [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
             [<ffffffff8100c200>] ? child_rip+0x0/0x20
            Code: f8 0f 1f 44 00 00 f6 05 8c ae f3 ff 01 44 0f b6 6d 10 49 89 fe 48 89 f3 4c 8b 66 38 74 0d f6 05 70 ae f3 ff 40 0f 85 9a 01 00 00 <41> f6 44 24 24 02 0f 84 6b 02 00 00 48 89 4d a0 48 89 55 a8 44 
            RIP  [<ffffffffa0232eb6>] llog_cat_process_or_fork+0x46/0x300 [obdclass]
             RSP <ffff8806694b7da0>
            CR2: 0000000000000024
            ---[ end trace 2a9e4e41d6fdd5e2 ]---
            Kernel panic - not syncing: Fatal exception
            Pid: 7552, comm: lod0061_rec0063 Tainted: G      D    ---------------    2.6.32-504.16.2.el6_lustre.g2f99b7f.x86_64 #1
            Call Trace:
             [<ffffffff81529fbc>] ? panic+0xa7/0x16f
             [<ffffffff8152dbbc>] ? _spin_unlock_irqrestore+0x1c/0x20
             [<ffffffff8152ed94>] ? oops_end+0xe4/0x100
             [<ffffffff8104c80b>] ? no_context+0xfb/0x260
             [<ffffffff8104ca95>] ? __bad_area_nosemaphore+0x125/0x1e0
             [<ffffffff8104cb63>] ? bad_area_nosemaphore+0x13/0x20
             [<ffffffff8104d25c>] ? __do_page_fault+0x30c/0x500
             [<ffffffff81007d82>] ? check_events+0x12/0x20
             [<ffffffff810075dd>] ? xen_force_evtchn_callback+0xd/0x10
             [<ffffffffa0511869>] ? out_update_pack+0xc9/0x190 [ptlrpc]
             [<ffffffff810075dd>] ? xen_force_evtchn_callback+0xd/0x10
             [<ffffffff81007d82>] ? check_events+0x12/0x20
             [<ffffffff81530cbe>] ? do_page_fault+0x3e/0xa0
             [<ffffffff8152e075>] ? page_fault+0x25/0x30
             [<ffffffffa0c48be0>] ? lod_process_recovery_updates+0x0/0x420 [lod]
             [<ffffffffa0232eb6>] ? llog_cat_process_or_fork+0x46/0x300 [obdclass]
             [<ffffffffa0c72dc0>] ? lod_sub_prep_llog+0x4f0/0x7b0 [lod]
             [<ffffffffa0233189>] ? llog_cat_process+0x19/0x20 [obdclass]
             [<ffffffffa0c4870a>] ? lod_sub_recovery_thread+0x4ba/0x990 [lod]
             [<ffffffff81007d82>] ? check_events+0x12/0x20
             [<ffffffff8152dbbc>] ? _spin_unlock_irqrestore+0x1c/0x20
             [<ffffffffa0c48250>] ? lod_sub_recovery_thread+0x0/0x990 [lod]
             [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
             [<ffffffff8100c20a>] ? child_rip+0xa/0x20
             [<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b
             [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
             [<ffffffff8100c200>] ? child_rip+0x0/0x20
            
            rread Robert Read added a comment - Not sure if this is related to DNE or not, but during setup one of the MDS nodes panic with this trace right after mounting an MDT: LDISKFS-fs (xvdg1): mounted filesystem with ordered data mode. quota=on. Opts: LDISKFS-fs (xvdg1): mounted filesystem with ordered data mode. quota=on. Opts: LDISKFS-fs (xvdg1): mounted filesystem with ordered data mode. quota=on. Opts: BUG: unable to handle kernel NULL pointer dereference at 0000000000000024 IP: [<ffffffffa0232eb6>] llog_cat_process_or_fork+0x46/0x300 [obdclass] PGD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/vbd-2145/block/xvdg1/dev CPU 7 Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ipv6 xen_netfront ext4 jbd2 mbcache xen_blkfront dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 7552, comm: lod0061_rec0063 Not tainted 2.6.32-504.16.2.el6_lustre.g2f99b7f.x86_64 #1 RIP: e030:[<ffffffffa0232eb6>] [<ffffffffa0232eb6>] llog_cat_process_or_fork+0x46/0x300 [obdclass] RSP: e02b:ffff8806694b7da0 EFLAGS: 00010246 RAX: ffff88067e6aa378 RBX: ffff88068d020380 RCX: ffff88069c9bc240 RDX: ffffffffa0c48be0 RSI: ffff88068d020380 RDI: ffff8806694b7e70 RBP: ffff8806694b7e20 R08: 0000000000000000 R09: 0000000000000000 R10: ffff88067232eec0 R11: 1000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: ffff8806694b7e70 R15: ffff8806694b7e70 FS: 00007fe3a4090700(0000) GS:ffff880028122000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000024 CR3: 0000000001a85000 CR4: 0000000000002660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process lod0061_rec0063 (pid: 7552, threadinfo ffff8806694b6000, task ffff8806694b5520) Stack: ffff8806694b7e70 ffff88067232a800 ffff8806694b7e40 ffffffffa0c72dc0 <d> 0000000000000000 ffff8806694b7e70 ffff88066a66db40 ffff8806694b7e08 <d> ffff88067e6aa078 ffff88067fb62030 ffff88067fb622b8 ffff88069c9bc240 Call Trace: [<ffffffffa0c72dc0>] ? lod_sub_prep_llog+0x4f0/0x7b0 [lod] [<ffffffffa0233189>] llog_cat_process+0x19/0x20 [obdclass] [<ffffffffa0c4870a>] lod_sub_recovery_thread+0x4ba/0x990 [lod] [<ffffffff81007d82>] ? check_events+0x12/0x20 [<ffffffff8152dbbc>] ? _spin_unlock_irqrestore+0x1c/0x20 [<ffffffffa0c48250>] ? lod_sub_recovery_thread+0x0/0x990 [lod] [<ffffffff8109e71e>] kthread+0x9e/0xc0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Code: f8 0f 1f 44 00 00 f6 05 8c ae f3 ff 01 44 0f b6 6d 10 49 89 fe 48 89 f3 4c 8b 66 38 74 0d f6 05 70 ae f3 ff 40 0f 85 9a 01 00 00 <41> f6 44 24 24 02 0f 84 6b 02 00 00 48 89 4d a0 48 89 55 a8 44 RIP [<ffffffffa0232eb6>] llog_cat_process_or_fork+0x46/0x300 [obdclass] RSP <ffff8806694b7da0> CR2: 0000000000000024 ---[ end trace 2a9e4e41d6fdd5e2 ]--- Kernel panic - not syncing: Fatal exception Pid: 7552, comm: lod0061_rec0063 Tainted: G D --------------- 2.6.32-504.16.2.el6_lustre.g2f99b7f.x86_64 #1 Call Trace: [<ffffffff81529fbc>] ? panic+0xa7/0x16f [<ffffffff8152dbbc>] ? _spin_unlock_irqrestore+0x1c/0x20 [<ffffffff8152ed94>] ? oops_end+0xe4/0x100 [<ffffffff8104c80b>] ? no_context+0xfb/0x260 [<ffffffff8104ca95>] ? __bad_area_nosemaphore+0x125/0x1e0 [<ffffffff8104cb63>] ? bad_area_nosemaphore+0x13/0x20 [<ffffffff8104d25c>] ? __do_page_fault+0x30c/0x500 [<ffffffff81007d82>] ? check_events+0x12/0x20 [<ffffffff810075dd>] ? xen_force_evtchn_callback+0xd/0x10 [<ffffffffa0511869>] ? out_update_pack+0xc9/0x190 [ptlrpc] [<ffffffff810075dd>] ? xen_force_evtchn_callback+0xd/0x10 [<ffffffff81007d82>] ? check_events+0x12/0x20 [<ffffffff81530cbe>] ? do_page_fault+0x3e/0xa0 [<ffffffff8152e075>] ? page_fault+0x25/0x30 [<ffffffffa0c48be0>] ? lod_process_recovery_updates+0x0/0x420 [lod] [<ffffffffa0232eb6>] ? llog_cat_process_or_fork+0x46/0x300 [obdclass] [<ffffffffa0c72dc0>] ? lod_sub_prep_llog+0x4f0/0x7b0 [lod] [<ffffffffa0233189>] ? llog_cat_process+0x19/0x20 [obdclass] [<ffffffffa0c4870a>] ? lod_sub_recovery_thread+0x4ba/0x990 [lod] [<ffffffff81007d82>] ? check_events+0x12/0x20 [<ffffffff8152dbbc>] ? _spin_unlock_irqrestore+0x1c/0x20 [<ffffffffa0c48250>] ? lod_sub_recovery_thread+0x0/0x990 [lod] [<ffffffff8109e71e>] ? kthread+0x9e/0xc0 [<ffffffff8100c20a>] ? child_rip+0xa/0x20 [<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6 [<ffffffff8100c200>] ? child_rip+0x0/0x20

            People

              di.wang Di Wang
              rread Robert Read
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: