[LU-6602] ASSERTION( rec->lrh_len <= 8192 ) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.8.0
Affects Version/s: None
Labels:
- dne2

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Testing this build: https://build.hpdd.intel.com/job/lustre-reviews/32021/

In AWS environment with 64 MDTs (8 MDS * 8 MDT each).

cd /mnt/lustre
lfs mkdir -c 8 8stripedir

lfs mkdir -c 64 64stripedir
<hang>
On MDS0

LustreError: 1291:0:(llog_cat.c:319:llog_cat_add_rec()) ASSERTION( rec->lrh_len <= 8192 ) failed: 
LustreError: 1291:0:(llog_cat.c:319:llog_cat_add_rec()) LBUG
Pid: 1291, comm: mdt00_002

Call Trace:
 [<ffffffffa00f2875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa00f2e77>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0207848>] llog_cat_add_rec+0x3e8/0x450 [obdclass]
 [<ffffffffa01ff039>] llog_add+0x89/0x1c0 [obdclass]
 [<ffffffffa187b6f4>] sub_updates_write+0x154/0x600 [ptlrpc]
 [<ffffffffa187c247>] top_trans_stop+0x6a7/0xb40 [ptlrpc]
 [<ffffffffa1d8cd21>] lod_trans_stop+0x61/0x70 [lod]
 [<ffffffffa1e3149a>] mdd_trans_stop+0x1a/0xac [mdd]
 [<ffffffffa1e20909>] mdd_create+0x13a9/0x1750 [mdd]
 [<ffffffffa1cdb65c>] ? mdt_version_save+0x8c/0x1a0 [mdt]
 [<ffffffffa1cdf9ec>] mdt_reint_create+0xbbc/0xcc0 [mdt]
 [<ffffffffa1cdab1d>] mdt_reint_rec+0x5d/0x200 [mdt]
 [<ffffffffa1cbffcb>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
 [<ffffffffa1cc073b>] mdt_reint+0x6b/0x120 [mdt]
 [<ffffffffa1868e8e>] tgt_request_handle+0x8be/0xfe0 [ptlrpc]
 [<ffffffffa1818aa1>] ptlrpc_main+0xe41/0x1970 [ptlrpc]
 [<ffffffff81060c3f>] ? finish_task_switch+0x4f/0xf0
 [<ffffffffa1817c60>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
 [<ffffffff8109e71e>] kthread+0x9e/0xc0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

Kernel panic - not syncing: LBUG
Pid: 1291, comm: mdt00_002 Not tainted 2.6.32-504.16.2.el6_lustre.gd805a88.x86_64 #1
Call Trace:
 [<ffffffff81529fbc>] ? panic+0xa7/0x16f
 [<ffffffffa00f2ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa0207848>] ? llog_cat_add_rec+0x3e8/0x450 [obdclass]
 [<ffffffffa01ff039>] ? llog_add+0x89/0x1c0 [obdclass]
 [<ffffffffa187b6f4>] ? sub_updates_write+0x154/0x600 [ptlrpc]
 [<ffffffffa187c247>] ? top_trans_stop+0x6a7/0xb40 [ptlrpc]
 [<ffffffffa1d8cd21>] ? lod_trans_stop+0x61/0x70 [lod]
 [<ffffffffa1e3149a>] ? mdd_trans_stop+0x1a/0xac [mdd]
 [<ffffffffa1e20909>] ? mdd_create+0x13a9/0x1750 [mdd]
 [<ffffffffa1cdb65c>] ? mdt_version_save+0x8c/0x1a0 [mdt]
 [<ffffffffa1cdf9ec>] ? mdt_reint_create+0xbbc/0xcc0 [mdt]
 [<ffffffffa1cdab1d>] ? mdt_reint_rec+0x5d/0x200 [mdt]
 [<ffffffffa1cbffcb>] ? mdt_reint_internal+0x4cb/0x7a0 [mdt]
 [<ffffffffa1cc073b>] ? mdt_reint+0x6b/0x120 [mdt]
 [<ffffffffa1868e8e>] ? tgt_request_handle+0x8be/0xfe0 [ptlrpc]
 [<ffffffffa1818aa1>] ? ptlrpc_main+0xe41/0x1970 [ptlrpc]
 [<ffffffff81060c3f>] ? finish_task_switch+0x4f/0xf0
 [<ffffffffa1817c60>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
 [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
 [<ffffffff8100c20a>] ? child_rip+0xa/0x20
 [<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

After each reboot/recovery cycle the MDS would LBUG again with same error right after recovery completed. Presumably the client was resending the mkdir. Once I killed lfs, the crashes stopped.

Attachments

Issue Links

is blocking

LU-6737 many stripe testing of DNE2

Resolved

is related to

LU-6831 The ticket for tracking all DNE2 bugs

Reopened

LU-7666 llog_cat_new_log() should use chunk size when freeing header

Resolved

mentioned in: Page Loading...

Activity

People

Assignee:: Di Wang (Inactive)

Reporter:: Robert Read

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 14/May/15 12:03 AM

Updated:: 30/Aug/19 4:02 PM

Resolved:: 16/Jul/15 5:24 AM