[LU-8422] llog_osd.c:165:llog_osd_pad()) ASSERTION( len >= (24) && (len & 0x7) == 0 ) Created: 20/Jul/16  Updated: 18/Aug/17  Resolved: 18/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Christopher Morrone Assignee: Mikhail Pershin
Resolution: Duplicate Votes: 0
Labels: llnl

Issue Links:
Related
is related to LU-9848 LBUG: ASSERTION( len >= (24) && (len ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

During 2.8.0 DNE testing (in particular tag 2.8.0_0.0.llnlpreview.25), we hit the following assertion on of the MDS nodes:

2016-07-20 09:39:28 [87775.729342] LustreError: 28402:0:(llog_osd.c:165:llog_osd_pad()) ASSERTION( len >= (24) && (len & 0x7) == 0 ) failed:
2016-07-20 09:39:28 [87775.741933] LustreError: 28402:0:(llog_osd.c:165:llog_osd_pad()) LBUG

This is nothing else much happening in the console log except for regular occurrences of LU-6447. The most recent hit of LU-6447 was about two minutes earlier.

There is no available crash dump. No additional logs since the node LBUGs on assertion.



 Comments   
Comment by Christopher Morrone [ 21/Jul/16 ]

This is now hitting frequently, sometimes within minutes of recovery completing on the MDS. This particular test system is hosed until we have a fix for this.

Comment by Christopher Morrone [ 21/Jul/16 ]

And on the plus side, if you have any debug patches that you would like me to try, I can probably get you the results pretty quickly.

Comment by Christopher Morrone [ 22/Jul/16 ]

We got crash dumps hacked into a basically working state. The full backtrace of the latest hit of this assertion looks like this:

PID: 14429  TASK: ffff887f23d90b80  CPU: 12  COMMAND: "mdt03_004"
 #0 [ffff887ee35c35d0] machine_kexec at ffffffff810545db
 #1 [ffff887ee35c3630] crash_kexec at ffffffff810f7ec2
 #2 [ffff887ee35c3700] panic at ffffffff816456b6
 #3 [ffff887ee35c3780] lbug_with_loc at ffffffffa0858dfb [libcfs]
 #4 [ffff887ee35c37a0] llog_osd_pad at ffffffffa093b171 [obdclass]
 #5 [ffff887ee35c37f8] llog_osd_write_rec at ffffffffa09407d0 [obdclass]
 #6 [ffff887ee35c3898] llog_write_rec at ffffffffa092f97a [obdclass]
 #7 [ffff887ee35c38e0] llog_cat_add_rec at ffffffffa0934108 [obdclass]
 #8 [ffff887ee35c3930] llog_add at ffffffffa092ca8a [obdclass]
 #9 [ffff887ee35c3970] sub_updates_write at ffffffffa0c508cf [ptlrpc]
#10 [ffff887ee35c3a08] top_trans_stop at ffffffffa0c3eb96 [ptlrpc]
#11 [ffff887ee35c3a60] lod_trans_stop at ffffffffa0f27439 [lod]
#12 [ffff887ee35c3ad8] mdd_trans_stop at ffffffffa0fb70ca [mdd]
#13 [ffff887ee35c3ae8] mdd_unlink at ffffffffa0fa6b7a [mdd]
#14 [ffff887ee35c3bb0] mdt_reint_unlink at ffffffffa0e9a470 [mdt]
#15 [ffff887ee35c3c48] mdt_reint_rec at ffffffffa0e9d970 [mdt]
#16 [ffff887ee35c3c70] mdt_reint_internal at ffffffffa0e80b01 [mdt]
#17 [ffff887ee35c3ca8] mdt_reint at ffffffffa0e8a3c7 [mdt]
#18 [ffff887ee35c3cd8] tgt_request_handle at ffffffffa0c2aa65 [ptlrpc]
#19 [ffff887ee35c3d20] ptlrpc_server_handle_request at ffffffffa0bd61ab [ptlrpc]
#20 [ffff887ee35c3de8] ptlrpc_main at ffffffffa0bda2f0 [ptlrpc]
#21 [ffff887ee35c3ec8] kthread at ffffffff810a997f
#22 [ffff887ee35c3f50] ret_from_fork at ffffffff8165d658

This time there was also some additional console traffic right before the assertion:

2016-07-21 16:45:44 [ 2587.821304] LustreError: 14086:0:(llog_cat.c:712:llog_cat_cancel_records()) lquake-MDT000f-osp-MDT0005: fail to cancel 1 of 1 llog-records: rc = -116
2016-07-21 16:45:49 [ 2592.856881] LustreError: 14086:0:(llog_cat.c:712:llog_cat_cancel_records()) lquake-MDT000a-osp-MDT0005: fail to cancel 1 of 1 llog-records: rc = -116
2016-07-21 16:45:54 [ 2597.877414] LustreError: 14086:0:(llog_cat.c:712:llog_cat_cancel_records()) lquake-MDT0001-osp-MDT0005: fail to cancel 1 of 1 llog-records: rc = -116
2016-07-21 16:45:54 [ 2597.893545] LustreError: 14086:0:(llog_cat.c:712:llog_cat_cancel_records()) Skipped 3 previous similar messages
2016-07-21 16:45:59 [ 2602.962942] LustreError: 14086:0:(llog_cat.c:712:llog_cat_cancel_records()) lquake-MDT000c-osp-MDT0005: fail to cancel 1 of 1 llog-records: rc = -116
2016-07-21 16:45:59 [ 2602.979124] LustreError: 14086:0:(llog_cat.c:712:llog_cat_cancel_records()) Skipped 6 previous similar messages
2016-07-21 16:46:04 [ 2607.966714] LustreError: 14086:0:(llog_cat.c:712:llog_cat_cancel_records()) lquake-MDT000c-osp-MDT0005: fail to cancel 1 of 1 llog-records: rc = -116
2016-07-21 16:46:04 [ 2607.982948] LustreError: 14086:0:(llog_cat.c:712:llog_cat_cancel_recor
Comment by Mikhail Pershin [ 25/Jul/16 ]

I think I've found the reason, but there is no simple fix for that. I am thinking about proper solution.

Comment by Gerrit Updater [ 26/Jul/16 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/21509
Subject: LU-8422 llog: extended debug info
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 955096bddc204c5f2a34143e351ccf4cbbd9da8d

Comment by Mikhail Pershin [ 26/Jul/16 ]

this is debug patch to prove the issue reason.

Comment by Christopher Morrone [ 26/Jul/16 ]

Thanks, we reproduced the assertion with the 21509 debug patch applied. We hit the expanded LASSERTF, of course:

[ 1465.793319] LustreError: 13881:0:(llog_osd.c:167:llog_osd_pad()) ASSERTION( len >= LLOG_MIN_REC_SIZE && (len & 0x7) == 0 ) failed: wrong pad len 8, off 65528, index 3
[ 1465.793320] LustreError: 13881:0:(llog_osd.c:167:llog_osd_pad()) LBUG

I do not see the new "Failed write llog index" CERROR message.

Comment by Christopher Morrone [ 28/Jul/16 ]

Was that what you expected to see?

Comment by Mikhail Pershin [ 29/Jul/16 ]

Chris, yes, extended info in assertion was helpful. I've just updated this patch with more debug and few workarounds. Could you try it?

Comment by Christopher Morrone [ 29/Jul/16 ]

Yes, I will get that on the testbed today. Thanks.

Comment by Christopher Morrone [ 29/Jul/16 ]

It looks like there is a fair bit of conflict between your patch and the latest in LU-8370. Do you know that you two are working in the same area?

I'll need direction on which set of patches to try.

Comment by Gerrit Updater [ 30/Jul/16 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/21609
Subject: LU-8422 update: add more debug info for the ticket
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0b3881152f6b95f1fb04e559f6905c6f5636f230

Comment by Mikhail Pershin [ 30/Jul/16 ]

Chris, I've just updated patch to be applied over the one from the LU-8370 ticket. Try the patch in LU-8370 first, the apply mine or apply both at once.

Comment by Christopher Morrone [ 04/Aug/16 ]

Building power and cooling issues have been delaying work on the testbed, but I think those problems are worked out now.

We hit a NULL pointer dereference when running with just the updated LU-8370 patch (patch set 5). More info in that issue.

I can try with both patches applied if that seems worth while. What do you think?

Comment by Mikhail Pershin [ 05/Aug/16 ]

yes, please, let's try both patches. If LU-8370 will cause issues than it will be better to run just patch http://review.whamcloud.com/21509 that will be helpful too.

Comment by Christopher Morrone [ 18/Aug/16 ]

We haven't seen this in a while, since running the LU-7800 and LU-8370 patches starting at 2.8.0_0.0.llnlpreview.32.

Comment by Peter Jones [ 18/Aug/16 ]

Good news! Thanks for updating Chris

Generated at Sat Feb 10 02:17:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.