[LU-8422] llog_osd.c:165:llog_osd_pad()) ASSERTION( len >= (24) && (len & 0x7) == 0 ) Created: 20/Jul/16 Updated: 18/Aug/17 Resolved: 18/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Christopher Morrone | Assignee: | Mikhail Pershin |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
During 2.8.0 DNE testing (in particular tag 2.8.0_0.0.llnlpreview.25), we hit the following assertion on of the MDS nodes: 2016-07-20 09:39:28 [87775.729342] LustreError: 28402:0:(llog_osd.c:165:llog_osd_pad()) ASSERTION( len >= (24) && (len & 0x7) == 0 ) failed: 2016-07-20 09:39:28 [87775.741933] LustreError: 28402:0:(llog_osd.c:165:llog_osd_pad()) LBUG This is nothing else much happening in the console log except for regular occurrences of There is no available crash dump. No additional logs since the node LBUGs on assertion. |
| Comments |
| Comment by Christopher Morrone [ 21/Jul/16 ] |
|
This is now hitting frequently, sometimes within minutes of recovery completing on the MDS. This particular test system is hosed until we have a fix for this. |
| Comment by Christopher Morrone [ 21/Jul/16 ] |
|
And on the plus side, if you have any debug patches that you would like me to try, I can probably get you the results pretty quickly. |
| Comment by Christopher Morrone [ 22/Jul/16 ] |
|
We got crash dumps hacked into a basically working state. The full backtrace of the latest hit of this assertion looks like this: PID: 14429 TASK: ffff887f23d90b80 CPU: 12 COMMAND: "mdt03_004" #0 [ffff887ee35c35d0] machine_kexec at ffffffff810545db #1 [ffff887ee35c3630] crash_kexec at ffffffff810f7ec2 #2 [ffff887ee35c3700] panic at ffffffff816456b6 #3 [ffff887ee35c3780] lbug_with_loc at ffffffffa0858dfb [libcfs] #4 [ffff887ee35c37a0] llog_osd_pad at ffffffffa093b171 [obdclass] #5 [ffff887ee35c37f8] llog_osd_write_rec at ffffffffa09407d0 [obdclass] #6 [ffff887ee35c3898] llog_write_rec at ffffffffa092f97a [obdclass] #7 [ffff887ee35c38e0] llog_cat_add_rec at ffffffffa0934108 [obdclass] #8 [ffff887ee35c3930] llog_add at ffffffffa092ca8a [obdclass] #9 [ffff887ee35c3970] sub_updates_write at ffffffffa0c508cf [ptlrpc] #10 [ffff887ee35c3a08] top_trans_stop at ffffffffa0c3eb96 [ptlrpc] #11 [ffff887ee35c3a60] lod_trans_stop at ffffffffa0f27439 [lod] #12 [ffff887ee35c3ad8] mdd_trans_stop at ffffffffa0fb70ca [mdd] #13 [ffff887ee35c3ae8] mdd_unlink at ffffffffa0fa6b7a [mdd] #14 [ffff887ee35c3bb0] mdt_reint_unlink at ffffffffa0e9a470 [mdt] #15 [ffff887ee35c3c48] mdt_reint_rec at ffffffffa0e9d970 [mdt] #16 [ffff887ee35c3c70] mdt_reint_internal at ffffffffa0e80b01 [mdt] #17 [ffff887ee35c3ca8] mdt_reint at ffffffffa0e8a3c7 [mdt] #18 [ffff887ee35c3cd8] tgt_request_handle at ffffffffa0c2aa65 [ptlrpc] #19 [ffff887ee35c3d20] ptlrpc_server_handle_request at ffffffffa0bd61ab [ptlrpc] #20 [ffff887ee35c3de8] ptlrpc_main at ffffffffa0bda2f0 [ptlrpc] #21 [ffff887ee35c3ec8] kthread at ffffffff810a997f #22 [ffff887ee35c3f50] ret_from_fork at ffffffff8165d658 This time there was also some additional console traffic right before the assertion: 2016-07-21 16:45:44 [ 2587.821304] LustreError: 14086:0:(llog_cat.c:712:llog_cat_cancel_records()) lquake-MDT000f-osp-MDT0005: fail to cancel 1 of 1 llog-records: rc = -116 2016-07-21 16:45:49 [ 2592.856881] LustreError: 14086:0:(llog_cat.c:712:llog_cat_cancel_records()) lquake-MDT000a-osp-MDT0005: fail to cancel 1 of 1 llog-records: rc = -116 2016-07-21 16:45:54 [ 2597.877414] LustreError: 14086:0:(llog_cat.c:712:llog_cat_cancel_records()) lquake-MDT0001-osp-MDT0005: fail to cancel 1 of 1 llog-records: rc = -116 2016-07-21 16:45:54 [ 2597.893545] LustreError: 14086:0:(llog_cat.c:712:llog_cat_cancel_records()) Skipped 3 previous similar messages 2016-07-21 16:45:59 [ 2602.962942] LustreError: 14086:0:(llog_cat.c:712:llog_cat_cancel_records()) lquake-MDT000c-osp-MDT0005: fail to cancel 1 of 1 llog-records: rc = -116 2016-07-21 16:45:59 [ 2602.979124] LustreError: 14086:0:(llog_cat.c:712:llog_cat_cancel_records()) Skipped 6 previous similar messages 2016-07-21 16:46:04 [ 2607.966714] LustreError: 14086:0:(llog_cat.c:712:llog_cat_cancel_records()) lquake-MDT000c-osp-MDT0005: fail to cancel 1 of 1 llog-records: rc = -116 2016-07-21 16:46:04 [ 2607.982948] LustreError: 14086:0:(llog_cat.c:712:llog_cat_cancel_recor |
| Comment by Mikhail Pershin [ 25/Jul/16 ] |
|
I think I've found the reason, but there is no simple fix for that. I am thinking about proper solution. |
| Comment by Gerrit Updater [ 26/Jul/16 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/21509 |
| Comment by Mikhail Pershin [ 26/Jul/16 ] |
|
this is debug patch to prove the issue reason. |
| Comment by Christopher Morrone [ 26/Jul/16 ] |
|
Thanks, we reproduced the assertion with the 21509 debug patch applied. We hit the expanded LASSERTF, of course: [ 1465.793319] LustreError: 13881:0:(llog_osd.c:167:llog_osd_pad()) ASSERTION( len >= LLOG_MIN_REC_SIZE && (len & 0x7) == 0 ) failed: wrong pad len 8, off 65528, index 3 [ 1465.793320] LustreError: 13881:0:(llog_osd.c:167:llog_osd_pad()) LBUG I do not see the new "Failed write llog index" CERROR message. |
| Comment by Christopher Morrone [ 28/Jul/16 ] |
|
Was that what you expected to see? |
| Comment by Mikhail Pershin [ 29/Jul/16 ] |
|
Chris, yes, extended info in assertion was helpful. I've just updated this patch with more debug and few workarounds. Could you try it? |
| Comment by Christopher Morrone [ 29/Jul/16 ] |
|
Yes, I will get that on the testbed today. Thanks. |
| Comment by Christopher Morrone [ 29/Jul/16 ] |
|
It looks like there is a fair bit of conflict between your patch and the latest in I'll need direction on which set of patches to try. |
| Comment by Gerrit Updater [ 30/Jul/16 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/21609 |
| Comment by Mikhail Pershin [ 30/Jul/16 ] |
|
Chris, I've just updated patch to be applied over the one from the |
| Comment by Christopher Morrone [ 04/Aug/16 ] |
|
Building power and cooling issues have been delaying work on the testbed, but I think those problems are worked out now. We hit a NULL pointer dereference when running with just the updated I can try with both patches applied if that seems worth while. What do you think? |
| Comment by Mikhail Pershin [ 05/Aug/16 ] |
|
yes, please, let's try both patches. If |
| Comment by Christopher Morrone [ 18/Aug/16 ] |
|
We haven't seen this in a while, since running the |
| Comment by Peter Jones [ 18/Aug/16 ] |
|
Good news! Thanks for updating Chris |