[LU-9848] LBUG: ASSERTION( len >= (24) && (len & 0x7) == 0 ) failed Created: 08/Aug/17 Updated: 18/Sep/17 Resolved: 28/Aug/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.10.1, Lustre 2.11.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Olaf Faaland | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
Linux version 3.10.0-514.26.2.1chaos.ch6_1.x86_64 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
LustreError: 43470:0:(llog_osd.c:165:llog_osd_pad()) ASSERTION( len >= (24) && (len & 0x7) == 0 ) failed: LustreError: 43470:0:(llog_osd.c:165:llog_osd_pad()) LBUG Pid: 43470, comm: mdt00_081 Kernel panic - not syncing: LBUG CPU: 2 PID: 43470 Comm: mdt00_081 Tainted: P OE ------------ 3.10.0-514.26.2.1chaos.ch6_1.x86_64 #1 Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016 ffffffffa0c50e0f 000000005a5d14f4 ffff887dd2457700 ffffffff8169d4bc ffff887dd2457780 ffffffff816966ff ffffffff00000008 ffff887dd2457790 ffff887dd2457730 000000005a5d14f4 ffffffffa0d861e7 0000000000000246 Call Trace: [<ffffffff8169d4bc>] dump_stack+0x19/0x1b [<ffffffff816966ff>] panic+0xe3/0x1f2 [<ffffffff810a6ac0>] ? call_usermodehelper_freeinfo+0x20/0x30 [<ffffffffa0c34deb>] lbug_with_loc+0xab/0xc0 [libcfs] [<ffffffffa0d1727a>] llog_osd_pad+0x3ca/0x440 [obdclass] [<ffffffffa0d19967>] llog_osd_write_rec+0xe87/0x14d0 [obdclass] [<ffffffffa0d0b8da>] llog_write_rec+0xaa/0x280 [obdclass] [<ffffffffa0d100c0>] llog_cat_add_rec+0x210/0x8e0 [obdclass] [<ffffffffa0d08a3a>] llog_add+0x7a/0x1a0 [obdclass] [<ffffffffa1029f7c>] ? sub_updates_write+0x7f6/0xef8 [ptlrpc] [<ffffffffa102a373>] sub_updates_write+0xbed/0xef8 [ptlrpc] [<ffffffffa101899f>] top_trans_stop+0x62f/0x970 [ptlrpc] [<ffffffffa134c399>] lod_trans_stop+0x259/0x340 [lod] [<ffffffffa13c0c32>] ? mdd_links_rename+0x312/0x5d0 [mdd] [<ffffffffa13daafd>] mdd_trans_stop+0x1d/0x25 [mdd] [<ffffffffa13c5c18>] mdd_link+0x2e8/0x930 [mdd] [<ffffffffa0fa42d2>] ? lustre_msg_get_versions+0x22/0xf0 [ptlrpc] [<ffffffffa1296b6e>] mdt_reint_link+0xade/0xc30 [mdt] [<ffffffff8132f4d2>] ? strlcpy+0x42/0x60 [<ffffffffa1298ef0>] mdt_reint_rec+0x80/0x210 [mdt] [<ffffffffa127bdf1>] mdt_reint_internal+0x5e1/0x990 [mdt] [<ffffffffa1285a07>] mdt_reint+0x67/0x140 [mdt] [<ffffffffa1004425>] tgt_request_handle+0x915/0x1320 [ptlrpc] [<ffffffffa0fb0e7b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [<ffffffffa0c41748>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [<ffffffffa0faea4b>] ? ptlrpc_wait_event+0xab/0x350 [ptlrpc] [<ffffffff810c8aa2>] ? default_wake_function+0x12/0x20 [<ffffffff810bdb18>] ? __wake_up_common+0x58/0x90 [<ffffffffa0fb4f20>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [<ffffffff8102a569>] ? __switch_to+0xd9/0x4e0 [<ffffffffa0fb4490>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [<ffffffff810b3a0f>] kthread+0xcf/0xe0 Appears to be regression of |
| Comments |
| Comment by Olaf Faaland [ 09/Aug/17 ] |
|
Intel has access to the patch stack we are running. See the lustre-release-fe-llnl project in gerrit. |
| Comment by Giuseppe Di Natale (Inactive) [ 10/Aug/17 ] |
|
I'm able to reproduce this with a sizable simul run that only runs the symlink tests. Something similar to the following hits the issue within minutes on our testbed: srun -N 72 -n $((72*36)) ./simul -d /p/lquake/dinatale/SIMUL -V 1 -n 20 -i "16,36,38,39,12,18,19,32" |
| Comment by Peter Jones [ 10/Aug/17 ] |
|
Lai Can you please advise? Thanks Peter |
| Comment by Olaf Faaland [ 10/Aug/17 ] |
|
Lai, We found that the directory the test, /p/lquake/dinatale/SIMUL/, is sharded (DNE2). We only run DNE1 in production and that may explain why we suddenly started encountering the error. We have no plans at this time to use DNE2 in production and so this may not need attention at all. We will re-test with DNE1 directories and update this ticket. |
| Comment by Giuseppe Di Natale (Inactive) [ 11/Aug/17 ] |
|
I ran the test on DNE1 directories and we don't hit the assertion. We can leave this ticket open since I believe the assert is still a problem, but we won't be hitting this in production. |
| Comment by Peter Jones [ 11/Aug/17 ] |
|
What would be interesting to know is whether this issue still hits on the latest master (or at least 2.10.x) |
| Comment by Giuseppe Di Natale (Inactive) [ 14/Aug/17 ] |
|
Peter, are you able to run those tests on your test hardware? |
| Comment by Gerrit Updater [ 15/Aug/17 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/28554 |
| Comment by Lai Siyao [ 15/Aug/17 ] |
|
This is a duplicate of |
| Comment by Gerrit Updater [ 28/Aug/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28554/ |
| Comment by Peter Jones [ 28/Aug/17 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 28/Aug/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28762 |
| Comment by Gerrit Updater [ 06/Sep/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28762/ |