Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.4.0
-
3
-
8082
Description
On a SPARC machine, "./llmount.sh" hit this assertion failure:
LustreError: 20452:0:(llog_osd.c:630:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed: LustreError: 20452:0:(llog_osd.c:630:llog_osd_next_block()) LBUG Pid: 20452, comm: ll_mgs_0002 Call Trace: Kernel panic - not syncing: LBUG Call Trace: [00000000103a3194] lbug_with_loc+0x94/0xc0 [libcfs] [0000000010535fbc] llog_osd_next_block+0xb5c/0x1000 [obdclass] [00000000104f39d0] llog_process_thread+0x2b0/0x13a0 [obdclass] [00000000104f4cdc] llog_process_or_fork+0x21c/0x980 [obdclass] [000000001090a140] mgs_steal_llog_for_mdt_from_client+0x5e0/0xae0 [mgs] [000000001090b120] mgs_write_log_mdt+0xae0/0x3a60 [mgs] [00000000109262f8] mgs_write_log_target+0x798/0x20a0 [mgs] [00000000108ea624] mgs_handle_target_reg+0xd44/0x17c0 [mgs] [00000000108edab8] mgs_handle+0xd18/0x22a0 [mgs] [00000000106f5f60] ptlrpc_server_handle_request+0x980/0x16c0 [ptlrpc] [00000000106fc130] ptlrpc_main+0xa10/0x1680 [ptlrpc] [000000000042ad88] kernel_thread+0x30/0x48 [00000000103aea44] cfs_create_thread+0x24/0x60 [libcfs] Press Stop-A (L1-A) to return to the boot prom
The llog_osd_next_block() lines in question are
tail = (struct llog_rec_tail *)((char *)buf + rc - sizeof(struct llog_rec_tail)); /* get the last record in block */ last_rec = (struct llog_rec_hdr *)((char *)buf + rc - le32_to_cpu(tail->lrt_len)); if (LLOG_REC_HDR_NEEDS_SWABBING(last_rec)) lustre_swab_llog_rec(last_rec); LASSERT(last_rec->lrh_index == tail->lrt_index);
The le32_to_cpu() call above assumes the data to be little-endian. That is not true, however, because configuration logs (as well as at least OSP logs) are actually written in host-endianness, which is big-endian on sparc Linux.
It is not clear what the endianness rule should be. The comment above the definition of llog_rec_hdr requires little-endianness, while the LLOG_REC_HDR_NEEDS_SWABBING() calls and log writing code suggest host-endianness (or adaptive-endianness). Enforcing little-endianness requires a larger amount of changes, while host-endianness makes it impossible to find the index of the last record in a chunk in O(1) time, since the record header must be read first to determine endianness.
Attachments
Issue Links
- is related to
-
LU-6968 Update the whole header in llog_cancel_rec()
-
- Resolved
-
If we have anything like what's described in the gcc manual, that's our bug.
We should NOT mix packed and non-packed structures in the same structure.
Overall packed structures are for tings on the wire/on disk. Even then we can do without if we carefully do our own layout of the structures.
For any bad structures like that we have right now (esp. in OSD), we need to fix them by yesterday so 2.4.0 has all of those changes and there is no protocol breakage going forward.
So I am expecting a patch real soon.