[LU-16245] __osd_init_iobuf()) ASSERTION( iobuf->dr_elapsed_valid == 0 ) Created: 18/Oct/22 Updated: 29/Nov/22 Due: 18/Oct/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jason Feng | Assignee: | Jason Feng |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre servers: lustre clients: IO500 tag:io500-sc21 |
||
| Issue Links: |
|
||||||||
| Epic/Theme: | Performance | ||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
[ 8753.247529] LustreError: 37772:0:(osd_io.c:79:__osd_init_iobuf()) ASSERTION( iobuf->dr_elapsed_valid == 0 ) failed: iobuf 000000006eba9531, reqs 0, rw 1, line 1633 [ 8753.262771] LustreError: 37772:0:(osd_io.c:79:__osd_init_iobuf()) LBUG [ 8753.269970] Pid: 37772, comm: mdt_io05_022 5.10.0-60.18.0.50.aarch64 #1 SMP Wed Oct 5 10:58:08 CST 2022 [ 8753.280021] Call Trace TBD: [ 8753.283505] Kernel panic - not syncing: LBUG [ 8753.288454] CPU: 59 PID: 37772 Comm: mdt_io05_022 Kdump: loaded Tainted: P OE 5.10.0-60.18.0.50.aarch64 #1 [ 8753.299963] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDDA, BIOS 1.38 07/04/2020 [ 8753.308881] Call trace: [ 8753.312014] dump_backtrace+0x0/0x1e0 [ 8753.316352] show_stack+0x20/0x30 [ 8753.320347] dump_stack+0xe0/0x148 [ 8753.324426] panic+0x170/0x398 [ 8753.328188] param_set_delay_minmax.isra.1+0x0/0xd0 [libcfs] [ 8753.334552] __osd_init_iobuf+0x2e8/0x408 [osd_ldiskfs] [ 8753.340454] osd_write_prep+0xec/0x330 [osd_ldiskfs] [ 8753.346149] mdt_obd_preprw+0xaa0/0xc38 [mdt] [ 8753.351294] tgt_brw_write+0x1208/0x2f30 [ptlrpc] [ 8753.351367] tgt_handle_request0+0xd4/0x9b0 [ptlrpc] [ 8753.362369] tgt_request_handle+0x7cc/0x1a30 [ptlrpc] [ 8753.368148] ptlrpc_server_handle_request+0x3bc/0x1218 [ptlrpc] [ 8753.374791] ptlrpc_main+0xdfc/0x16c8 [ptlrpc] [ 8753.379910] kthread+0x130/0x138 [ 8753.383818] ret_from_fork+0x10/0x18 [ 8753.388121] SMP: stopping secondary CPUs [ 8753.395179] Starting crashdump kernel... [ 8753.399781] Bye! |
| Comments |
| Comment by Jason Feng [ 18/Oct/22 ] |
|
Do not modify dr_elapsed_valid if osd_fini_iobuf has been invoked. The initial value of dr_elapsed_valid is 0. When the I/O is complete, dio_complete_routine will set dr_elapsed_valid to 1. Finally, dr_elapsed_valid is cleared in osd_fini_iobuf.In the I/O write process, wait_event is not called, and osd_fini_iobuf cannot be executed before dio_complete_routine. As a result, dr_elapsed_valid is not cleared and is asserted when it is used again. |
| Comment by Gerrit Updater [ 18/Oct/22 ] |
|
"fengchunsong <fengchunsong@huawei.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48905 |
| Comment by Xinliang Liu [ 29/Nov/22 ] |
|
I suspect this issue is similar to See about the problem of nested sleeping primitives here: https://lwn.net/Articles/628628/. We might need to fix this issue like |