[LU-16245] __osd_init_iobuf()) ASSERTION( iobuf->dr_elapsed_valid == 0 ) Created: 18/Oct/22  Updated: 29/Nov/22  Due: 18/Oct/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.1
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Jason Feng Assignee: Jason Feng
Resolution: Unresolved Votes: 0
Labels: None
Environment:

lustre servers:
10 nodes ,each node has kunpeng920 96core *2, memory 512GB,nvme 3.2T*4
centos 8.4.2105
kernel 5.10.0-60.18.0.50.aarch64 (openeuler 22.03 kernel)
lustre 0c68b13a5eeb408862bad795aaf9a24a11a14b6a

lustre clients:
10 nodes intel 6266C*2, memory 372GB
centos 8.4.2105
kernel 4.18.0-372.9.1.el8.x86_64

IO500 tag:io500-sc21


Issue Links:
Related
is related to LU-16246 NULL pointer at lod_lookup+0x24/0x38 Open
Epic/Theme: Performance
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
[ 8753.247529] LustreError: 37772:0:(osd_io.c:79:__osd_init_iobuf()) ASSERTION( iobuf->dr_elapsed_valid == 0 ) failed: iobuf 000000006eba9531, reqs 0, rw 1, line 1633
[ 8753.262771] LustreError: 37772:0:(osd_io.c:79:__osd_init_iobuf()) LBUG
[ 8753.269970] Pid: 37772, comm: mdt_io05_022 5.10.0-60.18.0.50.aarch64 #1 SMP Wed Oct 5 10:58:08 CST 2022
[ 8753.280021] Call Trace TBD:
[ 8753.283505] Kernel panic - not syncing: LBUG
[ 8753.288454] CPU: 59 PID: 37772 Comm: mdt_io05_022 Kdump: loaded Tainted: P           OE     5.10.0-60.18.0.50.aarch64 #1
[ 8753.299963] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDDA, BIOS 1.38 07/04/2020
[ 8753.308881] Call trace:
[ 8753.312014]  dump_backtrace+0x0/0x1e0
[ 8753.316352]  show_stack+0x20/0x30
[ 8753.320347]  dump_stack+0xe0/0x148
[ 8753.324426]  panic+0x170/0x398
[ 8753.328188]  param_set_delay_minmax.isra.1+0x0/0xd0 [libcfs]
[ 8753.334552]  __osd_init_iobuf+0x2e8/0x408 [osd_ldiskfs]
[ 8753.340454]  osd_write_prep+0xec/0x330 [osd_ldiskfs]
[ 8753.346149]  mdt_obd_preprw+0xaa0/0xc38 [mdt]
[ 8753.351294]  tgt_brw_write+0x1208/0x2f30 [ptlrpc]
[ 8753.351367]  tgt_handle_request0+0xd4/0x9b0 [ptlrpc]
[ 8753.362369]  tgt_request_handle+0x7cc/0x1a30 [ptlrpc]
[ 8753.368148]  ptlrpc_server_handle_request+0x3bc/0x1218 [ptlrpc]
[ 8753.374791]  ptlrpc_main+0xdfc/0x16c8 [ptlrpc]
[ 8753.379910]  kthread+0x130/0x138
[ 8753.383818]  ret_from_fork+0x10/0x18
[ 8753.388121] SMP: stopping secondary CPUs
[ 8753.395179] Starting crashdump kernel...
[ 8753.399781] Bye!


 Comments   
Comment by Jason Feng [ 18/Oct/22 ]

Do not modify dr_elapsed_valid if osd_fini_iobuf has been invoked.

The initial value of dr_elapsed_valid is 0. When the I/O is complete, dio_complete_routine will set dr_elapsed_valid  to 1. Finally, dr_elapsed_valid is cleared in osd_fini_iobuf.In the I/O write process, wait_event is not called, and osd_fini_iobuf cannot be executed before dio_complete_routine. As a result, dr_elapsed_valid is not cleared and is asserted when it is used again.
The initial value of dr_elapsed_valid is 0 and is changed to 2 in osd_fini_iobuf. The value of dr_elapsed_valid is changed to 1 only when the value of dr_elapsed_valid is 0 in dio_complete_routine. This avoids modification after finishing.

Comment by Gerrit Updater [ 18/Oct/22 ]

"fengchunsong <fengchunsong@huawei.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48905
Subject: LU-16245 osd-ldiskfs: prevent dr_elapsed_valid assertion
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5bc624a9c930f5dfd38b62eb661b706c418682e0

Comment by Xinliang Liu [ 29/Nov/22 ]

I suspect this issue is similar to LU-12362. Nested sleeping primitives might lead to an infinite wait, making osd_fini_iobuf() won't be called which causes this crash.

See about the problem of nested sleeping primitives here: https://lwn.net/Articles/628628/. We might need to fix this issue like LU-12362.

Generated at Sat Feb 10 03:25:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.