[LU-17314] ASSERTION( !list_empty(&req->rq_srv.sr_timed_list) ) failed:  Created: 26/Nov/23  Updated: 13/Jan/24  Resolved: 13/Jan/24

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16430 ASSERTION( !list_empty(&req->rq_srv.s... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

 

We hit LBUG in 2.15.3 it looks like a duplicate of LU-16430. If it is a duplicate can we get a backport to 2.15.3.

 

[4508205.470593] LustreError: 185433:0:(service.c:1330:ptlrpc_at_remove_timed()) ASSERTION( !list_empty(&req->rq_srv.sr_timed_list) ) failed: 
[4508205.483191] LustreError: 185433:0:(service.c:1330:ptlrpc_at_remove_timed()) LBUG
[4508205.491175] Kernel panic - not syncing: LBUG
[4508205.495642] CPU: 33 PID: 185433 Comm: ldlm_cn07_020 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-477.10.1.el8_lustre.x86_64 #1
[4508205.508843] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 04/20/2023
[4508205.517587] Call Trace:
[4508205.520219]  dump_stack+0x41/0x60
[4508205.523726]  panic+0xe7/0x2ac
[4508205.526884]  ? ret_from_fork+0x1f/0x40
[4508205.530828]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]
[4508205.536004]  ptlrpc_at_remove_timed+0xc7/0xd0 [ptlrpc]
[4508205.541421]  ptlrpc_server_drop_request+0x11b/0x750 [ptlrpc]
[4508205.547361]  ? _raw_spin_lock+0x1e/0x30
[4508205.551391]  ptlrpc_server_handle_req_in+0x3c1/0x8d0 [ptlrpc]
[4508205.557419]  ptlrpc_main+0xbb9/0x1570 [ptlrpc]
[4508205.562134]  ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
[4508205.567460]  kthread+0x134/0x150
[4508205.570878]  ? set_kthread_struct+0x50/0x50
[4508205.575257]  ret_from_fork+0x1f/0x40
 


 Comments   
Comment by Peter Jones [ 26/Nov/23 ]

Hongchao

It looks like there is already a port of the LU-16430 fix - https://review.whamcloud.com/#/c/fs/lustre-release/+/52338/ . Does this issue match that one or is it something else?

Please advise

Peter

Comment by Hongchao Zhang [ 27/Nov/23 ]

It should be the same issue as LU-16430, which is caused by the racy bit modification of "req.rq_obsolete"

Comment by Peter Jones [ 02/Dec/23 ]

Thanks Hongchao. Mahmoud have you tried the effectiveness of the patch?

Comment by Peter Jones [ 13/Jan/24 ]

Fix included in 2.15.4

Generated at Sat Feb 10 03:34:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.