[LU-15681] crash in lnet_process_id_hash() Created: 23/Mar/22  Updated: 24/Mar/22  Resolved: 24/Mar/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Lukasz Flis Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

CentOS Linux release 8.5.2111
Kernel: 4.18.0-348.7.1.el8_5.x86_64


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Dear Devs,

During heavy workload we are experiencing kernel crash caused by page fault in  lnet_process_id_hash()

<pre>

[  520.767199] BUG: unable to handle kernel paging request at 00000000deadbf1f
[  520.775831] PGD 0 P4D 0 
[  520.779875] Oops: 0000 1 SMP NOPTI
[  520.785037] CPU: 10 PID: 492691 Comm: ll_ost00_016 Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-348.7.1.el8_5.x86_64 #1
[  520.800422] Hardware name: HPE ProLiant DL325 Gen10 Plus/ProLiant DL325 Gen10 Plus, BIOS A43 12/03/2021
[  520.812168] RIP: 0010:lnet_process_id_hash+0x5/0x50 [ptlrpc]
[  520.820123] Code: 7e 28 39 7a 0c 75 d4 8b 7e 2c 39 7a 10 75 cc 8b 46 30 39 42 14 0f 94 c0 0f b6 c0 8d 44 40 fd c3 0f 1f 44 00 00 0f 1f 44 00 00 <33> 57 14 be ff ff ff ff 69 ca 47 86 c8 61 48 85 ff 74 18 0f b6 47
[  520.842104] RSP: 0018:ffffaa79b40c3be0 EFLAGS: 00010202
[  520.848959] RAX: ffffffffc1a36690 RBX: 5a5a5a5a5a5a5a5a RCX: 00000000deadbeef
[  520.857883] RDX: 000000000cdd1d51 RSI: 0000000000000001 RDI: 00000000deadbf0b
[  520.866635] RBP: ffffaa79b40c3c70 R08: ffff8ac3fecaabf8 R09: 00000000000003e8
[  520.875303] R10: 0000000000000000 R11: ffff8ac3feca8ec4 R12: ffff8a4e59d5c000
[  520.884015] R13: ffffffffc1baa580 R14: fffffffffffffff0 R15: 00000000deadbeef
[  520.893240] FS:  0000000000000000(0000) GS:ffff8ac3fec80000(0000) knlGS:0000000000000000
[  520.903294] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  520.911071] CR2: 00000000deadbf1f CR3: 0000000c407c6000 CR4: 0000000000350ee0
[  520.920020] Call Trace:
[  520.924182]  ptlrpc_connection_get+0x27f/0x920 [ptlrpc]
[  520.931034]  target_handle_connect+0x6de/0x29d0 [ptlrpc]
[  520.937816]  ? internal_add_timer+0x42/0x60
[  520.943593]  tgt_request_handle+0x565/0x1a40 [ptlrpc]
[  520.950382]  ? ptlrpc_nrs_req_get_nolock0+0xfb/0x1f0 [ptlrpc]
[  520.957780]  ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc]
[  520.965373]  ptlrpc_main+0xc06/0x1560 [ptlrpc]
[  520.971430]  ? __schedule+0x2c5/0x760
[  520.976758]  ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
[  520.983264]  kthread+0x116/0x130
[  520.987811]  ? kthread_flush_work_fn+0x10/0x10
[  520.993636]  ret_from_fork+0x22/0x40

</pre>

 



 Comments   
Comment by Etienne Aujames [ 23/Mar/22 ]

Hi,
This is probably a duplicate of LU-15634.
Maybe you can try the https://review.whamcloud.com/46763/ "LU-15634 ptlrpc: Use after free of 'conn' in rhashtable retry" (landed on master).

Comment by Lukasz Flis [ 24/Mar/22 ]

@Etienne Aujames - thank you very much for pointing this one out. This fixed the problem in 2.15.0_RC2
Tested and confirmed

Comment by Peter Jones [ 24/Mar/22 ]

Yes, thanks eaujames. lflis please be sure to include the build being used in any future issues reported running pre-release code. Otherwise it is not clear which code is being run.

Generated at Sat Feb 10 03:20:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.