[LU-17076] ptlrpc_nrs_req_stop_nolock() use after free with ORR NRS policy Created: 01/Sep/23 Updated: 13/Nov/23 Resolved: 16/Oct/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Mikhail Pershin | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
sanityn.sh test 77c crashed as below: [13898.279218] Lustre: DEBUG MARKER: == sanityn test 77c: check ORR NRS policy ================ 11:42:47 (1693482167) [13898.512652] Lustre: DEBUG MARKER: lctl set_param ost.OSS.ost_io.nrs_policies=orr ost.OSS.*.nrs_orr_quantum=1 ost.OSS.*.nrs_orr_offset_type=physical ost.OSS.*.nrs_orr_supported=reads [13900.571898] Lustre: DEBUG MARKER: lctl set_param ost.OSS.*.nrs_orr_supported=writes ost.OSS.*.nrs_orr_quantum=64 [13903.067814] Lustre: DEBUG MARKER: lctl set_param ost.OSS.*.nrs_orr_supported=reads_and_writes ost.OSS.*.nrs_orr_offset_type=logical [13905.203004] general protection fault, probably for non-canonical address 0x5a5a5a5a5a5a5a5a: 0000 [#1] SMP PTI [13905.205028] CPU: 1 PID: 250894 Comm: ll_ost_io00_027 Kdump: loaded Tainted: P OE --------- - - 4.18.0-477.15.1.el8_lustre.x86_64 #1 [13905.207489] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [13905.208620] RIP: 0010:ptlrpc_nrs_req_stop_nolock+0x34/0x140 [ptlrpc] [13905.210443] Code: 00 00 a8 04 75 05 e9 0b 08 f0 c1 53 a8 01 74 65 a8 08 0f 85 8d 00 00 00 8b 87 08 01 00 00 48 8b 84 c7 f8 00 00 00 48 8b 58 08 <48> 8b 43 68 48 8b 40 20 48 8b 40 50 48 85 c0 74 0f 48 8d b7 f8 00 [13905.213924] RSP: 0018:ffff9c0ec3513da8 EFLAGS: 00010246 [13905.214961] RAX: ffff8ecb454357d0 RBX: 5a5a5a5a5a5a5a5a RCX: 0000000000000000 [13905.216321] RDX: 0000000000000001 RSI: ffff8ecb309ca400 RDI: ffff8ecb309ca400 [13905.217685] RBP: ffff8ecb05c4d8e8 R08: 00000000000000b8 R09: abcc77118461cefd [13905.219044] R10: ffff9c0ec3513d88 R11: ffff8ecb34982646 R12: ffff8ecb05c4d800 [13905.220422] R13: ffff8ecb43bfb200 R14: 00000000000089b6 R15: ffff8ecb309ca400 [13905.221866] FS: 0000000000000000(0000) GS:ffff8ecbbfd00000(0000) knlGS:0000000000000000 [13905.223781] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [13905.224902] CR2: 00007ffc748ef221 CR3: 0000000076810004 CR4: 00000000001706e0 [13905.226268] Call Trace: [13905.226846] ptlrpc_server_finish_active_request+0x26/0x120 [ptlrpc] [13905.228184] ptlrpc_server_handle_request+0x4a2/0xbc0 [ptlrpc] [13905.229415] ptlrpc_main+0xc91/0x15a0 [ptlrpc] [13905.230399] ? __schedule+0x2d9/0x870 [13905.231251] ? ptlrpc_wait_event+0x590/0x590 [ptlrpc] [13905.232347] kthread+0x134/0x150 [13905.233064] ? set_kthread_struct+0x50/0x50 [13905.233920] ret_from_fork+0x35/0x40 report link: https://testing.whamcloud.com/test_sets/79713f31-a8a9-4056-8478-4fb529a14077 It can be related to recent changes in ORR/TRR code in 'LU-8130 nrs: convert NRS ORR/TRR to rhashtable' landed two weeks ago |
| Comments |
| Comment by Mikhail Pershin [ 01/Sep/23 ] |
|
On further check it seems that all recent occurrences are from |
| Comment by James A Simmons [ 01/Sep/23 ] |
|
Strange bug since nrs_orr_req_stop() is the under lying function being called and the debug data is not being handled by RCU so its not a rhashtable issue. I was thinking it was due to the lquota memory corruption bug but if you can show its |
| Comment by Mikhail Pershin [ 02/Sep/23 ] |
|
https://testing.whamcloud.com/test_sets/49dd0ad5-bff7-4dba-a84e-7d8dcae4402a Another occurrence without relation to
|
| Comment by Mikhail Pershin [ 02/Sep/23 ] |
|
It happened prior related policy operation ..._req_stop call. I did objdump for related code and it looks like that is due to freed resource: static inline void nrs_request_stop(struct ptlrpc_nrs_request *nrq)
{
struct ptlrpc_nrs_policy *policy = nrs_request_policy(nrq);
which is in turn:
struct ptlrpc_nrs_policy *nrs_request_policy(struct ptlrpc_nrs_request *nrq)
{
return nrs_request_resource(nrq)->res_policy; <--- here res_policy is filled from resource in %rbx
}
so it looks like nrs_request_resource returned by nrs_request_policy() is freed and contains that 0x5a5a5a5a5a |
| Comment by Etienne Aujames [ 05/Sep/23 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/49dd0ad5-bff7-4dba-a84e-7d8dcae4402a |
| Comment by Etienne Aujames [ 06/Sep/23 ] |
|
Hi, I tried to reproduce this bug because I suspected the "c098c09 I reproduced this crash with debug=all on zfs (it took arround 5-10min): ONLY_REPEAT=1000 FSTYPE=zfs auster -rv sanityn --only "77c" From the vmcore-dmesg: [ 1187.404425] RIP: 0010:[<ffffffffc0c33ddb>] [<ffffffffc0c33ddb>] ptlrpc_nrs_req_stop_nolock+0x3b/0x150 [ptlrpc] [ 1187.404509] RSP: 0018:ffff904365ab7ce0 EFLAGS: 00010246 [ 1187.404528] RAX: ffff9043bbae4b40 RBX: 5a5a5a5a5a5a5a5a RCX: 0000000000000000 [ 1187.404552] RDX: 0000000000000001 RSI: ffff9043639ca880 RDI: ffff9043639ca880 [ 1187.404576] RBP: ffff904365ab7ce8 R08: 000000000000000a R09: 0000000000000038 [ 1187.404600] R10: 0000000000000000 R11: ffff904365ab7af6 R12: ffff9043797b8800 [ 1187.404624] R13: ffff9043797b88f0 R14: ffff9043baad0e00 R15: 00064f83754e9e80 [ 1187.404648] FS: 0000000000000000(0000) GS:ffff9043bfc00000(0000) knlGS:0000000000000000 [ 1187.404676] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1187.404695] CR2: 00007fc038264d00 CR3: 0000000078f9e000 CR4: 00000000000606f0 [ 1187.404731] Call Trace: [ 1187.404780] [<ffffffffc0bfcb8b>] ptlrpc_server_finish_active_request+0x2b/0x140 [ptlrpc] [ 1187.404846] [<ffffffffc0c0080e>] ptlrpc_server_handle_request+0x40e/0xc00 [ptlrpc] [ 1187.404903] [<ffffffffc0c0232c>] ptlrpc_main+0xc8c/0x1680 [ptlrpc] [ 1187.404929] [<ffffffffb5e2b59e>] ? __switch_to+0xce/0x580 [ 1187.404990] [<ffffffffc0c016a0>] ? ptlrpc_wait_event+0x5c0/0x5c0 [ptlrpc] [ 1187.405789] [<ffffffffb5ec5e61>] kthread+0xd1/0xe0 [ 1187.406600] [<ffffffffb5ec5d90>] ? insert_kthread_work+0x40/0x40 [ 1187.407483] [<ffffffffb65aadf7>] ret_from_fork_nospec_begin+0x21/0x21 [ 1187.408317] [<ffffffffb5ec5d90>] ? insert_kthread_work+0x40/0x40 From the crash: crash> ptlrpc_request ffff9043639ca880
....
sr_nrq = {
nr_res_ptrs = {0xffff904398815680, 0xffff9043bbae4b40},
....
crash> kmem 0xffff9043bbae4b40
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
ffff9043bd16c900 Acpi-State 80 31 306 6 4k
SLAB MEMORY NODE TOTAL ALLOCATED FREE
ffffe2c481eeb900 ffff9043bbae4000 0 51 10 41
FREE / [ALLOCATED]
ffff9043bbae4b40 (cpu 0 cache)
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffe2c481eeb900 7bae4000 0 ffff9043bbae4b40 1 1fffff00000080 slab
crash> struct nrs_orr_data -x -l nrs_orr_data.od_res 0xffff9043bbae4b40 | head -n10
struct nrs_orr_data {
od_res = {
res_parent = 0xffff9043bbae4cd0,
res_policy = 0x5a5a5a5a5a5a5a5a
},
od_binheap = 0x5a5a5a5a5a5a5a5a,
od_obj_hash = {
tbl = 0x5a5a5a5a5a5a5a5a,
nelems = {
counter = 0x5a5a5a5a
The orr object was freed. The debug log confirmed this: 00000100:00000010:1.1:1693989787.815785:0:10036:0:(nrs_orr.c:832:nrs_orr_object_free()) slab-freed '(orro)': 80 at ffff9043bbae4b40. |
| Comment by Etienne Aujames [ 06/Sep/23 ] |
|
I think the culprit is the following code: static int nrs_orr_res_get(struct ptlrpc_nrs_policy *policy, struct ptlrpc_nrs_request *nrq, const struct ptlrpc_nrs_resource *parent, struct ptlrpc_nrs_resource **resp, bool moving_req) .... orro2 = rhashtable_lookup_get_insert_fast(&orrd->od_obj_hash, &orro->oo_rhead, nrs_orr_hash_params); /* A returned non-error orro2 means it already exist */ if (!IS_ERR_OR_NULL(orro2)) { <------ no ref taken if rhashtable_lookup_get_insert_fast returns an existing object OBD_SLAB_FREE_PTR(orro, orrd->od_cache); orro = orro2; } /* insertion failed */ if (IS_ERR(orro2)) { A ref needs to be taken if rhashtable_lookup_get_insert_fast() returns an object, otherwise the object could be freed before nrs_orr_res_put()/ptlrpc_nrs_req_stop_nolock(). |
| Comment by Gerrit Updater [ 06/Sep/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52295 |
| Comment by Gerrit Updater [ 26/Sep/23 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52515 |
| Comment by Gerrit Updater [ 28/Sep/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52295/ |
| Comment by Gerrit Updater [ 16/Oct/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52515/ |
| Comment by Peter Jones [ 16/Oct/23 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 13/Nov/23 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53113 |