[LU-16132] NULL pointer dereference lu_object_put Created: 01/Sep/22  Updated: 07/Sep/22  Resolved: 07/Sep/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Brian Barbisch Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

5.4.0-1089-azure #94~18.04.1-Ubuntu


Issue Links:
Related
is related to LU-15811 simplify lower/upper AIO/DIO code Resolved
Epic/Theme: client
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

NULL pointer dereference while running ior in odirect with random writes.

 

Here is the dmesg output:

 

[39590.638366] BUG: kernel NULL pointer dereference, address: 0000000000000000
[39590.642325] #PF: supervisor read access in kernel mode
[39590.642325] #PF: error_code(0x0000) - not-present page
[39590.647646] PGD 0 P4D 0 
[39590.647646] Oops: 0000 [#1] SMP PTI
[39590.647646] CPU: 3 PID: 4130 Comm: ptlrpcd_00_01 Kdump: loaded Tainted: G           OE     5.4.0-1089-azure #94~18.04.1-Ubuntu
[39590.647646] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008  12/07/2018
[39590.647646] RIP: 0010:lu_object_put+0x1c/0x4a0 [obdclass]
[39590.647646] Code: 92 c6 d1 e9 72 ff ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 89 f3 48 83 ec 
18 <4c> 8b 36 49 8b 36 41 8b 56 08 48 85 f6 75 08 85 d2 0f 84 9a 00 00
[39590.647646] RSP: 0018:ffffb0ae83ccfa60 EFLAGS: 00010286
[39590.647646] RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffff9f5b4d64e670
[39590.647646] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb0ae83ccfe98
[39590.647646] RBP: ffffb0ae83ccfaa0 R08: 0000000000000100 R09: 0000000000000001
[39590.686119] R10: 0000000000100000 R11: 0000000000000000 R12: ffffb0ae83ccfe98
[39590.686119] R13: ffffb0ae83ccfe98 R14: ffffffffc0e50aa0 R15: 0000000000000000
[39590.686119] FS:  0000000000000000(0000) GS:ffff9f5bdf8c0000(0000) knlGS:0000000000000000
[39590.686119] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[39590.686119] CR2: 0000000000000000 CR3: 00000007f7108004 CR4: 00000000003706e0
[39590.686119] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[39590.686119] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[39590.686119] Call Trace:
[39590.686119]  ? cl_sync_io_note+0x1c0/0x360 [obdclass]
[39590.686119]  cl_object_put+0xe/0x10 [obdclass]
[39590.686119]  cl_aio_free+0x1b/0xe0 [obdclass]
[39590.686119]  cl_sync_io_note+0x17a/0x360 [obdclass]
[39590.686119]  cl_sync_io_note+0x27e/0x360 [obdclass]
[39590.686119]  ? cl_sync_io_note+0x1c0/0x360 [obdclass]
[39590.686119]  cl_sync_io_note+0x14e/0x360 [obdclass]
[39590.686119]  cl_page_completion+0x2ef/0x450 [obdclass]
[39590.735565]  osc_prep_async_page+0x831/0x19d0 [osc]
[39590.735565]  osc_extent_finish+0x160/0xa70 [osc]
[39590.735565]  ? kmem_cache_free+0x294/0x2b0
[39590.735565]  osc_set_info_async+0x2869/0x5380 [osc]
[39590.735565]  ? ptlrpc_retain_replayable_request+0xc33/0xff0 [ptlrpc]
[39590.735565]  ptlrpc_check_set+0x248/0x1f60 [ptlrpc]
[39590.735565]  ptlrpcd_add_req+0xd03/0xef0 [ptlrpc]
[39590.735565]  ? do_wait_intr_irq+0x90/0x90
[39590.735565]  kthread+0x121/0x140
[39590.735565]  ? ptlrpcd_add_req+0x490/0xef0 [ptlrpc]
[39590.735565]  ? kthread_park+0x90/0x90
[39590.735565]  ret_from_fork+0x35/0x40

 

 

Backtrace from crash dump:

 

crash> bt -sx
PID: 4130   TASK: ffff9f5b69f15b80  CPU: 3   COMMAND: "ptlrpcd_00_01"
 #0 [ffffb0ae83ccf670] machine_kexec+0x180 at ffffffff92a5e5b0
 #1 [ffffb0ae83ccf6c8] __crash_kexec+0x72 at ffffffff92b43ab2
 #2 [ffffb0ae83ccf798] panic+0x158 at ffffffff93434b05
 #3 [ffffb0ae83ccf820] oops_end+0xcc at ffffffff92a2512c
 #4 [ffffb0ae83ccf848] no_context+0x1db at ffffffff92a6d55b
 #5 [ffffb0ae83ccf8b8] __bad_area_nosemaphore+0x50 at ffffffff92a6d950
 #6 [ffffb0ae83ccf900] bad_area_nosemaphore+0x16 at ffffffff92a6daf6
 #7 [ffffb0ae83ccf910] __do_page_fault+0x21a at ffffffff92a6e4ba
 #8 [ffffb0ae83ccf978] do_page_fault+0x35 at ffffffff92a6e795
 #9 [ffffb0ae83ccf9b0] page_fault+0x39 at ffffffff93601129
    [exception RIP: lu_object_put+28]
    RIP: ffffffffc0e3e3ec  RSP: ffffb0ae83ccfa60  RFLAGS: 00010286
    RAX: 0000000000000001  RBX: 0000000000000000  RCX: ffff9f5b4d64e670
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: ffffb0ae83ccfe98
    RBP: ffffb0ae83ccfaa0   R8: 0000000000000100   R9: 0000000000000001
    R10: 0000000000100000  R11: 0000000000000000  R12: ffffb0ae83ccfe98
    R13: ffffb0ae83ccfe98  R14: ffffffffc0e50aa0  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffffb0ae83ccfaa8] cl_object_put+0xe at ffffffffc0e45cbe [obdclass]
#11 [ffffb0ae83ccfab8] cl_aio_free+0x1b at ffffffffc0e4fc9b [obdclass]
#12 [ffffb0ae83ccfad0] cl_sync_io_note+0x17a at ffffffffc0e50a5a [obdclass]
#13 [ffffb0ae83ccfb08] cl_sync_io_note+0x27e at ffffffffc0e50b5e [obdclass]
#14 [ffffb0ae83ccfb40] cl_sync_io_note+0x14e at ffffffffc0e50a2e [obdclass]
#15 [ffffb0ae83ccfb78] cl_page_completion+0x2ef at ffffffffc0e4bf8f [obdclass]
#16 [ffffb0ae83ccfbb8] osc_prep_async_page+0x831 at ffffffffc0c9f371 [osc]
#17 [ffffb0ae83ccfc08] osc_extent_finish+0x160 at ffffffffc0ca4b80 [osc]
#18 [ffffb0ae83ccfca0] osc_set_info_async+0x2869 at ffffffffc0c87e69 [osc]
#19 [ffffb0ae83ccfd70] ptlrpc_check_set+0x248 at ffffffffc11508a8 [ptlrpc]
#20 [ffffb0ae83ccfe00] ptlrpcd_add_req+0xd03 at ffffffffc117f903 [ptlrpc]
#21 [ffffb0ae83ccff08] kthread+0x121 at ffffffff92aaf8d1
#22 [ffffb0ae83ccff50] ret_from_fork+0x35 at ffffffff93600215

 

 

 

 

 



 Comments   
Comment by Brian Barbisch [ 06/Sep/22 ]

FWIW, the same test passes on 2.12.x and 2.14.0, so possible regression in 2.15.x.

Comment by Andreas Dilger [ 06/Sep/22 ]

There were a few patches with fixes to the AIO/DIO code in LU-15811 that were just landed to master (not in any tag yet).

Could you run this same test with master (at least commit v2_15_51-20-gf1c8ac1156 or later), and/or cherry-pick the three LU-15811 patches to b2_15 to see if this fixes the problem?

Comment by Brian Barbisch [ 07/Sep/22 ]

Thank you Andreas!  I cherry-picked the 3 commits from LU-15811 into my personal upstream/b2_15 branch, and my directio random writes test has passed 6 times in a row now (it used to fail around 2 out of every 3 times).  I'm fairly confident that this fixes the issue and would definitely advise that those commits get cherry-picked into the LTS 2.15.x in the future.

Generated at Sat Feb 10 03:24:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.