[LU-16593]  BUG: unable to handle kernel NULL pointer in mdd_changelog_recalc_mask Created: 24/Feb/23  Updated: 24/Feb/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Alex Zhuravlev Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
BUG: unable to handle kernel NULL pointer dereference at 0000000000000160 in conf-sanity / 76c

Trace:

PID: 317366  TASK: ffffa2bfa2c219c0  CPU: 0   COMMAND: "lctl"
 #0 [ffffa2bfa2c27c68] panic at ffffffffaf0b9786
    /tmp/kernel/kernel/panic.c: 299
 #1 [ffffa2bfa2c27d00] no_context at ffffffffaf0a9563
    /tmp/kernel/arch/x86/mm/fault.c: 799
 #2 [ffffa2bfa2c27d50] page_fault at ffffffffaf600f0e
    /tmp/kernel/arch/x86/entry/entry_64.S: 1220
    [exception RIP: mdd_changelog_recalc_mask+212]
    RIP: ffffffffc0ce40a4  RSP: ffffa2bfa2c27e00  RFLAGS: 00010286
    RAX: ffffffffffffffff  RBX: ffffa2bf73145e00  RCX: 0000000000000000
    RDX: 0000000000000a2f  RSI: 0000000000000000  RDI: ffffa2bf765ec950
    RBP: ffffa2bf6a67ec00   R8: ffffa2bfa2c27c28   R9: 0000000000000a6c
    R10: 0000000000000000  R11: ffffa2bf8cd05a6b  R12: ffffa2bf765ec950
    R13: ffffa2bfa2c27e40  R14: ffffa2bf8d508000  R15: ffffa2bfa2c27f10
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    /home/lustre/master-mine/lustre/mdd/mdd_device.c: 1889
 #3 [ffffa2bfa2c27e30] mdd_changelog_mask_seq_write at ffffffffc0d098e2 [mdd]
    /home/lustre/master-mine/lustre/mdd/mdd_lproc.c: 174
 #4 [ffffa2bfa2c27ea0] full_proxy_write at ffffffffaf2e6a7b
    /tmp/kernel/fs/debugfs/file.c: 230
 #5 [ffffa2bfa2c27ed8] vfs_write at ffffffffaf1cffc9
    /tmp/kernel/fs/read_write.c: 550
 #6 [ffffa2bfa2c27f08] ksys_write at ffffffffaf1d021d
    /tmp/kernel/fs/read_write.c: 599
 #7 [ffffa2bfa2c27f38] do_syscall_64 at ffffffffaf001893
    /tmp/kernel/arch/x86/entry/common.c: 302
 #8 [ffffa2bfa2c27f50] entry_SYSCALL_64_after_hwframe at ffffffffaf600099
    /tmp/kernel/arch/x86/entry/entry_64.S: 151
    RIP: 00007f8dd57e5915  RSP: 00007ffd59725268  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 00000000006822f0  RCX: 00007f8dd57e5915
    RDX: 0000000000000006  RSI: 00007ffd59728f32  RDI: 0000000000000003
    RBP: 0000000000000001   R8: 000000000000fc00   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000003
    R13: 000000000068a301  R14: 0000000000682440  R15: 00007ffd5972740c
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
/home/lustre/master-mine/lustre/mdd/mdd_device.c: 1889
0xffffffffc0ce40a0 <mdd_changelog_recalc_mask+208>:	mov    0x30(%rbx),%rsi
0xffffffffc0ce40a4 <mdd_changelog_recalc_mask+212>:	mov    0x160(%rsi),%rax
crash> p *(struct llog_ctxt *)0xffffa2bf73145e00
$1 = {
  loc_idx = 14,
  loc_obd = 0xffffa2bf765ec188,
  loc_olg = 0xffffa2bf765ec868,
  loc_exp = 0xffffa2bf6308f800,
  loc_imp = 0x0,
  loc_logops = 0xffffffffc0d470c0 <changelog_orig_logops>,
  loc_handle = 0x0,

just before the panic another process (mount) started to initialize the changelog:

00000004:00000080:1.0:1677245239.833125:0:317339:0:(mdd_device.c:548:mdd_changelog_llog_init()) changelog starting index=0
00000040:00000001:1.0:1677245239.833125:0:317339:0:(llog_obd.c:150:llog_setup()) Process entered
00000040:00000010:1.0:1677245239.833127:0:317339:0:(llog_obd.c:44:llog_new_ctxt()) kmalloced '(ctxt)': 264 at 00000000ccbe3063.
00000020:00000040:1.0:1677245239.833128:0:317339:0:(genops.c:970:class_export_get()) GET export 00000000c0fe929f refcount=3
00000040:00000001:1.0:1677245239.833128:0:317339:0:(llog_osd.c:1919:llog_osd_setup()) Process entered
00000040:00000040:1.0:1677245239.833129:0:317339:0:(lustre_log.h:395:llog_ctxt_get()) GETting ctxt 00000000ccbe3063 : new refcount 2
00000020:00000001:1.0:1677245239.833129:0:317339:0:(local_storage.c:842:local_oid_storage_init()) Process entered
00000020:00000001:1.0:1677245239.833130:0:317339:0:(local_storage.c:147:ls_device_get()) Process entered
00000020:00000001:1.0:1677245239.833130:0:317339:0:(local_storage.c:152:ls_device_get()) Process leaving via out_ls (rc=18446641542280932352 : -102531428619264 : 0xffffa2bf8a9e6c00)

tos this is a race obviously, though I don't quite understand how lctl was able to start before MDS mount completion.


Generated at Sat Feb 10 03:28:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.