Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.0, Lustre 2.1.4
-
3
-
5830
Description
We repeatedly hit this problem on our Grove-Production MDS today:
BUG: unable to handle kernel NULL pointer dereference at 000000000000001c IP: [<ffffffffa08bcdb7>] lustre_swab_lov_user_md_common+0x27/0x4e0 [ptlrpc]
crash> bt
PID: 738 TASK: ffff881778c9caa0 CPU: 14 COMMAND: "mdt00_006"
#0 [ffff88175b907370] machine_kexec at ffffffff8103216b
#1 [ffff88175b9073d0] crash_kexec at ffffffff810b8d12
#2 [ffff88175b9074a0] oops_end at ffffffff814f2c00
#3 [ffff88175b9074d0] no_context at ffffffff810423fb
#4 [ffff88175b907520] __bad_area_nosemaphore at ffffffff81042685
#5 [ffff88175b907570] bad_area_nosemaphore at ffffffff81042753
#6 [ffff88175b907580] __do_page_fault at ffffffff81042e0d
#7 [ffff88175b9076a0] do_page_fault at ffffffff814f4bde
#8 [ffff88175b9076d0] page_fault at ffffffff814f1f95
[exception RIP: lustre_swab_lov_user_md_common+39]
RIP: ffffffffa08bcdb7 RSP: ffff88175b907780 RFLAGS: 00010246
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffffffffa090961a RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88175b907790 R8: ffff88175b937000 R9: ffff88175b8910d0
R10: 0000000000000001 R11: 00000000fffffff3 R12: ffff8817ec176000
R13: ffff88175c222468 R14: ffffc9013311e208 R15: ffff8817ec176000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff88175b907798] lustre_swab_lov_user_md_v3 at ffffffffa08bd2ad [ptlrpc]
#10 [ffff88175b9077b8] lod_qos_prep_create at ffffffffa0b6bf77 [lod]
#11 [ffff88175b907858] lod_declare_striped_object at ffffffffa0b66c7b [lod]
#12 [ffff88175b9078b8] lod_declare_xattr_set at ffffffffa0b67b9d [lod]
#13 [ffff88175b907918] mdd_create_data at ffffffffa0bf4c00 [mdd]
#14 [ffff88175b907978] mdt_finish_open at ffffffffa0c794f8 [mdt]
#15 [ffff88175b907a08] mdt_open_by_fid_lock at ffffffffa0c7a5a7 [mdt]
#16 [ffff88175b907a78] mdt_reint_open at ffffffffa0c7ac5f [mdt]
#17 [ffff88175b907b58] mdt_reint_rec at ffffffffa0c66a21 [mdt]
#18 [ffff88175b907b78] mdt_reint_internal at ffffffffa0c601b3 [mdt]
#19 [ffff88175b907bb8] mdt_intent_reint at ffffffffa0c6077d [mdt]
#20 [ffff88175b907c08] mdt_intent_policy at ffffffffa0c5c38e [mdt]
#21 [ffff88175b907c48] ldlm_lock_enqueue at ffffffffa0872b91 [ptlrpc]
#22 [ffff88175b907ca8] ldlm_handle_enqueue0 at ffffffffa089a837 [ptlrpc]
#23 [ffff88175b907d18] mdt_enqueue at ffffffffa0c5bf16 [mdt]
#24 [ffff88175b907d38] mdt_handle_common at ffffffffa0c4fdd2 [mdt]
#25 [ffff88175b907d88] mdt_regular_handle at ffffffffa0c50cd5 [mdt]
#26 [ffff88175b907d98] ptlrpc_server_handle_request at ffffffffa08ca8fc [ptlrpc]
#27 [ffff88175b907e98] ptlrpc_main at ffffffffa08cbeec [ptlrpc]
#28 [ffff88175b907f48] kernel_thread at ffffffff8100c14a
Recovery was manually aborted, which cleared up the issue:
lctl --device 5 abort_recovery
Prior to the manual intervention, the node would continuously crash after recovery for about 12 hours.