Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.0, Lustre 2.1.4
-
3
-
5830
Description
We repeatedly hit this problem on our Grove-Production MDS today:
BUG: unable to handle kernel NULL pointer dereference at 000000000000001c IP: [<ffffffffa08bcdb7>] lustre_swab_lov_user_md_common+0x27/0x4e0 [ptlrpc]
crash> bt PID: 738 TASK: ffff881778c9caa0 CPU: 14 COMMAND: "mdt00_006" #0 [ffff88175b907370] machine_kexec at ffffffff8103216b #1 [ffff88175b9073d0] crash_kexec at ffffffff810b8d12 #2 [ffff88175b9074a0] oops_end at ffffffff814f2c00 #3 [ffff88175b9074d0] no_context at ffffffff810423fb #4 [ffff88175b907520] __bad_area_nosemaphore at ffffffff81042685 #5 [ffff88175b907570] bad_area_nosemaphore at ffffffff81042753 #6 [ffff88175b907580] __do_page_fault at ffffffff81042e0d #7 [ffff88175b9076a0] do_page_fault at ffffffff814f4bde #8 [ffff88175b9076d0] page_fault at ffffffff814f1f95 [exception RIP: lustre_swab_lov_user_md_common+39] RIP: ffffffffa08bcdb7 RSP: ffff88175b907780 RFLAGS: 00010246 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffffffffa090961a RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88175b907790 R8: ffff88175b937000 R9: ffff88175b8910d0 R10: 0000000000000001 R11: 00000000fffffff3 R12: ffff8817ec176000 R13: ffff88175c222468 R14: ffffc9013311e208 R15: ffff8817ec176000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff88175b907798] lustre_swab_lov_user_md_v3 at ffffffffa08bd2ad [ptlrpc] #10 [ffff88175b9077b8] lod_qos_prep_create at ffffffffa0b6bf77 [lod] #11 [ffff88175b907858] lod_declare_striped_object at ffffffffa0b66c7b [lod] #12 [ffff88175b9078b8] lod_declare_xattr_set at ffffffffa0b67b9d [lod] #13 [ffff88175b907918] mdd_create_data at ffffffffa0bf4c00 [mdd] #14 [ffff88175b907978] mdt_finish_open at ffffffffa0c794f8 [mdt] #15 [ffff88175b907a08] mdt_open_by_fid_lock at ffffffffa0c7a5a7 [mdt] #16 [ffff88175b907a78] mdt_reint_open at ffffffffa0c7ac5f [mdt] #17 [ffff88175b907b58] mdt_reint_rec at ffffffffa0c66a21 [mdt] #18 [ffff88175b907b78] mdt_reint_internal at ffffffffa0c601b3 [mdt] #19 [ffff88175b907bb8] mdt_intent_reint at ffffffffa0c6077d [mdt] #20 [ffff88175b907c08] mdt_intent_policy at ffffffffa0c5c38e [mdt] #21 [ffff88175b907c48] ldlm_lock_enqueue at ffffffffa0872b91 [ptlrpc] #22 [ffff88175b907ca8] ldlm_handle_enqueue0 at ffffffffa089a837 [ptlrpc] #23 [ffff88175b907d18] mdt_enqueue at ffffffffa0c5bf16 [mdt] #24 [ffff88175b907d38] mdt_handle_common at ffffffffa0c4fdd2 [mdt] #25 [ffff88175b907d88] mdt_regular_handle at ffffffffa0c50cd5 [mdt] #26 [ffff88175b907d98] ptlrpc_server_handle_request at ffffffffa08ca8fc [ptlrpc] #27 [ffff88175b907e98] ptlrpc_main at ffffffffa08cbeec [ptlrpc] #28 [ffff88175b907f48] kernel_thread at ffffffff8100c14a
Recovery was manually aborted, which cleared up the issue:
lctl --device 5 abort_recovery
Prior to the manual intervention, the node would continuously crash after recovery for about 12 hours.