Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
When running a very recent (few days ago) copy of master under heavy load (on real hardware), we hit MDS panics relatively easily.
Here's the basic crash signature:
[16840.816521] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 [16840.827221] IP: [<ffffffffc0bae433>] lu_context_key_get+0x13/0x30 [obdclass] [16840.837127] PGD 0 [16840.841095] Oops: 0000 [#1] SMP [.....] [16841.012065] CPU: 18 PID: 145031 Comm: ldlm_cn01_053 Tainted: G OE ------------ 3.10.0-693.21.1.x3.1.9.x86_64 #1 [16841.026546] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0021.032120170601 03/21/2017 [16841.040373] task: ffff880e3b296eb0 ti: ffff881ee0e44000 task.ti: ffff881ee0e44000 [16841.050648] RIP: 0010:[<ffffffffc0bae433>] [<ffffffffc0bae433>] lu_context_key_get+0x13/0x30 [obdclass] [16841.063235] RSP: 0018:ffff881ee0e47a28 EFLAGS: 00010246 [16841.071023] RAX: 0000000000000014 RBX: ffff881dc15a2f40 RCX: ffff881ee0e47aac [16841.080843] RDX: ffff881e50b05930 RSI: ffffffffc1325c40 RDI: 0000000000000000 [16841.090656] RBP: ffff881ee0e47a70 R08: ffff880fef20c000 R09: 0000000000000130 [16841.100468] R10: 0000000000000000 R11: ffff881e50b05800 R12: ffff881ee0e47aac [16841.110290] R13: ffff881e50b05930 R14: ffff881dc15a2f40 R15: 0000000000000000 [16841.120113] FS: 0000000000000000(0000) GS:ffff88203df80000(0000) knlGS:0000000000000000 [16841.130986] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [16841.139190] CR2: 0000000000000010 CR3: 0000000fdc790000 CR4: 00000000003607e0 [16841.148961] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [16841.158694] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [16841.168389] Call Trace: [16841.172820] [<ffffffffc12e7ad4>] ? mdt_lvbo_fill+0x74/0xa80 [mdt] [16841.181467] [<ffffffffc0dc6852>] ldlm_server_completion_ast+0x242/0x9e0 [ptlrpc] [16841.191573] [<ffffffffc0dc6610>] ? ldlm_server_blocking_ast+0xa40/0xa40 [ptlrpc] [16841.201642] [<ffffffffc0d98748>] ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc] [16841.211100] [<ffffffffc0de062a>] ptlrpc_set_wait+0x7a/0x8d0 [ptlrpc] [16841.219944] [<ffffffffc09ba2b8>] ? cfs_hash_bd_from_key+0x38/0xb0 [libcfs] [16841.229338] [<ffffffff811e4d1d>] ? kmem_cache_alloc_node_trace+0x11d/0x210 [16841.238708] [<ffffffffc0b90e19>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [16841.248345] [<ffffffffc0d986a0>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc] [16841.258047] [<ffffffffc0dd6e80>] ? ptlrpc_prep_set+0xc0/0x260 [ptlrpc] [16841.267028] [<ffffffffc0d9e245>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] [16841.275975] [<ffffffffc0d9f7a9>] __ldlm_reprocess_all+0x129/0x380 [ptlrpc] [16841.285292] [<ffffffffc0d9fa10>] ldlm_reprocess_all+0x10/0x20 [ptlrpc] [16841.294199] [<ffffffffc0dc3d3e>] ldlm_request_cancel+0x14e/0x740 [ptlrpc] [16841.303307] [<ffffffffc0dc8ada>] ldlm_handle_cancel+0xba/0x250 [ptlrpc] [16841.312233] [<ffffffffc0dc8dc8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc] [16841.321356] [<ffffffffc0df9ccb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [16841.331372] [<ffffffffc0df6b55>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [16841.340401] [<ffffffffc0dfd5c4>] ptlrpc_main+0xaf4/0x1fa0 [ptlrpc] [16841.348851] [<ffffffffc0dfcad0>] ? ptlrpc_register_service+0xf70/0xf70 [ptlrpc] [16841.358516] [<ffffffff810b4031>] kthread+0xd1/0xe0
I think there's a good chance this is related to https://jira.whamcloud.com/browse/LU-11483 , but I haven't done detailed triage yet.
If someone from WC wants to take a look, I can make the vmcore available. (Someone from Cray will take a detailed look eventually, but we haven't had the chance yet.)
Attachments
Issue Links
- duplicates
-
LU-11483 replay-dual test_25: ofd_lvbo_init()) ASSERTION( env ) failed
-
- Resolved
-
I'm not skeptical of what Oleg did, I think there's some confusion here, and it would be good to get Oleg to weigh in.
Essentially, exactly the same problem exists in two places in the code. One crashes with the signature given in
LU-11483, one is the signature reported here inLU-11629. It's simply that the same fix has to be applied in two places. I believe that's what Oleg was indicating by marking this as a duplicate ofLU-11483.