[LU-11629] MDS panic under load - lu_context_key_get - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

When running a very recent (few days ago) copy of master under heavy load (on real hardware), we hit MDS panics relatively easily.

Here's the basic crash signature:

 [16840.816521] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[16840.827221] IP: [<ffffffffc0bae433>] lu_context_key_get+0x13/0x30 [obdclass]
[16840.837127] PGD 0
[16840.841095] Oops: 0000 [#1] SMP
[.....]
[16841.012065] CPU: 18 PID: 145031 Comm: ldlm_cn01_053 Tainted: G OE ------------ 3.10.0-693.21.1.x3.1.9.x86_64 #1
[16841.026546] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0021.032120170601 03/21/2017
[16841.040373] task: ffff880e3b296eb0 ti: ffff881ee0e44000 task.ti: ffff881ee0e44000
[16841.050648] RIP: 0010:[<ffffffffc0bae433>] [<ffffffffc0bae433>] lu_context_key_get+0x13/0x30 [obdclass]
[16841.063235] RSP: 0018:ffff881ee0e47a28 EFLAGS: 00010246
[16841.071023] RAX: 0000000000000014 RBX: ffff881dc15a2f40 RCX: ffff881ee0e47aac
[16841.080843] RDX: ffff881e50b05930 RSI: ffffffffc1325c40 RDI: 0000000000000000
[16841.090656] RBP: ffff881ee0e47a70 R08: ffff880fef20c000 R09: 0000000000000130
[16841.100468] R10: 0000000000000000 R11: ffff881e50b05800 R12: ffff881ee0e47aac
[16841.110290] R13: ffff881e50b05930 R14: ffff881dc15a2f40 R15: 0000000000000000
[16841.120113] FS: 0000000000000000(0000) GS:ffff88203df80000(0000) knlGS:0000000000000000
[16841.130986] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16841.139190] CR2: 0000000000000010 CR3: 0000000fdc790000 CR4: 00000000003607e0
[16841.148961] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[16841.158694] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[16841.168389] Call Trace:
[16841.172820] [<ffffffffc12e7ad4>] ? mdt_lvbo_fill+0x74/0xa80 [mdt]
[16841.181467] [<ffffffffc0dc6852>] ldlm_server_completion_ast+0x242/0x9e0 [ptlrpc]
[16841.191573] [<ffffffffc0dc6610>] ? ldlm_server_blocking_ast+0xa40/0xa40 [ptlrpc]
[16841.201642] [<ffffffffc0d98748>] ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc]
[16841.211100] [<ffffffffc0de062a>] ptlrpc_set_wait+0x7a/0x8d0 [ptlrpc]
[16841.219944] [<ffffffffc09ba2b8>] ? cfs_hash_bd_from_key+0x38/0xb0 [libcfs]
[16841.229338] [<ffffffff811e4d1d>] ? kmem_cache_alloc_node_trace+0x11d/0x210
[16841.238708] [<ffffffffc0b90e19>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[16841.248345] [<ffffffffc0d986a0>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc]
[16841.258047] [<ffffffffc0dd6e80>] ? ptlrpc_prep_set+0xc0/0x260 [ptlrpc]
[16841.267028] [<ffffffffc0d9e245>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
[16841.275975] [<ffffffffc0d9f7a9>] __ldlm_reprocess_all+0x129/0x380 [ptlrpc]
[16841.285292] [<ffffffffc0d9fa10>] ldlm_reprocess_all+0x10/0x20 [ptlrpc]
[16841.294199] [<ffffffffc0dc3d3e>] ldlm_request_cancel+0x14e/0x740 [ptlrpc]
[16841.303307] [<ffffffffc0dc8ada>] ldlm_handle_cancel+0xba/0x250 [ptlrpc]
[16841.312233] [<ffffffffc0dc8dc8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
[16841.321356] [<ffffffffc0df9ccb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[16841.331372] [<ffffffffc0df6b55>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[16841.340401] [<ffffffffc0dfd5c4>] ptlrpc_main+0xaf4/0x1fa0 [ptlrpc]
[16841.348851] [<ffffffffc0dfcad0>] ? ptlrpc_register_service+0xf70/0xf70 [ptlrpc]
[16841.358516] [<ffffffff810b4031>] kthread+0xd1/0xe0

I think there's a good chance this is related to https://jira.whamcloud.com/browse/LU-11483 , but I haven't done detailed triage yet.

If someone from WC wants to take a look, I can make the vmcore available. (Someone from Cray will take a detailed look eventually, but we haven't had the chance yet.)

Attachments

Issue Links

duplicates

LU-11483 replay-dual test_25: ofd_lvbo_init()) ASSERTION( env ) failed

Resolved

Activity

[LU-11629] MDS panic under load - lu_context_key_get

Peter Jones added a comment - 07/Nov/18 4:40 AM

Ah sorry, having now seen that extra context I understand. Thanks for pointing that out!

Peter Jones added a comment - 07/Nov/18 4:40 AM Ah sorry, having now seen that extra context I understand. Thanks for pointing that out!

Patrick Farrell (Inactive) added a comment - 07/Nov/18 3:19 AM

I'm not skeptical of what Oleg did, I think there's some confusion here, and it would be good to get Oleg to weigh in.

Essentially, exactly the same problem exists in two places in the code. One crashes with the signature given in ~~LU-11483~~, one is the signature reported here in ~~LU-11629~~. It's simply that the same fix has to be applied in two places. I believe that's what Oleg was indicating by marking this as a duplicate of ~~LU-11483~~.

Patrick Farrell (Inactive) added a comment - 07/Nov/18 3:19 AM I'm not skeptical of what Oleg did, I think there's some confusion here, and it would be good to get Oleg to weigh in. Essentially, exactly the same problem exists in two places in the code. One crashes with the signature given in LU-11483 , one is the signature reported here in LU-11629 . It's simply that the same fix has to be applied in two places. I believe that's what Oleg was indicating by marking this as a duplicate of LU-11483 .

Peter Jones added a comment - 07/Nov/18 12:39 AM

Patrick is skeptical that this is a duplicate of ~~LU-11483~~ so reopening until some testing has been run to prove/disprove this theory either way

Peter Jones added a comment - 07/Nov/18 12:39 AM Patrick is skeptical that this is a duplicate of LU-11483 so reopening until some testing has been run to prove/disprove this theory either way

Justin Miller (Inactive) added a comment - 06/Nov/18 8:43 PM

Testing mail delivery for @paf

Justin Miller (Inactive) added a comment - 06/Nov/18 8:43 PM Testing mail delivery for @paf

Patrick Farrell (Inactive) added a comment - 06/Nov/18 8:37 PM

OK, thanks!

Patrick Farrell (Inactive) added a comment - 06/Nov/18 8:37 PM OK, thanks!

Andreas Dilger added a comment - 06/Nov/18 6:23 PM

Please follow up in ~~LU-11483~~.

Andreas Dilger added a comment - 06/Nov/18 6:23 PM Please follow up in LU-11483 .

Oleg Drokin added a comment - 06/Nov/18 6:08 PM

I believe they are the same

Oleg Drokin added a comment - 06/Nov/18 6:08 PM I believe they are the same

People

Assignee:: WC Triage

Reporter:: Patrick Farrell (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 06/Nov/18 4:18 PM

Updated:: 07/Nov/18 4:40 AM

Resolved:: 07/Nov/18 4:40 AM