Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>
This issue relates to the following review-dne-selinux-ssk run: https://testing.whamcloud.com/test_sets/c4cb7246-fc38-11e9-9487-52540065bddc
Test failed when both onyx-66vm1 and onyx-66vm2 crashed during recovery-small test_136 with the same stack trace. It looks like the clients were trying to refresh the key after losing connection to the server, and some kernel timer accessed invalid memory.
[ 7122.083618] Lustre: DEBUG MARKER: == recovery-small test 136: changelog_deregister leaving pending records ============================= 20:27:12 (1572553632) [ 7165.453719] LNetError: 13554:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.2.5.166@tcp added to recovery queue. Health = 900 [ 7170.455640] LNetError: 13554:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.2.5.163@tcp added to recovery queue. Health = 900 [ 7193.457642] LNetError: 13554:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.2.5.163@tcp added to recovery queue. Health = 0 : : [ 7229.210628] Lustre: 13560:0:(sec_gss.c:688:gss_cli_ctx_handle_err_notify()) req x1648934820665600/t0, ctx ffff984b21bb9c00 idx 0xec1e62374fed889(0->lustre-MDT0002_UUID): server respond (00080000/00000000) [ 7229.213867] Lustre: 13560:0:(sec_gss.c:720:gss_cli_ctx_handle_err_notify()) NO_CONTEXT: server might lost the context, retrying [ 7229.216008] Lustre: 13560:0:(sec_gss.c:315:cli_ctx_expire()) ctx ffff984b21bb9c00(0->lustre-MDT0002_UUID) get expired: 1573158292(+604552s) : : [ 7253.102691] Lustre: 30140:0:(sec_gss.c:315:cli_ctx_expire()) Skipped 1 previous similar message [ 7253.205949] Lustre: DEBUG MARKER: keyctl show | grep lustre | cut -c1-11 | sed -e 's/ //g;' | xargs -IX keyctl setperm X 0x3f3f3f3f [ 7269.340618] Lustre: 0:0:(gss_keyring.c:150:ctx_upcall_timeout_kr()) ctx ffff984b3fd03da0, key ffffffff91aaaac7 [ 7269.342435] general protection fault: 0000 [#1] SMP [ 7269.358981] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: G 3.10.0-957.27.2.el7.x86_64 #1 [ 7269.363007] RIP: 0010:[<ffffffffc0ba0605>] [<ffffffffc0ba0605>] sec2target_str+0x15/0xb0 [ptlrpc] [ 7269.375062] Call Trace: [ 7269.375510] <IRQ> [ 7269.375906] [<ffffffffc0ce3a96>] cli_ctx_expire+0x96/0x120 [ptlrpc_gss] [ 7269.377128] [<ffffffff91aaaac7>] ? __internal_add_timer+0xc7/0x130 [ 7269.378174] [<ffffffff91aaaac7>] ? __internal_add_timer+0xc7/0x130 [ 7269.379245] [<ffffffffc0cfe8d0>] ? ctx_unlist_kr+0xc0/0xc0 [ptlrpc_gss] [ 7269.380365] [<ffffffffc0cfe955>] ctx_upcall_timeout_kr+0x85/0xd0 [ptlrpc_gss] [ 7269.381580] [<ffffffff91aa91a8>] call_timer_fn+0x38/0x110 [ 7269.382504] [<ffffffffc0cfe8d0>] ? ctx_unlist_kr+0xc0/0xc0 [ptlrpc_gss] [ 7269.383615] [<ffffffff91aab60d>] run_timer_softirq+0x24d/0x300 [ 7269.384609] [<ffffffff91aa2155>] __do_softirq+0xf5/0x280 [ 7269.385550] [<ffffffff9217a32c>] call_softirq+0x1c/0x30 [ 7269.386458] [<ffffffff91a2e675>] do_softirq+0x65/0xa0 [ 7269.387319] [<ffffffff91aa24d5>] irq_exit+0x105/0x110 [ 7269.388183] [<ffffffff9217b6e8>] smp_apic_timer_interrupt+0x48/0x60 [ 7269.389246] [<ffffffff92177df2>] apic_timer_interrupt+0x162/0x170 [ 7269.390273] <EOI> [ 7269.390626] [<ffffffff9216bd70>] ? __cpuidle_text_start+0x8/0x8 [ 7269.391688] [<ffffffff9216bf9b>] ? native_safe_halt+0xb/0x20 [ 7269.392652] [<ffffffff9216bd8e>] default_idle+0x1e/0xc0 [ 7269.393550] [<ffffffff91a366f0>] arch_cpu_idle+0x20/0xc0
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
recovery-small test_136 - onyx-66vm1, onyx-66vm2 crashed during recovery-small test_136