[LU-12928] recovery-small test_136: crash in sec2target_str() with review-dne-selinux-ssk Created: 01/Nov/19  Updated: 08/Apr/20  Resolved: 18/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0, Lustre 2.12.4

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following review-dne-selinux-ssk run: https://testing.whamcloud.com/test_sets/c4cb7246-fc38-11e9-9487-52540065bddc

Test failed when both onyx-66vm1 and onyx-66vm2 crashed during recovery-small test_136 with the same stack trace. It looks like the clients were trying to refresh the key after losing connection to the server, and some kernel timer accessed invalid memory.

[ 7122.083618] Lustre: DEBUG MARKER: == recovery-small test 136: changelog_deregister leaving pending records ============================= 20:27:12 (1572553632)
[ 7165.453719] LNetError: 13554:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.2.5.166@tcp added to recovery queue. Health = 900
[ 7170.455640] LNetError: 13554:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.2.5.163@tcp added to recovery queue. Health = 900
[ 7193.457642] LNetError: 13554:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.2.5.163@tcp added to recovery queue. Health = 0
:
:
[ 7229.210628] Lustre: 13560:0:(sec_gss.c:688:gss_cli_ctx_handle_err_notify()) req x1648934820665600/t0, ctx ffff984b21bb9c00 idx 0xec1e62374fed889(0->lustre-MDT0002_UUID): server respond (00080000/00000000)
[ 7229.213867] Lustre: 13560:0:(sec_gss.c:720:gss_cli_ctx_handle_err_notify()) NO_CONTEXT: server might lost the context, retrying
[ 7229.216008] Lustre: 13560:0:(sec_gss.c:315:cli_ctx_expire()) ctx ffff984b21bb9c00(0->lustre-MDT0002_UUID) get expired: 1573158292(+604552s)
:
:
[ 7253.102691] Lustre: 30140:0:(sec_gss.c:315:cli_ctx_expire()) Skipped 1 previous similar message
[ 7253.205949] Lustre: DEBUG MARKER: keyctl show | grep lustre | cut -c1-11 |
				sed -e 's/ //g;' |
				xargs -IX keyctl setperm X 0x3f3f3f3f
[ 7269.340618] Lustre: 0:0:(gss_keyring.c:150:ctx_upcall_timeout_kr()) ctx ffff984b3fd03da0, key ffffffff91aaaac7
[ 7269.342435] general protection fault: 0000 [#1] SMP 
[ 7269.358981] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: G   3.10.0-957.27.2.el7.x86_64 #1
[ 7269.363007] RIP: 0010:[<ffffffffc0ba0605>]  [<ffffffffc0ba0605>] sec2target_str+0x15/0xb0 [ptlrpc]
[ 7269.375062] Call Trace:
[ 7269.375510]  <IRQ> 
[ 7269.375906]  [<ffffffffc0ce3a96>] cli_ctx_expire+0x96/0x120 [ptlrpc_gss]
[ 7269.377128]  [<ffffffff91aaaac7>] ? __internal_add_timer+0xc7/0x130
[ 7269.378174]  [<ffffffff91aaaac7>] ? __internal_add_timer+0xc7/0x130
[ 7269.379245]  [<ffffffffc0cfe8d0>] ? ctx_unlist_kr+0xc0/0xc0 [ptlrpc_gss]
[ 7269.380365]  [<ffffffffc0cfe955>] ctx_upcall_timeout_kr+0x85/0xd0 [ptlrpc_gss]
[ 7269.381580]  [<ffffffff91aa91a8>] call_timer_fn+0x38/0x110
[ 7269.382504]  [<ffffffffc0cfe8d0>] ? ctx_unlist_kr+0xc0/0xc0 [ptlrpc_gss]
[ 7269.383615]  [<ffffffff91aab60d>] run_timer_softirq+0x24d/0x300
[ 7269.384609]  [<ffffffff91aa2155>] __do_softirq+0xf5/0x280
[ 7269.385550]  [<ffffffff9217a32c>] call_softirq+0x1c/0x30
[ 7269.386458]  [<ffffffff91a2e675>] do_softirq+0x65/0xa0
[ 7269.387319]  [<ffffffff91aa24d5>] irq_exit+0x105/0x110
[ 7269.388183]  [<ffffffff9217b6e8>] smp_apic_timer_interrupt+0x48/0x60
[ 7269.389246]  [<ffffffff92177df2>] apic_timer_interrupt+0x162/0x170
[ 7269.390273]  <EOI> 
[ 7269.390626]  [<ffffffff9216bd70>] ? __cpuidle_text_start+0x8/0x8
[ 7269.391688]  [<ffffffff9216bf9b>] ? native_safe_halt+0xb/0x20
[ 7269.392652]  [<ffffffff9216bd8e>] default_idle+0x1e/0xc0
[ 7269.393550]  [<ffffffff91a366f0>] arch_cpu_idle+0x20/0xc0

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
recovery-small test_136 - onyx-66vm1, onyx-66vm2 crashed during recovery-small test_136



 Comments   
Comment by Andreas Dilger [ 05/Nov/19 ]

+1 https://testing.whamcloud.com/test_sets/37042b6a-fddc-11e9-8e77-52540065bddc review-dne-selinux-ssk

Comment by Yang Sheng [ 07/Nov/19 ]

Patch submit to: https://review.whamcloud.com/#/c/36708/.

Comment by Yang Sheng [ 07/Nov/19 ]

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36708
Subject: LU-12928 gss: crash in sec2target_str()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 45d51db74eee6873f6e368d9581ac9a57fe44a62

Comment by Gerrit Updater [ 06/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36708/
Subject: LU-12928 gss: crash in sec2target_str()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5b40c9b90b44ddd0b042c12c10c65c9965a9856f

Comment by Peter Jones [ 06/Dec/19 ]

Landed for 2.14

Comment by Gerrit Updater [ 12/Dec/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36999
Subject: LU-12928 gss: crash in sec2target_str()
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: a170ee322d75e7a19998437adf441361b52d5b25

Comment by Gerrit Updater [ 13/Dec/19 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37009
Subject: LU-12928 tests: start running recovery-small 136
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8c46660ccd69a165cd1fe3919806b3ba46df8b5d

Comment by Gerrit Updater [ 03/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36999/
Subject: LU-12928 gss: crash in sec2target_str()
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: f46971cef33818dc1d91ac6ff511823b7091587d

Comment by Andreas Dilger [ 09/Jan/20 ]

The patch that enables the test to actually be run is not landed yet.

Comment by Gerrit Updater [ 18/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37009/
Subject: LU-12928 tests: start running recovery-small 136
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b2b3afd34925efd4067031e1a18d63d7b4daa3ff

Generated at Sat Feb 10 02:56:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.