[LU-13116] sanity-sec test_30b: touch skn Created: 07/Jan/20  Updated: 21/Apr/20  Resolved: 17/Mar/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Sebastien Buisson
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/37ce626a-22a6-11ea-bb75-52540065bddc

test_30b failed with the following error:

touch skn

After correctly setting skn flavor for all components, the client cannot create a simple file:

touch: setting times of '/mnt/lustre/d30b.sanity-sec/f0': Input/output error

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-sec test_30b - touch skn
sanity-sec test_30b - mkdir skn



 Comments   
Comment by Sebastien Buisson [ 07/Jan/20 ]

Maybe this is due to an issue in the previous test, test_30. I will push a patch to test this idea.

Comment by Gerrit Updater [ 07/Jan/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/37151
Subject: LU-13116 tests: properly clean keyring in sanity-sec test_30
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 900c06808c6286283d3921e987767ee32742eb98

Comment by James Nunez (Inactive) [ 08/Jan/20 ]

I'm not sure if this is the same issue or not, but, we see sanity-sec test 30b fail with 'mkdir skn' whne SSK enabled. This test started to fail with the mkdir error starting on 16 DEC 2019. Please see the following for more information
https://testing.whamcloud.com/test_sets/5e22a5a0-2041-11ea-b0f4-52540065bddc
https://testing.whamcloud.com/test_sets/43fcab28-31dd-11ea-971c-52540065bddc

If these are separate issues, I can open a new ticket for the mkdir error.

Comment by Sebastien Buisson [ 09/Jan/20 ]

The patch https://review.whamcloud.com/37151 is certainly helpful to cleanup sanity-sec test suite, but after more investigations the reason for test_30b failures is the following: on server side, the new srpc flavor is not taken into account. So clients send requests with the newly enforced flavor, but these requests get dropped on server side:

Jan 08 18:34:12 onyx-65vm9.onyx.whamcloud.com kernel: Lustre: 6597:0:(sec.c:2105:sptlrpc_target_export_check()) exp ffff8d9e622ce000(lustre-MDT0000): req ffff8d9e614af600 (1|0|0|1|0|0) with unauthorized flavor 22, expect 20222|0(+0)|0(+0)
Jan 08 18:34:12 onyx-65vm9.onyx.whamcloud.com kernel: LustreError: 6597:0:(service.c:2154:ptlrpc_server_handle_req_in()) @@@ DROPPING req with illegal security flavor  req@ffff8d9e614af600 x1655160908866112/t0(0) o400->aefb6a7b-4cfd-4@10.2.5.160@tcp:0/0 lens 224/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1 job:'kworker/0:3.0'

(above, flavor 22 is skn, 20222 is ski)

And, after a bunch of these, clients finally get evicted, and cannot reconnect because of incompatible srpc flavor:

Jan 08 18:34:58 onyx-65vm9.onyx.whamcloud.com kernel: Lustre: lustre-MDT0000: haven't heard from client b7fdc252-f2b2-4 (at 10.2.5.159@tcp) in 47 seconds. I think it's dead, and I am evicting it. exp ffff8d9e5ab80c00, cur 1578508498 expire 1578508468 last 1578508451
Jan 08 18:35:01 onyx-65vm9.onyx.whamcloud.com kernel: LustreError: 2578:0:(tgt_handler.c:929:tgt_connect_check_sptlrpc()) lustre-MDT0000: unauthorized rpc flavor 22 from 10.2.5.159@tcp, expect 20222

test_30b can fail on either mkdir or touch, it should be the same root cause.

I did not manage to reproduce the problem on my own test system, so I have triggered a number of test sessions in Maloo with various parameters, in the hope one of them would hit the bug. I still did not figure out how a server could possibly not take the new srpc flavor into account.

Comment by Sebastien Buisson [ 15/Jan/20 ]

Sometimes sanity-sec test_30 triggers an assertion failed:

[ 2768.772950] LustreError: 2528:0:(gss_keyring.c:838:gss_sec_lookup_ctx_kr()) failed request key: -126
[ 2768.773485] LustreError: 2528:0:(sec.c:451:sptlrpc_req_get_ctx()) req ffff96a3cdc90900: fail to get context
[ 2768.773984] LustreError: 2528:0:(lmv_obd.c:308:lmv_connect_mdc()) target seb-MDT0000_UUID connect error -111
[ 2768.774048] LustreError: 2563:0:(import.c:597:import_select_connection()) ASSERTION( dlmexp != ((void *)0) ) failed:
[ 2768.774050] LustreError: 2563:0:(import.c:597:import_select_connection()) LBUG
[ 2768.774052] Pid: 2563, comm: lctl 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018
[ 2768.774052] Call Trace:
[ 2768.774071]  [<ffffffffc06498ac>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 2768.774086]  [<ffffffffc064995c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 2768.774146]  [<ffffffffc0ac161c>] ptlrpc_connect_import_locked+0x128c/0x1330 [ptlrpc]
[ 2768.774168]  [<ffffffffc0ac16e1>] ptlrpc_connect_import+0x21/0x30 [ptlrpc]
[ 2768.774188]  [<ffffffffc0a9a498>] ptlrpc_recover_import+0x318/0x790 [ptlrpc]
[ 2768.774211]  [<ffffffffc0ac902c>] lprocfs_import_seq_write+0x3ec/0x520 [ptlrpc]
[ 2768.774217]  [<ffffffffc0d78919>] mdc_import_seq_write+0x19/0x20 [mdc]
[ 2768.774222]  [<ffffffff86a94d40>] proc_reg_write+0x40/0x80
[ 2768.774226]  [<ffffffff86a1f180>] vfs_write+0xc0/0x1f0
[ 2768.774227]  [<ffffffff86a1ffaf>] SyS_write+0x7f/0xf0
[ 2768.774232]  [<ffffffff86f2579b>] system_call_fastpath+0x22/0x27
[ 2768.774239]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 2768.774240] Kernel panic - not syncing: LBUG
[ 2768.774242] CPU: 4 PID: 2563 Comm: lctl Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.11.6.el7.x86_64 #1
[ 2768.774243] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[ 2768.774243] Call Trace:
[ 2768.774247]  [<ffffffff86f135d4>] dump_stack+0x19/0x1b
[ 2768.774250]  [<ffffffff86f0d11f>] panic+0xe8/0x21f
[ 2768.774254]  [<ffffffffc06499ab>] lbug_with_loc+0x9b/0xa0 [libcfs]
[ 2768.774274]  [<ffffffffc0ac161c>] ptlrpc_connect_import_locked+0x128c/0x1330 [ptlrpc]
[ 2768.774292]  [<ffffffffc0a610bf>] ? import_set_conn+0x31f/0x7a0 [ptlrpc]
[ 2768.774313]  [<ffffffffc0ac16e1>] ptlrpc_connect_import+0x21/0x30 [ptlrpc]
[ 2768.774332]  [<ffffffffc0a9a498>] ptlrpc_recover_import+0x318/0x790 [ptlrpc]
[ 2768.774353]  [<ffffffffc0ac902c>] lprocfs_import_seq_write+0x3ec/0x520 [ptlrpc]
[ 2768.774356]  [<ffffffffc0d78919>] mdc_import_seq_write+0x19/0x20 [mdc]
[ 2768.774358]  [<ffffffff86a94d40>] proc_reg_write+0x40/0x80
[ 2768.774359]  [<ffffffff86a1f180>] vfs_write+0xc0/0x1f0
[ 2768.774361]  [<ffffffff86f256e1>] ? system_call_after_swapgs+0xae/0x146
[ 2768.774363]  [<ffffffff86a1ffaf>] SyS_write+0x7f/0xf0
[ 2768.774364]  [<ffffffff86f256e1>] ? system_call_after_swapgs+0xae/0x146
[ 2768.774366]  [<ffffffff86f2579b>] system_call_fastpath+0x22/0x27
[ 2768.774368]  [<ffffffff86f256e1>] ? system_call_after_swapgs+0xae/0x146

This assertion should be replaced with test on dlmexp, and simply return an error in case it is NULL.

Comment by Gerrit Updater [ 15/Jan/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/37251
Subject: LU-13116 ptlrpc: return error for conn with NULL export
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 983af71915a0652c4eb4f615db0f6d4ae7e64a67

Comment by Sebastien Buisson [ 28/Jan/20 ]

I think I finally managed to find the reason why sanity-sec test_30b is failing. In fact it happens when recovery-small is launched before sanity-sec, and the problem is due in particular to recovery-small test_23 which restarts the combined MGS/MDS.

There is a bug in the Lustre code that makes the MGC that is used by the MDS server, lose the reference on the sptlrpc config log when the colocated MGS is stopped. As a consequence, even once the MGS is restarted, the MDS does not hold any reference on the sptlrpc config log anymore, and does not get any subsequent update of this llog. It explains why the newly set srpc flavor in sanity-sec test_30b is not taken into account by the MDS server, which refuses requests from the clients that do apply the new flavor.

It seems this bug was introduced by patch 0ad54d5977 ("LU-11185 mgc: config lock leak").

Comment by Gerrit Updater [ 28/Jan/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/37344
Subject: LU-13116 mgc: do not lose sptlrpc config lock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e6b9eac6a8ad0ad26e628e66b20cbdef34acf637

Comment by Gerrit Updater [ 17/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37344/
Subject: LU-13116 mgc: do not lose sptlrpc config lock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 26e8f1137b82b26dca0219f58543ea6edd9a30ae

Comment by Gerrit Updater [ 17/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37251/
Subject: LU-13116 ptlrpc: return error for conn with NULL export
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c54dc02d51e76146f3f273196f47204a2fb48345

Comment by Gerrit Updater [ 17/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37151/
Subject: LU-13116 tests: properly clean keyring in sanity-sec test_30
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e628b7cd06ce0092b3003d27a2f45e6a0b7574d9

Comment by Peter Jones [ 17/Mar/20 ]

Looks like everything has landed for 2.14

Generated at Sat Feb 10 02:58:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.