Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for S Buisson <sbuisson@ddn.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/37ce626a-22a6-11ea-bb75-52540065bddc
test_30b failed with the following error:
touch skn
After correctly setting skn flavor for all components, the client cannot create a simple file:
touch: setting times of '/mnt/lustre/d30b.sanity-sec/f0': Input/output error
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-sec test_30b - touch skn
sanity-sec test_30b - mkdir skn
Attachments
Activity
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37151/
Subject: LU-13116 tests: properly clean keyring in sanity-sec test_30
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e628b7cd06ce0092b3003d27a2f45e6a0b7574d9
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37251/
Subject: LU-13116 ptlrpc: return error for conn with NULL export
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c54dc02d51e76146f3f273196f47204a2fb48345
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37344/
Subject: LU-13116 mgc: do not lose sptlrpc config lock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 26e8f1137b82b26dca0219f58543ea6edd9a30ae
Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/37344
Subject: LU-13116 mgc: do not lose sptlrpc config lock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e6b9eac6a8ad0ad26e628e66b20cbdef34acf637
I think I finally managed to find the reason why sanity-sec test_30b is failing. In fact it happens when recovery-small is launched before sanity-sec, and the problem is due in particular to recovery-small test_23 which restarts the combined MGS/MDS.
There is a bug in the Lustre code that makes the MGC that is used by the MDS server, lose the reference on the sptlrpc config log when the colocated MGS is stopped. As a consequence, even once the MGS is restarted, the MDS does not hold any reference on the sptlrpc config log anymore, and does not get any subsequent update of this llog. It explains why the newly set srpc flavor in sanity-sec test_30b is not taken into account by the MDS server, which refuses requests from the clients that do apply the new flavor.
It seems this bug was introduced by patch 0ad54d5977 ("LU-11185 mgc: config lock leak").
Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/37251
Subject: LU-13116 ptlrpc: return error for conn with NULL export
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 983af71915a0652c4eb4f615db0f6d4ae7e64a67
Sometimes sanity-sec test_30 triggers an assertion failed:
[ 2768.772950] LustreError: 2528:0:(gss_keyring.c:838:gss_sec_lookup_ctx_kr()) failed request key: -126 [ 2768.773485] LustreError: 2528:0:(sec.c:451:sptlrpc_req_get_ctx()) req ffff96a3cdc90900: fail to get context [ 2768.773984] LustreError: 2528:0:(lmv_obd.c:308:lmv_connect_mdc()) target seb-MDT0000_UUID connect error -111 [ 2768.774048] LustreError: 2563:0:(import.c:597:import_select_connection()) ASSERTION( dlmexp != ((void *)0) ) failed: [ 2768.774050] LustreError: 2563:0:(import.c:597:import_select_connection()) LBUG [ 2768.774052] Pid: 2563, comm: lctl 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 [ 2768.774052] Call Trace: [ 2768.774071] [<ffffffffc06498ac>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 2768.774086] [<ffffffffc064995c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 2768.774146] [<ffffffffc0ac161c>] ptlrpc_connect_import_locked+0x128c/0x1330 [ptlrpc] [ 2768.774168] [<ffffffffc0ac16e1>] ptlrpc_connect_import+0x21/0x30 [ptlrpc] [ 2768.774188] [<ffffffffc0a9a498>] ptlrpc_recover_import+0x318/0x790 [ptlrpc] [ 2768.774211] [<ffffffffc0ac902c>] lprocfs_import_seq_write+0x3ec/0x520 [ptlrpc] [ 2768.774217] [<ffffffffc0d78919>] mdc_import_seq_write+0x19/0x20 [mdc] [ 2768.774222] [<ffffffff86a94d40>] proc_reg_write+0x40/0x80 [ 2768.774226] [<ffffffff86a1f180>] vfs_write+0xc0/0x1f0 [ 2768.774227] [<ffffffff86a1ffaf>] SyS_write+0x7f/0xf0 [ 2768.774232] [<ffffffff86f2579b>] system_call_fastpath+0x22/0x27 [ 2768.774239] [<ffffffffffffffff>] 0xffffffffffffffff [ 2768.774240] Kernel panic - not syncing: LBUG [ 2768.774242] CPU: 4 PID: 2563 Comm: lctl Kdump: loaded Tainted: G OE ------------ 3.10.0-862.11.6.el7.x86_64 #1 [ 2768.774243] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [ 2768.774243] Call Trace: [ 2768.774247] [<ffffffff86f135d4>] dump_stack+0x19/0x1b [ 2768.774250] [<ffffffff86f0d11f>] panic+0xe8/0x21f [ 2768.774254] [<ffffffffc06499ab>] lbug_with_loc+0x9b/0xa0 [libcfs] [ 2768.774274] [<ffffffffc0ac161c>] ptlrpc_connect_import_locked+0x128c/0x1330 [ptlrpc] [ 2768.774292] [<ffffffffc0a610bf>] ? import_set_conn+0x31f/0x7a0 [ptlrpc] [ 2768.774313] [<ffffffffc0ac16e1>] ptlrpc_connect_import+0x21/0x30 [ptlrpc] [ 2768.774332] [<ffffffffc0a9a498>] ptlrpc_recover_import+0x318/0x790 [ptlrpc] [ 2768.774353] [<ffffffffc0ac902c>] lprocfs_import_seq_write+0x3ec/0x520 [ptlrpc] [ 2768.774356] [<ffffffffc0d78919>] mdc_import_seq_write+0x19/0x20 [mdc] [ 2768.774358] [<ffffffff86a94d40>] proc_reg_write+0x40/0x80 [ 2768.774359] [<ffffffff86a1f180>] vfs_write+0xc0/0x1f0 [ 2768.774361] [<ffffffff86f256e1>] ? system_call_after_swapgs+0xae/0x146 [ 2768.774363] [<ffffffff86a1ffaf>] SyS_write+0x7f/0xf0 [ 2768.774364] [<ffffffff86f256e1>] ? system_call_after_swapgs+0xae/0x146 [ 2768.774366] [<ffffffff86f2579b>] system_call_fastpath+0x22/0x27 [ 2768.774368] [<ffffffff86f256e1>] ? system_call_after_swapgs+0xae/0x146
This assertion should be replaced with test on dlmexp, and simply return an error in case it is NULL.
The patch https://review.whamcloud.com/37151 is certainly helpful to cleanup sanity-sec test suite, but after more investigations the reason for test_30b failures is the following: on server side, the new srpc flavor is not taken into account. So clients send requests with the newly enforced flavor, but these requests get dropped on server side:
Jan 08 18:34:12 onyx-65vm9.onyx.whamcloud.com kernel: Lustre: 6597:0:(sec.c:2105:sptlrpc_target_export_check()) exp ffff8d9e622ce000(lustre-MDT0000): req ffff8d9e614af600 (1|0|0|1|0|0) with unauthorized flavor 22, expect 20222|0(+0)|0(+0) Jan 08 18:34:12 onyx-65vm9.onyx.whamcloud.com kernel: LustreError: 6597:0:(service.c:2154:ptlrpc_server_handle_req_in()) @@@ DROPPING req with illegal security flavor req@ffff8d9e614af600 x1655160908866112/t0(0) o400->aefb6a7b-4cfd-4@10.2.5.160@tcp:0/0 lens 224/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1 job:'kworker/0:3.0'
(above, flavor 22 is skn, 20222 is ski)
And, after a bunch of these, clients finally get evicted, and cannot reconnect because of incompatible srpc flavor:
Jan 08 18:34:58 onyx-65vm9.onyx.whamcloud.com kernel: Lustre: lustre-MDT0000: haven't heard from client b7fdc252-f2b2-4 (at 10.2.5.159@tcp) in 47 seconds. I think it's dead, and I am evicting it. exp ffff8d9e5ab80c00, cur 1578508498 expire 1578508468 last 1578508451 Jan 08 18:35:01 onyx-65vm9.onyx.whamcloud.com kernel: LustreError: 2578:0:(tgt_handler.c:929:tgt_connect_check_sptlrpc()) lustre-MDT0000: unauthorized rpc flavor 22 from 10.2.5.159@tcp, expect 20222
test_30b can fail on either mkdir or touch, it should be the same root cause.
I did not manage to reproduce the problem on my own test system, so I have triggered a number of test sessions in Maloo with various parameters, in the hope one of them would hit the bug. I still did not figure out how a server could possibly not take the new srpc flavor into account.
I'm not sure if this is the same issue or not, but, we see sanity-sec test 30b fail with 'mkdir skn' whne SSK enabled. This test started to fail with the mkdir error starting on 16 DEC 2019. Please see the following for more information
https://testing.whamcloud.com/test_sets/5e22a5a0-2041-11ea-b0f4-52540065bddc
https://testing.whamcloud.com/test_sets/43fcab28-31dd-11ea-971c-52540065bddc
If these are separate issues, I can open a new ticket for the mkdir error.
Looks like everything has landed for 2.14