[LU-13116] sanity-sec test_30b: touch skn Created: 07/Jan/20 Updated: 21/Apr/20 Resolved: 17/Mar/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Sebastien Buisson |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for S Buisson <sbuisson@ddn.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/37ce626a-22a6-11ea-bb75-52540065bddc test_30b failed with the following error: touch skn After correctly setting skn flavor for all components, the client cannot create a simple file: touch: setting times of '/mnt/lustre/d30b.sanity-sec/f0': Input/output error VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Sebastien Buisson [ 07/Jan/20 ] |
|
Maybe this is due to an issue in the previous test, test_30. I will push a patch to test this idea. |
| Comment by Gerrit Updater [ 07/Jan/20 ] |
|
Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/37151 |
| Comment by James Nunez (Inactive) [ 08/Jan/20 ] |
|
I'm not sure if this is the same issue or not, but, we see sanity-sec test 30b fail with 'mkdir skn' whne SSK enabled. This test started to fail with the mkdir error starting on 16 DEC 2019. Please see the following for more information If these are separate issues, I can open a new ticket for the mkdir error. |
| Comment by Sebastien Buisson [ 09/Jan/20 ] |
|
The patch https://review.whamcloud.com/37151 is certainly helpful to cleanup sanity-sec test suite, but after more investigations the reason for test_30b failures is the following: on server side, the new srpc flavor is not taken into account. So clients send requests with the newly enforced flavor, but these requests get dropped on server side: Jan 08 18:34:12 onyx-65vm9.onyx.whamcloud.com kernel: Lustre: 6597:0:(sec.c:2105:sptlrpc_target_export_check()) exp ffff8d9e622ce000(lustre-MDT0000): req ffff8d9e614af600 (1|0|0|1|0|0) with unauthorized flavor 22, expect 20222|0(+0)|0(+0) Jan 08 18:34:12 onyx-65vm9.onyx.whamcloud.com kernel: LustreError: 6597:0:(service.c:2154:ptlrpc_server_handle_req_in()) @@@ DROPPING req with illegal security flavor req@ffff8d9e614af600 x1655160908866112/t0(0) o400->aefb6a7b-4cfd-4@10.2.5.160@tcp:0/0 lens 224/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1 job:'kworker/0:3.0' (above, flavor 22 is skn, 20222 is ski) And, after a bunch of these, clients finally get evicted, and cannot reconnect because of incompatible srpc flavor: Jan 08 18:34:58 onyx-65vm9.onyx.whamcloud.com kernel: Lustre: lustre-MDT0000: haven't heard from client b7fdc252-f2b2-4 (at 10.2.5.159@tcp) in 47 seconds. I think it's dead, and I am evicting it. exp ffff8d9e5ab80c00, cur 1578508498 expire 1578508468 last 1578508451 Jan 08 18:35:01 onyx-65vm9.onyx.whamcloud.com kernel: LustreError: 2578:0:(tgt_handler.c:929:tgt_connect_check_sptlrpc()) lustre-MDT0000: unauthorized rpc flavor 22 from 10.2.5.159@tcp, expect 20222 test_30b can fail on either mkdir or touch, it should be the same root cause. I did not manage to reproduce the problem on my own test system, so I have triggered a number of test sessions in Maloo with various parameters, in the hope one of them would hit the bug. I still did not figure out how a server could possibly not take the new srpc flavor into account. |
| Comment by Sebastien Buisson [ 15/Jan/20 ] |
|
Sometimes sanity-sec test_30 triggers an assertion failed: [ 2768.772950] LustreError: 2528:0:(gss_keyring.c:838:gss_sec_lookup_ctx_kr()) failed request key: -126 [ 2768.773485] LustreError: 2528:0:(sec.c:451:sptlrpc_req_get_ctx()) req ffff96a3cdc90900: fail to get context [ 2768.773984] LustreError: 2528:0:(lmv_obd.c:308:lmv_connect_mdc()) target seb-MDT0000_UUID connect error -111 [ 2768.774048] LustreError: 2563:0:(import.c:597:import_select_connection()) ASSERTION( dlmexp != ((void *)0) ) failed: [ 2768.774050] LustreError: 2563:0:(import.c:597:import_select_connection()) LBUG [ 2768.774052] Pid: 2563, comm: lctl 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 [ 2768.774052] Call Trace: [ 2768.774071] [<ffffffffc06498ac>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 2768.774086] [<ffffffffc064995c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 2768.774146] [<ffffffffc0ac161c>] ptlrpc_connect_import_locked+0x128c/0x1330 [ptlrpc] [ 2768.774168] [<ffffffffc0ac16e1>] ptlrpc_connect_import+0x21/0x30 [ptlrpc] [ 2768.774188] [<ffffffffc0a9a498>] ptlrpc_recover_import+0x318/0x790 [ptlrpc] [ 2768.774211] [<ffffffffc0ac902c>] lprocfs_import_seq_write+0x3ec/0x520 [ptlrpc] [ 2768.774217] [<ffffffffc0d78919>] mdc_import_seq_write+0x19/0x20 [mdc] [ 2768.774222] [<ffffffff86a94d40>] proc_reg_write+0x40/0x80 [ 2768.774226] [<ffffffff86a1f180>] vfs_write+0xc0/0x1f0 [ 2768.774227] [<ffffffff86a1ffaf>] SyS_write+0x7f/0xf0 [ 2768.774232] [<ffffffff86f2579b>] system_call_fastpath+0x22/0x27 [ 2768.774239] [<ffffffffffffffff>] 0xffffffffffffffff [ 2768.774240] Kernel panic - not syncing: LBUG [ 2768.774242] CPU: 4 PID: 2563 Comm: lctl Kdump: loaded Tainted: G OE ------------ 3.10.0-862.11.6.el7.x86_64 #1 [ 2768.774243] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [ 2768.774243] Call Trace: [ 2768.774247] [<ffffffff86f135d4>] dump_stack+0x19/0x1b [ 2768.774250] [<ffffffff86f0d11f>] panic+0xe8/0x21f [ 2768.774254] [<ffffffffc06499ab>] lbug_with_loc+0x9b/0xa0 [libcfs] [ 2768.774274] [<ffffffffc0ac161c>] ptlrpc_connect_import_locked+0x128c/0x1330 [ptlrpc] [ 2768.774292] [<ffffffffc0a610bf>] ? import_set_conn+0x31f/0x7a0 [ptlrpc] [ 2768.774313] [<ffffffffc0ac16e1>] ptlrpc_connect_import+0x21/0x30 [ptlrpc] [ 2768.774332] [<ffffffffc0a9a498>] ptlrpc_recover_import+0x318/0x790 [ptlrpc] [ 2768.774353] [<ffffffffc0ac902c>] lprocfs_import_seq_write+0x3ec/0x520 [ptlrpc] [ 2768.774356] [<ffffffffc0d78919>] mdc_import_seq_write+0x19/0x20 [mdc] [ 2768.774358] [<ffffffff86a94d40>] proc_reg_write+0x40/0x80 [ 2768.774359] [<ffffffff86a1f180>] vfs_write+0xc0/0x1f0 [ 2768.774361] [<ffffffff86f256e1>] ? system_call_after_swapgs+0xae/0x146 [ 2768.774363] [<ffffffff86a1ffaf>] SyS_write+0x7f/0xf0 [ 2768.774364] [<ffffffff86f256e1>] ? system_call_after_swapgs+0xae/0x146 [ 2768.774366] [<ffffffff86f2579b>] system_call_fastpath+0x22/0x27 [ 2768.774368] [<ffffffff86f256e1>] ? system_call_after_swapgs+0xae/0x146 This assertion should be replaced with test on dlmexp, and simply return an error in case it is NULL. |
| Comment by Gerrit Updater [ 15/Jan/20 ] |
|
Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/37251 |
| Comment by Sebastien Buisson [ 28/Jan/20 ] |
|
I think I finally managed to find the reason why sanity-sec test_30b is failing. In fact it happens when recovery-small is launched before sanity-sec, and the problem is due in particular to recovery-small test_23 which restarts the combined MGS/MDS. There is a bug in the Lustre code that makes the MGC that is used by the MDS server, lose the reference on the sptlrpc config log when the colocated MGS is stopped. As a consequence, even once the MGS is restarted, the MDS does not hold any reference on the sptlrpc config log anymore, and does not get any subsequent update of this llog. It explains why the newly set srpc flavor in sanity-sec test_30b is not taken into account by the MDS server, which refuses requests from the clients that do apply the new flavor. It seems this bug was introduced by patch 0ad54d5977 (" |
| Comment by Gerrit Updater [ 28/Jan/20 ] |
|
Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/37344 |
| Comment by Gerrit Updater [ 17/Mar/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37344/ |
| Comment by Gerrit Updater [ 17/Mar/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37251/ |
| Comment by Gerrit Updater [ 17/Mar/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37151/ |
| Comment by Peter Jones [ 17/Mar/20 ] |
|
Looks like everything has landed for 2.14 |