[LU-13343] recovery-small test_140a: unable to mount /mnt/lustre2 on MDS Created: 09/Mar/20  Updated: 01/Feb/24  Resolved: 09/Jul/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Sebastien Buisson
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-13342 recovery-small test_140a and test_140... Resolved
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1b932583-bcd8-4933-9b6a-2d70a0b7aecc

test_140a failed with the following error on the client console log:

[  245.535452] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock,skpath=/tmp/test-framework-keys trevis-3vm9@tcp:/lustre /mnt/lustre2
[  245.868341] LustreError: 15982:0:(gss_keyring.c:862:gss_sec_lookup_ctx_kr()) failed request key: -126
[  245.869838] LustreError: 15982:0:(sec.c:449:sptlrpc_req_get_ctx()) req ffff9fb5ed3c3600: fail to get context
[  245.871460] LustreError: 15982:0:(lmv_obd.c:306:lmv_connect_mdc()) target lustre-MDT0000_UUID connect error -111
[  245.872973] LustreError: 15982:0:(llite_lib.c:320:client_common_fill_super()) cannot connect to lustre-clilmv-ffff9fb5f8a2d000: rc = -111
[  245.882513] LustreError: 15982:0:(lov_obd.c:824:lov_cleanup()) lustre-clilov-ffff9fb5f8a2d000: lov tgt 0 not cleaned! deathrow=0, lovrc=1
[  245.891188] Lustre: Unmounted lustre-client
[  245.892122] LustreError: 15982:0:(obd_mount.c:1681:lustre_fill_super()) Unable to mount  (-111)
[  246.144872] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-small test_140a: @@@@@@ FAIL: unable to mount \/mnt\/lustre2 on MDS 
[  246.335604] Lustre: DEBUG MARKER: recovery-small test_140a: @@@@@@ FAIL: unable to mount /mnt/lustre2 on MDS

It looks like the mount2 setup is missing the security setup.

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
recovery-small test_140a - unable to mount /mnt/lustre2 on MDS



 Comments   
Comment by Andreas Dilger [ 09/Mar/20 ]

It looks like this test has been failing since 2020-01-23, about 160 times, as many as 8 times a day.

Comment by Andreas Dilger [ 09/Mar/20 ]

Per LU-13342, Sebastien writes:

recovery-small test_140a and test_140b are using a 'local client' ie a client mounted on a server node, which is not compatible with SSK keys installed by the test framework.

So just skip these tests when running with SSK.

Comment by Gerrit Updater [ 09/Mar/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/37832
Subject: LU-13343 tests: skip recovery-small test_140 with SSK
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4c37186c40eee52e8f25aec66c73c4ba5d39432f

Comment by Gerrit Updater [ 10/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37832/
Subject: LU-13343 tests: skip recovery-small test_140 with SSK
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: df5b71a5e8031929d816b19a2f045bfa40388b5b

Comment by Andreas Dilger [ 11/Mar/20 ]

I was going to resolve this issue as "fixed" rather than marking it "always_except" and keeping it open, because my understanding is that local mounts would/should never use SSK/Kerberos as there is no transport security needed for memcpy() within the node.

That said, should there be an internal check for connections from the local NID to skip the gss_sec_lookup_ctx_kr() code/check so that this doesn't generate an error at all? Otherwise, it seems like if SSK is configured on a server it would prevent local mounts from working at all? Am I misunderstanding the issue here?

Comment by Sebastien Buisson [ 12/Mar/20 ]

The problem is just about the test framework. There is no restriction to use SSK with local clients (ie clients mounted on server nodes), it is just that it requires installing the client key as well on these nodes.
ATM the test framework does not handle this situation: each node has a dedicated role, and is either an MDS or an OSS or a client.

This is why I pushed this patch to skip the few tests that are using local clients, when SSK is enabled. If the number of cases increase, we would want to have SSK keys installed properly wherever they are needed.

Comment by Andreas Dilger [ 12/Mar/20 ]

What I'm asking about is the reverse - if SSK is enabled on the servers, it seems to me that it should be possible to mount a local client without the need to configure SSK for that client. I can't see any benefit to SSK/KRB for a local client mount, since the data doesn't go over the network, and the server itself can verify that the NID of the client is local.

Comment by Sebastien Buisson [ 12/Mar/20 ]

On the one hand, you are raising an interesting point. There is no real added value of checking SSK/KRB credentials for a local node.

On the other hand, one of the purposes of strong authentication is to define roles for nodes. When you install credentials on nodes (MDS, OSS, client), you explicitly assign them a role, meaning you want to prevent nodes from being re-purposed. So if we implement what you suggest, we would weaken this principle, with the initial intention of making local mounts easier.

I am not against what you suggest, but we have to be aware if the implications.

Comment by Andreas Dilger [ 12/Mar/20 ]

For clients mounting a local filesystem on the server for data movement or protocol re-export there is a need for the local client mount. I agree that the admin could configure the "client" for this local mount, but even if that was not inconvenient for the admin, it would still hurt performance due to encryption overhead for both the "client" and the "server" running on the same node for no real benefit, so we would likely recommend against using KRB/SSK for local connections.

Comment by Sebastien Buisson [ 12/Mar/20 ]

Understood. So I will work on a patch to add an internal check for connections from the local NID to skip the gss_sec_lookup_ctx_kr() code/check. Please leave this ticket open so that we keep in mind we already have recovery-small test_140a and test_140b that are making use of local clients.

Comment by Gerrit Updater [ 04/Mar/22 ]

"Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46704
Subject: LU-13343 gss: no sec flavor on loopback connection
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 635d0545b57e19c4139c48789d0d1cb009ab09a6

Comment by Gerrit Updater [ 08/Mar/22 ]

"Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46736
Subject: LU-13343 debug: debug recovery-small 140
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0332c6a72d770eabb136774b4da512ae2d75c231

Comment by Gerrit Updater [ 11/May/23 ]

"Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50941
Subject: LU-13343 dbg: print debug traces
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 77f3505e20acec948fccf24c4ee8ce9aae1a78b8

Comment by Gerrit Updater [ 08/Jul/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46704/
Subject: LU-13343 gss: no sec flavor on loopback connection
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e3e91ea95fd96a5eafc598e3812390b4cbac05c3

Comment by Peter Jones [ 09/Jul/23 ]

Landed for 2.16

Comment by Gerrit Updater [ 10/Jul/23 ]

"Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51616
Subject: LU-13343 gss: no sec flavor on loopback connection
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 3947a3b87798dae5b225b8df2d8effaed0e9c933

Generated at Sat Feb 10 03:00:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.