[LU-13739] mount fails with SSK keys Created: 01/Jul/20  Updated: 08/Jul/20  Resolved: 08/Jul/20

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Joe Frith Assignee: Jeremy Filizetti
Resolution: Duplicate Votes: 0
Labels: None
Environment:

RHEL7


Issue Links:
Related
is related to LU-13754 GSS-based authentication fails on Cen... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Client mount fails with the below error message. This is with "lustre02.srpc.flavor.tcp.cli2mdt=skpi" and "lustre02.srpc.flavor.tcp.cli2ost=skpi"

The server and clients have the correct SSK keys and keyring shows them as loaded. 

Jul 1 17:11:54 zabbix01 kernel: LustreError: 22972:0:(lmv_obd.c:315:lmv_connect_mdc()) target lustre02-MDT0000_UUID connect error -1
Jul 1 17:11:54 zabbix01 kernel: LustreError: 22972:0:(lmv_obd.c:315:lmv_connect_mdc()) Skipped 1 previous similar message
Jul 1 17:11:54 zabbix01 kernel: LustreError: 22972:0:(llite_lib.c:292:client_common_fill_super()) cannot connect to lustre02-clilmv-ffff8964b4dea800: rc = -1
Jul 1 17:11:54 zabbix01 kernel: LustreError: 22972:0:(llite_lib.c:292:client_common_fill_super()) Skipped 1 previous similar message
Jul 1 17:11:54 zabbix01 kernel: LustreError: 22972:0:(lov_obd.c:839:lov_cleanup()) lustre02-clilov-ffff8964b4dea800: lov tgt 0 not cleaned! deathrow=0, lovrc=1
Jul 1 17:11:54 zabbix01 kernel: LustreError: 22972:0:(lov_obd.c:839:lov_cleanup()) Skipped 13 previous similar messages
Jul 1 17:11:54 zabbix01 kernel: LustreError: 22972:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount (-1)



 Comments   
Comment by Joe Frith [ 01/Jul/20 ]

On the client  

 

[root@ ~]# keyctl show
Session Keyring
422481900 --alswrv 0 0 keyring: _ses
51709085 --alswrv 0 65534 _ keyring: _uid.0
180533941 --alswrv 0 0 _ user: lustre:lustre02
[root@ ~]# keyctl pipe 180533941 | lgss_sk -r -
Version: 1
Type: client
HMAC alg: sha256
Crypto alg: ctr(aes)
Ctx Expiration: 604800 seconds
Shared keylen: 256 bits
Prime length: 2048 bits
File system: lustre02
MGS NIDs:
Nodemap name: default
Shared key:
0000: 5900 0856 8af8 5fbc 2549 168f 8a29 56a8 Y..V.._.%I...)V.
0010: 858e 3bd9 5fd1 23af 780f b92b 3bcc c406 ..;._.#.x..+;...
Prime (p):
0000: f470 ae37 4a92 24bd d442 9456 2cf8 9cbd .p.7J.$..B.V,...
0010: 4dc6 400c 76b8 9edc c823 18e5 86f9 a0ba M.@.v....#......
0020: 70e2 72ba f5bf 4320 0386 a047 e772 2567 p.r...C ...G.r%g
0030: 8b65 10ef 758d e8a3 0441 bc0c 8b36 2be9 .e..u....A...6+.
0040: c38f cbdc 20ea 7461 890e c59b c948 f964 .... .ta.....H.d
0050: 8dd8 3891 4947 cb34 93d9 4150 1f4a 7eae ..8.IG.4..AP.J~.
0060: 65c1 d7b4 6e3b 274a 753f d0af 242a 8e10 e...n;'Ju?..$*..
0070: 9055 3ad5 b195 856c c7b8 b9f0 2b34 666a .U:....l....+4fj
0080: e8fc 2988 5f77 ced9 cbc0 2911 179b c1d9 ..)._w....).....
0090: 0717 e8d7 a14c 14f6 4907 fa0c 3de9 fffa .....L..I...=...
00a0: f524 623c 6664 fa20 4246 c1f3 1c06 27cc .$b<fd. BF....'.
00b0: ea75 7d10 8804 3489 88fd 392f 5c89 284b .u}...4...9/\.(K
00c0: 0aee 6df5 5471 95a7 6e1d 669c 658f e848 ..m.Tq..n.f.e..H
00d0: f74b 15be 4a1e fbc1 8fcc 78ed 87c5 1abe .K..J.....x.....
00e0: 028a 66cd e230 d6f9 8e8b f9e9 9cfe 6013 ..f..0........`.
00f0: c6d5 fcae b2be d59f 9375 8beb 8564 ad63 .........u...d.c

 

On the server side -

 [root@ ~]# keyctl show
Session Keyring
13605391 --alswrv 0 0 keyring: _ses
861388744 --alswrv 0 65534 _ keyring: _uid.0
77605995 --alswrv 0 0 _ user: lustre:lustre02:default
[root@ ~]# keyctl pipe 77605995 | lgss_sk -r -
Version: 1
Type: server
HMAC alg: sha256
Crypto alg: ctr(aes)
Ctx Expiration: 604800 seconds
Shared keylen: 256 bits
Prime length: 2048 bits
File system: lustre02
MGS NIDs:
Nodemap name: default
Shared key:
0000: 5900 0856 8af8 5fbc 2549 168f 8a29 56a8 Y..V.._.%I...)V.
0010: 858e 3bd9 5fd1 23af 780f b92b 3bcc c406 ..;._.#.x..+;...
[root@ ~]#

Comment by Jeremy Filizetti [ 06/Jul/20 ]

Do you have the MDS server logs as well?

Comment by Joe Frith [ 06/Jul/20 ]

I see this on the MDS server. 

 

kernel: LustreError: 233128:0:(tgt_handler.c:921:tgt_connect_check_sptlrpc()) lustre02-MDT0000: unauthorized rpc flavor 0 from x.x.x.x@tcp, expect 22

Comment by Jeremy Filizetti [ 06/Jul/20 ]

This seems to indicate the client is sending the wrong RPC flavor SPTLRPC_POLICY_NULL.  For some reason your client isn't pulling the correct info out of the MGS llog.  What does "lctl get_param mgs.MGS.live.*" say?

Comment by Joe Frith [ 06/Jul/20 ]

[root@mds ~]# lctl get_param mgs.MGS.live.*
mgs.MGS.live.lustre02=
fsname: lustre02
flags: 0x20 gen: 58
lustre02-MDT0000
lustre02-OST0000
lustre02-OST0001
lustre02-OST0002
lustre02-OST0003
lustre02-OST0004
lustre02-OST0005
lustre02-OST0006

Secure RPC Config Rules:
lustre02.srpc.flavor.tcp.cli2mdt=skpi

imperative_recovery_state:
state: full
nonir_clients: 0
nidtbl_version: 37
notify_duration_total: 0.000238135
notify_duation_max: 0.000238135
notify_count: 1
mgs.MGS.live.params=
fsname: params
flags: 0x20 gen: 1

Secure RPC Config Rules:

imperative_recovery_state:
state: full
nonir_clients: 0
nidtbl_version: 2
notify_duration_total: 0.000000000
notify_duation_max: 0.000000000
notify_count: 0
[root@mds ~]#

Comment by Joe Frith [ 07/Jul/20 ]

ptlrpc_gss kernel module was not loaded on the client. After fixing this issue, the mount still hangs, I am seeing the below in the client logs, nothing on the server side. 

 

Jul 6 22:06:31 zabbix01 kernel: Lustre: 22730:0:(gss_svc_upcall.c:1149:gss_init_svc_upcall()) Init channel is not opened by lsvcgssd, following request might be dropped until lsvcgssd is active
Jul 6 22:06:31 zabbix01 kernel: Key type lgssc registered
Jul 6 22:06:42 zabbix01 kernel: Lustre: 22734:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1594087591/real 1594087591] req@ffff8af7b3270000 x1671521988773568/t0(0) o801->lustre02-MDT0000-mdc-ffff8af635af1000@10.42.41.33@tcp:12/10 lens 224/224 e 0 to 1 dl 1594087602 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Jul 6 22:06:42 zabbix01 kernel: LustreError: 22734:0:(gss_keyring.c:1409:gss_kt_update()) negotiation: rpc err -85, gss err 0
Jul 6 22:06:42 zabbix01 lgss_keyring: [22734]:ERROR:do_nego_rpc(): status: -110 (Connection timed out)
Jul 6 22:06:42 zabbix01 lgss_keyring: [22734]:ERROR:lgssc_negotiation_manual(): negotiation rpc error -85
Jul 6 22:06:42 zabbix01 lgss_keyring: [22734]:ERROR:lgssc_kr_negotiate_manual(): key 18d782c4: failed to negotiate
Jul 6 22:06:42 zabbix01 kernel: Lustre: 22734:0:(sec_gss.c:315:cli_ctx_expire()) ctx ffff8af7b4d19d00(0->lustre02-MDT0000_UUID) get expired: 1594087791(+189s)

Comment by Jeremy Filizetti [ 07/Jul/20 ]

Can you confirm that lsvcgss service is running on the server side and that you have the "-s" included in /etc/sysconfig/lsvcgss?

Comment by Joe Frith [ 07/Jul/20 ]

[root@starmds01 ~]# cat /etc/sysconfig/lsvcgss

  1. Optional arguments passed to lsvcgssd.
    LSVCGSSDARGS='-s -m'

[root@starmds01 ~]# systemctl status lsvcgss.service
● lsvcgss.service - SYSV: start and stop the lsvcgssd daemon
Loaded: loaded (/etc/rc.d/init.d/lsvcgss; bad; vendor preset: disabled)
Active: active (running) since Mon 2020-07-06 22:53:25 EDT; 18h ago
Docs: man:systemd-sysv-generator(8)
Process: 204274 ExecStart=/etc/rc.d/init.d/lsvcgss start (code=exited, status=0/SUCCESS)
CGroup: /system.slice/lsvcgss.service
└─204278 /usr/sbin/lsvcgssd -s -m

Jul 06 22:53:25 starmds01 systemd[1]: Starting SYSV: start and stop the lsvcgssd daemon...
Jul 06 22:53:25 starmds01 lsvcgss[204274]: Starting lsvcgssd
Jul 06 22:53:25 starmds01 systemd[1]: Started SYSV: start and stop the lsvcgssd daemon.
[root@starmds01 ~]#

Comment by Jeremy Filizetti [ 07/Jul/20 ]

Can you add a "-vv" as well to the LSVCGSSDARGS and then send the output from the syslog for the MDS.  If possible it'd be nice to have the output of "lctl dk" as well.

Comment by Joe Frith [ 08/Jul/20 ]

lctl dk shows below when the client is connecting. 

 

00010000:00080000:2.0F:1594173014.717485:0:235886:0:(ldlm_lib.c:1227:target_handle_connect()) MGS: connection from 17390dcc-e9ea-2781-83dc-708eda5b1b60@130.199.148.189@tcp t0 exp (null) cur 539645 last 0
00000020:00000080:2.0:1594173014.717501:0:235886:0:(genops.c:1417:class_connect()) connect: client 17390dcc-e9ea-2781-83dc-708eda5b1b60, cookie 0x9c7d9c60f6ed33e9
00000020:01000000:2.0:1594173014.717507:0:235886:0:(lprocfs_status_server.c:491:lprocfs_exp_setup()) using hash ffff99b7cfe94c00
00000100:00080000:2.0:1594173014.717523:0:235886:0:(import.c:86:import_set_state_nolock()) ffff99e7d97b6000 : changing import state from RECOVER to FULL
00000100:02000000:2.0:1594173014.717527:0:235886:0:(import.c:1597:ptlrpc_import_recovery_state_machine()) MGS: Connection restored to e3959aaf-45a8-6dda-c117-ced5724d4e0b (at 130.199.148.189@tcp)
20000000:01000000:2.0:1594173014.776757:0:235886:0:(mgs_nids.c:632:mgs_get_ir_logs()) Reading IR log lustre02-cliir bufsize 1048576.
20000000:01000000:2.0:1594173014.776763:0:235886:0:(mgs_nids.c:192:mgs_nidtbl_read()) fsname lustre02, entry size 32, pages 4064/1/256/255.
20000000:01000000:2.0:1594173014.776766:0:235886:0:(mgs_nids.c:204:mgs_nidtbl_read()) Read IR logs lustre02 return with 32, version 64

 

Even with the -vv option added, I see only the below line on the MDS syslog. 

 

Jul 7 21:49:20 starmds01 kernel: Lustre: MGS: Connection restored to e3959aaf-45a8-6dda-c117-ced5724d4e0b (at 130.199.148.189@tcp)
Jul 7 21:49:20 starmds01 kernel: Lustre: Skipped 14 previous similar messages

Comment by Jeremy Filizetti [ 08/Jul/20 ]

It looks like I can reproduce this with Centos 7.8  is that version you are using?  If you add security to the debug on the MDS do you see something like the following?

[root@tr-mds-1 ~]# echo '+sec' > /sys/kernel/debug/lnet/debug
[root@tr-mds-1 ~]# lctl dk | grep gss
02000000:08000000:5.0:1594175878.715795:0:31012:0:(sec_gss.c:1991:gss_svc_handle_init()) processing gss init(1) request from 172.26.0.250@tcp
02000000:08000000:5.0:1594175878.715810:0:31012:0:(gss_svc_upcall.c:960:gss_svc_upcall_handle_init()) cache_check return ENOENT, drop

Comment by Joe Frith [ 08/Jul/20 ]

Yes, see below. I am using RHEL 7.8. Is there a workaround or quick fix?

 

[root@starmds01 ~]# lctl dk | grep gss
02000000:08000000:22.0F:1594176204.563936:0:236049:0:(sec_gss.c:1991:gss_svc_handle_init()) processing gss init(1) request from 130.199.148.189@tcp
02000000:08000000:22.0:1594176204.563953:0:236049:0:(gss_svc_upcall.c:960:gss_svc_upcall_handle_init()) cache_check return ENOENT, drop
02000000:08000000:8.0F:1594176215.608251:0:235969:0:(sec_gss.c:1991:gss_svc_handle_init()) processing gss init(1) request from 130.199.148.189@tcp
02000000:08000000:8.0:1594176215.608265:0:235969:0:(gss_svc_upcall.c:960:gss_svc_upcall_handle_init()) cache_check return ENOENT, drop
[root@starmds01 ~]#

Comment by Jeremy Filizetti [ 08/Jul/20 ]

There was some additional changes to the sunrpc code incorporated into RHEL 7.8 from upstream that assumes the files for the cache get closed repeatedly. Since Lustre's lsvcgssd just opens the file one and keeps it open a last_close variable in the cache_detail struct (rsi_cache) is not being updated so this is why we are seeing the issue. I'll see if at some point today I can take a look at how the NFS stuff works to make sure my assumptions aren't wrong here on how it uses the cache and then create a patch for Lustre.

Comment by Jeremy Filizetti [ 08/Jul/20 ]

There is a patch that Sebastien has already posted for LU-13754 that should work. I thought there was additional patches needed but it was just something messed up in my test environment. If you have any issues after testing the patch please file it under the other bug.

Generated at Sat Feb 10 03:03:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.