Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
3
-
9223372036854775807
Description
When I try to mount lustre with GSS (SSK) enabled I receive checksum errors using a multi-rail client where I do not when using only a single interface. My guess is the NID is encoded in the checksum though I haven't dug into the cause yet. I also had lots of errors when using GSS on multi-rail servers although the errors were different.
[154311.786639] LustreError: 194908:0:(gss_sk_mech.c:388:sk_verify_hmac()) checksum mismatch [154311.798154] LustreError: 194908:0:(sec_gss.c:242:gss_verify_msg()) mic verify error: 00060000 [154311.810015] LustreError: 194908:0:(sec_gss.c:2125:gss_svc_verify_request()) failed to verify request: 60000
I managed to reproduce a similar issue on my test cluster. After properly tuning Linux routing as explained on the wiki page at https://wiki.whamcloud.com/display/LNet/MR+Cluster+Setup , I formatted a simple Lustre file system made of 3 servers (1 MGS, 1 MDS, 1 OSS) and 1 client. All nodes use Eth, and have the same network configuration:
With this configuration, we enable LNet Multirail on tcp0.
This file system works fine without SSK enabled. When I enable SSK (skpi flavor for cli2ost connections), the client fails to mount, and we can see the following messages on OSS side:
After running git bisect I identified the commit that introduces this problem:
This commit is part of the merge of the origin/multi-rail branch just after 2.14.50 tag was put. So basically we suffer from this behavior from very early on the master branch after 2.14.0 was released. Good news is that 2.14.0 is not impacted.
ashehata ssmirnov do you see how this patch could affect the way peers present themselves to others? My understanding was that the primary NID was always used as the unique identifier of the connection, do you think this commit could change this paradigm? Or maybe this commit could make the multi-rail implementation more effective, by switching between rails more often for instance?
Thanks,
Sebastien.