Details
Description
It appears in some cases when the Gerrit Janitor is starting up a large number of test sessions at one time that there can be occasional errors with the client processing the MGS config log. Rather than retry the llog read, it appears that the client returns an error and fails the mount completely:
[ 54.413858] Lustre: Lustre: Build Version: 2.15.63_2_g37bb2f4 [ 55.114449] LNet: Added LNI 192.168.201.60@tcp [8/256/0/180] [ 55.119888] LNet: Accept secure, port 988 [ 56.831735] Key type lgssc registered [ 57.948899] Lustre: Echo OBD driver; http://www.lustre.org/ [ 148.089328] Lustre: 5403:0:(client.c:2343:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1716253832/real 1716253832] req@ffff8800ab486680 x1799622480626496/t0(0) o503->MGC192.168.201.160@tcp@192.168.201.160@tcp:26/25 lens 272/8416 e 0 to 1 dl 1716253848 ref 3 fl Rpc:ReXQU/200/ffffffff rc 0/-1 job:'' uid:0 gid:0 [ 148.099734] LustreError: 166-1: MGC192.168.201.160@tcp: Connection to MGS (at 192.168.201.160@tcp) was lost; in progress operations using this service will fail [ 148.116413] LustreError: 5403:0:(mgc_request.c:1953:mgc_process_config()) MGC192.168.201.160@tcp: can't process params llog: rc = -5 [ 148.120094] Lustre: MGC192.168.201.160@tcp: Connection restored to (at 192.168.201.160@tcp) [ 148.128188] LustreError: 15c-8: MGC192.168.201.160@tcp: Confguration from log lustre-client failed from MGS -5. Communication error between node & MGS, a bad configuration, or other errors. See syslog for more info [ 148.160555] Lustre: Unmounted lustre-client [ 148.179518] LustreError: 5403:0:(super25.c:189:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -5
This can be seen several time with patch https://review.whamcloud.com/54840 "LU-16692 tests: force_new_seq_all interop version checking" that modifies several test scripts and triggered 575 separate test jobs to run at the same time:
https://testing.whamcloud.com/gerrit-janitor/42749/results.html
There were 8 similar failures among those jobs, one of them in sanityn with the failure of mounting the second mount point, so the status looks slightly different and has more useful logs. However, I suspect something similar could happen with a large production cluster mounting thousands of clients at the same time.
Rather than giving up on mounting completely, it makes sense to have a limited retry mechanism, either at the PtlRPC level and/or at the mount.lustre level (which . It may be that the llog RPCs are intentionally not using RPC retry to avoid blocking recovery, but I haven't looked at the code, and it would make sense to at least retry such an important RPC a few times before giving up. The gerrit-janitor sanityn test results should have full client debug logs for review (it isn't clear if the server even sees the LLOG_ORIGIN_HANDLE_READ_HEADER = 503 RPC that the client sent).
The "lightweight" option would be to retry the failed RPC in the kernel without tearing down the whole mountpoint. The "heavyweight" option (but easier to implement) is to set mount_lustre.c::parse_options() to set mop->mo_retry = 5 or similar for client mounts (before the "retry=N" option is parsed!) but don't retry multiple times during server failover and slow down the recovery (which will be retried externally by HA anyway).