Here is the timeline of events according to the Lustre debug log. The beginning num is relative to the start of the mount op.
+0 Client sent MGS_CONNECT req to primary MGS node with timeout set to (obd_timeout/20 + adaptive_timeout), which was 20 seconds in our test case.
+0 Client sent LDLM_ENQUEUE req to MGS node with rq_delay_limit set to 5 seconds. This is for sptlrpc. The send will be delayed because the import is still in connecting state.
+5 The above req failed after the delayed sent expired. But this is not fatal.
+5 Client sent another LDLM_ENQUEUE req to MGS node with rq_delay_limit set to MGC_ENQUEUE_LIMIT, which is hard coded to 50 seconds.
+20 MGS_CONNECT timed out.
+55 The second LDLM_ENQUEUE req failed after the delayed sent expired. This will fail the whole client mount with error -5.
The problem here is that, after the MGS_CONNECT failed to connect to the primary MGS, it didn't get a chance to connect to the secondary before the mount fails. We know that selecting a different MGS node is triggered by the pinger, which works at (obd_timeout/4) interval. Since we increased obd_timeout to 300, the interval became 75 seconds now. So the connection to the secondary will not happened prior to failure of the second LDLM_ENQUEUE req.
The solution we proposed here is to redefine MGS_ENQUEUE_LIMIT as relative to obd_timeout, instead of a hard-coded value. By doing that, the second LDLM_ENQUEUE will wait long enough to go through after the connection to the secondary MGS node is established.
Very sorry about the delay, will investigate