I rebased NASA Ames's nas-2.7.2 branch to b2_7_fe on Aug 20, which picked up 58 patches up to this one:
LU-8019 llite: Restore proper opencache operations
The resulting client codes hung on mounting a lustre fs with errors:
LustreError: 15c-8: MGC10.151.26.117@o2ib: The configuration from log 'nbp1-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
The culprit has been identified to be this commit (change 17527)
LU-7210 o2iblnd: take extra refcount in kiblnd_connreq_done
We have carried that patch since May 4th (nas-2.7.1-2nas) to address our lnet disconnect/reconnect problem. It showed no problem until we rebased to b2_7_fe (9f42af2).
I noticed that b2_7_fe does not include change 17527, On the other hand, b2_8_fe includes two patches from LU-2710: change 17527 and change 17004. So I cherry-picked #17004 to our nas-2.7.2, but the mount problem persisted. If I reverted 17527 instead, the rebased code works fine on mount.
However, I am concerned if the disconnect/reconnect problem we hit before would come back if I back out 17527. Please advise. Thanks!
Thank you Doug for your investigation. We will have 17527 reverted in our nas-2.7.2.