I have seen this symptom in CentOS 6.3 with Lustre 2.1.4. (I don't have a newer configuration installed to try.) Lustre is set up using o2ib. The clients and all Lustre nodes have IPoIB enabled plus an Ethernet connection. The clients are generally busy full-time, which is to say that when a client shutdown is initiated it is likely that at least some processes have a Lustre directory or file opened (current working directory of a process, if nothing else).
When the client hangs, there is no overt explanation - nothing in the syslog, etc. However, using IPMI to watch the client's virtual console showed that "/etc/init.d/rdma stop" was where the shutdown would hang. It would print that it was "Unloading OpenIB kernel modules" but it could not succeed because one (or more? I forget) of the OpenIB modules was in use. It would hang at that point.
As a hack, it usually prevents the hanging if we change /etc/init.d/rdma so it calls lustre_rmmod; specifically, so that the original line "stop()" becomes
stop()
{
[ -x /usr/sbin/lustre_rmmod ] && /usr/sbin/lustre_rmmod;
real_stop
}
real_stop()
This may or may not be related, but since it may be the symmetric issue at startup I'll mention it. On these clients, mounting the filesystem inside /etc/fstab using
<IPoIB address>@o2ib0:/lustre /mnt/lustre lustre defaults,_netdev 0 0
also generally fails: the system thinks the conditions for "_netdev" have been satisfied before ib0 is active so the mount fails. Stalling the mount one way or another is needed. (Do it explicitly later in the boot, or modify /etc/init.d/netfs so the check for _netdev waits for ib0, etc.)
This issue is resolved with patches for
LU-8293