[LU-5715] Reboot hangs due to lustre modules Created: 07/Oct/14 Updated: 29/Jan/22 Resolved: 29/Jan/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Chakravarthy N | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | None | ||
| Environment: |
RHEL 6.5, MLNX_OFED_LINUX-2.2-1.0.1, CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz; Memory: 64GB; Kernel: 2.6.32-431.17.1.el6_lustre.x86_64; Lustre: 2.5.2-2.6.32_431.17.1 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Epic: | server | ||||||||
| Rank (Obsolete): | 16029 | ||||||||
| Description |
|
Hi, The reboot gets hanged with lustre 2.5.2 and RHEL 6.5. If i unload the lustre modules using lustre_rmmod before reboot, it works. Appreciate your help here. my lustre.conf is as follows. options lnet networks="o2ib0(ib0)" |
| Comments |
| Comment by Oleg Drokin [ 10/Oct/14 ] |
|
Are there any messages in kernel logs? |
| Comment by Oleg Drokin [ 10/Oct/14 ] |
|
Also is this something you started to experience recently and was all fine with older versions? |
| Comment by Chakravarthy N [ 10/Oct/14 ] |
|
There are no messages found in the syslog and everthing was fine until older versions. I dd not face the same issue with RHEL-6.4+Lustre-2.5 or RHEL6.4+Lustre-2.4 |
| Comment by Lana Deere [ 11/Jun/15 ] |
|
I have seen this symptom in CentOS 6.3 with Lustre 2.1.4. (I don't have a newer configuration installed to try.) Lustre is set up using o2ib. The clients and all Lustre nodes have IPoIB enabled plus an Ethernet connection. The clients are generally busy full-time, which is to say that when a client shutdown is initiated it is likely that at least some processes have a Lustre directory or file opened (current working directory of a process, if nothing else). When the client hangs, there is no overt explanation - nothing in the syslog, etc. However, using IPMI to watch the client's virtual console showed that "/etc/init.d/rdma stop" was where the shutdown would hang. It would print that it was "Unloading OpenIB kernel modules" but it could not succeed because one (or more? I forget) of the OpenIB modules was in use. It would hang at that point. As a hack, it usually prevents the hanging if we change /etc/init.d/rdma so it calls lustre_rmmod; specifically, so that the original line "stop()" becomes /etc/init.d/rdma hack stop()
{
[ -x /usr/sbin/lustre_rmmod ] && /usr/sbin/lustre_rmmod;
real_stop
}
real_stop()
This may or may not be related, but since it may be the symmetric issue at startup I'll mention it. On these clients, mounting the filesystem inside /etc/fstab using /etc/fstab <IPoIB address>@o2ib0:/lustre /mnt/lustre lustre defaults,_netdev 0 0 also generally fails: the system thinks the conditions for "_netdev" have been satisfied before ib0 is active so the mount fails. Stalling the mount one way or another is needed. (Do it explicitly later in the boot, or modify /etc/init.d/netfs so the check for _netdev waits for ib0, etc.) |
| Comment by Kevin J Moran (Inactive) [ 24/Nov/15 ] |
|
I can confirm this is still an issue using RedHat kernel with Lustre client: Linux 2.6.32-573.7.1.el6.x86_64 #1 SMP Thu Sep 10 13:42:16 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux Using standard RedHat Infiniband Support package. Hangs at Unloading IB Modules even though /usr/sbin/lustre_rmmod is being called prior. |
| Comment by Wolfgang Baudler [ 13/Sep/16 ] |
|
I can confirm the same issue here with RHEL6.8 and lustre 2.5.3, also using the RedHat Infiniband packages. |
| Comment by Nathaniel Clark [ 08/Jun/18 ] |
|
This issue is resolved with patches for |