[LU-5715] Reboot hangs due to lustre modules Created: 07/Oct/14  Updated: 29/Jan/22  Resolved: 29/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Chakravarthy N Assignee: WC Triage
Resolution: Duplicate Votes: 1
Labels: None
Environment:

RHEL 6.5, MLNX_OFED_LINUX-2.2-1.0.1, CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz; Memory: 64GB; Kernel: 2.6.32-431.17.1.el6_lustre.x86_64; Lustre: 2.5.2-2.6.32_431.17.1


Issue Links:
Duplicate
duplicates LU-8293 lnet init.d script missing insserv he... Resolved
Severity: 3
Epic: server
Rank (Obsolete): 16029

 Description   

Hi,

The reboot gets hanged with lustre 2.5.2 and RHEL 6.5. If i unload the lustre modules using lustre_rmmod before reboot, it works. Appreciate your help here.

my lustre.conf is as follows.

options lnet networks="o2ib0(ib0)"



 Comments   
Comment by Oleg Drokin [ 10/Oct/14 ]

Are there any messages in kernel logs?

Comment by Oleg Drokin [ 10/Oct/14 ]

Also is this something you started to experience recently and was all fine with older versions?

Comment by Chakravarthy N [ 10/Oct/14 ]

There are no messages found in the syslog and everthing was fine until older versions. I dd not face the same issue with RHEL-6.4+Lustre-2.5 or RHEL6.4+Lustre-2.4

Comment by Lana Deere [ 11/Jun/15 ]

I have seen this symptom in CentOS 6.3 with Lustre 2.1.4. (I don't have a newer configuration installed to try.) Lustre is set up using o2ib. The clients and all Lustre nodes have IPoIB enabled plus an Ethernet connection. The clients are generally busy full-time, which is to say that when a client shutdown is initiated it is likely that at least some processes have a Lustre directory or file opened (current working directory of a process, if nothing else).

When the client hangs, there is no overt explanation - nothing in the syslog, etc. However, using IPMI to watch the client's virtual console showed that "/etc/init.d/rdma stop" was where the shutdown would hang. It would print that it was "Unloading OpenIB kernel modules" but it could not succeed because one (or more? I forget) of the OpenIB modules was in use. It would hang at that point.

As a hack, it usually prevents the hanging if we change /etc/init.d/rdma so it calls lustre_rmmod; specifically, so that the original line "stop()" becomes

/etc/init.d/rdma hack
stop()
{
[ -x /usr/sbin/lustre_rmmod ] && /usr/sbin/lustre_rmmod;
real_stop
}
real_stop()

This may or may not be related, but since it may be the symmetric issue at startup I'll mention it. On these clients, mounting the filesystem inside /etc/fstab using

/etc/fstab
<IPoIB address>@o2ib0:/lustre /mnt/lustre lustre defaults,_netdev 0 0

also generally fails: the system thinks the conditions for "_netdev" have been satisfied before ib0 is active so the mount fails. Stalling the mount one way or another is needed. (Do it explicitly later in the boot, or modify /etc/init.d/netfs so the check for _netdev waits for ib0, etc.)

Comment by Kevin J Moran (Inactive) [ 24/Nov/15 ]

I can confirm this is still an issue using RedHat kernel with Lustre client:

Linux 2.6.32-573.7.1.el6.x86_64 #1 SMP Thu Sep 10 13:42:16 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

Using standard RedHat Infiniband Support package.

Hangs at Unloading IB Modules even though /usr/sbin/lustre_rmmod is being called prior.

Comment by Wolfgang Baudler [ 13/Sep/16 ]

I can confirm the same issue here with RHEL6.8 and lustre 2.5.3, also using the RedHat Infiniband packages.

Comment by Nathaniel Clark [ 08/Jun/18 ]

This issue is resolved with patches for LU-8293

Generated at Sat Feb 10 01:53:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.