Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.5.2
    • None
    • RHEL 6.5, MLNX_OFED_LINUX-2.2-1.0.1, CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz; Memory: 64GB; Kernel: 2.6.32-431.17.1.el6_lustre.x86_64; Lustre: 2.5.2-2.6.32_431.17.1
    • 3
    • 16029

    Description

      Hi,

      The reboot gets hanged with lustre 2.5.2 and RHEL 6.5. If i unload the lustre modules using lustre_rmmod before reboot, it works. Appreciate your help here.

      my lustre.conf is as follows.

      options lnet networks="o2ib0(ib0)"

      Attachments

        Issue Links

          Activity

            [LU-5715] Reboot hangs due to lustre modules

            This issue is resolved with patches for LU-8293

            utopiabound Nathaniel Clark added a comment - This issue is resolved with patches for LU-8293

            I can confirm the same issue here with RHEL6.8 and lustre 2.5.3, also using the RedHat Infiniband packages.

            wbaudler Wolfgang Baudler added a comment - I can confirm the same issue here with RHEL6.8 and lustre 2.5.3, also using the RedHat Infiniband packages.

            I can confirm this is still an issue using RedHat kernel with Lustre client:

            Linux 2.6.32-573.7.1.el6.x86_64 #1 SMP Thu Sep 10 13:42:16 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

            Using standard RedHat Infiniband Support package.

            Hangs at Unloading IB Modules even though /usr/sbin/lustre_rmmod is being called prior.

            kmoran Kevin J Moran (Inactive) added a comment - I can confirm this is still an issue using RedHat kernel with Lustre client: Linux 2.6.32-573.7.1.el6.x86_64 #1 SMP Thu Sep 10 13:42:16 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux Using standard RedHat Infiniband Support package. Hangs at Unloading IB Modules even though /usr/sbin/lustre_rmmod is being called prior.
            lana.deere@gmail.com Lana Deere added a comment - - edited

            I have seen this symptom in CentOS 6.3 with Lustre 2.1.4. (I don't have a newer configuration installed to try.) Lustre is set up using o2ib. The clients and all Lustre nodes have IPoIB enabled plus an Ethernet connection. The clients are generally busy full-time, which is to say that when a client shutdown is initiated it is likely that at least some processes have a Lustre directory or file opened (current working directory of a process, if nothing else).

            When the client hangs, there is no overt explanation - nothing in the syslog, etc. However, using IPMI to watch the client's virtual console showed that "/etc/init.d/rdma stop" was where the shutdown would hang. It would print that it was "Unloading OpenIB kernel modules" but it could not succeed because one (or more? I forget) of the OpenIB modules was in use. It would hang at that point.

            As a hack, it usually prevents the hanging if we change /etc/init.d/rdma so it calls lustre_rmmod; specifically, so that the original line "stop()" becomes

            /etc/init.d/rdma hack
            stop()
            {
            [ -x /usr/sbin/lustre_rmmod ] && /usr/sbin/lustre_rmmod;
            real_stop
            }
            real_stop()
            

            This may or may not be related, but since it may be the symmetric issue at startup I'll mention it. On these clients, mounting the filesystem inside /etc/fstab using

            /etc/fstab
            <IPoIB address>@o2ib0:/lustre /mnt/lustre lustre defaults,_netdev 0 0
            

            also generally fails: the system thinks the conditions for "_netdev" have been satisfied before ib0 is active so the mount fails. Stalling the mount one way or another is needed. (Do it explicitly later in the boot, or modify /etc/init.d/netfs so the check for _netdev waits for ib0, etc.)

            lana.deere@gmail.com Lana Deere added a comment - - edited I have seen this symptom in CentOS 6.3 with Lustre 2.1.4. (I don't have a newer configuration installed to try.) Lustre is set up using o2ib. The clients and all Lustre nodes have IPoIB enabled plus an Ethernet connection. The clients are generally busy full-time, which is to say that when a client shutdown is initiated it is likely that at least some processes have a Lustre directory or file opened (current working directory of a process, if nothing else). When the client hangs, there is no overt explanation - nothing in the syslog, etc. However, using IPMI to watch the client's virtual console showed that "/etc/init.d/rdma stop" was where the shutdown would hang. It would print that it was "Unloading OpenIB kernel modules" but it could not succeed because one (or more? I forget) of the OpenIB modules was in use. It would hang at that point. As a hack, it usually prevents the hanging if we change /etc/init.d/rdma so it calls lustre_rmmod; specifically, so that the original line "stop()" becomes /etc/init.d/rdma hack stop() { [ -x /usr/sbin/lustre_rmmod ] && /usr/sbin/lustre_rmmod; real_stop } real_stop() This may or may not be related, but since it may be the symmetric issue at startup I'll mention it. On these clients, mounting the filesystem inside /etc/fstab using /etc/fstab <IPoIB address>@o2ib0:/lustre /mnt/lustre lustre defaults,_netdev 0 0 also generally fails: the system thinks the conditions for "_netdev" have been satisfied before ib0 is active so the mount fails. Stalling the mount one way or another is needed. (Do it explicitly later in the boot, or modify /etc/init.d/netfs so the check for _netdev waits for ib0, etc.)

            There are no messages found in the syslog and everthing was fine until older versions. I dd not face the same issue with RHEL-6.4+Lustre-2.5 or RHEL6.4+Lustre-2.4

            prabhu.chakra Chakravarthy N added a comment - There are no messages found in the syslog and everthing was fine until older versions. I dd not face the same issue with RHEL-6.4+Lustre-2.5 or RHEL6.4+Lustre-2.4
            green Oleg Drokin added a comment -

            Also is this something you started to experience recently and was all fine with older versions?

            green Oleg Drokin added a comment - Also is this something you started to experience recently and was all fine with older versions?
            green Oleg Drokin added a comment -

            Are there any messages in kernel logs?

            green Oleg Drokin added a comment - Are there any messages in kernel logs?

            People

              wc-triage WC Triage
              prabhu.chakra Chakravarthy N
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: