Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0, Lustre 2.5.4
    • Lustre 2.4.1
    • lustre-client-modules-2.4.1-6nasC OFED3.5
      server lustre2.4.1 and 2.1.5 OFED1.5.4
    • 3
    • 13682

    Description

      Upgrading to ofed3.5 we have started to get random mount failures during client boot. The filesystem that failed to mount is random. Here it client side debug output.

      0000000:00000001:1.0:1398271322.986806:0:7677:0:(mgc_request.c:947:mgc_enqueue()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      10000000:01000000:1.0:1398271322.986808:0:7677:0:(mgc_request.c:1852:mgc_process_log()) Can't get cfg lock: -5
      10000000:00000001:1.0:1398271322.986810:0:7677:0:(mgc_request.c:125:config_log_get()) Process entered
      10000000:00000001:1.0:1398271322.986811:0:7677:0:(mgc_request.c:129:config_log_get()) Process leaving (rc=0 : 0 : 0)
      10000000:00000001:1.0:1398271322.986813:0:7677:0:(mgc_request.c:1713:mgc_process_cfg_log()) Process entered
      10000000:00000001:1.0:1398271322.986815:0:7677:0:(mgc_request.c:1774:mgc_process_cfg_log()) Process leaving via out_pop (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)
      10000000:00000001:1.0:1398271322.986818:0:7677:0:(mgc_request.c:1811:mgc_process_cfg_log()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      10000000:01000000:1.0:1398271322.986819:0:7677:0:(mgc_request.c:1871:mgc_process_log()) MGC10.151.25.171@o2ib: configuration from log 'nbp3-client' failed (-5).
      10000000:00000001:1.0:1398271322.986822:0:7677:0:(mgc_request.c:1883:mgc_process_log()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      10000000:00000001:1.0:1398271322.986824:0:7677:0:(mgc_request.c:136:config_log_put()) Process entered
      10000000:00000001:1.0:1398271322.986825:0:7677:0:(mgc_request.c:160:config_log_put()) Process leaving
      10000000:00000001:1.0:1398271322.986826:0:7677:0:(mgc_request.c:1982:mgc_process_config()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      00000020:00000001:1.0:1398271322.986829:0:7677:0:(obd_class.h:714:obd_process_config()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      00000020:00000001:1.0:1398271322.986830:0:7677:0:(lustre_cfg.h:214:lustre_cfg_len()) Process entered
      00000020:00000001:1.0:1398271322.986831:0:7677:0:(lustre_cfg.h:220:lustre_cfg_len()) Process leaving (rc=176 : 176 : b0)
      00000020:00000001:1.0:1398271322.986833:0:7677:0:(lustre_cfg.h:259:lustre_cfg_free()) Process leaving
      00000020:02020000:1.0:1398271322.986834:0:7677:0:(obd_mount.c:119:lustre_process_log()) 15c-8: MGC10.151.25.171@o2ib: The configuration from log 'nbp3-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      00000020:00000001:1.0:1398271323.010020:0:7677:0:(obd_mount.c:122:lustre_process_log()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      

      Complete Debug output is attached

      Attachments

        Issue Links

          Activity

            [LU-4943] Client Failes to mount filesystem

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/11765/
            Subject: LU-4943 obdclass: detach MGC dev on error
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set:
            Commit: 8d1e9394d3a984e257e1e4b0f46f16b7ff2183cd

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/11765/ Subject: LU-4943 obdclass: detach MGC dev on error Project: fs/lustre-release Branch: b2_5 Current Patch Set: Commit: 8d1e9394d3a984e257e1e4b0f46f16b7ff2183cd
            haasken Ryan Haasken added a comment -

            I didn't notice that there was already a b2_5 version of this fix, so http://review.whamcloud.com/#/c/12303/ has been abandoned in favor of http://review.whamcloud.com/#/c/11765

            haasken Ryan Haasken added a comment - I didn't notice that there was already a b2_5 version of this fix, so http://review.whamcloud.com/#/c/12303/ has been abandoned in favor of http://review.whamcloud.com/#/c/11765
            pjones Peter Jones added a comment -

            Landed for 2.7

            pjones Peter Jones added a comment - Landed for 2.7
            haasken Ryan Haasken added a comment -

            It looks like we may have gotten the same spurious Maloo failures on the b2_5 patch as we did on other branches. Can somebody restart Maloo?

            haasken Ryan Haasken added a comment - It looks like we may have gotten the same spurious Maloo failures on the b2_5 patch as we did on other branches. Can somebody restart Maloo?
            haasken Ryan Haasken added a comment -

            The patch for master has landed.

            This issue also exists in 2.5. Here is a port for b2_5: http://review.whamcloud.com/#/c/12303

            haasken Ryan Haasken added a comment - The patch for master has landed. This issue also exists in 2.5. Here is a port for b2_5: http://review.whamcloud.com/#/c/12303
            haasken Ryan Haasken added a comment - - edited

            Is the test failure in replay-ost-single on http://review.whamcloud.com/#/c/10129/ related to the patch? It doesn't seem like it to me, but I don't see a bug matching that failure.

            haasken Ryan Haasken added a comment - - edited Is the test failure in replay-ost-single on http://review.whamcloud.com/#/c/10129/ related to the patch? It doesn't seem like it to me, but I don't see a bug matching that failure.
            bobijam Zhenyu Xu added a comment -

            update http://review.whamcloud.com/#/c/10127/ (b2_4) to be sync with master.

            bobijam Zhenyu Xu added a comment - update http://review.whamcloud.com/#/c/10127/ (b2_4) to be sync with master.

            Thanks Jay. I have been testing on master, so that may explain why PS7 didn't fix the problem for me. PS10 of #10129 takes a different approach than the earlier patch-sets. It now takes a similar approach to #10569. PS10 of #10129 is broken, but when I fixed it locally and rebuilt, it resolved the problem in the same way that #10569 did.

            It seems to me that we only need either #10129 or #10569. Can anybody confirm this?

            haasken Ryan Haasken added a comment - Thanks Jay. I have been testing on master, so that may explain why PS7 didn't fix the problem for me. PS10 of #10129 takes a different approach than the earlier patch-sets. It now takes a similar approach to #10569. PS10 of #10129 is broken, but when I fixed it locally and rebuilt, it resolved the problem in the same way that #10569 did. It seems to me that we only need either #10129 or #10569. Can anybody confirm this?

            Ryan, we use http://review.whamcloud.com/10127 (for b2_4 branch) instead.

            The latest PS for #10129 (for master branch) is PS10 as you pointed out, but the latest PS for #10127 is still PS7. We use PS7 of #10127.

            jaylan Jay Lan (Inactive) added a comment - Ryan, we use http://review.whamcloud.com/10127 (for b2_4 branch) instead. The latest PS for #10129 (for master branch) is PS10 as you pointed out, but the latest PS for #10127 is still PS7. We use PS7 of #10127.

            People

              bobijam Zhenyu Xu
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: