Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0, Lustre 2.5.4
    • Lustre 2.4.1
    • lustre-client-modules-2.4.1-6nasC OFED3.5
      server lustre2.4.1 and 2.1.5 OFED1.5.4
    • 3
    • 13682

    Description

      Upgrading to ofed3.5 we have started to get random mount failures during client boot. The filesystem that failed to mount is random. Here it client side debug output.

      0000000:00000001:1.0:1398271322.986806:0:7677:0:(mgc_request.c:947:mgc_enqueue()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      10000000:01000000:1.0:1398271322.986808:0:7677:0:(mgc_request.c:1852:mgc_process_log()) Can't get cfg lock: -5
      10000000:00000001:1.0:1398271322.986810:0:7677:0:(mgc_request.c:125:config_log_get()) Process entered
      10000000:00000001:1.0:1398271322.986811:0:7677:0:(mgc_request.c:129:config_log_get()) Process leaving (rc=0 : 0 : 0)
      10000000:00000001:1.0:1398271322.986813:0:7677:0:(mgc_request.c:1713:mgc_process_cfg_log()) Process entered
      10000000:00000001:1.0:1398271322.986815:0:7677:0:(mgc_request.c:1774:mgc_process_cfg_log()) Process leaving via out_pop (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)
      10000000:00000001:1.0:1398271322.986818:0:7677:0:(mgc_request.c:1811:mgc_process_cfg_log()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      10000000:01000000:1.0:1398271322.986819:0:7677:0:(mgc_request.c:1871:mgc_process_log()) MGC10.151.25.171@o2ib: configuration from log 'nbp3-client' failed (-5).
      10000000:00000001:1.0:1398271322.986822:0:7677:0:(mgc_request.c:1883:mgc_process_log()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      10000000:00000001:1.0:1398271322.986824:0:7677:0:(mgc_request.c:136:config_log_put()) Process entered
      10000000:00000001:1.0:1398271322.986825:0:7677:0:(mgc_request.c:160:config_log_put()) Process leaving
      10000000:00000001:1.0:1398271322.986826:0:7677:0:(mgc_request.c:1982:mgc_process_config()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      00000020:00000001:1.0:1398271322.986829:0:7677:0:(obd_class.h:714:obd_process_config()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      00000020:00000001:1.0:1398271322.986830:0:7677:0:(lustre_cfg.h:214:lustre_cfg_len()) Process entered
      00000020:00000001:1.0:1398271322.986831:0:7677:0:(lustre_cfg.h:220:lustre_cfg_len()) Process leaving (rc=176 : 176 : b0)
      00000020:00000001:1.0:1398271322.986833:0:7677:0:(lustre_cfg.h:259:lustre_cfg_free()) Process leaving
      00000020:02020000:1.0:1398271322.986834:0:7677:0:(obd_mount.c:119:lustre_process_log()) 15c-8: MGC10.151.25.171@o2ib: The configuration from log 'nbp3-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      00000020:00000001:1.0:1398271323.010020:0:7677:0:(obd_mount.c:122:lustre_process_log()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      

      Complete Debug output is attached

      Attachments

        Issue Links

          Activity

            [LU-4943] Client Failes to mount filesystem

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/11765/
            Subject: LU-4943 obdclass: detach MGC dev on error
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set:
            Commit: 8d1e9394d3a984e257e1e4b0f46f16b7ff2183cd

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/11765/ Subject: LU-4943 obdclass: detach MGC dev on error Project: fs/lustre-release Branch: b2_5 Current Patch Set: Commit: 8d1e9394d3a984e257e1e4b0f46f16b7ff2183cd
            haasken Ryan Haasken added a comment -

            I didn't notice that there was already a b2_5 version of this fix, so http://review.whamcloud.com/#/c/12303/ has been abandoned in favor of http://review.whamcloud.com/#/c/11765

            haasken Ryan Haasken added a comment - I didn't notice that there was already a b2_5 version of this fix, so http://review.whamcloud.com/#/c/12303/ has been abandoned in favor of http://review.whamcloud.com/#/c/11765
            pjones Peter Jones added a comment -

            Landed for 2.7

            pjones Peter Jones added a comment - Landed for 2.7
            haasken Ryan Haasken added a comment -

            It looks like we may have gotten the same spurious Maloo failures on the b2_5 patch as we did on other branches. Can somebody restart Maloo?

            haasken Ryan Haasken added a comment - It looks like we may have gotten the same spurious Maloo failures on the b2_5 patch as we did on other branches. Can somebody restart Maloo?
            haasken Ryan Haasken added a comment -

            The patch for master has landed.

            This issue also exists in 2.5. Here is a port for b2_5: http://review.whamcloud.com/#/c/12303

            haasken Ryan Haasken added a comment - The patch for master has landed. This issue also exists in 2.5. Here is a port for b2_5: http://review.whamcloud.com/#/c/12303
            haasken Ryan Haasken added a comment - - edited

            Is the test failure in replay-ost-single on http://review.whamcloud.com/#/c/10129/ related to the patch? It doesn't seem like it to me, but I don't see a bug matching that failure.

            haasken Ryan Haasken added a comment - - edited Is the test failure in replay-ost-single on http://review.whamcloud.com/#/c/10129/ related to the patch? It doesn't seem like it to me, but I don't see a bug matching that failure.
            bobijam Zhenyu Xu added a comment -

            update http://review.whamcloud.com/#/c/10127/ (b2_4) to be sync with master.

            bobijam Zhenyu Xu added a comment - update http://review.whamcloud.com/#/c/10127/ (b2_4) to be sync with master.

            Thanks Jay. I have been testing on master, so that may explain why PS7 didn't fix the problem for me. PS10 of #10129 takes a different approach than the earlier patch-sets. It now takes a similar approach to #10569. PS10 of #10129 is broken, but when I fixed it locally and rebuilt, it resolved the problem in the same way that #10569 did.

            It seems to me that we only need either #10129 or #10569. Can anybody confirm this?

            haasken Ryan Haasken added a comment - Thanks Jay. I have been testing on master, so that may explain why PS7 didn't fix the problem for me. PS10 of #10129 takes a different approach than the earlier patch-sets. It now takes a similar approach to #10569. PS10 of #10129 is broken, but when I fixed it locally and rebuilt, it resolved the problem in the same way that #10569 did. It seems to me that we only need either #10129 or #10569. Can anybody confirm this?

            Ryan, we use http://review.whamcloud.com/10127 (for b2_4 branch) instead.

            The latest PS for #10129 (for master branch) is PS10 as you pointed out, but the latest PS for #10127 is still PS7. We use PS7 of #10127.

            jaylan Jay Lan (Inactive) added a comment - Ryan, we use http://review.whamcloud.com/10127 (for b2_4 branch) instead. The latest PS for #10129 (for master branch) is PS10 as you pointed out, but the latest PS for #10127 is still PS7. We use PS7 of #10127.
            haasken Ryan Haasken added a comment -

            Mahmoud, do you know which patch-set of the change http://review.whamcloud.com/#/c/10129/ you used? The newest version of the patch looks like it addresses the problem similar to the way http://review.whamcloud.com/#/c/10569/ addresses the problem.

            Also, there is a problem with the current patch-set (PS10) of http://review.whamcloud.com/#/c/10129/ . With this patch-set applied mount.lustre hangs with the following trace in dmesg:

            INFO: task mount.lustre:12025 blocked for more than 120 seconds.
                  Not tainted 2.6.32.431.23.3.el6_lustre #2
            "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            mount.lustre  D 0000000000000001     0 12025  12024 0x00000080
             ffff88013bb15828 0000000000000086 ffff88013bb157b8 ffffffff81069f15
             ffff88013bb15798 ffff88013dc08ad8 ffff8800283168e8 ffff880028316880
             ffff88013a56fab8 ffff88013bb15fd8 000000000000fbc8 ffff88013a56fab8
            Call Trace:
             [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450
             [<ffffffff8152a365>] schedule_timeout+0x215/0x2e0
             [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450
             [<ffffffff81529fe3>] wait_for_common+0x123/0x180
             [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
             [<ffffffff8152a0fd>] wait_for_completion+0x1d/0x20
             [<ffffffffa11fd0a8>] mgc_setup+0x4c8/0x5a0 [mgc]
             [<ffffffffa0c4de8b>] obd_setup+0x19b/0x290 [obdclass]
             [<ffffffffa0afc181>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
             [<ffffffffa0af63a8>] ? libcfs_log_return+0x28/0x40 [libcfs]
             [<ffffffffa0c4e188>] class_setup+0x208/0x870 [obdclass]
             [<ffffffffa0c56a6c>] class_process_config+0xc6c/0x1ad0 [obdclass]
             [<ffffffffa0af63a8>] ? libcfs_log_return+0x28/0x40 [libcfs]
             [<ffffffffa0c5baab>] ? lustre_cfg_new+0x40b/0x6f0 [obdclass]
             [<ffffffffa0c5bee8>] do_lcfg+0x158/0x450 [obdclass]
             [<ffffffff8128daa0>] ? sprintf+0x40/0x50
             [<ffffffffa0c5c274>] lustre_start_simple+0x94/0x200 [obdclass]
             [<ffffffffa0c60993>] lustre_start_mgc+0xbd3/0x1e00 [obdclass]
             [<ffffffffa0af63a8>] ? libcfs_log_return+0x28/0x40 [libcfs]
             [<ffffffffa0afc181>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
             [<ffffffffa0c61ccc>] lustre_fill_super+0x10c/0x550 [obdclass]
             [<ffffffffa0c61bc0>] ? lustre_fill_super+0x0/0x550 [obdclass]
             [<ffffffff8118c5df>] get_sb_nodev+0x5f/0xa0
             [<ffffffffa0c59995>] lustre_get_sb+0x25/0x30 [obdclass]
             [<ffffffff8118bc3b>] vfs_kern_mount+0x7b/0x1b0
             [<ffffffff8118bde2>] do_kern_mount+0x52/0x130
             [<ffffffff811ad7bb>] do_mount+0x2fb/0x930
             [<ffffffff811ade80>] sys_mount+0x90/0xe0
             [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            

            I think this is because rq_start is never completed in mgc_requeue_thread().

            haasken Ryan Haasken added a comment - Mahmoud, do you know which patch-set of the change http://review.whamcloud.com/#/c/10129/ you used? The newest version of the patch looks like it addresses the problem similar to the way http://review.whamcloud.com/#/c/10569/ addresses the problem. Also, there is a problem with the current patch-set (PS10) of http://review.whamcloud.com/#/c/10129/ . With this patch-set applied mount.lustre hangs with the following trace in dmesg: INFO: task mount.lustre:12025 blocked for more than 120 seconds. Not tainted 2.6.32.431.23.3.el6_lustre #2 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mount.lustre D 0000000000000001 0 12025 12024 0x00000080 ffff88013bb15828 0000000000000086 ffff88013bb157b8 ffffffff81069f15 ffff88013bb15798 ffff88013dc08ad8 ffff8800283168e8 ffff880028316880 ffff88013a56fab8 ffff88013bb15fd8 000000000000fbc8 ffff88013a56fab8 Call Trace: [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450 [<ffffffff8152a365>] schedule_timeout+0x215/0x2e0 [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450 [<ffffffff81529fe3>] wait_for_common+0x123/0x180 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 [<ffffffff8152a0fd>] wait_for_completion+0x1d/0x20 [<ffffffffa11fd0a8>] mgc_setup+0x4c8/0x5a0 [mgc] [<ffffffffa0c4de8b>] obd_setup+0x19b/0x290 [obdclass] [<ffffffffa0afc181>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa0af63a8>] ? libcfs_log_return+0x28/0x40 [libcfs] [<ffffffffa0c4e188>] class_setup+0x208/0x870 [obdclass] [<ffffffffa0c56a6c>] class_process_config+0xc6c/0x1ad0 [obdclass] [<ffffffffa0af63a8>] ? libcfs_log_return+0x28/0x40 [libcfs] [<ffffffffa0c5baab>] ? lustre_cfg_new+0x40b/0x6f0 [obdclass] [<ffffffffa0c5bee8>] do_lcfg+0x158/0x450 [obdclass] [<ffffffff8128daa0>] ? sprintf+0x40/0x50 [<ffffffffa0c5c274>] lustre_start_simple+0x94/0x200 [obdclass] [<ffffffffa0c60993>] lustre_start_mgc+0xbd3/0x1e00 [obdclass] [<ffffffffa0af63a8>] ? libcfs_log_return+0x28/0x40 [libcfs] [<ffffffffa0afc181>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa0c61ccc>] lustre_fill_super+0x10c/0x550 [obdclass] [<ffffffffa0c61bc0>] ? lustre_fill_super+0x0/0x550 [obdclass] [<ffffffff8118c5df>] get_sb_nodev+0x5f/0xa0 [<ffffffffa0c59995>] lustre_get_sb+0x25/0x30 [obdclass] [<ffffffff8118bc3b>] vfs_kern_mount+0x7b/0x1b0 [<ffffffff8118bde2>] do_kern_mount+0x52/0x130 [<ffffffff811ad7bb>] do_mount+0x2fb/0x930 [<ffffffff811ade80>] sys_mount+0x90/0xe0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b I think this is because rq_start is never completed in mgc_requeue_thread().

            People

              bobijam Zhenyu Xu
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: