Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0, Lustre 2.5.4
    • Lustre 2.4.1
    • lustre-client-modules-2.4.1-6nasC OFED3.5
      server lustre2.4.1 and 2.1.5 OFED1.5.4
    • 3
    • 13682

    Description

      Upgrading to ofed3.5 we have started to get random mount failures during client boot. The filesystem that failed to mount is random. Here it client side debug output.

      0000000:00000001:1.0:1398271322.986806:0:7677:0:(mgc_request.c:947:mgc_enqueue()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      10000000:01000000:1.0:1398271322.986808:0:7677:0:(mgc_request.c:1852:mgc_process_log()) Can't get cfg lock: -5
      10000000:00000001:1.0:1398271322.986810:0:7677:0:(mgc_request.c:125:config_log_get()) Process entered
      10000000:00000001:1.0:1398271322.986811:0:7677:0:(mgc_request.c:129:config_log_get()) Process leaving (rc=0 : 0 : 0)
      10000000:00000001:1.0:1398271322.986813:0:7677:0:(mgc_request.c:1713:mgc_process_cfg_log()) Process entered
      10000000:00000001:1.0:1398271322.986815:0:7677:0:(mgc_request.c:1774:mgc_process_cfg_log()) Process leaving via out_pop (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)
      10000000:00000001:1.0:1398271322.986818:0:7677:0:(mgc_request.c:1811:mgc_process_cfg_log()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      10000000:01000000:1.0:1398271322.986819:0:7677:0:(mgc_request.c:1871:mgc_process_log()) MGC10.151.25.171@o2ib: configuration from log 'nbp3-client' failed (-5).
      10000000:00000001:1.0:1398271322.986822:0:7677:0:(mgc_request.c:1883:mgc_process_log()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      10000000:00000001:1.0:1398271322.986824:0:7677:0:(mgc_request.c:136:config_log_put()) Process entered
      10000000:00000001:1.0:1398271322.986825:0:7677:0:(mgc_request.c:160:config_log_put()) Process leaving
      10000000:00000001:1.0:1398271322.986826:0:7677:0:(mgc_request.c:1982:mgc_process_config()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      00000020:00000001:1.0:1398271322.986829:0:7677:0:(obd_class.h:714:obd_process_config()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      00000020:00000001:1.0:1398271322.986830:0:7677:0:(lustre_cfg.h:214:lustre_cfg_len()) Process entered
      00000020:00000001:1.0:1398271322.986831:0:7677:0:(lustre_cfg.h:220:lustre_cfg_len()) Process leaving (rc=176 : 176 : b0)
      00000020:00000001:1.0:1398271322.986833:0:7677:0:(lustre_cfg.h:259:lustre_cfg_free()) Process leaving
      00000020:02020000:1.0:1398271322.986834:0:7677:0:(obd_mount.c:119:lustre_process_log()) 15c-8: MGC10.151.25.171@o2ib: The configuration from log 'nbp3-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      00000020:00000001:1.0:1398271323.010020:0:7677:0:(obd_mount.c:122:lustre_process_log()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
      

      Complete Debug output is attached

      Attachments

        Issue Links

          Activity

            [LU-4943] Client Failes to mount filesystem
            haasken Ryan Haasken added a comment - - edited

            Is the test failure in replay-ost-single on http://review.whamcloud.com/#/c/10129/ related to the patch? It doesn't seem like it to me, but I don't see a bug matching that failure.

            haasken Ryan Haasken added a comment - - edited Is the test failure in replay-ost-single on http://review.whamcloud.com/#/c/10129/ related to the patch? It doesn't seem like it to me, but I don't see a bug matching that failure.
            bobijam Zhenyu Xu added a comment -

            update http://review.whamcloud.com/#/c/10127/ (b2_4) to be sync with master.

            bobijam Zhenyu Xu added a comment - update http://review.whamcloud.com/#/c/10127/ (b2_4) to be sync with master.

            Thanks Jay. I have been testing on master, so that may explain why PS7 didn't fix the problem for me. PS10 of #10129 takes a different approach than the earlier patch-sets. It now takes a similar approach to #10569. PS10 of #10129 is broken, but when I fixed it locally and rebuilt, it resolved the problem in the same way that #10569 did.

            It seems to me that we only need either #10129 or #10569. Can anybody confirm this?

            haasken Ryan Haasken added a comment - Thanks Jay. I have been testing on master, so that may explain why PS7 didn't fix the problem for me. PS10 of #10129 takes a different approach than the earlier patch-sets. It now takes a similar approach to #10569. PS10 of #10129 is broken, but when I fixed it locally and rebuilt, it resolved the problem in the same way that #10569 did. It seems to me that we only need either #10129 or #10569. Can anybody confirm this?

            Ryan, we use http://review.whamcloud.com/10127 (for b2_4 branch) instead.

            The latest PS for #10129 (for master branch) is PS10 as you pointed out, but the latest PS for #10127 is still PS7. We use PS7 of #10127.

            jaylan Jay Lan (Inactive) added a comment - Ryan, we use http://review.whamcloud.com/10127 (for b2_4 branch) instead. The latest PS for #10129 (for master branch) is PS10 as you pointed out, but the latest PS for #10127 is still PS7. We use PS7 of #10127.
            haasken Ryan Haasken added a comment -

            Mahmoud, do you know which patch-set of the change http://review.whamcloud.com/#/c/10129/ you used? The newest version of the patch looks like it addresses the problem similar to the way http://review.whamcloud.com/#/c/10569/ addresses the problem.

            Also, there is a problem with the current patch-set (PS10) of http://review.whamcloud.com/#/c/10129/ . With this patch-set applied mount.lustre hangs with the following trace in dmesg:

            INFO: task mount.lustre:12025 blocked for more than 120 seconds.
                  Not tainted 2.6.32.431.23.3.el6_lustre #2
            "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            mount.lustre  D 0000000000000001     0 12025  12024 0x00000080
             ffff88013bb15828 0000000000000086 ffff88013bb157b8 ffffffff81069f15
             ffff88013bb15798 ffff88013dc08ad8 ffff8800283168e8 ffff880028316880
             ffff88013a56fab8 ffff88013bb15fd8 000000000000fbc8 ffff88013a56fab8
            Call Trace:
             [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450
             [<ffffffff8152a365>] schedule_timeout+0x215/0x2e0
             [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450
             [<ffffffff81529fe3>] wait_for_common+0x123/0x180
             [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
             [<ffffffff8152a0fd>] wait_for_completion+0x1d/0x20
             [<ffffffffa11fd0a8>] mgc_setup+0x4c8/0x5a0 [mgc]
             [<ffffffffa0c4de8b>] obd_setup+0x19b/0x290 [obdclass]
             [<ffffffffa0afc181>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
             [<ffffffffa0af63a8>] ? libcfs_log_return+0x28/0x40 [libcfs]
             [<ffffffffa0c4e188>] class_setup+0x208/0x870 [obdclass]
             [<ffffffffa0c56a6c>] class_process_config+0xc6c/0x1ad0 [obdclass]
             [<ffffffffa0af63a8>] ? libcfs_log_return+0x28/0x40 [libcfs]
             [<ffffffffa0c5baab>] ? lustre_cfg_new+0x40b/0x6f0 [obdclass]
             [<ffffffffa0c5bee8>] do_lcfg+0x158/0x450 [obdclass]
             [<ffffffff8128daa0>] ? sprintf+0x40/0x50
             [<ffffffffa0c5c274>] lustre_start_simple+0x94/0x200 [obdclass]
             [<ffffffffa0c60993>] lustre_start_mgc+0xbd3/0x1e00 [obdclass]
             [<ffffffffa0af63a8>] ? libcfs_log_return+0x28/0x40 [libcfs]
             [<ffffffffa0afc181>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
             [<ffffffffa0c61ccc>] lustre_fill_super+0x10c/0x550 [obdclass]
             [<ffffffffa0c61bc0>] ? lustre_fill_super+0x0/0x550 [obdclass]
             [<ffffffff8118c5df>] get_sb_nodev+0x5f/0xa0
             [<ffffffffa0c59995>] lustre_get_sb+0x25/0x30 [obdclass]
             [<ffffffff8118bc3b>] vfs_kern_mount+0x7b/0x1b0
             [<ffffffff8118bde2>] do_kern_mount+0x52/0x130
             [<ffffffff811ad7bb>] do_mount+0x2fb/0x930
             [<ffffffff811ade80>] sys_mount+0x90/0xe0
             [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            

            I think this is because rq_start is never completed in mgc_requeue_thread().

            haasken Ryan Haasken added a comment - Mahmoud, do you know which patch-set of the change http://review.whamcloud.com/#/c/10129/ you used? The newest version of the patch looks like it addresses the problem similar to the way http://review.whamcloud.com/#/c/10569/ addresses the problem. Also, there is a problem with the current patch-set (PS10) of http://review.whamcloud.com/#/c/10129/ . With this patch-set applied mount.lustre hangs with the following trace in dmesg: INFO: task mount.lustre:12025 blocked for more than 120 seconds. Not tainted 2.6.32.431.23.3.el6_lustre #2 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mount.lustre D 0000000000000001 0 12025 12024 0x00000080 ffff88013bb15828 0000000000000086 ffff88013bb157b8 ffffffff81069f15 ffff88013bb15798 ffff88013dc08ad8 ffff8800283168e8 ffff880028316880 ffff88013a56fab8 ffff88013bb15fd8 000000000000fbc8 ffff88013a56fab8 Call Trace: [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450 [<ffffffff8152a365>] schedule_timeout+0x215/0x2e0 [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450 [<ffffffff81529fe3>] wait_for_common+0x123/0x180 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 [<ffffffff8152a0fd>] wait_for_completion+0x1d/0x20 [<ffffffffa11fd0a8>] mgc_setup+0x4c8/0x5a0 [mgc] [<ffffffffa0c4de8b>] obd_setup+0x19b/0x290 [obdclass] [<ffffffffa0afc181>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa0af63a8>] ? libcfs_log_return+0x28/0x40 [libcfs] [<ffffffffa0c4e188>] class_setup+0x208/0x870 [obdclass] [<ffffffffa0c56a6c>] class_process_config+0xc6c/0x1ad0 [obdclass] [<ffffffffa0af63a8>] ? libcfs_log_return+0x28/0x40 [libcfs] [<ffffffffa0c5baab>] ? lustre_cfg_new+0x40b/0x6f0 [obdclass] [<ffffffffa0c5bee8>] do_lcfg+0x158/0x450 [obdclass] [<ffffffff8128daa0>] ? sprintf+0x40/0x50 [<ffffffffa0c5c274>] lustre_start_simple+0x94/0x200 [obdclass] [<ffffffffa0c60993>] lustre_start_mgc+0xbd3/0x1e00 [obdclass] [<ffffffffa0af63a8>] ? libcfs_log_return+0x28/0x40 [libcfs] [<ffffffffa0afc181>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa0c61ccc>] lustre_fill_super+0x10c/0x550 [obdclass] [<ffffffffa0c61bc0>] ? lustre_fill_super+0x0/0x550 [obdclass] [<ffffffff8118c5df>] get_sb_nodev+0x5f/0xa0 [<ffffffffa0c59995>] lustre_get_sb+0x25/0x30 [obdclass] [<ffffffff8118bc3b>] vfs_kern_mount+0x7b/0x1b0 [<ffffffff8118bde2>] do_kern_mount+0x52/0x130 [<ffffffff811ad7bb>] do_mount+0x2fb/0x930 [<ffffffff811ade80>] sys_mount+0x90/0xe0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b I think this is because rq_start is never completed in mgc_requeue_thread().
            bobijam Zhenyu Xu added a comment -

            yes, I think they are compatible.

            bobijam Zhenyu Xu added a comment - yes, I think they are compatible.
            haasken Ryan Haasken added a comment -

            Are the two changes compatible with each other?

            haasken Ryan Haasken added a comment - Are the two changes compatible with each other?
            pjones Peter Jones added a comment -

            Thanks Mahmoud. I suggest that we keep this ticket open until at least the master patch - http://review.whamcloud.com/#/c/10129/ - lands.

            Ryan it would probably be best to decouple the upstreaming of http://review.whamcloud.com/#/c/10569/ from this NASA support issue and use a unique JIRA ticket reference on the next push.

            pjones Peter Jones added a comment - Thanks Mahmoud. I suggest that we keep this ticket open until at least the master patch - http://review.whamcloud.com/#/c/10129/ - lands. Ryan it would probably be best to decouple the upstreaming of http://review.whamcloud.com/#/c/10569/ from this NASA support issue and use a unique JIRA ticket reference on the next push.
            mhanafi Mahmoud Hanafi added a comment - we used http://review.whamcloud.com/10127
            haasken Ryan Haasken added a comment -

            From our end, we need http://review.whamcloud.com/#/c/10569/ to land, but there are some style issues with it right now.

            haasken Ryan Haasken added a comment - From our end, we need http://review.whamcloud.com/#/c/10569/ to land, but there are some style issues with it right now.
            pjones Peter Jones added a comment -

            Mahmoud

            To be clear - which patches (if any) did you end up using to meet your requirements?

            Thanks

            Peter

            pjones Peter Jones added a comment - Mahmoud To be clear - which patches (if any) did you end up using to meet your requirements? Thanks Peter

            People

              bobijam Zhenyu Xu
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: