Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.11.0
    • Lustre 2.11.0
    • Soak stress cluster, lustre-master-next-ib build 1
    • 9223372036854775807

    Description

      Mounts frequently hang on clients.

      Mar  9 18:14:41 soak-36 kernel: INFO: task mount.lustre:2807 blocked for more than 120 seconds.
      Mar  9 18:14:41 soak-36 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Mar  9 18:14:41 soak-36 kernel: mount.lustre    D ffff88085b7a0000     0  2807   2806 0x00000080
      Mar  9 18:14:41 soak-36 kernel: Call Trace:
      Mar  9 18:14:41 soak-36 kernel: [<ffffffff816ab8a9>] schedule+0x29/0x70
      Mar  9 18:14:41 soak-36 kernel: [<ffffffff816a92b9>] schedule_timeout+0x239/0x2c0
      Mar  9 18:14:41 soak-36 kernel: [<ffffffff81050b4c>] ? native_smp_send_reschedule+0x4c/0x70
      Mar  9 18:14:41 soak-36 kernel: [<ffffffff810c2358>] ? resched_curr+0xa8/0xc0
      Mar  9 18:14:41 soak-36 kernel: [<ffffffff810c30d8>] ? check_preempt_curr+0x78/0xa0
      Mar  9 18:14:41 soak-36 kernel: [<ffffffff810c3119>] ? ttwu_do_wakeup+0x19/0xd0
      Mar  9 18:14:41 soak-36 kernel: [<ffffffff816abc5d>] wait_for_completion+0xfd/0x140
      Mar  9 18:14:42 soak-36 kernel: [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
      Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0b28854>] llog_process_or_fork+0x244/0x450 [obdclass]
      Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0b28a74>] llog_process+0x14/0x20 [obdclass]
      Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0b5b1c5>] class_config_parse_llog+0x125/0x350 [obdclass]
      Mar  9 18:14:42 soak-36 kernel: [<ffffffffc06501c8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
      Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0652243>] mgc_process_log+0x3d3/0x8b0 [mgc]
      Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0b63240>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0652968>] ? do_config_log_add+0x248/0x580 [mgc]
      Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0653840>] mgc_process_config+0x890/0x13f0 [mgc]
      Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0b66c85>] lustre_process_log+0x2d5/0xae0 [obdclass]
      Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0855e27>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      Mar  9 18:14:43 soak-36 kernel: [<ffffffffc0f3e3bb>] ll_fill_super+0x45b/0x1100 [lustre]
      Mar  9 18:14:43 soak-36 kernel: [<ffffffffc0b6caa6>] lustre_fill_super+0x286/0x910 [obdclass]
      Mar  9 18:14:43 soak-36 kernel: [<ffffffffc0b6c820>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      Mar  9 18:14:43 soak-36 kernel: [<ffffffff81206abd>] mount_nodev+0x4d/0xb0
      Mar  9 18:14:43 soak-36 kernel: [<ffffffffc0b64ab8>] lustre_mount+0x38/0x60 [obdclass]
      Mar  9 18:14:43 soak-36 kernel: [<ffffffff81207549>] mount_fs+0x39/0x1b0
      Mar  9 18:14:43 soak-36 kernel: [<ffffffff81224177>] vfs_kern_mount+0x67/0x110
      Mar  9 18:14:43 soak-36 kernel: [<ffffffff81226683>] do_mount+0x233/0xaf0
      Mar  9 18:14:43 soak-36 kernel: [<ffffffff811894ee>] ? __get_free_pages+0xe/0x40
      Mar  9 18:14:43 soak-36 kernel: [<ffffffff812272c6>] SyS_mount+0x96/0xf0
      Mar  9 18:14:43 soak-36 kernel: [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
      Mar  9 18:16:43 soak-36 kernel: INFO: task mount.lustre:2807 blocked for more than 120 seconds.
      Mar  9 18:16:43 soak-36 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      

      I dumped the lustre log during the hang, attached. I also crash-dumped the client, files available on soak

      Attachments

        Issue Links

          Activity

            [LU-10800] Mount hangs on clients.
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31675/
            Subject: LU-10800 lnet: Revert "LU-10270 lnet: remove an early rx code"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 6224bb44d8d10894f1c21921a0224dd8baf0ded0

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31675/ Subject: LU-10800 lnet: Revert " LU-10270 lnet: remove an early rx code" Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6224bb44d8d10894f1c21921a0224dd8baf0ded0

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31663/
            Subject: LU-10800 lnet: reduce discovery timeout
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1cf929df259a9aaa5446a4cda637930ca5b27d7a

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31663/ Subject: LU-10800 lnet: reduce discovery timeout Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1cf929df259a9aaa5446a4cda637930ca5b27d7a
            ashehata Amir Shehata (Inactive) added a comment - - edited

            John found the reason for ping being dropped. There was a patch that was introduced which removed handling the early rxs from the o2iblnd. I believe we are hitting this case. Which triggered the discovery ping timeout. We tested the revert on soak and it works properly.

            I believe we should land both the revert and the timeout change.

            The revert will eliminate the drop ping issue.

            The timeout change will deal with cases where there is a legitimate communication failure, like trying to mount while some of the OSTs are down.

            ashehata Amir Shehata (Inactive) added a comment - - edited John found the reason for ping being dropped. There was a patch that was introduced which removed handling the early rxs from the o2iblnd. I believe we are hitting this case. Which triggered the discovery ping timeout. We tested the revert on soak and it works properly. I believe we should land both the revert and the timeout change. The revert will eliminate the drop ping issue. The timeout change will deal with cases where there is a legitimate communication failure, like trying to mount while some of the OSTs are down.

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/31675
            Subject: LU-10800 lnet: Revert "LU-10270 lnet: remove an early rx code"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 46eb7f97c029a998ca384ddd208ff8444d4cb126

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/31675 Subject: LU-10800 lnet: Revert " LU-10270 lnet: remove an early rx code" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 46eb7f97c029a998ca384ddd208ff8444d4cb126

            Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/31663
            Subject: LU-10800 lnet: reduce discovery timeout
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3daf02b0245d57b4250e5404985b88eb72af3588

            gerrit Gerrit Updater added a comment - Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/31663 Subject: LU-10800 lnet: reduce discovery timeout Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3daf02b0245d57b4250e5404985b88eb72af3588
            ashehata Amir Shehata (Inactive) added a comment - - edited

            It appears that the issue is that with peer discovery enabled some nodes are not responding to the ping (REPLY to the GET). Discovery is initiated in:

            ptlrpc_connection_get() 

            by calling

            LNetPrimaryNID() 

            This function blocks until discovery completes. However because discovery is not getting a reply, the timeout for the reply is 3 minutes. In some of the logs I have seen, it shows that two OSTs are timing out, so that adds 6 minutes to mount time. What we see is that after this long period the mount completes.

            So there appears to be two problems:

            1. Why are some of the nodes not responding to the ping?
            2. The 3 minutes timeout for the ping REPLY or push ACK seems to be an overkill

            For 1, I have run a simple test where I modprobe lustre on all the nodes and then run lctl ping <ost nid> on all the nodes. What I found is that some would fail the ping. If I run the ping again, the nodes which failed to ping succeed the second time around. This behavior is unrelated to discovery. I'll need to look at it in more details tomorrow.

            To solve 2, there are two options to consider:

            1. Reduce the timeout to 5 or 10 seconds. This will reduce the time to mount the FS from 3 minutes to 10 or 20 seconds per failed server. This solution I think would be sufficient for 2.11.
            2. Change the behavior to do discovery in the background. Create a non-MR peer and initiate discovery on it. And at the same time continue sending messages as normal. When discovery completes the peer will be updated according to the returned reply. If discovery fails then it can be initiated again. In case of discovery failure the system operates no worse than if discovery is disabled. This can be done for 2.12

             

            ashehata Amir Shehata (Inactive) added a comment - - edited It appears that the issue is that with peer discovery enabled some nodes are not responding to the ping (REPLY to the GET). Discovery is initiated in: ptlrpc_connection_get() by calling LNetPrimaryNID() This function blocks until discovery completes. However because discovery is not getting a reply, the timeout for the reply is 3 minutes. In some of the logs I have seen, it shows that two OSTs are timing out, so that adds 6 minutes to mount time. What we see is that after this long period the mount completes. So there appears to be two problems: Why are some of the nodes not responding to the ping? The 3 minutes timeout for the ping REPLY or push ACK seems to be an overkill For 1, I have run a simple test where I modprobe lustre on all the nodes and then run lctl ping <ost nid> on all the nodes. What I found is that some would fail the ping. If I run the ping again, the nodes which failed to ping succeed the second time around. This behavior is unrelated to discovery. I'll need to look at it in more details tomorrow. To solve 2, there are two options to consider: Reduce the timeout to 5 or 10 seconds. This will reduce the time to mount the FS from 3 minutes to 10 or 20 seconds per failed server. This solution I think would be sufficient for 2.11. Change the behavior to do discovery in the background. Create a non-MR peer and initiate discovery on it. And at the same time continue sending messages as normal. When discovery completes the peer will be updated according to the returned reply. If discovery fails then it can be initiated again. In case of discovery failure the system operates no worse than if discovery is disabled. This can be done for 2.12  
            jhammond John Hammond added a comment - - edited

            I got onto soak-34 while it was stuck in mount and found the following stack trace for the llog processing thread:

            13498 llog_process_th
            [<ffffffffc099d61b>] lnet_discover_peer_locked+0x10b/0x380 [lnet]
            [<ffffffffc099d900>] LNetPrimaryNID+0x70/0x1a0 [lnet]
            [<ffffffffc0c1937e>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
            [<ffffffffc0c0da1c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc]
            [<ffffffffc0bf21b2>] import_set_conn+0xb2/0x7a0 [ptlrpc]
            [<ffffffffc0bf35f9>] client_obd_setup+0xd39/0x1450 [ptlrpc]
            [<ffffffffc0d83c0b>] osc_setup_common+0x5b/0x330 [osc]
            [<ffffffffc0d83f13>] osc_setup+0x33/0x2f0 [osc]
            [<ffffffffc0d8f7c5>] osc_device_alloc+0xa5/0x240 [osc]
            [<ffffffffc0a2be2a>] obd_setup+0x11a/0x2b0 [obdclass]
            [<ffffffffc0a2d188>] class_setup+0x2a8/0x840 [obdclass]
            [<ffffffffc0a30662>] class_process_config+0x1ae2/0x27b0 [obdclass]
            [<ffffffffc0a32e94>] class_config_llog_handler+0x914/0x1330 [obdclass]
            [<ffffffffc09f69b9>] llog_process_thread+0x839/0x1560 [obdclass]
            [<ffffffffc09f80ef>] llog_process_thread_daemonize+0x9f/0xe0 [obdclass]
            [<ffffffff810b4031>] kthread+0xd1/0xe0
            [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
            [<ffffffffffffffff>] 0xffffffffffffffff
            
            13473 mount.lustre
            [<ffffffffc09f7924>] llog_process_or_fork+0x244/0x450 [obdclass]
            [<ffffffffc09f7b44>] llog_process+0x14/0x20 [obdclass]
            [<ffffffffc0a2a495>] class_config_parse_llog+0x125/0x350 [obdclass]
            [<ffffffffc0652038>] mgc_process_cfg_log+0x788/0xc40 [mgc]
            [<ffffffffc0655233>] mgc_process_log+0x3d3/0x8b0 [mgc]
            [<ffffffffc0656830>] mgc_process_config+0x890/0x13f0 [mgc]
            [<ffffffffc0a35fd8>] lustre_process_log+0x2d8/0xaf0 [obdclass]
            [<ffffffffc0e835eb>] ll_fill_super+0x45b/0x1100 [lustre]
            [<ffffffffc0a3be3c>] lustre_fill_super+0x28c/0x920 [obdclass]
            [<ffffffff8120948f>] mount_nodev+0x4f/0xb0
            [<ffffffffc0a33df8>] lustre_mount+0x38/0x60 [obdclass]
            [<ffffffff81209f1e>] mount_fs+0x3e/0x1b0
            [<ffffffff81226d57>] vfs_kern_mount+0x67/0x110
            [<ffffffff81229263>] do_mount+0x233/0xaf0
            [<ffffffff81229ea6>] SyS_mount+0x96/0xf0
            [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21
            [<ffffffffffffffff>] 0xffffffffffffffff
            

            This is at the schedule() call in the for (;;) loop of lnet_discover_peer_locked():

            [root@soak-34 ~]# nm /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/lustre-client/net/lnet.ko | grep lnet_discover_peer_locked
            0000000000035510 T lnet_discover_peer_locked
            [root@soak-34 ~]# printf '%#x\n' $((0x0000000000035510 + 0x10b))
            0x3561b
            [root@soak-34 ~]# addr2line -e /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/lustre-client/net/lnet.ko --functions --inlines 0x3561b
            lnet_discover_peer_locked
            /usr/src/debug/lustre-2.10.58_139_g630cd49/lnet/lnet/peer.c:2064
            

            The last lines in the debug buffer for the processing thread were:

            00000020:01000000:17.0:1520891497.612056:0:13498:0:(obd_config.c:1599:class_config_llog_handler()) Marker, inst_flg=0x2 mark_flg=0x2
            00000020:00000080:17.0:1520891497.612058:0:13498:0:(obd_config.c:1135:class_process_config()) processing cmd: cf010
            00000020:00000080:17.0:1520891497.612059:0:13498:0:(obd_config.c:1205:class_process_config()) marker 79 (0x2) soaked-OST0017 add osc
            
            jhammond John Hammond added a comment - - edited I got onto soak-34 while it was stuck in mount and found the following stack trace for the llog processing thread: 13498 llog_process_th [<ffffffffc099d61b>] lnet_discover_peer_locked+0x10b/0x380 [lnet] [<ffffffffc099d900>] LNetPrimaryNID+0x70/0x1a0 [lnet] [<ffffffffc0c1937e>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc] [<ffffffffc0c0da1c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc] [<ffffffffc0bf21b2>] import_set_conn+0xb2/0x7a0 [ptlrpc] [<ffffffffc0bf35f9>] client_obd_setup+0xd39/0x1450 [ptlrpc] [<ffffffffc0d83c0b>] osc_setup_common+0x5b/0x330 [osc] [<ffffffffc0d83f13>] osc_setup+0x33/0x2f0 [osc] [<ffffffffc0d8f7c5>] osc_device_alloc+0xa5/0x240 [osc] [<ffffffffc0a2be2a>] obd_setup+0x11a/0x2b0 [obdclass] [<ffffffffc0a2d188>] class_setup+0x2a8/0x840 [obdclass] [<ffffffffc0a30662>] class_process_config+0x1ae2/0x27b0 [obdclass] [<ffffffffc0a32e94>] class_config_llog_handler+0x914/0x1330 [obdclass] [<ffffffffc09f69b9>] llog_process_thread+0x839/0x1560 [obdclass] [<ffffffffc09f80ef>] llog_process_thread_daemonize+0x9f/0xe0 [obdclass] [<ffffffff810b4031>] kthread+0xd1/0xe0 [<ffffffff816c0577>] ret_from_fork+0x77/0xb0 [<ffffffffffffffff>] 0xffffffffffffffff 13473 mount.lustre [<ffffffffc09f7924>] llog_process_or_fork+0x244/0x450 [obdclass] [<ffffffffc09f7b44>] llog_process+0x14/0x20 [obdclass] [<ffffffffc0a2a495>] class_config_parse_llog+0x125/0x350 [obdclass] [<ffffffffc0652038>] mgc_process_cfg_log+0x788/0xc40 [mgc] [<ffffffffc0655233>] mgc_process_log+0x3d3/0x8b0 [mgc] [<ffffffffc0656830>] mgc_process_config+0x890/0x13f0 [mgc] [<ffffffffc0a35fd8>] lustre_process_log+0x2d8/0xaf0 [obdclass] [<ffffffffc0e835eb>] ll_fill_super+0x45b/0x1100 [lustre] [<ffffffffc0a3be3c>] lustre_fill_super+0x28c/0x920 [obdclass] [<ffffffff8120948f>] mount_nodev+0x4f/0xb0 [<ffffffffc0a33df8>] lustre_mount+0x38/0x60 [obdclass] [<ffffffff81209f1e>] mount_fs+0x3e/0x1b0 [<ffffffff81226d57>] vfs_kern_mount+0x67/0x110 [<ffffffff81229263>] do_mount+0x233/0xaf0 [<ffffffff81229ea6>] SyS_mount+0x96/0xf0 [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21 [<ffffffffffffffff>] 0xffffffffffffffff This is at the schedule() call in the for (;;) loop of lnet_discover_peer_locked() : [root@soak-34 ~]# nm /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/lustre-client/net/lnet.ko | grep lnet_discover_peer_locked 0000000000035510 T lnet_discover_peer_locked [root@soak-34 ~]# printf '%#x\n' $((0x0000000000035510 + 0x10b)) 0x3561b [root@soak-34 ~]# addr2line -e /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/lustre-client/net/lnet.ko --functions --inlines 0x3561b lnet_discover_peer_locked /usr/src/debug/lustre-2.10.58_139_g630cd49/lnet/lnet/peer.c:2064 The last lines in the debug buffer for the processing thread were: 00000020:01000000:17.0:1520891497.612056:0:13498:0:(obd_config.c:1599:class_config_llog_handler()) Marker, inst_flg=0x2 mark_flg=0x2 00000020:00000080:17.0:1520891497.612058:0:13498:0:(obd_config.c:1135:class_process_config()) processing cmd: cf010 00000020:00000080:17.0:1520891497.612059:0:13498:0:(obd_config.c:1205:class_process_config()) marker 79 (0x2) soaked-OST0017 add osc
            cliffw Cliff White (Inactive) added a comment - - edited

            Tried lustre-master-ib build 58, per Peter - same issue version=2.10.58_118_gcf6b8f5

            [  842.767852] INFO: task mount.lustre:2760 blocked for more than 120 seconds.
            [  842.791128] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [  842.817070] mount.lustre    D ffff88105acb0000     0  2760   2759 0x00000080
            [  842.840464] Call Trace:
            [  842.848565]  [<ffffffff810cae08>] ? __enqueue_entity+0x78/0x80
            [  842.867884]  [<ffffffff810d17fc>] ? enqueue_entity+0x26c/0xb60
            [  842.887205]  [<ffffffff816ab8a9>] schedule+0x29/0x70
            [  842.903644]  [<ffffffff816a92b9>] schedule_timeout+0x239/0x2c0
            [  842.922960]  [<ffffffff810c95d5>] ? sched_clock_cpu+0x85/0xc0
            [  842.941987]  [<ffffffff810c30d8>] ? check_preempt_curr+0x78/0xa0
            [  842.961880]  [<ffffffff810c3119>] ? ttwu_do_wakeup+0x19/0xd0
            [  842.980620]  [<ffffffff816abc5d>] wait_for_completion+0xfd/0x140
            [  843.000516]  [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
            [  843.019150]  [<ffffffffc0a24854>] llog_process_or_fork+0x244/0x450 [obdclass]
            [  843.042950]  [<ffffffffc0a24a74>] llog_process+0x14/0x20 [obdclass]
            [  843.063850]  [<ffffffffc0a56da5>] class_config_parse_llog+0x125/0x350 [obdclass]
            [  843.088364]  [<ffffffffc05a51c8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
            [  843.110283]  [<ffffffffc05a7243>] mgc_process_log+0x3d3/0x8b0 [mgc]
            [  843.131181]  [<ffffffffc0a5ee20>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
            [  843.156774]  [<ffffffffc05a7968>] ? do_config_log_add+0x248/0x580 [mgc]
            [  843.178634]  [<ffffffffc05a8840>] mgc_process_config+0x890/0x13f0 [mgc]
            [  843.200574]  [<ffffffffc0a62865>] lustre_process_log+0x2d5/0xae0 [obdclass]
            [  843.223581]  [<ffffffffc0939e27>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
            [  843.245497]  [<ffffffffc0ea994b>] ll_fill_super+0x45b/0x1100 [lustre]
            [  843.266861]  [<ffffffffc0a68686>] lustre_fill_super+0x286/0x910 [obdclass]
            [  843.289656]  [<ffffffffc0a68400>] ? lustre_common_put_super+0x270/0x270 [obdclass]
            [  843.314667]  [<ffffffff81206abd>] mount_nodev+0x4d/0xb0
            [  843.331977]  [<ffffffffc0a60698>] lustre_mount+0x38/0x60 [obdclass]
            [  843.352676]  [<ffffffff81207549>] mount_fs+0x39/0x1b0
            [  843.369355]  [<ffffffff81224177>] vfs_kern_mount+0x67/0x110
            [  843.387756]  [<ffffffff81226683>] do_mount+0x233/0xaf0
            [  843.404720]  [<ffffffff811894ee>] ? __get_free_pages+0xe/0x40
            [  843.423693]  [<ffffffff812272c6>] SyS_mount+0x96/0xf0
            [  843.440374]  [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
            
            cliffw Cliff White (Inactive) added a comment - - edited Tried lustre-master-ib build 58, per Peter - same issue version=2.10.58_118_gcf6b8f5 [ 842.767852] INFO: task mount.lustre:2760 blocked for more than 120 seconds. [ 842.791128] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 842.817070] mount.lustre D ffff88105acb0000 0 2760 2759 0x00000080 [ 842.840464] Call Trace: [ 842.848565] [<ffffffff810cae08>] ? __enqueue_entity+0x78/0x80 [ 842.867884] [<ffffffff810d17fc>] ? enqueue_entity+0x26c/0xb60 [ 842.887205] [<ffffffff816ab8a9>] schedule+0x29/0x70 [ 842.903644] [<ffffffff816a92b9>] schedule_timeout+0x239/0x2c0 [ 842.922960] [<ffffffff810c95d5>] ? sched_clock_cpu+0x85/0xc0 [ 842.941987] [<ffffffff810c30d8>] ? check_preempt_curr+0x78/0xa0 [ 842.961880] [<ffffffff810c3119>] ? ttwu_do_wakeup+0x19/0xd0 [ 842.980620] [<ffffffff816abc5d>] wait_for_completion+0xfd/0x140 [ 843.000516] [<ffffffff810c6620>] ? wake_up_state+0x20/0x20 [ 843.019150] [<ffffffffc0a24854>] llog_process_or_fork+0x244/0x450 [obdclass] [ 843.042950] [<ffffffffc0a24a74>] llog_process+0x14/0x20 [obdclass] [ 843.063850] [<ffffffffc0a56da5>] class_config_parse_llog+0x125/0x350 [obdclass] [ 843.088364] [<ffffffffc05a51c8>] mgc_process_cfg_log+0x788/0xc40 [mgc] [ 843.110283] [<ffffffffc05a7243>] mgc_process_log+0x3d3/0x8b0 [mgc] [ 843.131181] [<ffffffffc0a5ee20>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] [ 843.156774] [<ffffffffc05a7968>] ? do_config_log_add+0x248/0x580 [mgc] [ 843.178634] [<ffffffffc05a8840>] mgc_process_config+0x890/0x13f0 [mgc] [ 843.200574] [<ffffffffc0a62865>] lustre_process_log+0x2d5/0xae0 [obdclass] [ 843.223581] [<ffffffffc0939e27>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 843.245497] [<ffffffffc0ea994b>] ll_fill_super+0x45b/0x1100 [lustre] [ 843.266861] [<ffffffffc0a68686>] lustre_fill_super+0x286/0x910 [obdclass] [ 843.289656] [<ffffffffc0a68400>] ? lustre_common_put_super+0x270/0x270 [obdclass] [ 843.314667] [<ffffffff81206abd>] mount_nodev+0x4d/0xb0 [ 843.331977] [<ffffffffc0a60698>] lustre_mount+0x38/0x60 [obdclass] [ 843.352676] [<ffffffff81207549>] mount_fs+0x39/0x1b0 [ 843.369355] [<ffffffff81224177>] vfs_kern_mount+0x67/0x110 [ 843.387756] [<ffffffff81226683>] do_mount+0x233/0xaf0 [ 843.404720] [<ffffffff811894ee>] ? __get_free_pages+0xe/0x40 [ 843.423693] [<ffffffff812272c6>] SyS_mount+0x96/0xf0 [ 843.440374] [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
            jhammond John Hammond added a comment -

            Lustre: Build Version: 2.10.58_140_ge103776

            jhammond John Hammond added a comment - Lustre: Build Version: 2.10.58_140_ge103776

            People

              ashehata Amir Shehata (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: