[LU-10800] Mount hangs on clients. Created: 09/Mar/18  Updated: 19/Mar/18  Resolved: 19/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: soak
Environment:

Soak stress cluster, lustre-master-next-ib build 1


Attachments: File soak-36.mount.hang.txt.gz    
Issue Links:
Related
is related to LU-9984 Client mounts hang Open
is related to LU-10270 remove a early rx code from o2ib lnd Reopened
Rank (Obsolete): 9223372036854775807

 Description   

Mounts frequently hang on clients.

Mar  9 18:14:41 soak-36 kernel: INFO: task mount.lustre:2807 blocked for more than 120 seconds.
Mar  9 18:14:41 soak-36 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  9 18:14:41 soak-36 kernel: mount.lustre    D ffff88085b7a0000     0  2807   2806 0x00000080
Mar  9 18:14:41 soak-36 kernel: Call Trace:
Mar  9 18:14:41 soak-36 kernel: [<ffffffff816ab8a9>] schedule+0x29/0x70
Mar  9 18:14:41 soak-36 kernel: [<ffffffff816a92b9>] schedule_timeout+0x239/0x2c0
Mar  9 18:14:41 soak-36 kernel: [<ffffffff81050b4c>] ? native_smp_send_reschedule+0x4c/0x70
Mar  9 18:14:41 soak-36 kernel: [<ffffffff810c2358>] ? resched_curr+0xa8/0xc0
Mar  9 18:14:41 soak-36 kernel: [<ffffffff810c30d8>] ? check_preempt_curr+0x78/0xa0
Mar  9 18:14:41 soak-36 kernel: [<ffffffff810c3119>] ? ttwu_do_wakeup+0x19/0xd0
Mar  9 18:14:41 soak-36 kernel: [<ffffffff816abc5d>] wait_for_completion+0xfd/0x140
Mar  9 18:14:42 soak-36 kernel: [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0b28854>] llog_process_or_fork+0x244/0x450 [obdclass]
Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0b28a74>] llog_process+0x14/0x20 [obdclass]
Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0b5b1c5>] class_config_parse_llog+0x125/0x350 [obdclass]
Mar  9 18:14:42 soak-36 kernel: [<ffffffffc06501c8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0652243>] mgc_process_log+0x3d3/0x8b0 [mgc]
Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0b63240>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0652968>] ? do_config_log_add+0x248/0x580 [mgc]
Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0653840>] mgc_process_config+0x890/0x13f0 [mgc]
Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0b66c85>] lustre_process_log+0x2d5/0xae0 [obdclass]
Mar  9 18:14:42 soak-36 kernel: [<ffffffffc0855e27>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Mar  9 18:14:43 soak-36 kernel: [<ffffffffc0f3e3bb>] ll_fill_super+0x45b/0x1100 [lustre]
Mar  9 18:14:43 soak-36 kernel: [<ffffffffc0b6caa6>] lustre_fill_super+0x286/0x910 [obdclass]
Mar  9 18:14:43 soak-36 kernel: [<ffffffffc0b6c820>] ? lustre_common_put_super+0x270/0x270 [obdclass]
Mar  9 18:14:43 soak-36 kernel: [<ffffffff81206abd>] mount_nodev+0x4d/0xb0
Mar  9 18:14:43 soak-36 kernel: [<ffffffffc0b64ab8>] lustre_mount+0x38/0x60 [obdclass]
Mar  9 18:14:43 soak-36 kernel: [<ffffffff81207549>] mount_fs+0x39/0x1b0
Mar  9 18:14:43 soak-36 kernel: [<ffffffff81224177>] vfs_kern_mount+0x67/0x110
Mar  9 18:14:43 soak-36 kernel: [<ffffffff81226683>] do_mount+0x233/0xaf0
Mar  9 18:14:43 soak-36 kernel: [<ffffffff811894ee>] ? __get_free_pages+0xe/0x40
Mar  9 18:14:43 soak-36 kernel: [<ffffffff812272c6>] SyS_mount+0x96/0xf0
Mar  9 18:14:43 soak-36 kernel: [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
Mar  9 18:16:43 soak-36 kernel: INFO: task mount.lustre:2807 blocked for more than 120 seconds.
Mar  9 18:16:43 soak-36 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

I dumped the lustre log during the hang, attached. I also crash-dumped the client, files available on soak



 Comments   
Comment by John Hammond [ 09/Mar/18 ]

Lustre: Build Version: 2.10.58_140_ge103776

Comment by Cliff White (Inactive) [ 09/Mar/18 ]

Tried lustre-master-ib build 58, per Peter - same issue version=2.10.58_118_gcf6b8f5

[  842.767852] INFO: task mount.lustre:2760 blocked for more than 120 seconds.
[  842.791128] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  842.817070] mount.lustre    D ffff88105acb0000     0  2760   2759 0x00000080
[  842.840464] Call Trace:
[  842.848565]  [<ffffffff810cae08>] ? __enqueue_entity+0x78/0x80
[  842.867884]  [<ffffffff810d17fc>] ? enqueue_entity+0x26c/0xb60
[  842.887205]  [<ffffffff816ab8a9>] schedule+0x29/0x70
[  842.903644]  [<ffffffff816a92b9>] schedule_timeout+0x239/0x2c0
[  842.922960]  [<ffffffff810c95d5>] ? sched_clock_cpu+0x85/0xc0
[  842.941987]  [<ffffffff810c30d8>] ? check_preempt_curr+0x78/0xa0
[  842.961880]  [<ffffffff810c3119>] ? ttwu_do_wakeup+0x19/0xd0
[  842.980620]  [<ffffffff816abc5d>] wait_for_completion+0xfd/0x140
[  843.000516]  [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
[  843.019150]  [<ffffffffc0a24854>] llog_process_or_fork+0x244/0x450 [obdclass]
[  843.042950]  [<ffffffffc0a24a74>] llog_process+0x14/0x20 [obdclass]
[  843.063850]  [<ffffffffc0a56da5>] class_config_parse_llog+0x125/0x350 [obdclass]
[  843.088364]  [<ffffffffc05a51c8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
[  843.110283]  [<ffffffffc05a7243>] mgc_process_log+0x3d3/0x8b0 [mgc]
[  843.131181]  [<ffffffffc0a5ee20>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[  843.156774]  [<ffffffffc05a7968>] ? do_config_log_add+0x248/0x580 [mgc]
[  843.178634]  [<ffffffffc05a8840>] mgc_process_config+0x890/0x13f0 [mgc]
[  843.200574]  [<ffffffffc0a62865>] lustre_process_log+0x2d5/0xae0 [obdclass]
[  843.223581]  [<ffffffffc0939e27>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[  843.245497]  [<ffffffffc0ea994b>] ll_fill_super+0x45b/0x1100 [lustre]
[  843.266861]  [<ffffffffc0a68686>] lustre_fill_super+0x286/0x910 [obdclass]
[  843.289656]  [<ffffffffc0a68400>] ? lustre_common_put_super+0x270/0x270 [obdclass]
[  843.314667]  [<ffffffff81206abd>] mount_nodev+0x4d/0xb0
[  843.331977]  [<ffffffffc0a60698>] lustre_mount+0x38/0x60 [obdclass]
[  843.352676]  [<ffffffff81207549>] mount_fs+0x39/0x1b0
[  843.369355]  [<ffffffff81224177>] vfs_kern_mount+0x67/0x110
[  843.387756]  [<ffffffff81226683>] do_mount+0x233/0xaf0
[  843.404720]  [<ffffffff811894ee>] ? __get_free_pages+0xe/0x40
[  843.423693]  [<ffffffff812272c6>] SyS_mount+0x96/0xf0
[  843.440374]  [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
Comment by John Hammond [ 12/Mar/18 ]

I got onto soak-34 while it was stuck in mount and found the following stack trace for the llog processing thread:

13498 llog_process_th
[<ffffffffc099d61b>] lnet_discover_peer_locked+0x10b/0x380 [lnet]
[<ffffffffc099d900>] LNetPrimaryNID+0x70/0x1a0 [lnet]
[<ffffffffc0c1937e>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
[<ffffffffc0c0da1c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc]
[<ffffffffc0bf21b2>] import_set_conn+0xb2/0x7a0 [ptlrpc]
[<ffffffffc0bf35f9>] client_obd_setup+0xd39/0x1450 [ptlrpc]
[<ffffffffc0d83c0b>] osc_setup_common+0x5b/0x330 [osc]
[<ffffffffc0d83f13>] osc_setup+0x33/0x2f0 [osc]
[<ffffffffc0d8f7c5>] osc_device_alloc+0xa5/0x240 [osc]
[<ffffffffc0a2be2a>] obd_setup+0x11a/0x2b0 [obdclass]
[<ffffffffc0a2d188>] class_setup+0x2a8/0x840 [obdclass]
[<ffffffffc0a30662>] class_process_config+0x1ae2/0x27b0 [obdclass]
[<ffffffffc0a32e94>] class_config_llog_handler+0x914/0x1330 [obdclass]
[<ffffffffc09f69b9>] llog_process_thread+0x839/0x1560 [obdclass]
[<ffffffffc09f80ef>] llog_process_thread_daemonize+0x9f/0xe0 [obdclass]
[<ffffffff810b4031>] kthread+0xd1/0xe0
[<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

13473 mount.lustre
[<ffffffffc09f7924>] llog_process_or_fork+0x244/0x450 [obdclass]
[<ffffffffc09f7b44>] llog_process+0x14/0x20 [obdclass]
[<ffffffffc0a2a495>] class_config_parse_llog+0x125/0x350 [obdclass]
[<ffffffffc0652038>] mgc_process_cfg_log+0x788/0xc40 [mgc]
[<ffffffffc0655233>] mgc_process_log+0x3d3/0x8b0 [mgc]
[<ffffffffc0656830>] mgc_process_config+0x890/0x13f0 [mgc]
[<ffffffffc0a35fd8>] lustre_process_log+0x2d8/0xaf0 [obdclass]
[<ffffffffc0e835eb>] ll_fill_super+0x45b/0x1100 [lustre]
[<ffffffffc0a3be3c>] lustre_fill_super+0x28c/0x920 [obdclass]
[<ffffffff8120948f>] mount_nodev+0x4f/0xb0
[<ffffffffc0a33df8>] lustre_mount+0x38/0x60 [obdclass]
[<ffffffff81209f1e>] mount_fs+0x3e/0x1b0
[<ffffffff81226d57>] vfs_kern_mount+0x67/0x110
[<ffffffff81229263>] do_mount+0x233/0xaf0
[<ffffffff81229ea6>] SyS_mount+0x96/0xf0
[<ffffffff816c0715>] system_call_fastpath+0x1c/0x21
[<ffffffffffffffff>] 0xffffffffffffffff

This is at the schedule() call in the for (;;) loop of lnet_discover_peer_locked():

[root@soak-34 ~]# nm /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/lustre-client/net/lnet.ko | grep lnet_discover_peer_locked
0000000000035510 T lnet_discover_peer_locked
[root@soak-34 ~]# printf '%#x\n' $((0x0000000000035510 + 0x10b))
0x3561b
[root@soak-34 ~]# addr2line -e /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/lustre-client/net/lnet.ko --functions --inlines 0x3561b
lnet_discover_peer_locked
/usr/src/debug/lustre-2.10.58_139_g630cd49/lnet/lnet/peer.c:2064

The last lines in the debug buffer for the processing thread were:

00000020:01000000:17.0:1520891497.612056:0:13498:0:(obd_config.c:1599:class_config_llog_handler()) Marker, inst_flg=0x2 mark_flg=0x2
00000020:00000080:17.0:1520891497.612058:0:13498:0:(obd_config.c:1135:class_process_config()) processing cmd: cf010
00000020:00000080:17.0:1520891497.612059:0:13498:0:(obd_config.c:1205:class_process_config()) marker 79 (0x2) soaked-OST0017 add osc
Comment by Amir Shehata (Inactive) [ 15/Mar/18 ]

It appears that the issue is that with peer discovery enabled some nodes are not responding to the ping (REPLY to the GET). Discovery is initiated in:

ptlrpc_connection_get() 

by calling

LNetPrimaryNID() 

This function blocks until discovery completes. However because discovery is not getting a reply, the timeout for the reply is 3 minutes. In some of the logs I have seen, it shows that two OSTs are timing out, so that adds 6 minutes to mount time. What we see is that after this long period the mount completes.

So there appears to be two problems:

  1. Why are some of the nodes not responding to the ping?
  2. The 3 minutes timeout for the ping REPLY or push ACK seems to be an overkill

For 1, I have run a simple test where I modprobe lustre on all the nodes and then run lctl ping <ost nid> on all the nodes. What I found is that some would fail the ping. If I run the ping again, the nodes which failed to ping succeed the second time around. This behavior is unrelated to discovery. I'll need to look at it in more details tomorrow.

To solve 2, there are two options to consider:

  1. Reduce the timeout to 5 or 10 seconds. This will reduce the time to mount the FS from 3 minutes to 10 or 20 seconds per failed server. This solution I think would be sufficient for 2.11.
  2. Change the behavior to do discovery in the background. Create a non-MR peer and initiate discovery on it. And at the same time continue sending messages as normal. When discovery completes the peer will be updated according to the returned reply. If discovery fails then it can be initiated again. In case of discovery failure the system operates no worse than if discovery is disabled. This can be done for 2.12

 

Comment by Gerrit Updater [ 15/Mar/18 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/31663
Subject: LU-10800 lnet: reduce discovery timeout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3daf02b0245d57b4250e5404985b88eb72af3588

Comment by Gerrit Updater [ 16/Mar/18 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/31675
Subject: LU-10800 lnet: Revert "LU-10270 lnet: remove an early rx code"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 46eb7f97c029a998ca384ddd208ff8444d4cb126

Comment by Amir Shehata (Inactive) [ 16/Mar/18 ]

John found the reason for ping being dropped. There was a patch that was introduced which removed handling the early rxs from the o2iblnd. I believe we are hitting this case. Which triggered the discovery ping timeout. We tested the revert on soak and it works properly.

I believe we should land both the revert and the timeout change.

The revert will eliminate the drop ping issue.

The timeout change will deal with cases where there is a legitimate communication failure, like trying to mount while some of the OSTs are down.

Comment by Gerrit Updater [ 17/Mar/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31663/
Subject: LU-10800 lnet: reduce discovery timeout
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1cf929df259a9aaa5446a4cda637930ca5b27d7a

Comment by Gerrit Updater [ 19/Mar/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31675/
Subject: LU-10800 lnet: Revert "LU-10270 lnet: remove an early rx code"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6224bb44d8d10894f1c21921a0224dd8baf0ded0

Comment by Peter Jones [ 19/Mar/18 ]

Landed for 2.11

Generated at Sat Feb 10 02:38:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.