[LU-10123] Repeated timeouts during mount()- MDS and client. Created: 16/Oct/17  Updated: 15/May/20  Resolved: 06/Feb/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: soak
Environment:

Soak cluster, lustre-master build 3654 version=2.10.54_13_g84f690e


Attachments: File soak-40.lustre.log.gz    
Issue Links:
Duplicate
is duplicated by LU-9954 conf-sanity test_53b: TIMEOUT on clie... Closed
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Attempting to mount filesystem, with DNE
MGT/MDS node hangs on initial mount.

Oct 16 16:38:01 soak-8 kernel: LDISKFS-fs warning (device dm-0): ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected, please wait.
Oct 16 16:38:44 soak-8 kernel: LDISKFS-fs (dm-0): recovery complete
Oct 16 16:38:44 soak-8 kernel: LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
Oct 16 16:38:45 soak-8 kernel: Lustre: MGS: Connection restored to e7ad9453-d165-e9e1-75ba-304460ab678f (at 0@lo)
Oct 16 16:38:50 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1508171925/real 1508171925]  req@ffff88081c348c00 x1581432791105776/t0(0) o8->soaked-OST0001-osc-MDT0000@192.168.1.103@o2ib:28/4 lens 520/544 e 0 to 1 dl 1508171930 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 16 16:39:10 soak-8 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
Oct 16 16:39:15 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1508171950/real 1508171950]  req@ffff880814820c00 x1581432791105856/t0(0) o8->soaked-OST0000-osc-MDT0000@192.168.1.103@o2ib:28/4 lens 520/544 e 0 to 1 dl 1508171955 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 16 16:39:15 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Oct 16 16:39:24 soak-8 sssd_be: GSSAPI client step 1
Oct 16 16:39:24 soak-8 sssd_be: GSSAPI client step 1
Oct 16 16:39:24 soak-8 sssd_be: GSSAPI client step 1
Oct 16 16:39:24 soak-8 sssd_be: GSSAPI client step 2
Oct 16 16:39:45 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1508171975/real 1508171975]  req@ffff880814821800 x1581432791105936/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib:24/4 lens 520/544 e 0 to 1 dl 1508171985 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 16 16:39:45 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Oct 16 16:40:00 soak-8 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
Oct 16 16:40:01 soak-8 systemd: Started Session 7 of user root.
Oct 16 16:40:01 soak-8 systemd: Starting Session 7 of user root.
Oct 16 16:40:01 soak-8 CROND[13027]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Oct 16 16:40:10 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1508172000/real 1508172000]  req@ffff880814822d00 x1581432791106048/t0(0) o8->soaked-OST0000-osc-MDT0000@192.168.1.103@o2ib:28/4 lens 520/544 e 0 to 1 dl 1508172010 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 16 16:40:10 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Oct 16 16:40:40 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1508172025/real 1508172025]  req@ffff880814823c00 x1581432791106128/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib:24/4 lens 520/544 e 0 to 1 dl 1508172040 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 16 16:40:40 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Oct 16 16:40:50 soak-8 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
Oct 16 16:40:53 soak-8 kernel: INFO: task mount.lustre:12918 blocked for more than 120 seconds.
Oct 16 16:40:54 soak-8 kernel: mount.lustre    D 0000000000000001     0 12918  12917 0x00000082
Oct 16 16:40:54 soak-8 kernel: ffff8804b9bbb7b0 0000000000000082 ffff88083f9e0000 ffff8804b9bbbfd8
Oct 16 16:40:54 soak-8 kernel: ffff8804b9bbbfd8 ffff8804b9bbbfd8 ffff88083f9e0000 ffff8808203a8150
Oct 16 16:40:54 soak-8 kernel: 7fffffffffffffff ffff8808203a8148 ffff88083f9e0000 0000000000000001
Oct 16 16:40:54 soak-8 kernel: Call Trace:
Oct 16 16:40:54 soak-8 kernel: [<ffffffff816a94c9>] schedule+0x29/0x70
Oct 16 16:40:54 soak-8 kernel: [<ffffffff816a6fd9>] schedule_timeout+0x239/0x2c0
Oct 16 16:40:54 soak-8 kernel: [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700
Oct 16 16:40:54 soak-8 kernel: [<ffffffff816a987d>] wait_for_completion+0xfd/0x140
Oct 16 16:40:54 soak-8 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a2f6b4>] llog_process_or_fork+0x244/0x450 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a2f8d4>] llog_process+0x14/0x20 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a5fe45>] class_config_parse_llog+0x125/0x350 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc10235a8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc1026413>] mgc_process_log+0x3d3/0x890 [mgc]
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a68550>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc1026b18>] ? do_config_log_add+0x248/0x580 [mgc]
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc10279f0>] mgc_process_config+0x890/0x13f0 [mgc]
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a6bfee>] lustre_process_log+0x2de/0xaf0 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc090dba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a57339>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a98c32>] server_start_targets+0x1352/0x2a70 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a6ef5d>] ? lustre_start_mgc+0x20d/0x2800 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a68550>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a9b3dd>] server_fill_super+0x108d/0x184c [obdclass]
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a71ff8>] lustre_fill_super+0x328/0x950 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a71cd0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0
Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a69e18>] lustre_mount+0x38/0x60 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [<ffffffff81205599>] mount_fs+0x39/0x1b0
Oct 16 16:40:54 soak-8 kernel: [<ffffffff81222077>] vfs_kern_mount+0x67/0x110
Oct 16 16:40:54 soak-8 kernel: [<ffffffff81224583>] do_mount+0x233/0xaf0
Oct 16 16:40:54 soak-8 kernel: [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40
Oct 16 16:40:54 soak-8 kernel: [<ffffffff812251c6>] SyS_mount+0x96/0xf0
Oct 16 16:40:54 soak-8 kernel: [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b

Power-cycled soak-8, mount completed after power cycle. Continued bringing up the filesystem
soak-10 (MDT0003) hung during mount.

Oct 16 16:57:37 soak-10 kernel: INFO: task mount.lustre:13350 blocked for more than 120 seconds.
Oct 16 16:57:37 soak-10 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 16 16:57:38 soak-10 kernel: mount.lustre    D 0000000000000001     0 13350  13349 0x00000082
Oct 16 16:57:38 soak-10 kernel: ffff880817bdf7b0 0000000000000086 ffff880817783f40 ffff880817bdffd8
Oct 16 16:57:38 soak-10 kernel: ffff880817bdffd8 ffff880817bdffd8 ffff880817783f40 ffff88082ce87c30
Oct 16 16:57:38 soak-10 kernel: 7fffffffffffffff ffff88082ce87c28 ffff880817783f40 0000000000000001
Oct 16 16:57:38 soak-10 kernel: Call Trace:
Oct 16 16:57:38 soak-10 kernel: [<ffffffff816a94c9>] schedule+0x29/0x70
Oct 16 16:57:38 soak-10 kernel: [<ffffffff816a6fd9>] schedule_timeout+0x239/0x2c0
Oct 16 16:57:38 soak-10 kernel: [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0
Oct 16 16:57:38 soak-10 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0
Oct 16 16:57:38 soak-10 kernel: [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0
Oct 16 16:57:38 soak-10 kernel: [<ffffffff816a987d>] wait_for_completion+0xfd/0x140
Oct 16 16:57:38 soak-10 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a406b4>] llog_process_or_fork+0x244/0x450 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a408d4>] llog_process+0x14/0x20 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a70e45>] class_config_parse_llog+0x125/0x350 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [<ffffffffc14505a8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
Oct 16 16:57:38 soak-10 kernel: [<ffffffffc1453413>] mgc_process_log+0x3d3/0x890 [mgc]
Oct 16 16:57:38 soak-10 kernel: [<ffffffffc1453b77>] ? do_config_log_add+0x2a7/0x580 [mgc]
Oct 16 16:57:38 soak-10 kernel: [<ffffffffc14549f0>] mgc_process_config+0x890/0x13f0 [mgc]
Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a7cfee>] lustre_process_log+0x2de/0xaf0 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [<ffffffffc091b4e4>] ? libcfs_log_return+0x24/0x30 [libcfs]
Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0aa9c32>] server_start_targets+0x1352/0x2a70 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a80259>] ? lustre_start_mgc+0x509/0x2800 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a79550>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0aac3dd>] server_fill_super+0x108d/0x184c [obdclass]
Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a82ff8>] lustre_fill_super+0x328/0x950 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a82cd0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0
Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a7ae18>] lustre_mount+0x38/0x60 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [<ffffffff81205599>] mount_fs+0x39/0x1b0
Oct 16 16:57:38 soak-10 kernel: [<ffffffff81222077>] vfs_kern_mount+0x67/0x110
Oct 16 16:57:38 soak-10 kernel: [<ffffffff81224583>] do_mount+0x233/0xaf0
Oct 16 16:57:38 soak-10 kernel: [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40
Oct 16 16:57:38 soak-10 kernel: [<ffffffff812251c6>] SyS_mount+0x96/0xf0
Oct 16 16:57:38 soak-10 kernel: [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b

Power-cycled that node, continued with mount, second mount on soak-10 worked.
Mounting clients, soak-40 hung.

Oct 16 17:17:56 soak-40 kernel: INFO: task mount.lustre:2933 blocked for more than 120 seconds.
Oct 16 17:17:56 soak-40 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 16 17:17:56 soak-40 kernel: mount.lustre    D 0000000000000001     0  2933   2932 0x00000080
Oct 16 17:17:56 soak-40 kernel: ffff88107f827940 0000000000000086 ffff88085c2e5ee0 ffff88107f827fd8
Oct 16 17:17:56 soak-40 kernel: ffff88107f827fd8 ffff88107f827fd8 ffff88085c2e5ee0 ffff88085cb5c5d0
Oct 16 17:17:57 soak-40 kernel: 7fffffffffffffff ffff88085cb5c5c8 ffff88085c2e5ee0 0000000000000001
Oct 16 17:17:57 soak-40 kernel: Call Trace:
Oct 16 17:17:57 soak-40 kernel: [<ffffffff816a94e9>] schedule+0x29/0x70
Oct 16 17:17:57 soak-40 kernel: [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
Oct 16 17:17:57 soak-40 kernel: [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0
Oct 16 17:17:57 soak-40 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0
Oct 16 17:17:57 soak-40 kernel: [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0
Oct 16 17:17:57 soak-40 kernel: [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
Oct 16 17:17:57 soak-40 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
Oct 16 17:17:57 soak-40 kernel: [<ffffffffc0ba66b4>] llog_process_or_fork+0x244/0x450 [obdclass]
Oct 16 17:17:57 soak-40 kernel: [<ffffffffc0ba68d4>] llog_process+0x14/0x20 [obdclass]
Oct 16 17:17:57 soak-40 kernel: [<ffffffffc0bd6d65>] class_config_parse_llog+0x125/0x350 [obdclass]
Oct 16 17:17:57 soak-40 kernel: [<ffffffffc0844368>] mgc_process_cfg_log+0x788/0xc40 [mgc]
Oct 16 17:17:57 soak-40 kernel: [<ffffffffc08471e3>] mgc_process_log+0x3d3/0x890 [mgc]
Oct 16 17:17:57 soak-40 kernel: [<ffffffffc0847947>] ? do_config_log_add+0x2a7/0x580 [mgc]
Oct 16 17:17:57 soak-40 kernel: [<ffffffffc08487c0>] mgc_process_config+0x890/0x13f0 [mgc]
Oct 16 17:17:58 soak-40 kernel: [<ffffffffc0be2d7e>] lustre_process_log+0x2de/0xaf0 [obdclass]
Oct 16 17:17:58 soak-40 kernel: [<ffffffff816aba0e>] ? _raw_spin_unlock_bh+0x1e/0x20
Oct 16 17:17:58 soak-40 kernel: [<ffffffff8132318b>] ? fprop_local_init_percpu+0x1b/0x30
Oct 16 17:17:58 soak-40 kernel: [<ffffffffc0d7b608>] ll_fill_super+0xaf8/0x1220 [lustre]
Oct 16 17:17:58 soak-40 kernel: [<ffffffffc0be8ce6>] lustre_fill_super+0x286/0x910 [obdclass]
Oct 16 17:17:58 soak-40 kernel: [<ffffffffc0be8a60>] ? lustre_common_put_super+0x270/0x270 [obdclass]
Oct 16 17:17:58 soak-40 kernel: [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0
Oct 16 17:17:58 soak-40 kernel: [<ffffffffc0be0ba8>] lustre_mount+0x38/0x60 [obdclass]
Oct 16 17:17:58 soak-40 kernel: [<ffffffff81205599>] mount_fs+0x39/0x1b0
Oct 16 17:17:58 soak-40 kernel: [<ffffffff81222067>] vfs_kern_mount+0x67/0x110
Oct 16 17:17:58 soak-40 kernel: [<ffffffff81224573>] do_mount+0x233/0xaf0
Oct 16 17:17:58 soak-40 kernel: [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40
Oct 16 17:17:58 soak-40 kernel: [<ffffffff812251b6>] SyS_mount+0x96/0xf0
Oct 16 17:17:58 soak-40 kernel: [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b
Oct 16 17:19:59 soak-40 kernel: INFO: task mount.lustre:2933 blocked for more than 120 seconds.
Oct 16 17:19:59 soak-40 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Dumped -1 lustre-log from client, attached.



 Comments   
Comment by Peter Jones [ 16/Oct/17 ]

Bruno

Could you please investigate this one?

Thanks

Peter

Comment by Cliff White (Inactive) [ 02/Nov/17 ]

These are appearing again with the latest lustre-master build

soak-24 login: [ 7800.834964] INFO: task mount.lustre:2613 blocked for more than 120 seconds.
[ 7800.844715] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7800.855390] mount.lustre    D 0000000000000001     0  2613   2612 0x00000080
[ 7800.865073]  ffff881014bdf940 0000000000000086 ffff88101c1cdee0 ffff881014bdffd8
[ 7800.875067]  ffff881014bdffd8 ffff881014bdffd8 ffff88101c1cdee0 ffff88101c13d5d0
[ 7800.884968]  7fffffffffffffff ffff88101c13d5c8 ffff88101c1cdee0 0000000000000001
[ 7800.894763] Call Trace:
[ 7800.898926]  [<ffffffff816a94e9>] schedule+0x29/0x70
[ 7800.905812]  [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
[ 7800.913671]  [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700
[ 7800.921952]  [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
[ 7800.929957]  [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
[ 7800.937435]  [<ffffffffc09f3724>] llog_process_or_fork+0x244/0x450 [obdclass]
[ 7800.946642]  [<ffffffffc09f3944>] llog_process+0x14/0x20 [obdclass]
[ 7800.954833]  [<ffffffffc0a24a05>] class_config_parse_llog+0x125/0x350 [obdclass]
[ 7800.964256]  [<ffffffffc08cd368>] mgc_process_cfg_log+0x788/0xc40 [mgc]
[ 7800.972714]  [<ffffffffc08d01e3>] mgc_process_log+0x3d3/0x890 [mgc]
[ 7800.980801]  [<ffffffffc0a2c770>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 7800.990481]  [<ffffffffc08d08e8>] ? do_config_log_add+0x248/0x580 [mgc]
[ 7800.998843]  [<ffffffffc08d17c0>] mgc_process_config+0x890/0x13f0 [mgc]
[ 7801.007224]  [<ffffffffc0a3006e>] lustre_process_log+0x2de/0xaf0 [obdclass]
[ 7801.015956]  [<ffffffffc0908c47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[ 7801.024307]  [<ffffffffc0ddc736>] ll_fill_super+0xae6/0x11e0 [lustre]
[ 7801.032419]  [<ffffffffc0a35d86>] lustre_fill_super+0x286/0x910 [obdclass]
[ 7801.041016]  [<ffffffffc0a35b00>] ? lustre_common_put_super+0x270/0x270 [obdclass]
[ 7801.050348]  [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0
[ 7801.057065]  [<ffffffffc0a2de98>] lustre_mount+0x38/0x60 [obdclass]
[ 7801.064931]  [<ffffffff81205599>] mount_fs+0x39/0x1b0
[ 7801.071426]  [<ffffffff81222067>] vfs_kern_mount+0x67/0x110
[ 7801.078498]  [<ffffffff81224573>] do_mount+0x233/0xaf0
[ 7801.085079]  [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40
[ 7801.092332]  [<ffffffff812251b6>] SyS_mount+0x96/0xf0
[ 7801.098807]  [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b
Comment by Sarah Liu [ 10/Nov/17 ]

seeing this again on lustre-master-ib #16 DNE

[ 5643.580301] INFO: task mount.lustre:2915 blocked for more than 120 seconds.
[ 5643.589964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5643.600403] mount.lustre    D 0000000000000001     0  2915   2914 0x00000082
[ 5643.609908]  ffff88042e27b7b0 0000000000000082 ffff880409d71fa0 ffff88042e27bfd8
[ 5643.619780]  ffff88042e27bfd8 ffff88042e27bfd8 ffff880409d71fa0 ffff88042b9ad7b0
[ 5643.629533]  7fffffffffffffff ffff88042b9ad7a8 ffff880409d71fa0 0000000000000001
[ 5643.639137] Call Trace:
[ 5643.643121]  [<ffffffff816a9589>] schedule+0x29/0x70
[ 5643.649842]  [<ffffffff816a7099>] schedule_timeout+0x239/0x2c0
[ 5643.657539]  [<ffffffff81050b5c>] ? native_smp_send_reschedule+0x4c/0x70
[ 5643.666145]  [<ffffffff810c0558>] ? resched_curr+0xa8/0xc0
[ 5643.673388]  [<ffffffff810c12d8>] ? check_preempt_curr+0x78/0xa0
[ 5643.681144]  [<ffffffff810c1319>] ? ttwu_do_wakeup+0x19/0xd0
[ 5643.688510]  [<ffffffff816a993d>] wait_for_completion+0xfd/0x140
[ 5643.696212]  [<ffffffff810c4820>] ? wake_up_state+0x20/0x20
[ 5643.703465]  [<ffffffffc0a53724>] llog_process_or_fork+0x244/0x450 [obdclass]
[ 5643.712438]  [<ffffffffc0a53944>] llog_process+0x14/0x20 [obdclass]
[ 5643.720410]  [<ffffffffc0a84ac5>] class_config_parse_llog+0x125/0x350 [obdclass]
[ 5643.729617]  [<ffffffffc14215a8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
[ 5643.737880]  [<ffffffffc1424413>] mgc_process_log+0x3d3/0x890 [mgc]
[ 5643.745784]  [<ffffffffc0a8c810>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 5643.755294]  [<ffffffffc1424b18>] ? do_config_log_add+0x248/0x580 [mgc]
[ 5643.763560]  [<ffffffffc14259f0>] mgc_process_config+0x890/0x13f0 [mgc]
[ 5643.771778]  [<ffffffffc0a902ae>] lustre_process_log+0x2de/0xaf0 [obdclass]
[ 5643.780388]  [<ffffffffc0968c47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[ 5643.788598]  [<ffffffffc0a7bbd9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[ 5643.797374]  [<ffffffffc0abca12>] server_start_targets+0x1352/0x2a70 [obdclass]
[ 5643.806348]  [<ffffffffc0a93270>] ? lustre_start_mgc+0x260/0x25b0 [obdclass]
[ 5643.815030]  [<ffffffffc0a8c810>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 5643.824484]  [<ffffffffc0abf1bd>] server_fill_super+0x108d/0x184c [obdclass]
[ 5643.833157]  [<ffffffffc0a96068>] lustre_fill_super+0x328/0x950 [obdclass]
[ 5643.841638]  [<ffffffffc0a95d40>] ? lustre_common_put_super+0x270/0x270 [obdclass]
[ 5643.850881]  [<ffffffff81204a7d>] mount_nodev+0x4d/0xb0
[ 5643.857516]  [<ffffffffc0a8e0d8>] lustre_mount+0x38/0x60 [obdclass]
[ 5643.865298]  [<ffffffff81205509>] mount_fs+0x39/0x1b0
[ 5643.871716]  [<ffffffff81221fe7>] vfs_kern_mount+0x67/0x110
[ 5643.878713]  [<ffffffff812244f3>] do_mount+0x233/0xaf0
[ 5643.885221]  [<ffffffff8118761e>] ? __get_free_pages+0xe/0x40
[ 5643.892384]  [<ffffffff81225136>] SyS_mount+0x96/0xf0
[ 5643.898795]  [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
[root@soak-9 ~]# 

Comment by Jian Yu [ 15/Nov/17 ]

conf-sanity test 53b also hit this issue on master branch, which is affecting patch review testing:
https://testing.hpdd.intel.com/test_sets/e63e81e6-c94c-11e7-a066-52540065bddc
https://testing.hpdd.intel.com/test_sets/683390e6-c570-11e7-a066-52540065bddc

Comment by Rus Taliss [ 16/Nov/17 ]

I realize u work on 2.11 future rel. This post not related to Soak, but looking at the Log it appears striped sets giving timeout problem.. May i ask why not use IB RDMA? i plan a 8-node running 2.10 Lustre with IB connect.. i wish to hold the DB in-memory..distrib across nodes. What is the best installation using Mellanox ConnectX2 VPI 40Gb / node in pure IB thru a SX6036 switch. These methods need integration within Lustre .. At What stage ?

Is the WQE encapsulated inside the PCIe doorbell and written via Programmed IO? Or is the WQE fetched in a separate DMA read by the HCA?
It can be either. The usual way is writing WQE and ringing the doorbell, in which case the HCA will fetch the WQE through DMA. The less common way is what is called "Blue Flame" - part of the PCIe BAR is used for WQEs, and the WQE is written to the HCA the same way the doorbell is written.

Main issue, how does Lustre map to in-mem distributed store? Even on a 2 node split using only ConnectX2 VPI (dual port) or X3 (but i hear IB performance is slower on X3). New features of Lustre 10 makes local routing a breeze direct from the header.

dont wish for this MPI issue to dominate : (even tho not using MPI directly) Can u say if MPI libraries are used at all in Lustre , or affected UDP binaries, in the case i use UDP) ?

Is IB datagram mode available as DMA (or other layer) transport within Lustre 10 ?

MPI IP performance is affected by the socket-buffer sizes for sending and receiving UDP data. These are defined by two network tuning parameters udp_sendspace and udp_recvspace. When the buffer for sending data is too small and quickly becomes full, UDP data transfer can be delayed. When the buffer for receiving data is too small, incoming UDP data can be dropped due to insufficient buffer space, resulting in send-side retransmission and very poor performance.

Heres v2 of Mellan Ib utils :
http://www.mellanox.com/page/products_dyn?product_family=110&mtag=monitoring_debug

Worth the ask, if Sarah or other may interest in remote install / Dev onto a NY cluster.. on a casual basis.

Comment by Bruno Faccini (Inactive) [ 16/Nov/17 ]

The attached Lustre debug log indicates that during the mount process, Client is fetching the soaked-client config LLOG from MGS and running the different config steps according to its content.

And it seems to be stuck during the LNet peer discovery of 192.168.1.110@o2ib NID for soaked-MDT0002-mdc import creation :

00000040:00001000:67.0:1508174041.158589:0:2946:0:(llog.c:513:llog_process_thread()) processing rec 0xffff8808f39f5698 type 0x10620000
00000040:00001000:67.0:1508174041.158590:0:2946:0:(llog.c:519:llog_process_thread()) after swabbing, type=0x10620000 idx=38
00000040:00001000:67.0:1508174041.158591:0:2946:0:(llog.c:595:llog_process_thread()) lrh_index: 38 lrh_len: 144 (2408 remains)
00000020:00000001:67.0:1508174041.158592:0:2946:0:(obd_config.c:1496:class_config_llog_handler()) Process entered
00000020:00000010:67.0:1508174041.158592:0:2946:0:(obd_config.c:1629:class_config_llog_handler()) kmalloced 'inst_name': 39 at ffff88083c75d040.
00000020:01000000:67.0:1508174041.158594:0:2946:0:(obd_config.c:1637:class_config_llog_handler()) cmd cf003, instance name: soaked-MDT0002-mdc-ffff88105dcb8800
00000020:00000010:67.0:1508174041.158595:0:2946:0:(obd_config.c:1707:class_config_llog_handler()) kmalloced 'lcfg_new': 136 at ffff88103f70e900.
00000020:00000080:67.0:1508174041.158596:0:2946:0:(obd_config.c:1180:class_process_config()) processing cmd: cf003
00000020:00000001:67.0:1508174041.158597:0:2946:0:(obd_config.c:479:class_setup()) Process entered
00000001:00000001:67.0:1508174041.158598:0:2946:0:(hash.c:1005:cfs_hash_create()) Process entered
00000001:00000010:67.0:1508174041.158599:0:2946:0:(hash.c:1032:cfs_hash_create()) alloc '(hs)': 144 at ffff88103f70e840 (tot 21621812).
00000001:00000010:67.0:1508174041.158600:0:2946:0:(hash.c:893:cfs_hash_buckets_realloc()) alloc '(new_bkts)': 32 at ffff88104f3f1580 (tot 21621844).
00000001:00000010:67.0:1508174041.158602:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4c600 (tot 21622124).
00000001:00000010:67.0:1508174041.158604:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4c800 (tot 21622404).
00000001:00000010:67.0:1508174041.158605:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4ca00 (tot 21622684).
00000001:00000010:67.0:1508174041.158607:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4cc00 (tot 21622964).
00000001:00000001:67.0:1508174041.158608:0:2946:0:(hash.c:1005:cfs_hash_create()) Process entered
00000001:00000010:67.0:1508174041.158609:0:2946:0:(hash.c:1032:cfs_hash_create()) alloc '(hs)': 144 at ffff88103f70e780 (tot 21623108).
00000001:00000010:67.0:1508174041.158610:0:2946:0:(hash.c:893:cfs_hash_buckets_realloc()) alloc '(new_bkts)': 32 at ffff88104f3f15a0 (tot 21623140).
00000001:00000010:67.0:1508174041.158611:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4ce00 (tot 21623420).
00000001:00000010:67.0:1508174041.158613:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4d000 (tot 21623700).
00000001:00000010:67.0:1508174041.158614:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4d200 (tot 21623980).
00000001:00000010:67.0:1508174041.158616:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4d400 (tot 21624260).
00000001:00000001:67.0:1508174041.158617:0:2946:0:(hash.c:1005:cfs_hash_create()) Process entered
00000001:00000010:67.0:1508174041.158617:0:2946:0:(hash.c:1032:cfs_hash_create()) alloc '(hs)': 144 at ffff88103f70e6c0 (tot 21624404).
00000001:00000010:67.0:1508174041.158619:0:2946:0:(hash.c:893:cfs_hash_buckets_realloc()) alloc '(new_bkts)': 32 at ffff88104f3f15e0 (tot 21624436).
00000001:00000010:67.0:1508174041.158620:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4d600 (tot 21624716).
00000001:00000010:67.0:1508174041.158622:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4d800 (tot 21624996).
00000001:00000010:67.0:1508174041.158623:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4da00 (tot 21625276).
00000001:00000010:67.0:1508174041.158625:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4dc00 (tot 21625556).
00000001:00000001:67.0:1508174041.158626:0:2946:0:(hash.c:1005:cfs_hash_create()) Process entered
00000001:00000010:67.0:1508174041.158626:0:2946:0:(hash.c:1032:cfs_hash_create()) alloc '(hs)': 144 at ffff88103f70e600 (tot 21625700).
00000001:00000010:67.0:1508174041.158627:0:2946:0:(hash.c:893:cfs_hash_buckets_realloc()) alloc '(new_bkts)': 32 at ffff88104f3f1600 (tot 21625732).
00000001:00000010:67.0:1508174041.158629:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4de00 (tot 21626012).
00000001:00000010:67.0:1508174041.158630:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4e000 (tot 21626292).
00000001:00000010:67.0:1508174041.158631:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4e200 (tot 21626572).
00000001:00000010:67.0:1508174041.158633:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4e400 (tot 21626852).
00000020:00000001:67.0:1508174041.158634:0:2946:0:(genops.c:853:class_new_export()) Process entered
00000020:00000010:67.0:1508174041.158636:0:2946:0:(genops.c:855:class_new_export()) kmalloced 'export': 968 at ffff88107f8f0400.
00000020:00000001:67.0:1508174041.158637:0:2946:0:(lustre_handles.c:64:class_handle_hash()) Process entered
00000020:00000040:67.0:1508174041.158638:0:2946:0:(lustre_handles.c:98:class_handle_hash()) added object ffff88107f8f0400 with handle 0x36c91a8e859b7cc8 to hash
00000020:00000001:67.0:1508174041.158639:0:2946:0:(lustre_handles.c:99:class_handle_hash()) Process leaving
00000020:00000001:67.0:1508174041.158640:0:2946:0:(obd_class.h:991:obd_init_export()) Process entered
00000020:00000001:67.0:1508174041.158641:0:2946:0:(obd_class.h:995:obd_init_export()) Process leaving (rc=0 : 0 : 0)
00000020:00000040:67.0:1508174041.158642:0:2946:0:(obd_config.c:753:class_incref()) incref soaked-MDT0002-mdc-ffff88105dcb8800 (ffff88107fdacf60) now 2
00000020:00000001:67.0:1508174041.158643:0:2946:0:(genops.c:928:class_new_export()) Process leaving (rc=18446612203173774336 : -131870535777280 : ffff88107f8f0400)
00000020:00000040:67.0:1508174041.158644:0:2946:0:(genops.c:828:class_export_put()) PUTting export ffff88107f8f0400 : new refcount 1
00000020:00000001:67.0:1508174041.158645:0:2946:0:(obd_class.h:582:obd_setup()) Process entered
00000002:00000001:67.0:1508174041.158646:0:2946:0:(mdc_request.c:2483:mdc_setup()) Process entered
00000100:00000001:67.0:1508174041.158647:0:2946:0:(ptlrpcd.c:944:ptlrpcd_addref()) Process entered
00000100:00000001:67.0:1508174041.158648:0:2946:0:(ptlrpcd.c:953:ptlrpcd_addref()) Process leaving (rc=0 : 0 : 0)
00010000:00000001:67.0:1508174041.158649:0:2946:0:(ldlm_lib.c:277:client_obd_setup()) Process entered
00010000:00000010:67.0:1508174041.158651:0:2946:0:(ldlm_lib.c:444:client_obd_setup()) kmalloced 'cli->cl_mod_tag_bitmap': 64 at ffff88083c75df80.
00010000:00000001:67.0:1508174041.158652:0:2946:0:(ldlm_lockd.c:2788:ldlm_get_ref()) Process entered
00010000:00000001:67.0:1508174041.158653:0:2946:0:(ldlm_lockd.c:2797:ldlm_get_ref()) Process leaving (rc=0 : 0 : 0)
00000020:00000010:67.0:1508174041.158654:0:2946:0:(genops.c:1067:class_new_import()) kmalloced 'imp': 1480 at ffff88083e668000.
00000020:00000040:67.0:1508174041.158655:0:2946:0:(obd_config.c:753:class_incref()) incref soaked-MDT0002-mdc-ffff88105dcb8800 (ffff88107fdacf60) now 3
00000020:00000001:67.0:1508174041.158657:0:2946:0:(lustre_handles.c:64:class_handle_hash()) Process entered
00000020:00000040:67.0:1508174041.158658:0:2946:0:(lustre_handles.c:98:class_handle_hash()) added object ffff88083e668000 with handle 0x36c91a8e859b7ccf to hash
00000020:00000001:67.0:1508174041.158659:0:2946:0:(lustre_handles.c:99:class_handle_hash()) Process leaving
00000020:00000001:67.0:1508174041.158661:0:2946:0:(genops.c:1030:class_import_put()) Process entered
00000020:00000040:67.0:1508174041.158661:0:2946:0:(genops.c:1037:class_import_put()) import ffff88083e668000 refcount=1 obd=soaked-MDT0002-mdc-ffff88105dcb8800
00000020:00000001:67.0:1508174041.158663:0:2946:0:(genops.c:1046:class_import_put()) Process leaving
00010000:00000001:67.0:1508174041.158664:0:2946:0:(ldlm_lib.c:61:import_set_conn()) Process entered
00000100:00000200:67.0:1508174041.158666:0:2946:0:(events.c:542:ptlrpc_uuid_to_peer()) 192.168.1.110@o2ib->12345-192.168.1.110@o2ib
00000100:00000001:67.0:1508174041.158667:0:2946:0:(connection.c:48:ptlrpc_connection_get()) Process entered
00000400:00000010:67.0:1508174041.158671:0:2946:0:(peer.c:239:lnet_peer_alloc()) alloc '(lp)': 184 at ffff88083e75c900 (tot 21627036).
00000400:00000200:67.0:1508174041.158673:0:2946:0:(peer.c:261:lnet_peer_alloc()) ffff88083e75c900 nid 192.168.1.110@o2ib
00000400:00000010:67.0:1508174041.158675:0:2946:0:(peer.c:204:lnet_peer_net_alloc()) alloc '(lpn)': 48 at ffff88105b155080 (tot 21627084).
00000400:00000200:67.0:1508174041.158676:0:2946:0:(peer.c:212:lnet_peer_net_alloc()) ffff88105b155080 net o2ib
00000400:00000010:67.0:1508174041.158680:0:2946:0:(peer.c:154:lnet_peer_ni_alloc()) alloc '(lpni)': 336 at ffff88085d7a8200 (tot 21627420).
00000400:00000200:67.0:1508174041.158682:0:2946:0:(peer.c:194:lnet_peer_ni_alloc()) ffff88085d7a8200 nid 192.168.1.110@o2ib
00000400:00000200:67.0:1508174041.158685:0:2946:0:(peer.c:1186:lnet_peer_attach_peer_ni()) peer 192.168.1.110@o2ib NID 192.168.1.110@o2ib flags 0x0
00000400:00000200:67.0:1508174041.158691:0:2946:0:(peer.c:1751:lnet_peer_queue_for_discovery()) Queue peer 192.168.1.110@o2ib: 0
00000400:00000200:68.0:1508174041.158724:0:2519:0:(peer.c:3030:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:68.0:1508174041.158728:0:2519:0:(peer.c:3136:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x20
00000400:00000010:68.0:1508174041.158730:0:2519:0:(api-ni.c:1076:lnet_ping_buffer_alloc()) alloc '(pbuf)': 280 at ffff88105c1e3600 (tot 21627700).
00000400:00000010:68.0:1508174041.158732:0:2519:0:(lib-lnet.h:228:lnet_md_alloc()) slab-alloced 'md' of size 128 at ffff8810537faf00.
00000400:00000010:68.0:1508174041.158736:0:2519:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc '(msg)': 416 at ffff88105c1e3400 (tot 21628116).
00000400:00000200:68.0:1508174041.158738:0:2519:0:(lib-move.c:3187:LNetGet()) LNetGet -> 12345-192.168.1.110@o2ib
00000400:00000200:68.0:1508174041.158742:0:2519:0:(peer.c:856:lnet_peer_ni_set_non_mr_pref_nid()) peer 192.168.1.110@o2ib nid 192.168.1.140@o2ib: 0
00000400:00000200:68.0:1508174041.158746:0:2519:0:(lib-move.c:2114:lnet_select_pathway()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib:<?>) -> 192.168.1.110@o2ib(192.168.1.110@o2ib:192.168.1.110@o2ib) : GET
00000800:00000200:68.0:1508174041.158749:0:2519:0:(o2iblnd_cb.c:1511:kiblnd_send()) sending 0 bytes in 0 frags to 12345-192.168.1.110@o2ib
00000800:00000010:68.0:1508174041.158757:0:2519:0:(o2iblnd.c:326:kiblnd_create_peer()) alloc '(peer_ni)': 112 at ffff88085c21d900 (tot 21628228).
00000800:00000200:68.0:1508174041.158759:0:2519:0:(o2iblnd_cb.c:1479:kiblnd_launch_tx()) peer_ni[ffff88085c21d900] -> 192.168.1.110@o2ib (1)++
00000800:00000200:68.0:1508174041.158763:0:2519:0:(o2iblnd_cb.c:1281:kiblnd_connect_peer()) peer_ni[ffff88085c21d900] -> 192.168.1.110@o2ib (2)++
00000800:00000200:68.0:1508174041.158782:0:2519:0:(o2iblnd_cb.c:1234:kiblnd_resolve_addr()) bound to port 1023
00000800:00000200:68.0:1508174041.158783:0:2519:0:(o2iblnd_cb.c:1302:kiblnd_connect_peer()) 192.168.1.110@o2ib: connection bound to ib0:192.168.1.140:mlx4_0
00000800:00000200:68.0:1508174041.158785:0:2519:0:(o2iblnd_cb.c:1486:kiblnd_launch_tx()) peer_ni[ffff88085c21d900] -> 192.168.1.110@o2ib (3)--
00000400:00000200:68.0:1508174041.158787:0:2519:0:(peer.c:2763:lnet_peer_send_ping()) peer 192.168.1.110@o2ib
00000400:00000200:68.0:1508174041.158788:0:2519:0:(peer.c:3157:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x120 rc 0
00000800:00000200:15.0:1508174041.159054:0:691:0:(o2iblnd_cb.c:2991:kiblnd_cm_callback()) 192.168.1.110@o2ib Addr resolved: 0
00000800:00000200:14.0:1508174041.159155:0:6:0:(o2iblnd_cb.c:3021:kiblnd_cm_callback()) 192.168.1.110@o2ib Route resolved: 0
00000800:00000010:14.0:1508174041.159164:0:6:0:(o2iblnd.c:777:kiblnd_create_conn()) alloc '(init_qp_attr)': 96 at ffff8808592e7ba0 (tot 21628324).
00000800:00000010:14.0:1508174041.159179:0:6:0:(o2iblnd.c:784:kiblnd_create_conn()) alloc '(conn)': 280 at ffff8800748ee000 (tot 21628604).
00000800:00000010:14.0:1508174041.159188:0:6:0:(o2iblnd.c:808:kiblnd_create_conn()) alloc '(conn->ibc_connvars)': 136 at ffff8808467e26c0 (tot 21628740).
00000800:00000010:14.0:1508174041.159204:0:6:0:(o2iblnd.c:843:kiblnd_create_conn()) alloc '(conn->ibc_rxs)': 1872 at ffff88085c4f2000 (tot 21630612).
00000800:00000010:14.0:1508174041.159206:0:6:0:(o2iblnd.c:1242:kiblnd_alloc_pages()) alloc '(p)': 152 at ffff8808467e2480 (tot 21630764).
00000800:00000200:14.0:1508174041.159214:0:6:0:(o2iblnd.c:1316:kiblnd_map_rx_descs()) rx 0: ffff880841b63000 0x841b63000(0x841b63000)
00000800:00000200:14.0:1508174041.159217:0:6:0:(o2iblnd.c:1316:kiblnd_map_rx_descs()) rx 1: ffff880859708000 0x859708000(0x859708000)
……………………
00000800:00000200:14.0:1508174041.159237:0:6:0:(o2iblnd.c:1316:kiblnd_map_rx_descs()) rx 16: ffff88083a615000 0x83a615000(0x83a615000)
00000800:00000200:14.0:1508174041.159238:0:6:0:(o2iblnd.c:1316:kiblnd_map_rx_descs()) rx 17: ffff88083b2d8000 0x83b2d8000(0x83b2d8000)
00000800:00000200:14.0:1508174041.160476:0:6:0:(o2iblnd.c:915:kiblnd_create_conn()) original send wr 2056, created with 4079
00000800:00000010:14.0:1508174041.160478:0:6:0:(o2iblnd.c:917:kiblnd_create_conn()) kfreed 'init_qp_attr': 96 at ffff8808592e7ba0 (tot 21630668).
00000800:00000200:14.0:1508174041.160480:0:6:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (19)++
00000800:00000200:14.0:1508174041.160482:0:6:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (20)--
…………………..
00000800:00000200:14.0:1508174041.160513:0:6:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (19)++
00000800:00000200:14.0:1508174041.160514:0:6:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (20)--
00000800:00000200:15.0:1508174041.162919:0:691:0:(o2iblnd_cb.c:3088:kiblnd_cm_callback()) ESTABLISHED(active): 192.168.1.110@o2ib
00000800:00000200:15.0:1508174041.162923:0:691:0:(o2iblnd_cb.c:2123:kiblnd_connreq_done()) 192.168.1.110@o2ib: active(1), version(12), status(0)
00000800:00000010:15.0:1508174041.162925:0:691:0:(o2iblnd_cb.c:2131:kiblnd_connreq_done()) kfreed 'conn->ibc_connvars': 136 at ffff8808467e26c0 (tot 21630532).
00000800:00000200:15.0:1508174041.162941:0:691:0:(o2iblnd_cb.c:2150:kiblnd_connreq_done()) conn[ffff8800748ee000] (19)++
00000800:00000200:15.0:1508174041.162944:0:691:0:(o2iblnd_cb.c:2192:kiblnd_connreq_done()) conn[ffff8800748ee000] (20)++
00000800:00000200:15.0:1508174041.162945:0:691:0:(o2iblnd_cb.c:1162:kiblnd_queue_tx_locked()) conn[ffff8800748ee000] (21)++
00000800:00000200:15.0:1508174041.162950:0:691:0:(o2iblnd_cb.c:2212:kiblnd_connreq_done()) conn[ffff8800748ee000] (22)--
00000800:00000200:20.2F:1508174041.162997:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (21)++
00000800:00000200:20.0F:1508174041.163022:0:2510:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (22)++
00000800:00000200:20.0:1508174041.163028:0:2510:0:(o2iblnd_cb.c:74:kiblnd_tx_done()) conn[ffff8800748ee000] (23)--
00000400:00000200:20.0:1508174041.163032:0:2510:0:(peer.c:2326:lnet_discovery_event_handler()) Received event: 5
00000400:00000200:20.0:1508174041.163036:0:2510:0:(peer.c:2278:lnet_discovery_event_send()) Ping Send to 192.168.1.110@o2ib: 0
00000400:00000010:20.0:1508174041.163040:0:2510:0:(lib-lnet.h:489:lnet_msg_free()) kfreed 'msg': 416 at ffff88105c1e3400 (tot 21630116).
00000800:00000200:20.0:1508174041.163043:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
00000800:00000200:20.0:1508174041.163045:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (21)--
00000800:00000200:20.2:1508174041.163169:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (20)++
00000800:00000200:20.0:1508174041.163176:0:2511:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (21)++
00000800:00000200:20.0:1508174041.163182:0:2511:0:(o2iblnd_cb.c:326:kiblnd_handle_rx()) Received d1[1] from 192.168.1.110@o2ib
00000400:00000200:20.0:1508174041.163187:0:2511:0:(lib-move.c:2599:lnet_parse()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib) <- 192.168.1.110@o2ib : REPLY - for me
00000400:00000010:20.0:1508174041.163191:0:2511:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc '(msg)': 416 at ffff88085cab6400 (tot 21630532).
00000400:00000200:20.0:1508174041.163198:0:2511:0:(lib-move.c:2366:lnet_parse_reply()) 192.168.1.140@o2ib: Reply from 12345-192.168.1.110@o2ib of length 48/48 into md 0xad
00000400:00000200:20.0:1508174041.163202:0:2511:0:(peer.c:2326:lnet_discovery_event_handler()) Received event: 3
00000400:00000200:20.0:1508174041.163205:0:2511:0:(peer.c:881:lnet_peer_ni_clr_non_mr_pref_nid()) peer 192.168.1.110@o2ib: 0
00000400:00000200:20.0:1508174041.163209:0:2511:0:(peer.c:2237:lnet_discovery_event_reply()) peer 192.168.1.110@o2ib data present 1
00000400:00000200:20.0:1508174041.163214:0:2511:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff8810537faf00
00000400:00000010:20.0:1508174041.163216:0:2511:0:(lib-lnet.h:259:lnet_md_free()) slab-freed 'md' at ffff8810537faf00.
00000800:00000200:47.0:1508174041.163218:0:2508:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
00000400:00000010:20.0:1508174041.163219:0:2511:0:(lib-lnet.h:489:lnet_msg_free()) kfreed 'msg': 416 at ffff88085cab6400 (tot 21630116).
00000800:00000200:20.0:1508174041.163221:0:2511:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (21)++
00000800:00000200:20.0:1508174041.163224:0:2511:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (22)--
00000800:00000200:20.0:1508174041.163225:0:2511:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (21)--
00000400:00000200:68.0:1508174041.163252:0:2519:0:(peer.c:3030:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:68.0:1508174041.163257:0:2519:0:(peer.c:3136:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x61
00000400:00000010:68.0:1508174041.163259:0:2519:0:(peer.c:2399:lnet_peer_merge_data()) alloc '(curnis)': 16 at ffff88105a776630 (tot 21630132).
00000400:00000010:68.0:1508174041.163261:0:2519:0:(peer.c:2400:lnet_peer_merge_data()) alloc '(addnis)': 16 at ffff88105a776640 (tot 21630148).
00000400:00000010:68.0:1508174041.163262:0:2519:0:(peer.c:2401:lnet_peer_merge_data()) alloc '(delnis)': 16 at ffff88105a776650 (tot 21630164).
00000400:00000010:68.0:1508174041.163263:0:2519:0:(peer.c:2470:lnet_peer_merge_data()) kfreed 'curnis': 16 at ffff88105a776630 (tot 21630148).
00000400:00000010:68.0:1508174041.163264:0:2519:0:(peer.c:2471:lnet_peer_merge_data()) kfreed 'addnis': 16 at ffff88105a776640 (tot 21630132).
00000400:00000010:68.0:1508174041.163265:0:2519:0:(peer.c:2472:lnet_peer_merge_data()) kfreed 'delnis': 16 at ffff88105a776650 (tot 21630116).
00000400:00000010:68.0:1508174041.163266:0:2519:0:(api-ni.c:1089:lnet_ping_buffer_free()) kfreed 'pbuf': 280 at ffff88105c1e3600 (tot 21629836).
00000400:00000200:68.0:1508174041.163268:0:2519:0:(peer.c:2474:lnet_peer_merge_data()) peer 192.168.1.110@o2ib: 0
00000400:00000200:68.0:1508174041.163269:0:2519:0:(peer.c:2634:lnet_peer_data_present()) peer 192.168.1.110@o2ib: 0
00000400:00000200:68.0:1508174041.163271:0:2519:0:(peer.c:3157:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0xa1 rc 1
00000400:00000200:68.0:1508174041.163274:0:2519:0:(peer.c:3136:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0xa1
00000400:00000010:68.0:1508174041.163275:0:2519:0:(lib-lnet.h:228:lnet_md_alloc()) slab-alloced 'md' of size 128 at ffff88085c21d500.
00000400:00000010:68.0:1508174041.163279:0:2519:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc '(msg)': 416 at ffff88105c1e3600 (tot 21630252).
00000400:00000200:68.0:1508174041.163281:0:2519:0:(lib-move.c:2984:LNetPut()) LNetPut -> 12345-192.168.1.110@o2ib
00000400:00000200:68.0:1508174041.163288:0:2519:0:(lib-move.c:2114:lnet_select_pathway()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib:<?>) -> 192.168.1.110@o2ib(192.168.1.110@o2ib:192.168.1.110@o2ib) : PUT
00000800:00000200:68.0:1508174041.163291:0:2519:0:(o2iblnd_cb.c:1511:kiblnd_send()) sending 48 bytes in 1 frags to 12345-192.168.1.110@o2ib
00000800:00000200:68.0:1508174041.163296:0:2519:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff88085c21d900] -> 192.168.1.110@o2ib (2) version: 12
00000800:00000200:68.0:1508174041.163298:0:2519:0:(o2iblnd_cb.c:1392:kiblnd_launch_tx()) conn[ffff8800748ee000] (20)++
00000800:00000200:68.0:1508174041.163299:0:2519:0:(o2iblnd_cb.c:1162:kiblnd_queue_tx_locked()) conn[ffff8800748ee000] (21)++
00000800:00000200:68.0:1508174041.163303:0:2519:0:(o2iblnd_cb.c:1398:kiblnd_launch_tx()) conn[ffff8800748ee000] (22)--
00000400:00000200:68.0:1508174041.163304:0:2519:0:(peer.c:2869:lnet_peer_send_push()) peer 192.168.1.110@o2ib
00000400:00000200:68.0:1508174041.163305:0:2519:0:(peer.c:3157:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x2a1 rc 0
00000800:00000200:20.2:1508174041.163309:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (21)++
00000800:00000200:20.0:1508174041.163315:0:2510:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (22)++
00000800:00000200:20.0:1508174041.163318:0:2510:0:(o2iblnd_cb.c:74:kiblnd_tx_done()) conn[ffff8800748ee000] (23)--
00000400:00000200:20.0:1508174041.163320:0:2510:0:(peer.c:2326:lnet_discovery_event_handler()) Received event: 5
00000400:00000200:20.0:1508174041.163323:0:2510:0:(peer.c:2278:lnet_discovery_event_send()) Push Send to 192.168.1.110@o2ib: 0
00000400:00000010:20.0:1508174041.163326:0:2510:0:(lib-lnet.h:489:lnet_msg_free()) kfreed 'msg': 416 at ffff88105c1e3600 (tot 21629836).
00000800:00000200:20.0:1508174041.163328:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
00000800:00000200:20.0:1508174041.163330:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (21)--
00000800:00000200:20.2:1508174041.163475:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (20)++
00000800:00000200:20.0:1508174041.163481:0:2508:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (21)++
00000800:00000200:20.0:1508174041.163485:0:2508:0:(o2iblnd_cb.c:326:kiblnd_handle_rx()) Received d1[0] from 192.168.1.110@o2ib
00000400:00000200:20.0:1508174041.163489:0:2508:0:(lib-move.c:2599:lnet_parse()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib) <- 192.168.1.110@o2ib : ACK - for me
00000400:00000010:20.0:1508174041.163491:0:2508:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc '(msg)': 416 at ffff88085cab6400 (tot 21630252).
00000400:00000200:20.0:1508174041.163496:0:2508:0:(lib-move.c:2419:lnet_parse_ack()) 192.168.1.140@o2ib: ACK from 12345-192.168.1.110@o2ib into md 0xb5
00000400:00000200:20.0:1508174041.163498:0:2508:0:(peer.c:2326:lnet_discovery_event_handler()) Received event: 4
00000400:00000200:20.0:1508174041.163500:0:2508:0:(peer.c:2111:lnet_discovery_event_ack()) peer 192.168.1.110@o2ib ev->status 0
00000400:00000200:20.0:1508174041.163505:0:2508:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff88085c21d500
00000400:00000010:20.0:1508174041.163506:0:2508:0:(lib-lnet.h:259:lnet_md_free()) slab-freed 'md' at ffff88085c21d500.
00000400:00000010:20.0:1508174041.163508:0:2508:0:(lib-lnet.h:489:lnet_msg_free()) kfreed 'msg': 416 at ffff88085cab6400 (tot 21629836).
00000800:00000200:20.0:1508174041.163510:0:2508:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (22)++
00000800:00000200:20.0:1508174041.163512:0:2508:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (23)--
00000800:00000200:20.0:1508174041.163513:0:2508:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
00000800:00000200:20.0:1508174041.163515:0:2508:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (21)--
00000800:00000200:20.2:1508174041.163551:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (20)++
00000800:00000200:20.0:1508174041.163556:0:2510:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (21)++
00000800:00000200:20.0:1508174041.163560:0:2510:0:(o2iblnd_cb.c:326:kiblnd_handle_rx()) Received d1[1] from 192.168.1.110@o2ib
00000400:00000200:20.0:1508174041.163563:0:2510:0:(lib-move.c:2599:lnet_parse()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib) <- 192.168.1.110@o2ib : PUT - for me
00000400:00000010:20.0:1508174041.163565:0:2510:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc '(msg)': 416 at ffff88085cab6400 (tot 21630252).
00000800:00000200:4.0:1508174041.163565:0:2509:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
00000400:00000200:20.0:1508174041.163569:0:2510:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-192.168.1.110@o2ib of length 48 into portal 0 MB=0x8000000000000000
00000400:00000200:20.0:1508174041.163575:0:2510:0:(lib-ptl.c:200:lnet_try_match_md()) Incoming put index 0 from 12345-192.168.1.110@o2ib of length 48/48 into md 0x9 [1] + 0
00000400:00000200:68.0:1508174041.163577:0:2519:0:(peer.c:3030:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:20.0:1508174041.163581:0:2510:0:(peer.c:1939:lnet_peer_push_event()) Stale Push from 192.168.1.110@o2ib: got 1 have 1
00000400:00000200:68.0:1508174041.163585:0:2519:0:(peer.c:3136:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0xa1
00000400:00000200:68.0:1508174041.163588:0:2519:0:(peer.c:2917:lnet_peer_discovered()) peer 192.168.1.110@o2ib
00000400:00000200:68.0:1508174041.163589:0:2519:0:(peer.c:3157:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x89 rc 0
00000400:00000200:20.0:1508174041.163592:0:2510:0:(peer.c:1751:lnet_peer_queue_for_discovery()) Queue peer 192.168.1.110@o2ib: -114
00000400:00000200:20.0:1508174041.163603:0:2510:0:(lib-move.c:2114:lnet_select_pathway()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib:192.168.1.140@o2ib) -> 192.168.1.110@o2ib(192.168.1.110@o2ib:192.168.1.110@o2ib) : ACK
00000800:00000200:20.0:1508174041.163608:0:2510:0:(o2iblnd_cb.c:1511:kiblnd_send()) sending 0 bytes in 1 frags to 12345-192.168.1.110@o2ib
00000800:00000200:20.0:1508174041.163613:0:2510:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff88085c21d900] -> 192.168.1.110@o2ib (2) version: 12
00000800:00000200:20.0:1508174041.163616:0:2510:0:(o2iblnd_cb.c:1392:kiblnd_launch_tx()) conn[ffff8800748ee000] (21)++
00000800:00000200:20.0:1508174041.163618:0:2510:0:(o2iblnd_cb.c:1162:kiblnd_queue_tx_locked()) conn[ffff8800748ee000] (22)++
00000800:00000200:20.0:1508174041.163621:0:2510:0:(o2iblnd_cb.c:1398:kiblnd_launch_tx()) conn[ffff8800748ee000] (23)--
00000800:00000200:20.0:1508174041.163622:0:2510:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (22)++
00000800:00000200:20.2:1508174041.163625:0:2510:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (22)++
00000800:00000200:20.0:1508174041.163630:0:2510:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (24)--
00000800:00000200:20.0:1508174041.163631:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (23)--
00000800:00000200:20.0:1508174041.163633:0:2510:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (22)++
00000800:00000200:20.0:1508174041.163638:0:2510:0:(o2iblnd_cb.c:74:kiblnd_tx_done()) conn[ffff8800748ee000] (23)--
00000400:00000010:20.0:1508174041.163640:0:2510:0:(lib-lnet.h:489:lnet_msg_free()) kfreed 'msg': 416 at ffff88085cab6400 (tot 21629836).
00000800:00000200:20.0:1508174041.163641:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--

where PID 2946 should be the forked LLOG parser thread for the soak-client config, PID 2519 should be the LNet discovery thread, PIDs 2508/2510 should be ko2iblnd scheduler threads.

Looks like there is a race between LNet discovery thread bottom handling of ACK for previous sent PUT msg to 192.168.1.110@o2ib, and 2510/ko2iblnd scheduler thread handling consecutive "stale" (unexpected or due to multi-rail?) PUT msg received from 192.168.1.110@o2ib, thus causing LNET_PEER_DISCOVERING flag to be set again and peer to be kept in discovery process/loop, but very likely on the ln_dc_working queue instead of ln_dc_request one so out of discovery thread control and should have been probably timed-out later.

Comment by Gerrit Updater [ 17/Nov/17 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/30147
Subject: LU-10123 lnet: ensure peer put back on dc request queue
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3bb495fcca17bdedd89a11cc9d9bf200604af0aa

Comment by Cliff White (Inactive) [ 17/Nov/17 ]

Mount performance is much better with this patch

Comment by Bruno Faccini (Inactive) [ 21/Nov/17 ]

Hmm, seems that my patch has also introduced some regression and causes this new LBUG on Client side for different auto-tests sub-tests/steps (like lnet-selftest. lustre-initialization, upon mount, ...) :

[ 5914.956548] LNetError: 23347:0:(peer.c:272:lnet_destroy_peer_locked()) ASSERTION( list_empty(&lp->lp_peer_nets) ) failed: 
[ 5914.960593] LNetError: 23347:0:(peer.c:272:lnet_destroy_peer_locked()) LBUG
[ 5914.964063] Pid: 23347, comm: lnet_discovery
[ 5914.967234] 
[ 5914.967234] Call Trace:
[ 5914.973027]  [<ffffffffc07337ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
[ 5914.976440]  [<ffffffffc073383c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[ 5914.979816]  [<ffffffffc06bab42>] lnet_destroy_peer_locked+0x232/0x340 [lnet]
[ 5914.983278]  [<ffffffffc06bb175>] lnet_peer_discovery_complete+0x295/0x340 [lnet]
[ 5914.986724]  [<ffffffffc06c0c30>] lnet_peer_discovery+0x6f0/0x11b0 [lnet]
[ 5914.990141]  [<ffffffff816a904d>] ? __schedule+0x39d/0x8b0
[ 5914.993348]  [<ffffffff810b1920>] ? autoremove_wake_function+0x0/0x40
[ 5914.996599]  [<ffffffffc06c0540>] ? lnet_peer_discovery+0x0/0x11b0 [lnet]
[ 5914.999836]  [<ffffffff810b099f>] kthread+0xcf/0xe0
[ 5915.002827]  [<ffffffff810b08d0>] ? kthread+0x0/0xe0
[ 5915.005771]  [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
[ 5915.008696]  [<ffffffff810b08d0>] ? kthread+0x0/0xe0

Cliff, do you also trigger this ?

Comment by Sarah Liu [ 21/Nov/17 ]

Hi Bruno,

Yesterday I tried to make a patch to include both your patch and the one for LU-10134 for soak. I checked out your patch and rebase master, then checkout LU-10134 patch and rebase on your patch. The new patch set #2 (which hit the LBUG) should have no change but only rebase master.

The test patch which includes both patch also hit the LBUG
https://review.whamcloud.com/#/c/30187/

Comment by Bruno Faccini (Inactive) [ 04/Dec/17 ]

Sarah, Cliff,
Can you give a try to more recent version/patch-set #7 of my patch change #30147 ?

Comment by Cliff White (Inactive) [ 04/Dec/17 ]

We are doing b2-10 now, will consider this when we move back to master

Comment by Cliff White (Inactive) [ 20/Dec/17 ]

We are attempting to test master now, can you rebase your patch on current master?

Comment by Bruno Faccini (Inactive) [ 21/Dec/17 ]

Cliff, I have just rebased my change #30147 on latest master.

Comment by Cliff White (Inactive) [ 04/Jan/18 ]

We still see client mount hangs on latest master, is this patch in queue for landing?

Comment by Peter Jones [ 05/Jan/18 ]

It had not been but I just added it to the chaser list. If there's something you think is important to get landed for soak please flag it either directly or via James so we can expedite it.

Comment by Gerrit Updater [ 31/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30147/
Subject: LU-10123 lnet: ensure peer put back on dc request queue
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d0185dd433943506f434eb5cf9707c2b6807b528

Comment by Peter Jones [ 06/Feb/18 ]

Landed for 2.11

Generated at Sat Feb 10 02:32:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.