[LU-7808] Service threads hung at ptlrpc_abort_bulk Created: 24/Feb/16  Updated: 28/Aug/18  Resolved: 08/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: DELBARY Gael
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: Text File lustre-log.1456265750.28275.gz     File traceall.gz    
Severity: 4
Rank (Obsolete): 9223372036854775807

 Description   

Server threads getting hung. Some time they will clear by them self but not always.

<3>LustreError: 26311:0:(ldlm_lib.c:2715:target_bulk_io()) @@@ timeout on bulk PUT after 100+0s  req@ffff881cd7d35800 x1526640544175816/t0(0) o3->d13c09b7-cb83-8238-fde5-d86e0a048f3a@10.151.63.50@o2ib:0/0 lens 488/432 e 1 to 0 dl 1456265702 ref 1 fl Interpret:/0/0 rc 0/0
<3>LustreError: 26311:0:(ldlm_lib.c:2715:target_bulk_io()) Skipped 62 previous similar messages
<6>LNet: 2454:0:(o2iblnd_cb.c:1937:kiblnd_close_conn_locked()) Closing conn to 10.151.27.36@o2ib: error 0(waiting)
<6>LNet: 2454:0:(o2iblnd_cb.c:1937:kiblnd_close_conn_locked()) Skipped 3 previous similar messages
<3>LustreError: 0:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 225s: evicting client at 10.151.55.224@o2ib  ns: filter-nbp8-OST00e0_UUID lock: ffff880b6d477100/0xc9935e7dda65f272 lrc: 4/0,0 mode: PW/PW res: [0x1fff69f:0x0:0x0].0 rrc: 32 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000000020 nid: 10.151.55.224@o2ib remote: 0x18f2b8c9e47dad96 expref: 4 pid: 2516 timeout: 4704175646 lvb_type: 0
<3>LustreError: 2513:0:(ldlm_lockd.c:664:ldlm_handle_ast_error()) ### client (nid 10.151.55.224@o2ib) returned 0 from completion AST ns: filter-nbp8-OST00e0_UUID lock: ffff880b6d477100/0xc9935e7dda65f272 lrc: 6/0,0 mode: PW/PW res: [0x1fff69f:0x0:0x0].0 rrc: 33 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000000020 nid: 10.151.55.224@o2ib remote: 0x18f2b8c9e47dad96 expref: 4 pid: 2516 timeout: 4704175646 lvb_type: 0
<3>LustreError: 0:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 225s: evicting client at 10.151.17.216@o2ib  ns: filter-nbp8-OST012e_UUID lock: ffff8805c4f78180/0xc9935e7dda65e30d lrc: 4/0,0 mode: PW/PW res: [0x1ff164b:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000010020 nid: 10.151.17.216@o2ib remote: 0x179b912da7369126 expref: 4 pid: 2472 timeout: 4704191266 lvb_type: 0
<6>LNet: 2454:0:(o2iblnd_cb.c:1937:kiblnd_close_conn_locked()) Closing conn to 10.151.27.24@o2ib: error 0(waiting)
<6>LNet: 2454:0:(o2iblnd_cb.c:1937:kiblnd_close_conn_locked()) Skipped 5 previous similar messages
<4>LNet: Service thread pid 28275 was inactive for 300.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
<4>Pid: 28275, comm: ll_ost_io02_091
<4>
<4>Call Trace:
<4> [<ffffffff810867ec>] ? lock_timer_base+0x3c/0x70
<4> [<ffffffff81566572>] schedule_timeout+0x192/0x2e0
<4> [<ffffffff81086900>] ? process_timeout+0x0/0x10
<4> [<ffffffffa0798c58>] ptlrpc_abort_bulk+0x158/0x2e0 [ptlrpc]
<4> [<ffffffff81064f90>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0765694>] target_bulk_io+0x474/0x980 [ptlrpc]
<4> [<ffffffffa04ce200>] ? cfs_crypto_hash_final+0x30/0x80 [libcfs]
<4> [<ffffffff81064f90>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa150287c>] ost_brw_read+0x103c/0x1350 [ost]
<4> [<ffffffff812b9076>] ? vsnprintf+0x336/0x5e0
<4> [<ffffffffa075e500>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
<4> [<ffffffffa04d47c3>] ? libcfs_debug_vmsg2+0x5d3/0xbd0 [libcfs]
<4> [<ffffffffa079ff8c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
<4> [<ffffffffa07a00e8>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc]
<4> [<ffffffffa15096d8>] ost_handle+0x24a8/0x44d0 [ost]
<4> [<ffffffffa04d0a44>] ? libcfs_id2str+0x74/0xb0 [libcfs]
<4> [<ffffffffa07ae0c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
<4> [<ffffffffa04d68d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
<4> [<ffffffffa07a6a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
<4> [<ffffffffa07b089d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]
<4> [<ffffffff8100c28a>] child_rip+0xa/0x20
<4> [<ffffffffa07afda0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
<4>

I am attaching backtrace and lustre debug dump.



 Comments   
Comment by Peter Jones [ 25/Feb/16 ]

Doug is looking into this one

Comment by Doug Oucharek (Inactive) [ 26/Feb/16 ]

I can see in the logs that the system seems to have closed all the connections, most likely due to timeouts, and then tries to recover. Did a router or switch go down to trigger this?

Is the focus of this ticket, then, that the server threads are not recovering or is it about what took down all the connections in the first place?

Comment by Mahmoud Hanafi [ 01/Mar/16 ]

We had some congestion on the fabric but nothing to cause a large amount of client to disconnect. The focus of the ticket is

1. what was the root cause of the timeouts.
2. Why was the service threads spinning in ptlrpc_abort_bulk

Comment by Mahmoud Hanafi [ 01/Mar/16 ]

We had another server hit this issue today.

Comment by Mahmoud Hanafi [ 02/Mar/16 ]

From the crash dump yesterday we have ib_cm threads that have 'exception RIP: __write_lock_failed+9' like this

PID: 3288   TASK: ffff882067f1f520  CPU: 1   COMMAND: "ib_cm/1"
 #0 [ffff880028226e90] crash_nmi_callback at ffffffff81032256
 #1 [ffff880028226ea0] notifier_call_chain at ffffffff8156be55
 #2 [ffff880028226ee0] atomic_notifier_call_chain at ffffffff8156beba
 #3 [ffff880028226ef0] notify_die at ffffffff810a44fe
 #4 [ffff880028226f20] do_nmi at ffffffff81569acf
 #5 [ffff880028226f50] nmi at ffffffff81569330
    [exception RIP: run_timer_softirq+169]
    RIP: ffffffff810870d9  RSP: ffff880028223e50  RFLAGS: 00000006
    RAX: ffff8810694b02c0  RBX: ffff8810694b0000  RCX: ffff880028223e10
    RDX: ffff8810694b02c0  RSI: ffff880028223e10  RDI: ffff8810694b0000
    RBP: ffff880028223ed0   R8: ffff8810694b1450   R9: 0000000000000000
    R10: 0000000000010400  R11: 0000000000000001  R12: 00000000000002a0
    R13: ffff880028223e90  R14: ffff88104b004020  R15: ffffffff814ba340
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff880028223e50] run_timer_softirq at ffffffff810870d9
 #7 [ffff880028223ed8] __do_softirq at ffffffff8107cc4b
 #8 [ffff880028223f48] call_softirq at ffffffff8100c38c
 #9 [ffff880028223f60] do_softirq at ffffffff8100fca5
#10 [ffff880028223f80] irq_exit at ffffffff8107cb05
#11 [ffff880028223f90] smp_apic_timer_interrupt at ffffffff8156e76a
#12 [ffff880028223fb0] apic_timer_interrupt at ffffffff8100bc13
--- <IRQ stack> ---
#13 [ffff88204e0976e8] apic_timer_interrupt at ffffffff8100bc13
    [exception RIP: __write_lock_failed+9]
    RIP: ffffffff812bb489  RSP: ffff88204e097798  RFLAGS: 00000287
    RAX: ffff880917c9e640  RBX: ffff88204e0977a0  RCX: ffff880917c9e670
    RDX: ffffffffa00a7727  RSI: ffff880c08a271c0  RDI: ffffffff81a653f0
    RBP: ffffffff8100bc0e   R8: 00000000ffffffff   R9: 0000000000000000
    R10: 0000000000000068  R11: 000000000000000c  R12: 0000000000000001
    R13: ffffffff811583ea  R14: ffff88204e097710  R15: ffffc90000000000
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#14 [ffff88204e097798] _write_lock at ffffffff81568bc7
#15 [ffff88204e0977a8] __get_vm_area_node at ffffffff8115a5f8
#16 [ffff88204e097808] get_vm_area_caller at ffffffff8115a6e0
#17 [ffff88204e097828] vmap at ffffffff8115a7c4
#18 [ffff88204e097868] mlx4_buf_alloc at ffffffffa00a7727 [mlx4_core]
#19 [ffff88204e0978f8] mlx4_ib_alloc_cq_buf at ffffffffa01c09f9 [mlx4_ib]
#20 [ffff88204e097938] mlx4_ib_create_cq at ffffffffa01c183f [mlx4_ib]
#21 [ffff88204e0979f8] ib_create_cq at ffffffffa017cb06 [ib_core]
#22 [ffff88204e097a38] kiblnd_create_conn at ffffffffa096919b [ko2iblnd]
#23 [ffff88204e097ab8] kiblnd_passive_connect at ffffffffa0975688 [ko2iblnd]
#24 [ffff88204e097bb8] kiblnd_cm_callback at ffffffffa0976d4d [ko2iblnd]
#25 [ffff88204e097be8] cma_req_handler at ffffffffa03793b8 [rdma_cm]
#26 [ffff88204e097cf8] cm_process_work at ffffffffa029eab7 [ib_cm]
#27 [ffff88204e097d48] cm_req_handler at ffffffffa029f965 [ib_cm]
#28 [ffff88204e097de8] cm_work_handler at ffffffffa02a0315 [ib_cm]
#29 [ffff88204e097e38] worker_thread at ffffffff81097600
#30 [ffff88204e097ee8] kthread at ffffffff8109dc8e
#31 [ffff88204e097f48] kernel_thread at ffffffff8100c28a
Comment by Doug Oucharek (Inactive) [ 09/Mar/16 ]

With regards to the spinning service threads, does this system have the fixes for LU-7054 and LU-7569 applied?

Comment by Jay Lan (Inactive) [ 09/Mar/16 ]

Yes, we carry fixes of LU-7054 and LU-7569.

Although 'git log' does not show LU-7569, but
LU-7676 o2iblnd: avoid intensive reconnecting
is a 5.3 back port of LU-7569.

Comment by Doug Oucharek (Inactive) [ 11/Mar/16 ]

I'm wondering if problem 2 is addressed by LU-7434?

Comment by Mahmoud Hanafi [ 06/Apr/16 ]

We have hit this issue several time this week. This alwasy starts with large number of clients timing out and getting disconnect from one or more servers. We have several filesystem and this is occurring on only one.

We really need to understand the root cause of the client timeout

Comment by Doug Oucharek (Inactive) [ 07/Apr/16 ]

Is there an LNet router between the clients at the servers? Somewhere, there has so be a single point of failure which is triggering the clients to time out and be disconnected. The logs don't given any indication of Lustre having any problems except that communication just gets cut off.

Comment by Mahmoud Hanafi [ 07/Apr/16 ]

There are routers but it is to the remote cluster. There are about 11K clients directly attached to the servers and about 2k clients are connected through the router.

The only thing I have found has been that there is usually a spike in RPC (ping or read/write). Is it possible that a large number of RPC could blocking ping from clients.

Comment by Doug Oucharek (Inactive) [ 07/Apr/16 ]

If the Mellanox switch has congestion management deployed (I'm not sure how they do this sort of thing), that could cause traffic loss during spikes. Lustre is not very tolerant of traffic loss (we assume a reliable connection, RC). Thus you can get a flurry of timeouts in this situation.

Something which could help smooth out the spikes is to lower the credit value on the clients. This module parameter limits the number of outstanding messages (RPCs) a node can have on any given network. When a node uses up all its credits, it has to start queuing until credits become available. In the case of a spike, this can smooth it out. You don't need to change the peer_credit value, just credit.

If lowering the credit value and smoothing out the spikes eliminates the timeouts, then I would deduce that some form of traffic shaping (QoS) is going on in the switches.

Comment by Mahmoud Hanafi [ 07/Apr/16 ]

Our current at_min=150 what about increasing this?

If we run out of service thread is that logged? Currently our threads_max = 384 and we are running at the max. All our servers have 128GB of memory so we the memory to run more.

Comment by Doug Oucharek (Inactive) [ 07/Apr/16 ]

Thread max can be as high as 512. You could try that. When we run out of service threads, RPCs do get queued. But if things get bad enough, evictions could be possible. There is a ticket, https://jira.hpdd.intel.com/browse/LU-7434, which reports freezes with bulk operations. I'm wondering if that is related to what you are seeing.

Comment by Jay Lan (Inactive) [ 07/Apr/16 ]

There were two patches submitted in LU-7434. Seven out of eight files changed in the first patch caused conflicts to b2_5_fe. We are willing to try the patches, but is it possible you provide a back port to b2_5_fe?

Thanks!

Comment by Mahmoud Hanafi [ 07/Apr/16 ]

This issue always starts with a timeout on bulk PUT.

Comment by Doug Oucharek (Inactive) [ 08/Apr/16 ]

The patches of LU-7434 have not landed to master yet. I will accelerate their code reviews to make sure they are acceptable. Then we can port them to b2_5_fe.

Comment by Mahmoud Hanafi [ 13/Apr/16 ]

How can we verify if this is the same as LU-7434. What level of debugging do we need for the confirmation. By the way, I enabled +net debugging and that has changed timing and we haven't seen the issue since the change.

Comment by Niu Yawei (Inactive) [ 14/Apr/16 ]

It's highly likely the LU-6441 if 4MB RPC is enabled, could you check if 4MB RPC is enabled in your system? Thanks.

Comment by Mahmoud Hanafi [ 14/Apr/16 ]

4MB RPC is NOT enabled.
max_pages_per_rpc=256 across all nodes.

Comment by Doug Oucharek (Inactive) [ 22/Apr/16 ]

There is a debug patch at: http://review.whamcloud.com/#/c/19643/. It does not attempt to address this issue, but rather just collects useful information to verify the theory that the freeze is happening when the MD unlink function is called in LNet.

If the problem happens with this patch, please report back with the contents of /proc/sys/lnet/stats and /proc/sys/lnet/peers for both server and one of the clients. The logs should also indicate if an MD unlink is taking more than 80 seconds. That will point at the MD unlink process as causing the freezes.

Comment by Jay Lan (Inactive) [ 27/Apr/16 ]

Hi Doug,

Could you kindly provide a b2_7_fe port of the debug patch you mentioned above? We discussed this patch in our system meeting yesterday and decided we need to run the debug patch on both b2_5_fe and b2_7_fe. Thanks in advance!

Comment by Doug Oucharek (Inactive) [ 02/May/16 ]

b2_7_fe port is at: http://review.whamcloud.com/#/c/19916/

Comment by Mahmoud Hanafi [ 12/May/16 ]

I think we need a review of our ko2iblnd settings. This is what we are running with

#ko2ib
options ko2iblnd require_privileged_port=0 use_privileged_port=0
options ko2iblnd ntx=125536 credits=125536 fmr_pool_size=31385 pmr_pool_size=31385
options ko2iblnd timeout=150 retry_count=7 peer_timeout=0 map_on_demand=2 peer_credits=63 concurrent_sends=63 credits=64
#lnet
options lnet networks=o2ib(ib1) routes="o2ib233 10.151.26.[80,81,82,83,84,85,86,87,88,89,90,91,92,93,94]@o2ib" dead_router_check_interval=60 live_router_check_interval=60
options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=16384 large_router_buffers=1024
## ptlrpc parameters
options ptlrpc at_max=700 at_min=150

Can you recommend any changes? And with lustre 2.7.1 and LU-3322 should these value change?

we have about 10K clients on the main fabric and 3K clients via routers.

Comment by Doug Oucharek (Inactive) [ 12/May/16 ]

The only value I question is using a map_on_demand of 2. This means that only 2 FMR fragments are available. With Mellanox, we have found there is little difference between having FMR turned off (map_on_demand of 0 which is the default) or having it on. Where map_on_demand does matter is if the latency of the network gets high (i.e. WAN conditions) or you are using an Intel fabric (Truescale or OPA).

The initial value of map_on_demand we use in testing is 32. I've never tried something low like 2.

As for LU-3322, this won't affect you if you go with map_on_demand of 0.

Comment by Doug Oucharek (Inactive) [ 12/May/16 ]

Does someone there know the history of why map_on_demand is 2? Before recommending it be set to 0, I need to see if there is a good reason for it being 2.

Comment by Mahmoud Hanafi [ 13/May/16 ]

We had found that pool allocation can be very slow, so the idea was that FMR could make it faster.

This was also the reason we pick a large ntx value to start.

btw, with our large ntx values module load takes more than 5 min. (i.e https://jira.hpdd.intel.com/browse/LU-7224)

Comment by Mahmoud Hanafi [ 18/May/16 ]

Still waiting to get a recommendation on tuning values.

Comment by Doug Oucharek (Inactive) [ 18/May/16 ]

map_on_demand should be either 0 or 32. If you feel that FMR is really needed, then go with 32. However, note: Mellanox has deprecated FMR and replaced it with a new Fast Memory system. We added support for this, but the patch is new and working its way into builds now. If you go with map_on_demand of 0, you are immune to this FMR change. If you go with map_on_demand of 32, you could run into issues if some of your IB cards are refreshed to use the newer mlx5 driver/firmware.

Overall, I'd recommend going with map_on_demand of 0 and measure the difference.

Comment by Mahmoud Hanafi [ 19/May/16 ]

Doing timing tests on just loading the module, by enabling FMR cut the time by 1/2. From 6min to 3min.

Comment by Doug Oucharek (Inactive) [ 19/May/16 ]

Ah, then I would recommend going with map_on_demand of 32.

Comment by Mahmoud Hanafi [ 08/Sep/16 ]

Please close this case. No longer an issue.

Comment by Peter Jones [ 08/Sep/16 ]

ok - thanks Mahmoud

Generated at Sat Feb 10 02:12:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.