[LU-11644] LNet: Service thread inactive for 300 causes client evictions Created: 08/Nov/18  Updated: 14/May/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.5, Lustre 2.12.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 1
Labels: None

Attachments: PDF File client_evictions_charts.pdf     File eviction_s611.06.05.19     PDF File lnet_metrics_during_eviction.pdf     File nasa_lu11644.patch     File s214_bt.20200108.18.21.23     File s618.out     File zero.io.top.20210130.19.31.37    
Issue Links:
Related
is related to LU-11768 sanity-quota test 6 fails with ‘LNet:... Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Update to 2.10.5 now we are seeing periods of mass evictions from servers. On the server we have the following stack trace

Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033253] Pid: 11080, comm: ll_ost01_220 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033260] Call Trace:
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033274]  [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038258]  [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038285]  [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038294]  [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038318]  [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038346]  [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038390]  [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038425]  [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038455]  [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038484]  [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038489]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038492]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038512]  [<ffffffffffffffff>] 0xffffffffffffffff
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038515] LustreError: dumping log to /tmp/lustre-log.1541619192.11080
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.254898] LNet: Service thread pid 9724 was inactive for 303.19s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310852] Pid: 9724, comm: ll_ost01_019 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310854] Call Trace:
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310866]  [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332869]  [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332902]  [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332912]  [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd]
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332936]  [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.332988]  [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333032]  [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333067]  [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333099]  [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333128]  [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333134]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333137]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333158]  [<ffffffffffffffff>] 0xffffffffffffffff

will upload to ftp:/uploads/LU11613/lustre-log.1541619192.11080

we didn't have rpctrace or dlmtrace so may not be very useful.

Could be related to https://jira.whamcloud.com/browse/LU-11613

 

 



 Comments   
Comment by Mahmoud Hanafi [ 08/Nov/18 ]

Better logs, on a different servers. Uploaded ftp:/uploads/LU11644_1/lustre-log.1541648790.87846

Nov  7 19:44:29 nbp2-oss3 kernel: [820650.532934] Lustre: Skipped 3581 previous similar messages
Nov  7 19:44:29 nbp2-oss3 kernel: [820650.853265] Lustre: nbp2-OST012e: Client cedc88fe-8812-7c23-9986-b50c84150b0e (at 10.151.17.70@o2ib) reconnecting
Nov  7 19:44:29 nbp2-oss3 kernel: [820650.853266] Lustre: nbp2-OST0106: Client cedc88fe-8812-7c23-9986-b50c84150b0e (at 10.151.17.70@o2ib) reconnecting
Nov  7 19:44:29 nbp2-oss3 kernel: [820650.915564] Lustre: Skipped 91 previous similar messages
Nov  7 19:44:29 nbp2-oss3 kernel: [820650.975975] LustreError: 92324:0:(ldlm_lockd.c:2365:ldlm_cancel_handler()) ldlm_cancel from 10.149.11.42@o2ib313 arrived at 1541648669
 with bad export cookie 2655206210869360537
Nov  7 19:44:29 nbp2-oss3 kernel: [820651.023594] LustreError: 92324:0:(ldlm_lockd.c:2365:ldlm_cancel_handler()) Skipped 26315 previous similar messages
Nov  7 19:44:30 nbp2-oss3 kernel: [820651.424409] Lustre: nbp2-OST002a: haven't heard from client 7788fc55-4957-f8c6-7825-fd1fdbcba946 (at 10.149.2.70@o2ib313) in 206 secon
ds. I think it's dead, and I am evicting it. exp ffff88185cdcf400, cur 1541648670 expire 1541648520 last 1541648464
Nov  7 19:44:30 nbp2-oss3 kernel: [820651.490743] Lustre: Skipped 1602 previous similar messages
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.470035] LustreError: 92941:0:(ldlm_lib.c:3237:target_bulk_io()) @@@ timeout on bulk WRITE after 300+0s  req@ffff8816d7e53c50 x1615
791089168704/t0(0) o4->fde038a9-ef4b-50d5-6625-3411db4c7f00@10.149.15.68@o2ib313:717/0 lens 608/448 e 5 to 0 dl 1541648807 ref 1 fl Interpret:/0/0 rc 0/0
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.546266] Lustre: nbp2-OST012e: Bulk IO write error with fde038a9-ef4b-50d5-6625-3411db4c7f00 (at 10.149.15.68@o2ib313), client will
 retry: rc = -110
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.587136] Lustre: Skipped 7 previous similar messages
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.607054] LNet: Service thread pid 87846 was inactive for 301.04s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.658059] Pid: 87846, comm: ll_ost01_184 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.658060] Call Trace:
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.658069]  [<ffffffffa10540e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.677999]  [<ffffffffa1011e43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.678029]  [<ffffffffa1032abb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.678046]  [<ffffffffa0d4d8a4>] ofd_intent_policy+0x444/0xa40 [ofd]
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.678074]  [<ffffffffa10112ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.678104]  [<ffffffffa103ab53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.678147]  [<ffffffffa10c0262>] tgt_enqueue+0x62/0x210 [ptlrpc]
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.678184]  [<ffffffffa10c3eca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.678217]  [<ffffffffa106c4bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.678250]  [<ffffffffa10704a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.678255]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.678261]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.678279]  [<ffffffffffffffff>] 0xffffffffffffffff
Nov  7 19:46:30 nbp2-oss3 kernel: [820771.678282] LustreError: dumping log to /tmp/lustre-log.1541648790.87846
Comment by Peter Jones [ 08/Nov/18 ]

Oleg

Could you please advise?

Peter

Comment by Mahmoud Hanafi [ 10/Nov/18 ]

From server metrics there is 3 to 4 mins where the server is locked up. I think this is the window where all the lnet pings get dropped. I enabled +net debug and it seem to prevent us hitting this bug. I disabled and not long after that we saw mass evictions. I will enable it for the next few days and see if we still hit the but.

 

Found this looks similar

https://jira.whamcloud.com/browse/LU-10035

Comment by Oleg Drokin [ 19/Nov/18 ]

Well, if you have locked up servers that don't even respond to pings, that does not look like LU-10035 or LU-11613.

If you increase lnet logging and it makes the problem go away, that sounds like some sort of a lockup inside the lnet. It would be great if you can get a crashdump while in this state so every thread could be examined (sadly cannot just do sysrq-t since threads spinning on CPU cannot be captured in this way - so if in this state and the node is alive enough that you still can issue commands - check if there's a bunch of Lustre threads hoggign cpu, if not - then sysrq-t is useful, if yes or if you cannot get console access either - trigger a crashdump).

I guess it's great news that you have a workaround of enabling _net logging so you can at least make sure this does not happen outside of when you actually want to for debug purposes.

Comment by Bruno Faccini (Inactive) [ 08/Dec/18 ]

+1 at https://testing.whamcloud.com/test_sessions/d7f0ec38-7b1d-4cba-8e5f-ca45754a694d

Comment by Lukasz Flis [ 17/Dec/18 ]

I can confirm seeing the same issue in Cyfronet with 2.10.6.RC3

Comment by Mahmoud Hanafi [ 18/Dec/18 ]

@Lukasz Do you have a reproducer

Comment by Mahmoud Hanafi [ 21/Feb/19 ]

what does this mean?

 Feb 20 04:30:59 nbp8-oss18 kernel: [68015.261353] LustreError: 32857:0:(ldlm_lockd.c:2365:ldlm_cancel_handler()) ldlm_cancel from 10.149.14.197@o2ib313 arrived at 1550665859 with bad export cookie 2159411342194476890

Could this be related to
https://jira.whamcloud.com/browse/LU-11931
Because the evictions are proceed by kiblnd_check_txs_locked around the time that obd pings should be arriving from the clients.

Feb 20 04:24:30 nbp8-oss18 kernel: [67626.491267] LNetError: 16978:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 2 seconds
Feb 20 04:24:30 nbp8-oss18 kernel: [67626.524744] LNetError: 16978:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Skipped 12 previous similar messages
Feb 20 04:24:30 nbp8-oss18 kernel: [67626.558524] LNetError: 16978:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 10.151.15.142@o2ib (268): c: 62, oc: 0, rc: 63
Feb 20 04:24:30 nbp8-oss18 kernel: [67626.599434] LNetError: 16978:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Skipped 12 previous similar messages
Feb 20 04:24:51 nbp8-oss18 kernel: [67647.490633] LNetError: 16978:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 2 seconds
Feb 20 04:24:51 nbp8-oss18 kernel: [67647.524115] LNetError: 16978:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 10.151.14.36@o2ib (290): c: 62, oc: 0, rc: 63

I uploaded lustre-log.1550676361.85441.gz to ftp.whamcloud.com/uploads/LU-11644 it shows "no credits' messages

Comment by Mahmoud Hanafi [ 21/Feb/19 ]

Here is what nis stats look like on the server during the eviction

 ========== /proc/sys/lnet/nis ==========
nid                      status alive refs peer  rtr   max    tx   min
0@lo                         up     0    2    0    0     0     0     0
0@lo                         up     0    0    0    0     0     0     0
10.151.27.84@o2ib            up    -1 12874   63    0 31384 31384 -86290
10.151.27.84@o2ib            up    -1 13237   63    0 31384 31384 -84712

our current ko2iblnd setting

options ko2iblnd ntx=125536 credits=62768 fmr_pool_size=31385 
options ko2iblnd timeout=150 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=63 concurrent_sends=63

Comment by Amir Shehata (Inactive) [ 22/Feb/19 ]

When did this start happening? Was this triggered by some change in the network?

From the logs in the comment above messages are timing out while on the active_txs. This means the message has been posted on the wire, but hasn't been completed yet. This usually indicates an underlying network issue? I'm thinking that when you enable net logging it slows down the flow just enough. You have set the timeout to 150s, which means messages remain in incomplete state for that long before they expire.

Are there any IB statistics we can take a look at? see if there are issues with the fabric? Drops? failures?

Comment by Mahmoud Hanafi [ 22/Feb/19 ]

This started when we moved from 2.7.3 mofed3.3 to 2.10.5 mofed4.4.2. The IB logs look clean. We don't see any drops of failures.

What do you think about the large negative values for "min" in the nis file? Should we increase our credits? Our cluster has >12K clients and our OSSes have 13 to 19 OST per oss.

Why is it timing out the active_txs after 2 seconds?

Comment by Amir Shehata (Inactive) [ 22/Feb/19 ]

The large negative values are a result of the o2iblnd not completing messages. Basically messages are not complete. You run out of credits at the LNet level, then messages start to get queued. From the stats shown you have 86K and 84K messages queued up.

By the way, why are their two NIs with the same NID?

Would you be able to share your lnet configuration?

Also would you be able to provide the output from "lnetctl net show -v 3" and "lnetctl peer show -v 3". Maybe we can see if there is a particular peer which is causing the problem.

The 2 seconds is the time after the deadline. So if you set the msg deadline to 150s after posting, then the 2 means it actually expired 2 seconds after the deadline has passed.

I don't think increasing the credits will resolve the issue in this particular case.

Would we be able to try and reduce the concurrent_sends to 32. I'd like to see what would happen if we throttle down the number of in flight messages.

 

Comment by Mahmoud Hanafi [ 23/Feb/19 ]

@Amir Shehata: sent you the output for lnetctl.

What is 1 credit? 1 RDMA message? How does that related to an RPC?
How does peer_credits relate to map_on_demand and fmr?

Comment by Amir Shehata (Inactive) [ 23/Feb/19 ]

At the LNet level, 1 credit is consumed for every LNet message sent, IE: PUT, GET, ACK, REPLY

So in the above case when you see large negative values, these represent lnet messages being queued because there are no credits available on the NI. You set the credits to: 62768, which gets divided between two CPTs., that's why you see 31384. An RPC can be formed of multiple LNet messages, depending on the RPC being sent.

Peer credits is primarily used in the o2iblnd, and it determines the queue depth of the QP created.

FMR is the memory registration type being used. You can control the the number of these FMR pools allocated when the NI is brought up. These pools can dynamically grow

map-on-demand originally was used to decide when the LND should use FMR memory registration or FastReg memory registration instead of global memory regions. So when it was set to 32 and the RDMA would need to be broken up to more than 32 fragments, then instead of using global memory regions, the RDMA data is mapped to device memory using FMR or FastReg. However since now we use FMR (or fastreg) all the time (global memory regions is no longer supported in RHEL 7.4 and later I believe), the map_on_demand has no impact on how LNet works.

But to be absolutely sure can you verify if the kernel you're using has HAVE_IB_GET_DMA_MR defined. That's what we use to determine if we should try and use global memory regions.

You can also confirm which memory registration you're using. You should see one of these two messages when you first configure the NI

Using FMR for registration
#or
Using FastReg for registration

As to regards to the lnetctl output, I don't see any peer consuming too much credits. There is just a lot of them.

Will you be able to try reducing concurrent_sends to see if it helps with the timeout?

Comment by Mahmoud Hanafi [ 25/Feb/19 ]

What should be the ratio/relation between ko2iblnd timeout and lnet at_min timeout setting.

Comment by Amir Shehata (Inactive) [ 25/Feb/19 ]

at_min is part of the adaptive timeout mechanism and defines the minimum timeout for RPC responses.

The timeout grows with slow connections and decreases with healthy connection. ptlrpc keeps track of the RTT + handling time of the RPC messages and based on that it changes the adaptive timeout.

The ko2iblnd timeout is really below all that. Each RPC message is composed of an LNet message (PUT, GET). A PUT or a GET in turn is composed of multiple ko2iblnd transmits, depending on the message type. The ko2iblnd timeout applies to the ko2iblnd transmits. If the ko2iblnd timeout is larger than the maximum RPC timeout, then it essentially means that RPCs will timeout before the ko2iblnd tx times out. I think it makes more sense for the ko2iblnd tx to time out before the RPC times out.

However, that said, after looking at the logs, I don't see except 6 timeouts in the o2iblnd:

Closing conn to 10.151.7.78@o2ib: error -110(waiting)
Closing conn to 10.151.7.82@o2ib: error -110(waiting)
Closing conn to 10.151.6.43@o2ib: error -110(waiting)
Closing conn to 10.151.7.105@o2ib: error -110(waiting)
Closing conn to 10.151.7.39@o2ib: error -110(waiting)
Closing conn to 10.151.7.50@o2ib: error -110(waiting)

I don't think that would be responsible for the mass evictions you're seeing.

Usually if the network is an issue, you'll be seeing a lot of o2iblnd timeouts occurring. Based on the logs you uploaded this is not the case. (unless these are routers which many clients connect through?)

Comment by Mahmoud Hanafi [ 25/Feb/19 ]

We do have clients connected through routers. The clients don't see any timeouts until they are evicted.

Should ko2iblnd timeout be less than at_min?
Does changing concurrent_sends require all clients and servers much be change at the same time?
Should we consider lowering ko2iblnd timeout to 100?
Currently we have ko2iblnd timeout=150 and our at_min=150. I did change the at_min to 200.

Comment by Amir Shehata (Inactive) [ 25/Feb/19 ]

Are all the clients experiencing evictions connect through routers? if so, are there any issues on routers? drops? timeouts?

I think if we set at_min to 200 and ko2iblnd timeout to 150 should be sane.

Unfortunately concurrent sends are not negotiated and therefore I recommend the change across the cluster.

 Another question: have you applied https://review.whamcloud.com/#/c/33975/ on your build? We're currently investigating an issue on another site and we're suspecting that this patch might be causing the timeouts. Although the other site is OPA.

Comment by Mahmoud Hanafi [ 26/Feb/19 ]

Our 2.10.5 didn't have this patch. But our most current 2.10.6-2nas does have that patch. But we are seeing the evictions on both versions.

The client evictions are all over not just clients behind the router. I think this is a server side issues. Before the eviction the rpcs rates drop to almost zero. It seems like there is a dead lock on the server for 30 to 120 seconds.

Comment by Mahmoud Hanafi [ 26/Feb/19 ]

I wonder if LU-5368 is related.

Comment by Amir Shehata (Inactive) [ 26/Feb/19 ]

From the logs, these are the clients having RDMA time outs:

10.151.7.53@o2ib
10.151.7.39@o2ib
10.151.6.43@o2ib
10.151.7.82@o2ib
10.151.4.110@o2ib
10.151.7.50@o2ib
10.151.7.105@o2ib
10.151.7.78@o2ib 

Is it possible to take a look at the logs from the clients which get evicted?

The timeouts ocurring on the active_tx leads me to believe that there was some sort of a connection breakdown between the server and these clients. Basically the server tries to send messages which never make it to the client. The server gets stuck on that connection waiting for a notification that the tx has completed, but doesn't it get it in 150 seconds.

If we can get logs from the clients during that time, maybe we can see more details.

Comment by Mahmoud Hanafi [ 28/Feb/19 ]

@Amir I will try to get you the logs. What I been able to determine before the eviction there is a period of over 2 mins that the server doesn't receive any obdping rpcs. For example on the server I was looking at it get ~120K obdping rpcs over 60sec.  But it didn't get any for over 2 mins before the eviction. Although, it gets zero obdping rpcs it still is getting ost_io (read/write) rpcs.

Do the obdping rpcs have to do any ipoib lookup? (we been have some ipoib issues on our fabric)

How are obdping rpcs different than ost_io rpcs?

 

Comment by Amir Shehata (Inactive) [ 28/Feb/19 ]

I'll let Oleg answer the rpc/ost_io question.

Regarding IPoIB, it is used when establishing a connection, address resolution and route resolution. If there are issues on the fabric, then some connections may fail. This will lead to reconnect attempts. If you have a storm of reconnects, it's possible that'll interfere with normal operations . Do you see reconnects happening during the lock up time?

Comment by Oleg Drokin [ 02/Mar/19 ]

obd pings are only sent from clients that did not have a recent communication with a server (there are two states - we sent som edata nad it was not committed, that would trigger an obd ping in 7 seconds, for clients that don't have any uncommitted data on a server we ping 4 times per obd_timeout.

so it's totally normal not to get any pings when there's an active io ongoing from a client.

Comment by Mahmoud Hanafi [ 09/Mar/19 ]

The stack trace look very similar to

https://jira.whamcloud.com/browse/LU-5368

And there is evidence that at that the time the last ping is received there is a spike in load and lustre.portals.ost.ldlm_extend_enqueue.

Comment by Mahmoud Hanafi [ 12/Mar/19 ]

We had an additional mass evictions. These all correspond to a spike in ldlm_glimpse_enqueue rpcs.

Is it possible for ldlm_reclaim to block all rpcs until it is finished? I think ldlm_reclaim may be involved here.  

Comment by Mahmoud Hanafi [ 14/Mar/19 ]

I uploaded debug and /var/log/messages for a server that experienced the mass eviction. Please take a look.

ftp.whamcloud.com:/uploads/LU-11644/

 s604.lctl.dump.gz

 s604.var.log.messages.gz

s608.lctl.dump.gz

s608.var.log.messages.gz

 

Comment by Amir Shehata (Inactive) [ 14/Mar/19 ]

I'm looking at a similar symptom reported here: LU-12065. We also see RDMA timeout on active_tx in 2.10.7-RC1. The last clean run was with 2.10.6. I'm suspecting a particular patch, which reduces the number of CQ entries. My current suspicion is if there aren't enough available CQ entries, we could have a boost in completion events, which might not be processed, leaving the corresponding txs to timeout. This would explain why we're seeing RDMA timeouts for transmits on active_txs.

This change came in:
LU-9810 lnd: use less CQ entries for each connection
I'm going to push a patch to revert the patch. It's not a straight revert as some other code changes happened after.

Do you have this patch on your systems?

Would you be able to try with the fix (when available) to see if it resolves the issue?

Comment by Mahmoud Hanafi [ 14/Mar/19 ]

our 2.10.5 and 2.10.6 build does have LU-9810.

$ git lg |grep LU-9810

  • 3970a8c4f1 - LU-9810 lnet: fix build with M-OFED 4.1 (7 weeks ago)
  • 31e16f27cc - LU-9810 lnd: use less CQ entries for each connection (8 weeks ago)

 

Comment by Amir Shehata (Inactive) [ 15/Mar/19 ]

We're seeing similar symptoms (IE client evictions) on our soak as I mentioned above. What we're trying to do is apply: https://review.whamcloud.com/34427, and see if this resolves the issue. This change simply reverts the functionality of LU-9810. If it doesn't resolve the issue then the next step for us would be to revert both:
LU-9810 lnd: use less CQ entries for each connection
LU-10213 lnd: calculate qp max_send_wrs properly
This way we can isolate the issue to one of these changes.

If you guys have a test clustre which you're able to reproduce the problem on, maybe you can try the same approach. Otherwise, I'll update the ticket when we have more details on the debugging on our end.

Comment by Peter Jones [ 18/Mar/19 ]

Mahmoud

Have you tried applying this revert into production?

Peter

Comment by Jay Lan (Inactive) [ 18/Mar/19 ]

I just did a build of nas-2.10.5 by backing out
LU-9810 lnd: use less CQ entries for each connection

Our nas-2.10.5 did not have LU-10213 patch installed, so no need to back that out. (LU-10213 is in nas-2.10.6.) Since we reported the problem on 2.10.5, LU-10213 should not be the culprit.

Comment by Amir Shehata (Inactive) [ 20/Mar/19 ]

Our testing on soak narrowed down the RDMA timeouts and subsequent evictions to

LU-9810 lnd: use less CQ entries for each connection

Please let me us know if the situation is improved.

Comment by Peter Jones [ 30/Mar/19 ]

NASA

Any updates?

Peter

Comment by Mahmoud Hanafi [ 10/Apr/19 ]

Finally got to try it on 1 server. I didn't help. During your testing did you have to do both clients and servers?

Comment by Jay Lan (Inactive) [ 11/Apr/19 ]

I have reverted "LU-9810 lnd: use less CQ entries for each connection" in our nas-2.10.5 and nas-2.10.6 for Mahmoud to test.

 

The revert failed on 2.12.0 though. It would bring back
+#define IBLND_SEND_WRS(c)      \
+       ((c->ibc_max_frags + 1) * kiblnd_concurrent_sends(c->ibc_version, \
+                                                         c->ibc_peer->ibp_ni))
+#define IBLND_CQ_ENTRIES(c)         (IBLND_RECV_WRS(c) + IBLND_SEND_WRS(c))
and kiblnd_concurrent_sends is no longer defined in lnet/klnds/o2iblnd/o2iblnd.h.

Comment by Amir Shehata (Inactive) [ 11/Apr/19 ]

mhanafi, yes this will need to be done on clients and servers. There is also another patch which ORNL is running with, but not 100% sure if it's applicable in your case:

LU-11931: https://review.whamcloud.com/34396

Comment by Mahmoud Hanafi [ 11/Apr/19 ]

I think will pickup LU-11931 also. Are there any other major changes in the ko2ib from 2.7 to 2.10?

Comment by Amir Shehata (Inactive) [ 11/Apr/19 ]

I believe these two are the ones which were causing problems. Technically the CQE issue is due to a problem with the way OFED works. We're still investigating that.

Comment by Mahmoud Hanafi [ 15/Apr/19 ]

We got all the server for one our large file system rebooted into the version without LU-9810. We are going to reboot the clients next. But with just the servers done I was surprised that the by it self didn't help the eviction issue. Do you still think that reboot of clients will help.

Comment by Amir Shehata (Inactive) [ 17/Apr/19 ]

I believe that we will need both clients and servers to have these changes to ensure the problem is resolved on both sides of the connection.

Comment by Mahmoud Hanafi [ 18/Apr/19 ]

@Amir Shehata, I was able to capture lnet metrics during the eviction. See the attached chart.
This was captured with the server running without LU-9180. But the clients were still running with the patch.
At the start client message loss, lnet_msg_alloc starts to grow sharply while send and receive count drop too low levels.
When the evictions happen, the msg_alloc drops to zero and there is a large spike in the send and receive counts. After which everything returns to normal.

lnet_metrics_during_eviction.pdf

Comment by Mahmoud Hanafi [ 22/Apr/19 ]

The revert of LU-9180 has NOT fixed this issue for us. If we turn off "net" debugging we hit the mass client eviction.

Comment by Amir Shehata (Inactive) [ 23/Apr/19 ]

I'll work on a debug patch to see if we can narrow down what's going on.

 

Comment by Mahmoud Hanafi [ 06/Jun/19 ]

I been try to see if I can reproduce some lnet failures using lnettest. In my test case I had 120 clients read/writing to a single server
I get this type of bandwidth

[LNet Rates of servers]
[R] Avg: 13910    RPC/s Min: 13910    RPC/s Max: 13910    RPC/s
[W] Avg: 18334    RPC/s Min: 18334    RPC/s Max: 18334    RPC/s
[LNet Bandwidth of servers]
[R] Avg: 4755.38  MiB/s Min: 4755.38  MiB/s Max: 4755.38  MiB/s 
[W] Avg: 4410.69  MiB/s Min: 4410.69  MiB/s Max: 4410.69  MiB/s 

At the same time I tried to lctl ping the server from the same set of clients. At different time the lctl ping will fail with like this

r467i1n3: failed to ping 10.151.27.53@o2ib: Input/output error 

 

and in the debug on the client i will get this messages

00000400:00000100:36.0F:1559862327.071451:0:3733:0:(lib-move.c:2153:lnet_parse_reply()) 10.151.23.139@o2ib: Dropping REPLY from 12345-10.151.27.53@o2ib for invalid MD 0x159da02807b48584.0x10ab844d

Not sure if this is significant or relevant.

Comment by Amir Shehata (Inactive) [ 12/Jun/19 ]

I've been looking at the logs and I see

 Feb 11 09:12:32 nbp2-oss3 kernel: [9077322.900796] LNetError: 14085:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 19 seconds
Feb 11 09:12:32 nbp2-oss3 kernel: [9077322.932031] LNetError: 14085:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 10.151.33.79@o2ib (169): c: 61, oc: 0, rc: 63
Feb 11 09:12:32 nbp2-oss3 kernel: [9077322.969509] LNet: 14085:0:(o2iblnd_cb.c:1914:kiblnd_close_conn_locked()) Closing conn to 10.151.33.79@o2ib: error -110(waiting)
Feb 11 09:12:32 nbp2-oss3 kernel: [9077322.969916] LustreError: 14089:0:(events.c:449:server_bulk_callback()) event type 5, status -5, desc ffff880aea612600
Feb 11 09:12:50 nbp2-oss3 kernel: [9077340.858729] LNet: 89480:0:(o2iblnd_cb.c:3013:kiblnd_cm_callback()) 10.151.33.79@o2ib: ROUTE ERROR -22
Feb 11 09:12:50 nbp2-oss3 kernel: [9077340.858735] LNet: 89480:0:(o2iblnd_cb.c:3013:kiblnd_cm_callback()) Skipped 66022 previous similar messages
Feb 11 09:12:50 nbp2-oss3 kernel: [9077340.858746] LNet: 89480:0:(o2iblnd_cb.c:2094:kiblnd_peer_connect_failed()) Deleting messages for 10.151.33.79@o2ib: connection failed
Feb 11 09:12:50 nbp2-oss3 kernel: [9077340.858749] LNet: 89480:0:(o2iblnd_cb.c:2094:kiblnd_peer_connect_failed()) Skipped 66022 previous similar messages

There appears to be significant number of route errors with -22. There are multiple places on the verbs/infiniband driver path where an -EINVAL can occur. Is this an error you can explain?

If not, we should probably compare other logs from other instances where the evictions occurred and see if we consistently see these errors.

Can we look at the opensm log (usually located in /var/log/opensm.log)? Are there any errors related to routing?

Comment by Amir Shehata (Inactive) [ 14/Jun/19 ]

Summary of suggestions from today's call:

  1. Try 3.3 MOFED or a version before 4.2 which does not have the routing issue reported. (possibly with the selftest reproducer Mahmoud mentioned above)
    1. The idea is if we can prove that rolling back MOFED resolves the issue, then we can narrow down the problem to MOFED.
  2. Take a look at opensm log to see if there are any errors related to route resolution during the eviction time or before. Maybe we can compare with the time stamp where we start seeing the route resolution problem in the lustre log.
  3. Do a patch to keep track of the active_tx queue length. This can be monitored and if it starts increasing it could be an indication that a problem has occurred.
  4. Patch can be used to trigger MLNX debug procedure discussed
  5. Capture perf data as described here: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html.
    1. Flamegraphs are a good way to visualize the call stack information. This could show us what's happening on the server when the problem is hit.
    2. Perf capture can be triggered periodically, or it can be triggered when the active_tx starts growing beyond the average.

I also looked at the rdma_resolve_route call.

It basically queries the sm to get the path record:

160 struct sa_path_rec_ib {
161 »·······__be16       dlid;
162 »·······__be16       slid;
163 »·······u8           raw_traffic;
164 };

190 struct sa_path_rec {                                                            
191 »·······union ib_gid dgid;                                                      
192 »·······union ib_gid sgid;                                                      
193 »·······__be64       service_id;                                                
194 »·······/* reserved */                                                          
195 »·······__be32       flow_label;                                                
196 »·······u8           hop_limit;                                                 
197 »·······u8           traffic_class;                                             
198 »·······u8           reversible;                                                
199 »·······u8           numb_path;                                                 
200 »·······__be16       pkey;                                                      
201 »·······__be16       qos_class;                                                 
202 »·······u8           sl;                                                        
203 »·······u8           mtu_selector;                                              
204 »·······u8           mtu;                                                       
205 »·······u8           rate_selector;                                             
206 »·······u8           rate;                                                      
207 »·······u8           packet_life_time_selector;                                 
208 »·······u8           packet_life_time;                                          
209 »·······u8           preference;                                                
210 »·······union {                                                                 
211 »·······»·······struct sa_path_rec_ib ib;                                       
212 »·······»·······struct sa_path_rec_roce roce;                                   
213 »·······»·······struct sa_path_rec_opa opa;                                     
214 »·······};                                                                      
215 »·······enum sa_path_rec_type rec_type;                                         
216 };  

This data is stored in the cmid and used in the rdma_connect() to create the connection. Eventually calls into cma_connect_ib()

I don't think this call can just be bypassed.

Comment by Mahmoud Hanafi [ 14/Jun/19 ]

Here is our git tree.

https://github.com/jlan/lustre-nas/tree/nas-2.10.8

 I check the SM logs they were clean with no errors during the last 4 client eviction events.

 

Comment by Mahmoud Hanafi [ 14/Jun/19 ]

eviction_s611.06.05.19

The evictions happened at

12:18:00

12:27:30

12:42:44

The file width is very wide.

 

Re-ran my lnet tests with L2.10.5/mofed4.5/el74 was able to get the -22 errors.
But when I tried L2.7/mofed3.3/el6.8 couldn't reproduce the errors. We will try to do a build of mofed4.1 and re-run tests.

Comment by Amir Shehata (Inactive) [ 18/Jun/19 ]

Would you be able to try L2.10.5/mofed3.3/el6.8?

Comment by Mahmoud Hanafi [ 18/Jun/19 ]

we will try to do a mofed3.3 and mofed4.1 build.

Comment by Amir Shehata (Inactive) [ 18/Jun/19 ]

I'm looking at the git tree you shared, if this is what's deployed on your nodes, then you still have the CQ problem.

On your tree

 168 /* 2 = LNet msg + Transfer chain */                                                          
 169 #define IBLND_CQ_ENTRIES(c)»····\                                                            
 170 »·······(IBLND_RECV_WRS(c) + 2 * kiblnd_concurrent_sends(c->ibc_version, \
 171 »·······»·······»·······»·······»·······»·······»······· c->ibc_peer->ibp_ni))   

It should be

 #define IBLND_CQ_ENTRIES(c)         (IBLND_RECV_WRS(c) + IBLND_SEND_WRS(c))

If you can revert

37919179dc61ebb7a63b8ca1c9e1bf76dd3356e8 LU-9810 lnd: use less CQ entries for each connection
Comment by Mahmoud Hanafi [ 18/Jun/19 ]

you should bat 2.10.8. That version reverts LU-9810

https://github.com/jlan/lustre-nas/commits/nas-2.10.8

Comment by Amir Shehata (Inactive) [ 18/Jun/19 ]

You're right. I was looking at 2.10.5.

There is one issue with 2.10.8 though. It does two things. It has:

 b856230fd2ac2a7b0d0b39a347fa981de424839a Revert "LU-9810 lnd: use less CQ entries for each connection"

and

6c12330a350d3e082c835fb61cfcc5c9c93e9bdd LU-12065 lnd: increase CQ entries

When I was suggesting reverting LU-9810 the understanding is that it would be reverted on 2.10.5 and 2.10.6

2.10.8 has the actual fix which is the second commit above. I think to make the branch more closely resemble the official 2.10.8, you should only have:

6c12330a350d3e082c835fb61cfcc5c9c93e9bdd LU-12065 lnd: increase CQ entries

I'll have the debug patch we discussed available by tomorrow.

Comment by Amir Shehata (Inactive) [ 20/Jun/19 ]

took me a bit longer to get through.

Here is a debug patch which can be used to monitor the internal iblnd queues and trigger an action when the queues get too large. I tested it locally, but I don't have a large cluster. So it'll be a good idea to test it on a larger cluster before deploying it on a live system.

Patch is nasa_lu11644.patch.

you can run:

lnetctl peer show --lnd --net <net type: ex o2ib1> 

This will dump output like

 [root@trevis-407 ~]# lnetctl peer show --lnd --net o2ib
lnd_peer:
    - nid: 172.16.1.6@o2ib
      ni_nid: 172.16.1.7@o2ib
      num_conns: 1
      tx_queue: 0
      accepting: 0
      connecting: 0
      reconnecting: 0
      conn_races: 0
      reconnected: 0
      refcount: 2
      max_frags: 256
      queue_depth: 8
      conns:
        - refcount: 0
          credits: 20
          outstanding_credits: 8
          reserved_credits: 1
          early_rxs: 8
          tx_noops: 0
          tx_active: 0
          tx_queue_nocred: 0
          tx_queue_rsrvd: 0
          tx_queue: 0
          conn_state: 0
          sends_posted: 3
          queue_depth: 0
          max_frags: 0
    - nid: 172.16.1.7@o2ib
      ni_nid: 172.16.1.7@o2ib
      num_conns: 2
      tx_queue: 0
      accepting: 0
      connecting: 0
      reconnecting: 0
      conn_races: 0
      reconnected: 0
      refcount: 3
      max_frags: 256
      queue_depth: 8
      conns:
        - refcount: 0
          credits: 20
          outstanding_credits: 7
          reserved_credits: 0
          early_rxs: 8
          tx_noops: 0
          tx_active: 0
          tx_queue_nocred: 0
          tx_queue_rsrvd: 0
          tx_queue: 0
          conn_state: 0
          sends_posted: 3
          queue_depth: 0
          max_frags: 0
        - refcount: 16777224
          credits: 20
          outstanding_credits: 8
          reserved_credits: 1
          early_rxs: 8
          tx_noops: 0
          tx_active: 0
          tx_queue_nocred: 0
          tx_queue_rsrvd: 0
          tx_queue: 0
          conn_state: 0
          sends_posted: 3
          queue_depth: 0
          max_frags: 0

Under the conns section you can monitor the different queue sizes: tx_active is the one of interest at the moment. This will require a bit of experimentation on a heavily loaded system, to see the average size of these queues. Then you can have a python script (or similar) to trigger an action whenever the queues grow beyond the expected average. The discussed action was to initiate some MLNX debugging to capture more data.

Note the patch grabs the first 4096 peers only.

An example python script is below. I used o2ib for the network and 300 for the expected average queue size. For the action, I simply print an output.

import yaml
import subprocess
import time

while True:
    output = subprocess.check_output(['lnetctl', 'peer', 'show', '--lnd', '--net', 'o2ib'])
    y = yaml.safe_load(output)
    for i in range (0, len(y['lnd_peer'])):
        peer = y['lnd_peer'][i]
        for j in range(0, len(peer['conns'])):
           if peer['conns'][j]['tx_active'] > 300:
              print("peer: ", peer['nid'], "active_tx is growing too large: ", peer['conns'][j]['tx_active'])
    time.sleep(10)

Comment by Mahmoud Hanafi [ 21/Jun/19 ]

Thanks, will test and report back results.

Comment by Mahmoud Hanafi [ 26/Jun/19 ]

I tested the patch with +8000 clients and no issues. It will take a few week to schedule dedicated time on our production filesystem. Will report back once I have more data.

 

Comment by Mahmoud Hanafi [ 26/Jun/19 ]

Found an issue with the debug patch. It is only reporting the first 4097 peers.

nbptest2-srv1 ~ # ls -l /proc/fs/lustre/mdt/nbptest2-MDT0000/exports/| wc -l
14260
nbptest2-srv1 ~ # lnetctl peer show --lnd --net o2ib |grep ' nid:' | wc -l
4097

 We recompiled with larger value for peers.lndprs_num_peers = 16k.

Comment by Amir Shehata (Inactive) [ 27/Jun/19 ]

Hi Mahmoud, I noted this limitation in my initial comment about the patch. I wanted to get the patch out and it would've taken longer to implement an iterative way of pulling up the peers in 4K chunks. But as long at the 16K works, it should be ok.

Comment by Peter Jones [ 13/Jul/19 ]

Mahmoud

Do you have the dedicated time scheduled to run this test yet?

Peter

Comment by Mahmoud Hanafi [ 13/Jul/19 ]

Because we had the crash with the larger peers.lndprs_num_peers we wanted to do more testing before installing on production filesystem. I haven't had time to get back to this yet... Since things are stable with +net debugging it's a bit lower on the priority.

Comment by Mahmoud Hanafi [ 02/Dec/19 ]

I was able to capture some rpc rates before a client evictions. It showed that the server gets large spike in ldlm_glimpse_equeue rpcs that starves out the ping rpcs. I have two charts that show this. So some how when we enable +net debugging it slows down things and the pings rpcs doen't get blocked. client_evictions_charts.pdf
What can cause just a large spike in ldlm_glimpse_equeue?

Comment by Mahmoud Hanafi [ 07/Jan/20 ]

I am trying to create a reproducer for the case were a OSS get a large spike in ldlm_glimpse_equeue RPCS. What is the best way to recreate this RPC workload.

Comment by Mahmoud Hanafi [ 09/Jan/20 ]

I was able to get a backtrace of all threads during when the servers ping rpcs drop to zero. It show 508 outof 512 ll_ost threads in ldlm_run_ast_work. This must block receiving all other RPCS.

What options do we have to slow down the rate of ldlm_glimpse_equeues?

s214_bt.20200108.18.21.23

Comment by Andreas Dilger [ 10/Jan/20 ]

Mahmoud, do you know what the client application is doing at this point in the run? Glimpse RPCs are generated when clients do "stat()" operations on files to get the size, that send the LDLM glimpse RPC for the OST object(s) in the file, which may in turn cause the OST to send RPCs to the client(s) holding the locks for the file if it is actively being written. So if there were a multiple clients doing parallel directory tree traversal in the same directory where other clients are writing it could generate a lot of glimpses, or if the application was malformed and calling stat() repeatedly on a shared file for some reason (e.g. to poll for updates/completion)?

Comment by Oleg Drokin [ 10/Jan/20 ]

can you please describe what workload is it? some big activity on shared files but enough of them to generate plenty of glimpses? something else?

You can try scaling up number of ost threads and see if that helps though obviously it's not ideal. there's probably a number of other workarounds we can use to either elevate ping rpc priority if it's not high yet or such.

Comment by Mahmoud Hanafi [ 10/Jan/20 ]

I haven't been able to track down the user or application creating the glimpse RPCs. Part of the issue is enabling RPCTRACE slows things such enough to prevent the evictions. I tired to limit rpcs using tbf but all rpcs are issues as user root. So that didn't work. 

I will try to setup some additional debugging triggers to determine the application/user.

Comment by Andreas Dilger [ 10/Jan/20 ]

It looks like 2.10.x has the patch https://review.whamcloud.com/17345 "LU-7470 nrs: extend TBF with NID/JobID/OPCode expression" which allows you to rate-limit the RPCs by opcode. While I would not normally suggest to limit LDLM traffic, you could try this for opc 101 (LDLM_ENQUEUE) to see if the problem can be avoided. I'm not sure of the specific rate limits you want, but looking at the recently-attached graph you could try 4000 as a reasonable cap (i.e. above normal usage, but enough to slow down the huge spike of incoming RPCs. You may have more fine-grained breakdown of RPCs by type to see how many of the RPCs in that spike are from LDLM_ENQUEUE.

If enabling +rpctrace debugging is too heavy-weight, you could also try to enable the RPC history functionality, via "lctl set_param ost.OSS.*.req_buffer_history_max=10000" and then grab it every second via "lctl get_param ost.OSS.*.req_history >> /var/log/lustre_oss_history-$(date +%Y%m%d%H%M)" (make sure you have a rule in /etc/logrotate.d/lustre.conf to clean up those logs if you cannot reproduce it quickly) and then post-process it after an event for a detailed RPC log (sort -u would be needed to discard duplicate records in a given logfile). The fields are:

timestamp:target_nid:client_nid:client_xid:req_length:req_state:arrival_time:expiry_time opc number
6780438299461025792:192.168.20.1@tcp:12345-192.168.20.159@tcp:x1654112309314912:328:Complete:1578693813.544069:-1.455930s(-43.0s) opc 101

Since the RPC history does not involve printing to the logs (just saving the request into a list), it may be less overhead than the kernel debug logging, and allow you to capture information when the problem happens.

Comment by Gerrit Updater [ 10/Jan/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37193
Subject: LU-11644 ptlrpc: show target name in req_history
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3d7e754be0a0aec62cd74c977cab159796b4b8a8

Comment by Mahmoud Hanafi [ 14/Jan/20 ]

For now we have enabled rpctrace. Long term I think elevating ping rpc prior it may be a good option. For very large clusters it would also be nice to set obd_ping rate and eviction timeout independent of at_min. At 10k nodes our OSS average ~4000 obd_ping rpcs/sec

 

 

 

Comment by Andreas Dilger [ 15/Jan/20 ]
For very large clusters it would also be nice to set obd_ping rate and eviction timeout independent of at_min. At 10k nodes our OSS average ~4000 obd_ping rpcs/sec

For newer Lustre versions (2.12 client+server), clients will disconnect from an OST in the background if they are not actively communicating with it. This will reduce the number of pings from an idle client, and will also reduce the load during recovery due to fewer clients being connected.

Comment by Gerrit Updater [ 28/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37193/
Subject: LU-11644 ptlrpc: show target name in req_history
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 83b6c6608e94c05766ed1eddad1b7f2fee5e79fa

Comment by Peter Jones [ 28/Jan/20 ]

Do I understand correctly that the patch that landed was not a fix for the issue but just made the issue easier to debug?

Comment by Andreas Dilger [ 30/Jan/20 ]

Correct, patch is only to improve diagnostics for similar situations.

Comment by Mahmoud Hanafi [ 27/Feb/20 ]

What about raising the priority of obd ping rpcs. This will ensure that they are not dropped. 

Comment by Andreas Dilger [ 23/Apr/20 ]

One issue about elevating OBD_PING priority is that these RPCs are mostly "dead weight" just to keep the servers informed that the clients are still alive. I think the idle_timeout feature would instead allow idle clients to disconnect and reduce the OBD_PING traffic on the servers.

Another approach is to prevent evicting clients if the service threads are not making progress on the RPC queue, or if the clients already have a pending OBD_PING RPC in the request queue, even if it hasn't been processed yet. That avoids interrupting useful work by prioritizing OBD_PING processing under normal usage, but also avoids mass evictions in the rare case that the server threads are overloaded.

As for the ldlm contention issue, it would be useful to get the +rpctrace (and +ldlmtrace if possible}} debug log from one of these events, to see what the root of the problem is. Fixing OBD_PING is only the symptom.

Comment by Mahmoud Hanafi [ 23/Apr/20 ]

By enabling +rpctrace it slow the server just enough for the issue to not occur.

Comment by Mahmoud Hanafi [ 28/May/20 ]

We need a way to debug this issue. We are still hitting this only way to workaround it is by enabling rpctrace.

Comment by Mahmoud Hanafi [ 01/Feb/21 ]

We have a large user job (1844 nodes/ 73760cores) triggering client evictions even with our workaround of enabling RPCTRACE. I was able to gather debug logs from the last event. We have mitigated the issue by enabling TBF QoS to rate limit ost_io rpcs. The logs showed that all ll_ost_io threads sit in 'D' state and all i/o rates drop to zero. This starts at 19:31:36 and ends at 19:35:10. At which time the server starts to evict clients running the job (see s618.out). See zero.io.top.20210130.19.31.37 for thread states. I uploaded lustre debug file to ftp.whamcloud.com/uploads/zero.io.lctl.dk.20210130.19.34.56.gz.new. The evictions only happen on OSS. This filesystem has 20 OSSes and 342 OSTs.

Comment by Oleg Drokin [ 02/Feb/21 ]

I checked the debug log and I don't think I see any traces of requests being stuck in the incoming queue for a long time. various "Already past deadline" sort of messages are not part of the logs.

I picked the very first evicted client which is "10.149.3.186@o2ib313" talking to "nbp2-OST0025" and I see last message from it as expected - getting a PR lock.

I do see a bunch of ping requests from this same nid/client across time, but there's no way to know what particular OST they were sent to (and there's more than one on this node). The important part is - there's no significant delay in request processing between its arrival and it being taken into actual processing.

E.g. the eviction happened at:

00000100:00080000:11.0:1612064007.003210:0:97446:0:(service.c:1128:ptlrpc_update_export_timer()) nbp2-OST0025: Think about evicting 10.149.3.186@o2ib313 from 1612063856
...
00000100:00080000:9.0:1612064083.110023:0:50801:0:(pinger.c:610:ping_evictor_main()) evicting all exports of obd nbp2-OST0025 older than 1612063933
00000100:00100000:12.0:1612064083.110024:0:21995:0:(nrs_tbf.c:3127:nrs_tbf_req_get()) TBF dequeues: class@ffffa08b32e6d200 rate 5000 gen 2 token 2, rule@ffffa095af158100 rate 5000 gen 2
00000100:02000400:9.0:1612064083.110025:0:50801:0:(pinger.c:636:ping_evictor_main()) nbp2-OST0025: haven't heard from client b58b2f27-aabc-f700-545a-8d4bc7d4c55c (at 10.149.3.186@o2ib313) in 227 seconds. I think it's dead, and I am evicting it. exp ffffa07a96360000, cur 1612064083 expire 1612063933 last 1612063856

now looking back from that looking for traffic to 10.149.3.186@o2ib313:

00000400:00000200:9.0:1612064056.837794:0:34195:0:(lib-move.c:4114:lnet_parse()) TRACE: 10.151.26.114@o2ib(10.151.26.114@o2ib) <- 10.149.3.186@o2ib313 : PUT - for me
00000400:00000200:9.0:1612064056.837798:0:34195:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-10.149.3.186@o2ib313 of length 224 into portal 28 MB=0x6011a1473a440
00000400:00000200:9.0:1612064056.837801:0:34195:0:(lib-ptl.c:200:lnet_try_match_md()) Incoming put index 1c from 12345-10.149.3.186@o2ib313 of length 224/224 into md 0x86246a0871 [1] + 0
00000400:00000200:9.0:1612064056.837803:0:34195:0:(lib-md.c:65:lnet_md_unlink()) Queueing unlink of md ffffa07b4745a2a8
00000100:00000200:9.0:1612064056.837804:0:34195:0:(events.c:305:request_in_callback()) event type 2, status 0, service ost
00000100:00100000:9.0:1612064056.837807:0:34195:0:(events.c:351:request_in_callback()) peer: 12345-10.149.3.186@o2ib313 (source: 12345-10.149.3.186@o2ib313)
00000400:00000200:9.0:1612064056.837809:0:34195:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffffa07b4745a2a8
00000400:00000200:9.0:1612064056.837810:0:34195:0:(lib-msg.c:816:lnet_is_health_check()) health check = 0, status = 0, hstatus = 0
00000800:00000200:9.0:1612064056.837811:0:34195:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffffa081f0c66000] (69)++
00000100:00100000:10.0:1612064056.837812:0:37831:0:(service.c:1989:ptlrpc_server_handle_req_in()) got req x1690061384164416
00000800:00000200:9.0:1612064056.837812:0:34195:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffffa081f0c66000] (70)--
00000800:00000200:9.0:1612064056.837813:0:34195:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffffa081f0c66000] (69)--
00000100:00100000:10.0:1612064056.837817:0:37831:0:(nrs_tbf.c:3210:nrs_tbf_req_add()) TBF enqueues: class@ffffa081e90ff600 rate 5000 gen 2 token 0, rule@ffffa07df979db00 rate 5000 gen 2
00000100:00100000:10.0:1612064056.837821:0:37831:0:(nrs_tbf.c:3127:nrs_tbf_req_get()) TBF dequeues: class@ffffa081e90ff600 rate 5000 gen 2 token 1, rule@ffffa07df979db00 rate 5000 gen 2
00000100:00100000:10.0:1612064056.837828:0:37831:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_250:b58b2f27-aabc-f700-545a-8d4bc7d4c55c+7:3580:x1690061384164416:12345-10.149.3.186@o2ib313:400

so as you can see - processing is instanteous.

What does the dmesg on the client side say? Could the RPCs be dropped somewhere along the way like routers?

I don't see any obvious signs of anything being slow - no complaints about slow io nor any threads being stuck for a long time like the original bugreport shows (the "LNet: Service thread pid XXX was inactive for")

Comment by Mahmoud Hanafi [ 02/Feb/21 ]

I don't think this is due dropped messages. If we had dropped messages it would effect multiple servers and osts. There are no errors on the client until they get evicted. No errors on the routers and there are no lnet errors on the servers.

What is troubling why all I/O halts while all ll_ost_io threads are sitting in 'D' state. I will try to get a dump of stack of the threads.

If a I/O request take longer to service ( while io threads sitting in 'D' state) than obdping timeout will the client get evicted. 

Comment by Oleg Drokin [ 02/Feb/21 ]

First of all I do not think you have any outstanding io requests from this partcular client. If you did - we'd see aborted bulk messages and I don't think I saw any?

if the io thread is taking too long in the actual filesystem layer we'd see messages like this for writes: "transaction handle %p was open for too long: now %lld, alloced %lld, started %lld, closed %lld" and there were a slow io of other kind messages that I cannot readily find. In addition the lnet "thread took too long" would also happen and I think you did not have any incidents of any watchdogs triggering this time?

If you have requests sitting too long in the serving queue there'd be the "Already past deadline" messages.

If you have requests taking to long actually being serviced.- the client would timeout (you'd also have watchdogs trigger).

Looking at the code we prolong export several times: when a request comes in and is added to the req in queue, when we start serving it and when we finish serving it.
So while it's possible for a too long request to end up in ping eviction in absence of other requests, the fact that the export's "last served" time did not advance in all this time and we can see it matching last confirmed request from this client, I think it's a pretty solid evidence that at least this particular client did not really send a ping request for this OST or if it did - it did not arrive to the OSS node - note if this was the case you must see a timeout for the opc 400 on the client though.

3rd option is if the client and server idea of obdtimeout is different (with client having a significantly longer timeout than the server) then client might be sending pings a lot less frequently than the server expects.

Comment by Mahmoud Hanafi [ 02/Feb/21 ]

This filesystem never sees more 1 second of zero reads and writes. While during this event it see zero I/O for 2 mins, so something must be locked up. I will try to get a stack trace of all threads.

The file I attached yesterday (s618.out) show that all I/O drops to zero and server load starts to climb starting at 19:31:36  to 19:34:56, when the evictions happen.

 

The watchdog timers (500 seconds) are much higher then obdping timeout so we wouldn't see those warnings. 

 

Comment by Andreas Dilger [ 03/Feb/21 ]

Oleg, is it possible that something in the NRS TBF path is bypassing/blocking the update of the export and preventing the request from being processed by the server?

In your previous example, it shows that this client is active for other pings and yet it is being evicted shortly thereafter. At the time of eviction, it looks like a request is being dequeued for the export, but It probably make sense for the target to evict a client that still has a pending request in the RPC queue:

nrs_tbf_req_get()) TBF dequeues: class@ffffa08b32e6d200 gen 2 token 2, rule@ffffa095af158100 rate 5000 gen 2
ping_evictor_main()) nbp2-OST0025: haven't heard from client (at 10.149.3.186@o2ib313) in 227 seconds. I think it's dead, and I am evicting it. cur 1612064083 expire 1612063933 last 1612063856

so it might be useful to print a bit more information in nrs_tbf_req_get() about the request so that we can see what it is. Maybe change the CDEBUG(D_RPCTRACE, ...) there to a DEBUG_REQ(D_RPCTRACE, ...). Also adding the target name into the ptlrpc_server_handle_request() "Handling" and "Handled" lines (from req->rq_export->exp_obd->obd_name, if rq_export isn't NULL) would definitely be very useful for lots of reasons.

Comment by Mahmoud Hanafi [ 03/Feb/21 ]

I don't think tbf is involved because we saw this with fifo. 

Comment by Andreas Dilger [ 03/Feb/21 ]

3rd option is if the client and server idea of obd timeout is different (with client having a significantly longer timeout than the server) then client might be sending pings a lot less frequently than the server expects.

Mahmoud, do you have multiple different filesystems mounted on the same client, possibly with different timeout values configured by their MGS config logs? There is an old problem LU-9912 that relates to this, in the uncommon case where different filesystems have been tuned with different timeouts...

Comment by Mahmoud Hanafi [ 03/Feb/21 ]

We do mount multiple filesystems, but we don't save those timeout values on the MGS. I double checked the MGS config logs there were no timeout settings. 

 

Comment by Oleg Drokin [ 03/Feb/21 ]

Andres, other pings are apparently coming to other OSTs on this node. We don't print the target information so I don't know how to 100% confirm that theory but it seems to be the most likely one.

Mahmoud - can you please directly check the timeout value on the client I looked at vs the server just to be sure?

The watchdog timeout is set based on AT values so it varies and the bigger it gets the more latency there was in the system beforehand. but doing sysrq-t as you see fit should show you where everything is stuck

Comment by Mahmoud Hanafi [ 03/Feb/21 ]

I check at_min, at_max, and time on all servers and clients they are the same.

timeout=100

at_min=275

at_max=600

We know that the server receives obdping from other clients and sends replies. These clients, that get evicted, are sending ping to the other OSTs. other wise we would see evictions on more than one OST. 

Generated at Sat Feb 10 02:45:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.