[LU-11644] LNet: Service thread inactive for 300 causes client evictions - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.10.5, Lustre 2.12.1
Labels:
None

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

Update to 2.10.5 now we are seeing periods of mass evictions from servers. On the server we have the following stack trace

Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033253] Pid: 11080, comm: ll_ost01_220 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033260] Call Trace:
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033274]  [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038258]  [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038285]  [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038294]  [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038318]  [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038346]  [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038390]  [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038425]  [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038455]  [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038484]  [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038489]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038492]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038512]  [<ffffffffffffffff>] 0xffffffffffffffff
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038515] LustreError: dumping log to /tmp/lustre-log.1541619192.11080
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.254898] LNet: Service thread pid 9724 was inactive for 303.19s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310852] Pid: 9724, comm: ll_ost01_019 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310854] Call Trace:
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310866]  [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332869]  [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332902]  [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332912]  [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd]
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332936]  [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.332988]  [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333032]  [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333067]  [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333099]  [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333128]  [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333134]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333137]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333158]  [<ffffffffffffffff>] 0xffffffffffffffff

will upload to ftp:/uploads/LU11613/lustre-log.1541619192.11080

we didn't have rpctrace or dlmtrace so may not be very useful.

Could be related to https://jira.whamcloud.com/browse/LU-11613

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

client_evictions_charts.pdf
63 kB
02/Dec/19 9:45 PM
eviction_s611.06.05.19
23 kB
14/Jun/19 5:54 PM
lnet_metrics_during_eviction.pdf
398 kB
18/Apr/19 2:17 AM
nasa_lu11644.patch
15 kB
20/Jun/19 11:54 PM
s214_bt.20200108.18.21.23
990 kB
09/Jan/20 5:31 AM
s618.out
37 kB
01/Feb/21 5:38 PM
zero.io.top.20210130.19.31.37
480 kB
01/Feb/21 5:38 PM

Issue Links

is related to

LU-11768 sanity-quota test 6 fails with ‘LNet: Service thread pid <pid> was inactive for …’

Resolved

mentioned in: Page No Confluence page found with the given URL.

Activity

[LU-11644] LNet: Service thread inactive for 300 causes client evictions

Andreas Dilger added a comment - 03/Feb/21 12:29 AM

Oleg, is it possible that something in the NRS TBF path is bypassing/blocking the update of the export and preventing the request from being processed by the server?

In your previous example, it shows that this client is active for other pings and yet it is being evicted shortly thereafter. At the time of eviction, it looks like a request is being dequeued for the export, but It probably make sense for the target to evict a client that still has a pending request in the RPC queue:

nrs_tbf_req_get()) TBF dequeues: class@ffffa08b32e6d200 gen 2 token 2, rule@ffffa095af158100 rate 5000 gen 2
ping_evictor_main()) nbp2-OST0025: haven't heard from client (at 10.149.3.186@o2ib313) in 227 seconds. I think it's dead, and I am evicting it. cur 1612064083 expire 1612063933 last 1612063856

so it might be useful to print a bit more information in nrs_tbf_req_get() about the request so that we can see what it is. Maybe change the CDEBUG(D_RPCTRACE, ...) there to a DEBUG_REQ(D_RPCTRACE, ...). Also adding the target name into the ptlrpc_server_handle_request() "Handling" and "Handled" lines (from req->rq_export->exp_obd->obd_name, if rq_export isn't NULL) would definitely be very useful for lots of reasons.

Andreas Dilger added a comment - 03/Feb/21 12:29 AM Oleg, is it possible that something in the NRS TBF path is bypassing/blocking the update of the export and preventing the request from being processed by the server? In your previous example, it shows that this client is active for other pings and yet it is being evicted shortly thereafter. At the time of eviction, it looks like a request is being dequeued for the export, but It probably make sense for the target to evict a client that still has a pending request in the RPC queue: nrs_tbf_req_get()) TBF dequeues: class@ffffa08b32e6d200 gen 2 token 2, rule@ffffa095af158100 rate 5000 gen 2 ping_evictor_main()) nbp2-OST0025: haven't heard from client (at 10.149.3.186@o2ib313) in 227 seconds. I think it's dead, and I am evicting it. cur 1612064083 expire 1612063933 last 1612063856 so it might be useful to print a bit more information in nrs_tbf_req_get() about the request so that we can see what it is. Maybe change the CDEBUG(D_RPCTRACE, ...) there to a DEBUG_REQ(D_RPCTRACE, ...) . Also adding the target name into the ptlrpc_server_handle_request() " Handling " and " Handled " lines (from req->rq_export->exp_obd->obd_name , if rq_export isn't NULL) would definitely be very useful for lots of reasons.

Mahmoud Hanafi added a comment - 02/Feb/21 10:39 PM - edited

This filesystem never sees more 1 second of zero reads and writes. While during this event it see zero I/O for 2 mins, so something must be locked up. I will try to get a stack trace of all threads.

The file I attached yesterday (s618.out) show that all I/O drops to zero and server load starts to climb starting at 19:31:36 to 19:34:56, when the evictions happen.

The watchdog timers (500 seconds) are much higher then obdping timeout so we wouldn't see those warnings.

Mahmoud Hanafi added a comment - 02/Feb/21 10:39 PM - edited This filesystem never sees more 1 second of zero reads and writes. While during this event it see zero I/O for 2 mins, so something must be locked up. I will try to get a stack trace of all threads. The file I attached yesterday (s618.out) show that all I/O drops to zero and server load starts to climb starting at 19:31:36 to 19:34:56, when the evictions happen. The watchdog timers (500 seconds) are much higher then obdping timeout so we wouldn't see those warnings.

Oleg Drokin added a comment - 02/Feb/21 10:20 PM

First of all I do not think you have any outstanding io requests from this partcular client. If you did - we'd see aborted bulk messages and I don't think I saw any?

if the io thread is taking too long in the actual filesystem layer we'd see messages like this for writes: "transaction handle %p was open for too long: now %lld, alloced %lld, started %lld, closed %lld" and there were a slow io of other kind messages that I cannot readily find. In addition the lnet "thread took too long" would also happen and I think you did not have any incidents of any watchdogs triggering this time?

If you have requests sitting too long in the serving queue there'd be the "Already past deadline" messages.

If you have requests taking to long actually being serviced.- the client would timeout (you'd also have watchdogs trigger).

Looking at the code we prolong export several times: when a request comes in and is added to the req in queue, when we start serving it and when we finish serving it.
So while it's possible for a too long request to end up in ping eviction in absence of other requests, the fact that the export's "last served" time did not advance in all this time and we can see it matching last confirmed request from this client, I think it's a pretty solid evidence that at least this particular client did not really send a ping request for this OST or if it did - it did not arrive to the OSS node - note if this was the case you must see a timeout for the opc 400 on the client though.

3rd option is if the client and server idea of obdtimeout is different (with client having a significantly longer timeout than the server) then client might be sending pings a lot less frequently than the server expects.

Oleg Drokin added a comment - 02/Feb/21 10:20 PM First of all I do not think you have any outstanding io requests from this partcular client. If you did - we'd see aborted bulk messages and I don't think I saw any? if the io thread is taking too long in the actual filesystem layer we'd see messages like this for writes: "transaction handle %p was open for too long: now %lld, alloced %lld, started %lld, closed %lld" and there were a slow io of other kind messages that I cannot readily find. In addition the lnet "thread took too long" would also happen and I think you did not have any incidents of any watchdogs triggering this time? If you have requests sitting too long in the serving queue there'd be the "Already past deadline" messages. If you have requests taking to long actually being serviced.- the client would timeout (you'd also have watchdogs trigger). Looking at the code we prolong export several times: when a request comes in and is added to the req in queue, when we start serving it and when we finish serving it. So while it's possible for a too long request to end up in ping eviction in absence of other requests, the fact that the export's "last served" time did not advance in all this time and we can see it matching last confirmed request from this client, I think it's a pretty solid evidence that at least this particular client did not really send a ping request for this OST or if it did - it did not arrive to the OSS node - note if this was the case you must see a timeout for the opc 400 on the client though. 3rd option is if the client and server idea of obdtimeout is different (with client having a significantly longer timeout than the server) then client might be sending pings a lot less frequently than the server expects.

Mahmoud Hanafi added a comment - 02/Feb/21 9:41 PM

I don't think this is due dropped messages. If we had dropped messages it would effect multiple servers and osts. There are no errors on the client until they get evicted. No errors on the routers and there are no lnet errors on the servers.

What is troubling why all I/O halts while all ll_ost_io threads are sitting in 'D' state. I will try to get a dump of stack of the threads.

If a I/O request take longer to service ( while io threads sitting in 'D' state) than obdping timeout will the client get evicted.

Mahmoud Hanafi added a comment - 02/Feb/21 9:41 PM I don't think this is due dropped messages. If we had dropped messages it would effect multiple servers and osts. There are no errors on the client until they get evicted. No errors on the routers and there are no lnet errors on the servers. What is troubling why all I/O halts while all ll_ost_io threads are sitting in 'D' state. I will try to get a dump of stack of the threads. If a I/O request take longer to service ( while io threads sitting in 'D' state) than obdping timeout will the client get evicted.

Oleg Drokin added a comment - 02/Feb/21 9:27 PM - edited

I checked the debug log and I don't think I see any traces of requests being stuck in the incoming queue for a long time. various "Already past deadline" sort of messages are not part of the logs.

I picked the very first evicted client which is "10.149.3.186@o2ib313" talking to "nbp2-OST0025" and I see last message from it as expected - getting a PR lock.

I do see a bunch of ping requests from this same nid/client across time, but there's no way to know what particular OST they were sent to (and there's more than one on this node). The important part is - there's no significant delay in request processing between its arrival and it being taken into actual processing.

E.g. the eviction happened at:

00000100:00080000:11.0:1612064007.003210:0:97446:0:(service.c:1128:ptlrpc_update_export_timer()) nbp2-OST0025: Think about evicting 10.149.3.186@o2ib313 from 1612063856
...
00000100:00080000:9.0:1612064083.110023:0:50801:0:(pinger.c:610:ping_evictor_main()) evicting all exports of obd nbp2-OST0025 older than 1612063933
00000100:00100000:12.0:1612064083.110024:0:21995:0:(nrs_tbf.c:3127:nrs_tbf_req_get()) TBF dequeues: class@ffffa08b32e6d200 rate 5000 gen 2 token 2, rule@ffffa095af158100 rate 5000 gen 2
00000100:02000400:9.0:1612064083.110025:0:50801:0:(pinger.c:636:ping_evictor_main()) nbp2-OST0025: haven't heard from client b58b2f27-aabc-f700-545a-8d4bc7d4c55c (at 10.149.3.186@o2ib313) in 227 seconds. I think it's dead, and I am evicting it. exp ffffa07a96360000, cur 1612064083 expire 1612063933 last 1612063856

now looking back from that looking for traffic to 10.149.3.186@o2ib313:

00000400:00000200:9.0:1612064056.837794:0:34195:0:(lib-move.c:4114:lnet_parse()) TRACE: 10.151.26.114@o2ib(10.151.26.114@o2ib) <- 10.149.3.186@o2ib313 : PUT - for me
00000400:00000200:9.0:1612064056.837798:0:34195:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-10.149.3.186@o2ib313 of length 224 into portal 28 MB=0x6011a1473a440
00000400:00000200:9.0:1612064056.837801:0:34195:0:(lib-ptl.c:200:lnet_try_match_md()) Incoming put index 1c from 12345-10.149.3.186@o2ib313 of length 224/224 into md 0x86246a0871 [1] + 0
00000400:00000200:9.0:1612064056.837803:0:34195:0:(lib-md.c:65:lnet_md_unlink()) Queueing unlink of md ffffa07b4745a2a8
00000100:00000200:9.0:1612064056.837804:0:34195:0:(events.c:305:request_in_callback()) event type 2, status 0, service ost
00000100:00100000:9.0:1612064056.837807:0:34195:0:(events.c:351:request_in_callback()) peer: 12345-10.149.3.186@o2ib313 (source: 12345-10.149.3.186@o2ib313)
00000400:00000200:9.0:1612064056.837809:0:34195:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffffa07b4745a2a8
00000400:00000200:9.0:1612064056.837810:0:34195:0:(lib-msg.c:816:lnet_is_health_check()) health check = 0, status = 0, hstatus = 0
00000800:00000200:9.0:1612064056.837811:0:34195:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffffa081f0c66000] (69)++
00000100:00100000:10.0:1612064056.837812:0:37831:0:(service.c:1989:ptlrpc_server_handle_req_in()) got req x1690061384164416
00000800:00000200:9.0:1612064056.837812:0:34195:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffffa081f0c66000] (70)--
00000800:00000200:9.0:1612064056.837813:0:34195:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffffa081f0c66000] (69)--
00000100:00100000:10.0:1612064056.837817:0:37831:0:(nrs_tbf.c:3210:nrs_tbf_req_add()) TBF enqueues: class@ffffa081e90ff600 rate 5000 gen 2 token 0, rule@ffffa07df979db00 rate 5000 gen 2
00000100:00100000:10.0:1612064056.837821:0:37831:0:(nrs_tbf.c:3127:nrs_tbf_req_get()) TBF dequeues: class@ffffa081e90ff600 rate 5000 gen 2 token 1, rule@ffffa07df979db00 rate 5000 gen 2
00000100:00100000:10.0:1612064056.837828:0:37831:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_250:b58b2f27-aabc-f700-545a-8d4bc7d4c55c+7:3580:x1690061384164416:12345-10.149.3.186@o2ib313:400

so as you can see - processing is instanteous.

What does the dmesg on the client side say? Could the RPCs be dropped somewhere along the way like routers?

I don't see any obvious signs of anything being slow - no complaints about slow io nor any threads being stuck for a long time like the original bugreport shows (the "LNet: Service thread pid XXX was inactive for")

Oleg Drokin added a comment - 02/Feb/21 9:27 PM - edited I checked the debug log and I don't think I see any traces of requests being stuck in the incoming queue for a long time. various "Already past deadline" sort of messages are not part of the logs. I picked the very first evicted client which is "10.149.3.186@o2ib313" talking to "nbp2-OST0025" and I see last message from it as expected - getting a PR lock. I do see a bunch of ping requests from this same nid/client across time, but there's no way to know what particular OST they were sent to (and there's more than one on this node). The important part is - there's no significant delay in request processing between its arrival and it being taken into actual processing. E.g. the eviction happened at: 00000100:00080000:11.0:1612064007.003210:0:97446:0:(service.c:1128:ptlrpc_update_export_timer()) nbp2-OST0025: Think about evicting 10.149.3.186@o2ib313 from 1612063856 ... 00000100:00080000:9.0:1612064083.110023:0:50801:0:(pinger.c:610:ping_evictor_main()) evicting all exports of obd nbp2-OST0025 older than 1612063933 00000100:00100000:12.0:1612064083.110024:0:21995:0:(nrs_tbf.c:3127:nrs_tbf_req_get()) TBF dequeues: class@ffffa08b32e6d200 rate 5000 gen 2 token 2, rule@ffffa095af158100 rate 5000 gen 2 00000100:02000400:9.0:1612064083.110025:0:50801:0:(pinger.c:636:ping_evictor_main()) nbp2-OST0025: haven't heard from client b58b2f27-aabc-f700-545a-8d4bc7d4c55c (at 10.149.3.186@o2ib313) in 227 seconds. I think it's dead, and I am evicting it. exp ffffa07a96360000, cur 1612064083 expire 1612063933 last 1612063856 now looking back from that looking for traffic to 10.149.3.186@o2ib313: 00000400:00000200:9.0:1612064056.837794:0:34195:0:(lib-move.c:4114:lnet_parse()) TRACE: 10.151.26.114@o2ib(10.151.26.114@o2ib) <- 10.149.3.186@o2ib313 : PUT - for me 00000400:00000200:9.0:1612064056.837798:0:34195:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-10.149.3.186@o2ib313 of length 224 into portal 28 MB=0x6011a1473a440 00000400:00000200:9.0:1612064056.837801:0:34195:0:(lib-ptl.c:200:lnet_try_match_md()) Incoming put index 1c from 12345-10.149.3.186@o2ib313 of length 224/224 into md 0x86246a0871 [1] + 0 00000400:00000200:9.0:1612064056.837803:0:34195:0:(lib-md.c:65:lnet_md_unlink()) Queueing unlink of md ffffa07b4745a2a8 00000100:00000200:9.0:1612064056.837804:0:34195:0:(events.c:305:request_in_callback()) event type 2, status 0, service ost 00000100:00100000:9.0:1612064056.837807:0:34195:0:(events.c:351:request_in_callback()) peer: 12345-10.149.3.186@o2ib313 (source: 12345-10.149.3.186@o2ib313) 00000400:00000200:9.0:1612064056.837809:0:34195:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffffa07b4745a2a8 00000400:00000200:9.0:1612064056.837810:0:34195:0:(lib-msg.c:816:lnet_is_health_check()) health check = 0, status = 0, hstatus = 0 00000800:00000200:9.0:1612064056.837811:0:34195:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffffa081f0c66000] (69)++ 00000100:00100000:10.0:1612064056.837812:0:37831:0:(service.c:1989:ptlrpc_server_handle_req_in()) got req x1690061384164416 00000800:00000200:9.0:1612064056.837812:0:34195:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffffa081f0c66000] (70)-- 00000800:00000200:9.0:1612064056.837813:0:34195:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffffa081f0c66000] (69)-- 00000100:00100000:10.0:1612064056.837817:0:37831:0:(nrs_tbf.c:3210:nrs_tbf_req_add()) TBF enqueues: class@ffffa081e90ff600 rate 5000 gen 2 token 0, rule@ffffa07df979db00 rate 5000 gen 2 00000100:00100000:10.0:1612064056.837821:0:37831:0:(nrs_tbf.c:3127:nrs_tbf_req_get()) TBF dequeues: class@ffffa081e90ff600 rate 5000 gen 2 token 1, rule@ffffa07df979db00 rate 5000 gen 2 00000100:00100000:10.0:1612064056.837828:0:37831:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_250:b58b2f27-aabc-f700-545a-8d4bc7d4c55c+7:3580:x1690061384164416:12345-10.149.3.186@o2ib313:400 so as you can see - processing is instanteous. What does the dmesg on the client side say? Could the RPCs be dropped somewhere along the way like routers? I don't see any obvious signs of anything being slow - no complaints about slow io nor any threads being stuck for a long time like the original bugreport shows (the "LNet: Service thread pid XXX was inactive for")

Mahmoud Hanafi added a comment - 01/Feb/21 5:54 PM

We have a large user job (1844 nodes/ 73760cores) triggering client evictions even with our workaround of enabling RPCTRACE. I was able to gather debug logs from the last event. We have mitigated the issue by enabling TBF QoS to rate limit ost_io rpcs. The logs showed that all ll_ost_io threads sit in 'D' state and all i/o rates drop to zero. This starts at 19:31:36 and ends at 19:35:10. At which time the server starts to evict clients running the job (see s618.out). See zero.io.top.20210130.19.31.37 for thread states. I uploaded lustre debug file to ftp.whamcloud.com/uploads/zero.io.lctl.dk.20210130.19.34.56.gz.new. The evictions only happen on OSS. This filesystem has 20 OSSes and 342 OSTs.

Mahmoud Hanafi added a comment - 01/Feb/21 5:54 PM We have a large user job (1844 nodes/ 73760cores) triggering client evictions even with our workaround of enabling RPCTRACE. I was able to gather debug logs from the last event. We have mitigated the issue by enabling TBF QoS to rate limit ost_io rpcs. The logs showed that all ll_ost_io threads sit in 'D' state and all i/o rates drop to zero. This starts at 19:31:36 and ends at 19:35:10. At which time the server starts to evict clients running the job (see s618.out). See zero.io.top.20210130.19.31.37 for thread states. I uploaded lustre debug file to ftp.whamcloud.com/uploads/zero.io.lctl.dk.20210130.19.34.56.gz.new. The evictions only happen on OSS. This filesystem has 20 OSSes and 342 OSTs.

Mahmoud Hanafi added a comment - 28/May/20 6:01 AM

We need a way to debug this issue. We are still hitting this only way to workaround it is by enabling rpctrace.

Mahmoud Hanafi added a comment - 28/May/20 6:01 AM We need a way to debug this issue. We are still hitting this only way to workaround it is by enabling rpctrace.

Mahmoud Hanafi added a comment - 23/Apr/20 9:58 PM

By enabling +rpctrace it slow the server just enough for the issue to not occur.

Mahmoud Hanafi added a comment - 23/Apr/20 9:58 PM By enabling +rpctrace it slow the server just enough for the issue to not occur.

Andreas Dilger added a comment - 23/Apr/20 5:36 PM

One issue about elevating OBD_PING priority is that these RPCs are mostly "dead weight" just to keep the servers informed that the clients are still alive. I think the idle_timeout feature would instead allow idle clients to disconnect and reduce the OBD_PING traffic on the servers.

Another approach is to prevent evicting clients if the service threads are not making progress on the RPC queue, or if the clients already have a pending OBD_PING RPC in the request queue, even if it hasn't been processed yet. That avoids interrupting useful work by prioritizing OBD_PING processing under normal usage, but also avoids mass evictions in the rare case that the server threads are overloaded.

As for the ldlm contention issue, it would be useful to get the +rpctrace (and +ldlmtrace if possible}} debug log from one of these events, to see what the root of the problem is. Fixing OBD_PING is only the symptom.

Andreas Dilger added a comment - 23/Apr/20 5:36 PM One issue about elevating OBD_PING priority is that these RPCs are mostly "dead weight" just to keep the servers informed that the clients are still alive. I think the idle_timeout feature would instead allow idle clients to disconnect and reduce the OBD_PING traffic on the servers. Another approach is to prevent evicting clients if the service threads are not making progress on the RPC queue, or if the clients already have a pending OBD_PING RPC in the request queue, even if it hasn't been processed yet. That avoids interrupting useful work by prioritizing OBD_PING processing under normal usage, but also avoids mass evictions in the rare case that the server threads are overloaded. As for the ldlm contention issue, it would be useful to get the +rpctrace (and +ldlmtrace if possible}} debug log from one of these events, to see what the root of the problem is. Fixing OBD_PING is only the symptom.

Mahmoud Hanafi added a comment - 27/Feb/20 4:28 AM

What about raising the priority of obd ping rpcs. This will ensure that they are not dropped.

Mahmoud Hanafi added a comment - 27/Feb/20 4:28 AM What about raising the priority of obd ping rpcs. This will ensure that they are not dropped.

Andreas Dilger added a comment - 30/Jan/20 7:20 PM

Correct, patch is only to improve diagnostics for similar situations.

Andreas Dilger added a comment - 30/Jan/20 7:20 PM Correct, patch is only to improve diagnostics for similar situations.

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Mahmoud Hanafi

Votes:: 1 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 08/Nov/18 4:03 AM

Updated:: 15/Oct/24 7:38 PM