Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.10.5, Lustre 2.12.1
-
None
-
2
-
9223372036854775807
Description
Update to 2.10.5 now we are seeing periods of mass evictions from servers. On the server we have the following stack trace
Nov 7 11:33:12 nbp8-oss7 kernel: [531465.033253] Pid: 11080, comm: ll_ost01_220 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018 Nov 7 11:33:12 nbp8-oss7 kernel: [531465.033260] Call Trace: Nov 7 11:33:12 nbp8-oss7 kernel: [531465.033274] [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038258] [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038285] [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038294] [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038318] [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038346] [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038390] [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038425] [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038455] [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038484] [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038489] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038492] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038512] [<ffffffffffffffff>] 0xffffffffffffffff Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038515] LustreError: dumping log to /tmp/lustre-log.1541619192.11080 Nov 7 11:33:14 nbp8-oss7 kernel: [531467.254898] LNet: Service thread pid 9724 was inactive for 303.19s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Nov 7 11:33:14 nbp8-oss7 kernel: [531467.310852] Pid: 9724, comm: ll_ost01_019 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018 Nov 7 11:33:14 nbp8-oss7 kernel: [531467.310854] Call Trace: Nov 7 11:33:14 nbp8-oss7 kernel: [531467.310866] [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc] Nov 7 11:33:14 nbp8-oss7 kernel: [531467.332869] [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc] Nov 7 11:33:14 nbp8-oss7 kernel: [531467.332902] [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Nov 7 11:33:14 nbp8-oss7 kernel: [531467.332912] [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd] Nov 7 11:33:14 nbp8-oss7 kernel: [531467.332936] [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.332988] [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333032] [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333067] [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333099] [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333128] [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333134] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333137] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333158] [<ffffffffffffffff>] 0xffffffffffffffff
will upload to ftp:/uploads/LU11613/lustre-log.1541619192.11080
we didn't have rpctrace or dlmtrace so may not be very useful.
Could be related to https://jira.whamcloud.com/browse/LU-11613
First of all I do not think you have any outstanding io requests from this partcular client. If you did - we'd see aborted bulk messages and I don't think I saw any?
if the io thread is taking too long in the actual filesystem layer we'd see messages like this for writes: "transaction handle %p was open for too long: now %lld, alloced %lld, started %lld, closed %lld" and there were a slow io of other kind messages that I cannot readily find. In addition the lnet "thread took too long" would also happen and I think you did not have any incidents of any watchdogs triggering this time?
If you have requests sitting too long in the serving queue there'd be the "Already past deadline" messages.
If you have requests taking to long actually being serviced.- the client would timeout (you'd also have watchdogs trigger).
Looking at the code we prolong export several times: when a request comes in and is added to the req in queue, when we start serving it and when we finish serving it.
So while it's possible for a too long request to end up in ping eviction in absence of other requests, the fact that the export's "last served" time did not advance in all this time and we can see it matching last confirmed request from this client, I think it's a pretty solid evidence that at least this particular client did not really send a ping request for this OST or if it did - it did not arrive to the OSS node - note if this was the case you must see a timeout for the opc 400 on the client though.
3rd option is if the client and server idea of obdtimeout is different (with client having a significantly longer timeout than the server) then client might be sending pings a lot less frequently than the server expects.