Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.10.5, Lustre 2.12.1
-
None
-
2
-
9223372036854775807
Description
Update to 2.10.5 now we are seeing periods of mass evictions from servers. On the server we have the following stack trace
Nov 7 11:33:12 nbp8-oss7 kernel: [531465.033253] Pid: 11080, comm: ll_ost01_220 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018 Nov 7 11:33:12 nbp8-oss7 kernel: [531465.033260] Call Trace: Nov 7 11:33:12 nbp8-oss7 kernel: [531465.033274] [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038258] [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038285] [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038294] [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038318] [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038346] [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038390] [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038425] [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038455] [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038484] [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038489] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038492] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038512] [<ffffffffffffffff>] 0xffffffffffffffff Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038515] LustreError: dumping log to /tmp/lustre-log.1541619192.11080 Nov 7 11:33:14 nbp8-oss7 kernel: [531467.254898] LNet: Service thread pid 9724 was inactive for 303.19s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Nov 7 11:33:14 nbp8-oss7 kernel: [531467.310852] Pid: 9724, comm: ll_ost01_019 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018 Nov 7 11:33:14 nbp8-oss7 kernel: [531467.310854] Call Trace: Nov 7 11:33:14 nbp8-oss7 kernel: [531467.310866] [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc] Nov 7 11:33:14 nbp8-oss7 kernel: [531467.332869] [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc] Nov 7 11:33:14 nbp8-oss7 kernel: [531467.332902] [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Nov 7 11:33:14 nbp8-oss7 kernel: [531467.332912] [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd] Nov 7 11:33:14 nbp8-oss7 kernel: [531467.332936] [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.332988] [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333032] [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333067] [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333099] [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333128] [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333134] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333137] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333158] [<ffffffffffffffff>] 0xffffffffffffffff
will upload to ftp:/uploads/LU11613/lustre-log.1541619192.11080
we didn't have rpctrace or dlmtrace so may not be very useful.
Could be related to https://jira.whamcloud.com/browse/LU-11613
Oleg, is it possible that something in the NRS TBF path is bypassing/blocking the update of the export and preventing the request from being processed by the server?
In your previous example, it shows that this client is active for other pings and yet it is being evicted shortly thereafter. At the time of eviction, it looks like a request is being dequeued for the export, but It probably make sense for the target to evict a client that still has a pending request in the RPC queue:
so it might be useful to print a bit more information in nrs_tbf_req_get() about the request so that we can see what it is. Maybe change the CDEBUG(D_RPCTRACE, ...) there to a DEBUG_REQ(D_RPCTRACE, ...). Also adding the target name into the ptlrpc_server_handle_request() "Handling" and "Handled" lines (from req->rq_export->exp_obd->obd_name, if rq_export isn't NULL) would definitely be very useful for lots of reasons.