Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.10.5, Lustre 2.12.1
-
None
-
2
-
9223372036854775807
Description
Update to 2.10.5 now we are seeing periods of mass evictions from servers. On the server we have the following stack trace
Nov 7 11:33:12 nbp8-oss7 kernel: [531465.033253] Pid: 11080, comm: ll_ost01_220 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018 Nov 7 11:33:12 nbp8-oss7 kernel: [531465.033260] Call Trace: Nov 7 11:33:12 nbp8-oss7 kernel: [531465.033274] [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038258] [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038285] [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038294] [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038318] [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038346] [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038390] [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038425] [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038455] [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038484] [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038489] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038492] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038512] [<ffffffffffffffff>] 0xffffffffffffffff Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038515] LustreError: dumping log to /tmp/lustre-log.1541619192.11080 Nov 7 11:33:14 nbp8-oss7 kernel: [531467.254898] LNet: Service thread pid 9724 was inactive for 303.19s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Nov 7 11:33:14 nbp8-oss7 kernel: [531467.310852] Pid: 9724, comm: ll_ost01_019 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018 Nov 7 11:33:14 nbp8-oss7 kernel: [531467.310854] Call Trace: Nov 7 11:33:14 nbp8-oss7 kernel: [531467.310866] [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc] Nov 7 11:33:14 nbp8-oss7 kernel: [531467.332869] [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc] Nov 7 11:33:14 nbp8-oss7 kernel: [531467.332902] [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Nov 7 11:33:14 nbp8-oss7 kernel: [531467.332912] [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd] Nov 7 11:33:14 nbp8-oss7 kernel: [531467.332936] [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.332988] [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333032] [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333067] [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333099] [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333128] [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333134] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333137] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333158] [<ffffffffffffffff>] 0xffffffffffffffff
will upload to ftp:/uploads/LU11613/lustre-log.1541619192.11080
we didn't have rpctrace or dlmtrace so may not be very useful.
Could be related to https://jira.whamcloud.com/browse/LU-11613
I check at_min, at_max, and time on all servers and clients they are the same.
timeout=100
at_min=275
at_max=600
We know that the server receives obdping from other clients and sends replies. These clients, that get evicted, are sending ping to the other OSTs. other wise we would see evictions on more than one OST.