Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.10.5, Lustre 2.12.1
-
None
-
2
-
9223372036854775807
Description
Update to 2.10.5 now we are seeing periods of mass evictions from servers. On the server we have the following stack trace
Nov 7 11:33:12 nbp8-oss7 kernel: [531465.033253] Pid: 11080, comm: ll_ost01_220 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018 Nov 7 11:33:12 nbp8-oss7 kernel: [531465.033260] Call Trace: Nov 7 11:33:12 nbp8-oss7 kernel: [531465.033274] [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038258] [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038285] [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038294] [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038318] [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038346] [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038390] [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038425] [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038455] [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038484] [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038489] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038492] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038512] [<ffffffffffffffff>] 0xffffffffffffffff Nov 7 11:33:12 nbp8-oss7 kernel: [531465.038515] LustreError: dumping log to /tmp/lustre-log.1541619192.11080 Nov 7 11:33:14 nbp8-oss7 kernel: [531467.254898] LNet: Service thread pid 9724 was inactive for 303.19s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Nov 7 11:33:14 nbp8-oss7 kernel: [531467.310852] Pid: 9724, comm: ll_ost01_019 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018 Nov 7 11:33:14 nbp8-oss7 kernel: [531467.310854] Call Trace: Nov 7 11:33:14 nbp8-oss7 kernel: [531467.310866] [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc] Nov 7 11:33:14 nbp8-oss7 kernel: [531467.332869] [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc] Nov 7 11:33:14 nbp8-oss7 kernel: [531467.332902] [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Nov 7 11:33:14 nbp8-oss7 kernel: [531467.332912] [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd] Nov 7 11:33:14 nbp8-oss7 kernel: [531467.332936] [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.332988] [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333032] [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333067] [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333099] [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333128] [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333134] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333137] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 7 11:33:15 nbp8-oss7 kernel: [531467.333158] [<ffffffffffffffff>] 0xffffffffffffffff
will upload to ftp:/uploads/LU11613/lustre-log.1541619192.11080
we didn't have rpctrace or dlmtrace so may not be very useful.
Could be related to https://jira.whamcloud.com/browse/LU-11613
Mahmoud, do you know what the client application is doing at this point in the run? Glimpse RPCs are generated when clients do "stat()" operations on files to get the size, that send the LDLM glimpse RPC for the OST object(s) in the file, which may in turn cause the OST to send RPCs to the client(s) holding the locks for the file if it is actively being written. So if there were a multiple clients doing parallel directory tree traversal in the same directory where other clients are writing it could generate a lot of glimpses, or if the application was malformed and calling stat() repeatedly on a shared file for some reason (e.g. to poll for updates/completion)?