Details
Description
Our OSS servers running 2.12.0 on Fir have been running fine until this morning. We are now seeing network errors, call traces and all servers seem overloaded. Filesystem is still reactive. I wanted to share the following logs with you just in case you see anything wrong. This really looks like a network issues but we spent some time investigating we didn't find any issues on our different IB fabrics, but lnet shows dropped packets.
Mar 21 09:44:43 fir-io1-s1 kernel: LNet: Skipped 2 previous similar messages Mar 21 09:44:43 fir-io1-s1 kernel: Pid: 96368, comm: ll_ost01_045 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Mar 21 09:44:44 fir-io1-s1 kernel: Call Trace: Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0dcd890>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0d8b185>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0dac86b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc166b10b>] ofd_intent_policy+0x69b/0x920 [ofd] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0d8bec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0db48a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0e3b302>] tgt_enqueue+0x62/0x210 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0e4235a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0de692b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0dea25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffff850c1c31>] kthread+0xd1/0xe0 Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffff85774c24>] ret_from_fork_nospec_begin+0xe/0x21 Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Mar 21 09:44:44 fir-io1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1553186684.96368
[root@fir-io1-s1 ~]# lnetctl stats show statistics: msgs_alloc: 0 msgs_max: 16371 rst_alloc: 283239 errors: 0 send_count: 2387805172 resend_count: 0 response_timeout_count: 203807 local_interrupt_count: 0 local_dropped_count: 33 local_aborted_count: 0 local_no_route_count: 0 local_timeout_count: 961 local_error_count: 13 remote_dropped_count: 3 remote_error_count: 0 remote_timeout_count: 12 network_timeout_count: 0 recv_count: 2387455644 route_count: 0 drop_count: 2971 send_length: 871207195166809 recv_length: 477340920770381 route_length: 0 drop_length: 1291240
I was able to dump the kernel tasks using sysrq, attaching that as fir-io1-s1-sysrq-t.log
Also attaching full kernel logs as fir-io1-s1-kern.log
We use DNE,PFL and DOM. OST backend is ldiskfs on mdraid.
Meanwhile, we'll keep investigating a possible network issue.
Thanks!
Stephane