Details
Description
Our OSS servers running 2.12.0 on Fir have been running fine until this morning. We are now seeing network errors, call traces and all servers seem overloaded. Filesystem is still reactive. I wanted to share the following logs with you just in case you see anything wrong. This really looks like a network issues but we spent some time investigating we didn't find any issues on our different IB fabrics, but lnet shows dropped packets.
Mar 21 09:44:43 fir-io1-s1 kernel: LNet: Skipped 2 previous similar messages Mar 21 09:44:43 fir-io1-s1 kernel: Pid: 96368, comm: ll_ost01_045 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Mar 21 09:44:44 fir-io1-s1 kernel: Call Trace: Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0dcd890>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0d8b185>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0dac86b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc166b10b>] ofd_intent_policy+0x69b/0x920 [ofd] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0d8bec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0db48a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0e3b302>] tgt_enqueue+0x62/0x210 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0e4235a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0de692b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffc0dea25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffff850c1c31>] kthread+0xd1/0xe0 Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffff85774c24>] ret_from_fork_nospec_begin+0xe/0x21 Mar 21 09:44:44 fir-io1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Mar 21 09:44:44 fir-io1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1553186684.96368
[root@fir-io1-s1 ~]# lnetctl stats show
statistics:
msgs_alloc: 0
msgs_max: 16371
rst_alloc: 283239
errors: 0
send_count: 2387805172
resend_count: 0
response_timeout_count: 203807
local_interrupt_count: 0
local_dropped_count: 33
local_aborted_count: 0
local_no_route_count: 0
local_timeout_count: 961
local_error_count: 13
remote_dropped_count: 3
remote_error_count: 0
remote_timeout_count: 12
network_timeout_count: 0
recv_count: 2387455644
route_count: 0
drop_count: 2971
send_length: 871207195166809
recv_length: 477340920770381
route_length: 0
drop_length: 1291240
I was able to dump the kernel tasks using sysrq, attaching that as fir-io1-s1-sysrq-t.log
Also attaching full kernel logs as fir-io1-s1-kern.log
We use DNE,PFL and DOM. OST backend is ldiskfs on mdraid.
Meanwhile, we'll keep investigating a possible network issue.
Thanks!
Stephane