[LU-12096] ldlm_run_ast_work call traces and network errors on overloaded OSS - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.12.0
Labels:
None
Environment:

Hide
Clients: 2.12.0+~~LU-11964~~, Servers: 2.12.0+~~LU-12037~~ (3.10.0-957.1.3.el7_lustre.x86_64), CentOS 7.6

Show
Clients: 2.12.0+ LU-11964 , Servers: 2.12.0+ LU-12037 (3.10.0-957.1.3.el7_lustre.x86_64), CentOS 7.6

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Our OSS servers running 2.12.0 on Fir have been running fine until this morning. We are now seeing network errors, call traces and all servers seem overloaded. Filesystem is still reactive. I wanted to share the following logs with you just in case you see anything wrong. This really looks like a network issues but we spent some time investigating we didn't find any issues on our different IB fabrics, but lnet shows dropped packets.

Mar 21 09:44:43 fir-io1-s1 kernel: LNet: Skipped 2 previous similar messages
Mar 21 09:44:43 fir-io1-s1 kernel: Pid: 96368, comm: ll_ost01_045 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Mar 21 09:44:44 fir-io1-s1 kernel: Call Trace:
Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0dcd890>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0d8b185>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0dac86b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc166b10b>] ofd_intent_policy+0x69b/0x920 [ofd]
Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0d8bec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0db48a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0e3b302>] tgt_enqueue+0x62/0x210 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0e4235a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0de692b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0dea25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffff850c1c31>] kthread+0xd1/0xe0
Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffff85774c24>] ret_from_fork_nospec_begin+0xe/0x21
Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Mar 21 09:44:44 fir-io1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1553186684.96368

[root@fir-io1-s1 ~]# lnetctl stats show
statistics:
    msgs_alloc: 0
    msgs_max: 16371
    rst_alloc: 283239
    errors: 0
    send_count: 2387805172
    resend_count: 0
    response_timeout_count: 203807
    local_interrupt_count: 0
    local_dropped_count: 33
    local_aborted_count: 0
    local_no_route_count: 0
    local_timeout_count: 961
    local_error_count: 13
    remote_dropped_count: 3
    remote_error_count: 0
    remote_timeout_count: 12
    network_timeout_count: 0
    recv_count: 2387455644
    route_count: 0
    drop_count: 2971
    send_length: 871207195166809
    recv_length: 477340920770381
    route_length: 0
    drop_length: 1291240

I was able to dump the kernel tasks using sysrq, attaching that as fir-io1-s1-sysrq-t.log
Also attaching full kernel logs as fir-io1-s1-kern.log

We use DNE,PFL and DOM. OST backend is ldiskfs on mdraid.

Meanwhile, we'll keep investigating a possible network issue.

Thanks!
Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fir-io1-s1.dk-dlmtrace.log.gz
4.14 MB
21/Mar/19 6:25 PM
fir-io1-s1-kern.log
2.37 MB
21/Mar/19 6:06 PM
fir-io1-s1-sysrq-t.log
3.88 MB
21/Mar/19 6:07 PM

Activity

[LU-12096] ldlm_run_ast_work call traces and network errors on overloaded OSS

Stephane Thiell added a comment - 22/Mar/19 5:35 PM

Thanks!

Stephane Thiell added a comment - 22/Mar/19 5:35 PM Thanks!

Patrick Farrell (Inactive) added a comment - 22/Mar/19 5:34 PM - edited

Yes, absolutely. Any Lustre node using an IB connection.

Patrick Farrell (Inactive) added a comment - 22/Mar/19 5:34 PM - edited Yes, absolutely. Any Lustre node using an IB connection.

Stephane Thiell added a comment - 22/Mar/19 5:32 PM

Are CQ entries also used on LNet routers? I assume they do? All of our routers (FDR/EDR and EDR/EDR) are running 2.12.0, maybe I need to update these too.

Stephane Thiell added a comment - 22/Mar/19 5:32 PM Are CQ entries also used on LNet routers? I assume they do? All of our routers (FDR/EDR and EDR/EDR) are running 2.12.0, maybe I need to update these too.

Stephane Thiell added a comment - 22/Mar/19 5:36 AM - edited

Hi Amir – Your absolutely right for the MGS, this is very likely another issue after all. We restarted this MGS today and I think things are better now. It's on our old 2.8 systems anyway so we don't want to spend too much time on that right now.

But ok for the recommendation re: the second patch, thanks much!

Stephane Thiell added a comment - 22/Mar/19 5:36 AM - edited Hi Amir – Your absolutely right for the MGS, this is very likely another issue after all. We restarted this MGS today and I think things are better now. It's on our old 2.8 systems anyway so we don't want to spend too much time on that right now. But ok for the recommendation re: the second patch, thanks much!

Amir Shehata (Inactive) added a comment - 22/Mar/19 5:32 AM

The RDMA timeouts are error level output. Regarding:

 Mar 21 13:50:57 sh-25-08.int kernel: LustreError: 128406:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb1980) refcount nonzer

If you don't see RDMA timeouts, this could be an unrelated problem.

Either way, I think it'll be good to try out both those patches the CQ entries one landed, but the other one I think makes sense to apply as well.

Amir Shehata (Inactive) added a comment - 22/Mar/19 5:32 AM The RDMA timeouts are error level output. Regarding: Mar 21 13:50:57 sh-25-08. int kernel: LustreError: 128406:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb1980) refcount nonzer If you don't see RDMA timeouts, this could be an unrelated problem. Either way, I think it'll be good to try out both those patches the CQ entries one landed, but the other one I think makes sense to apply as well.

Stephane Thiell added a comment - 22/Mar/19 5:29 AM

We have started an emergency rolling update of our 2.12 lustre clients on Sherlock with the patch https://review.whamcloud.com/34474/ "~~LU-12065~~ lnd: increase CQ entries". I hope this will fix the bulk read timeout that we see on both 2.8 and 2.12 servers.

Stephane Thiell added a comment - 22/Mar/19 5:29 AM We have started an emergency rolling update of our 2.12 lustre clients on Sherlock with the patch https://review.whamcloud.com/34474/ " LU-12065 lnd: increase CQ entries". I hope this will fix the bulk read timeout that we see on both 2.8 and 2.12 servers.

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 21/Mar/19 6:07 PM

Updated:: 22/Mar/19 5:35 PM