[LU-6061] Adding more than nine routers produces errors Created: 19/Dec/14  Updated: 25/Feb/15  Resolved: 25/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Matt Ezell Assignee: Isaac Huang (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

ORNL clusters


Severity: 3
Rank (Obsolete): 16876

 Description   

This ticket is to collect information on the problem that ORNL are experiencing when they try and add more than nine routers.

"The issue we have been seeing is where we cannot increase the number of routers beyond 9. It does not matter which routers of the 28 the 9 are in use, when we increase beyond 9 we get LNET messages."

Please fill in more details when available.



 Comments   
Comment by Matt Ezell [ 19/Dec/14 ]

I think we end up with a message stuck on the node. It has a message queued to a router that isn’t disappearing.

[root@rhea516 ~]# cat /proc/sys/lnet/peers
nid                      refs state  last   max   rtr   min    tx   min queue
10.38.144.17@o2ib4          4    up    -1    63    63    63    63    59 0
10.38.144.229@o2ib4         4    up    -1    63    63    63    63    59 0
10.38.144.221@o2ib4         1    NA    -1    63    63    63    63    63 0
10.38.144.234@o2ib4         4    up    -1    63    63    63    63    60 0
10.38.144.14@o2ib4          4    up    -1    63    63    63    63    60 0
10.38.144.226@o2ib4         4    up    -1    63    63    63    63    60 0
10.38.144.218@o2ib4         1    NA    -1    63    63    63    63    63 0
10.38.144.19@o2ib4          4    up    -1    63    63    63    63    60 0
10.38.144.231@o2ib4         4    up    -1    63    63    63    63    60 0
10.38.144.223@o2ib4         1    NA    -1    63    63    63    63    63 0
10.38.144.16@o2ib4          4    up    -1    63    63    63    63    60 0
10.38.144.228@o2ib4         4    up    -1    63    63    63    63    60 0
10.38.144.220@o2ib4         1    NA    -1    63    63    63    63    63 0
10.38.144.233@o2ib4         4    up    -1    63    63    63    63    60 0
10.38.144.13@o2ib4          4    up    -1    63    63    63    63    60 0
10.38.144.225@o2ib4         4    up    -1    63    63    63    63    60 0
10.38.144.217@o2ib4         1    NA    -1    63    63    63    63    63 0
10.38.144.18@o2ib4          5    up    -1    63    63    63    63    60 0
10.38.144.230@o2ib4         5    up    -1    63    63    63    62    60 1048648
10.38.144.222@o2ib4         1    NA    -1    63    63    63    63    63 0
10.38.144.235@o2ib4         4    up    -1    63    63    63    63    60 0
10.38.144.15@o2ib4          4    up    -1    63    63    63    63    60 0
10.38.144.227@o2ib4         4    up    -1    63    63    63    63    60 0
10.38.144.219@o2ib4         1    NA    -1    63    63    63    63    63 0
10.38.144.232@o2ib4         4    up    -1    63    63    63    63    60 0
10.38.144.12@o2ib4          4    up    -1    63    63    63    63    60 0
10.38.144.224@o2ib4         1    NA    -1    63    63    63    63    63 0
10.38.144.216@o2ib4         1    NA    -1    63    63    63    63    63 0

Earlier, this node had some o2iblnd issues:

[root@rhea516 ~]# dmesg|grep o2iblnd
LNet: 5422:0:(o2iblnd_cb.c:3267:kiblnd_complete()) RDMA (tx: ffffc9001e6df240) failed: 2
LNet: 5423:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from 10.38.144.230@o2ib4 failed: 5
LNet: 5423:0:(o2iblnd_cb.c:1895:kiblnd_close_conn_locked()) Closing conn to 10.38.144.230@o2ib4: error -5(waiting)

The IOR process that likely caused this is still hanging out, but it’s not stuck in the D state. Here’s its current stack:

[root@rhea516 ~]# cat /proc/32998/stack
[<ffffffffa091c770>] osc_extent_wait+0x420/0x670 [osc]
[<ffffffffa091cc8f>] osc_cache_wait_range+0x2cf/0x890 [osc]
[<ffffffffa090e2b9>] osc_io_fsync_end+0xa9/0xb0 [osc]
[<ffffffffa0606490>] cl_io_end+0x60/0x150 [obdclass]
[<ffffffffa099cc11>] lov_io_end_wrapper+0xf1/0x100 [lov]
[<ffffffffa099ec84>] lov_io_fsync_end+0x84/0x1e0 [lov]
[<ffffffffa0606490>] cl_io_end+0x60/0x150 [obdclass]
[<ffffffffa060b4c2>] cl_io_loop+0xc2/0x1b0 [obdclass]
[<ffffffffa0a1c54b>] cl_sync_file_range+0x31b/0x500 [lustre]
[<ffffffffa0a43fab>] ll_writepages+0x8b/0x1c0 [lustre]
[<ffffffff81134ad1>] do_writepages+0x21/0x40
[<ffffffff8111ff3b>] __filemap_fdatawrite_range+0x5b/0x60
[<ffffffff8111ff9a>] filemap_write_and_wait_range+0x5a/0x90
[<ffffffff811bb28e>] vfs_fsync_range+0x7e/0x100
[<ffffffff811bb37d>] vfs_fsync+0x1d/0x20
[<ffffffff811bb3be>] do_fsync+0x3e/0x60
[<ffffffff811bb410>] sys_fsync+0x10/0x20
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

There are lots of ldlm threads stuck in D, though:

[root@rhea516 ~]# ps aux|awk '$8=="D" {print}'
root       5716  0.0  0.0      0     0 ?        D    16:23   0:08 [ldlm_bl_00]
root       5717  0.0  0.0      0     0 ?        D    16:23   0:07 [ldlm_bl_01]
root      33880  0.0  0.0      0     0 ?        D    20:13   0:07 [ldlm_bl_02]
root      33882  0.0  0.0      0     0 ?        D    20:13   0:00 [ldlm_bl_04]
root      33883  0.0  0.0      0     0 ?        D    20:13   0:00 [ldlm_bl_05]
root      33884  0.0  0.0      0     0 ?        D    20:13   0:00 [ldlm_bl_06]
root      33885  0.0  0.0      0     0 ?        D    20:13   0:00 [ldlm_bl_07]
root      33886  0.0  0.0      0     0 ?        D    20:13   0:00 [ldlm_bl_08]
root      35914  0.0  0.0      0     0 ?        D    20:50   0:00 [ldlm_bl_09]
root      35928  0.0  0.0      0     0 ?        D    20:50   0:00 [ldlm_bl_10]
root      36389  0.0  0.0      0     0 ?        D    21:00   0:00 [ldlm_bl_11]
root      36403  0.0  0.0      0     0 ?        D    21:00   0:00 [ldlm_bl_12]
root      36655  0.0  0.0   4068   480 ?        D    21:03   0:00 sync
root      36924  0.0  0.0      0     0 ?        D    21:10   0:00 [ldlm_bl_13]
root      36938  0.0  0.0      0     0 ?        D    21:10   0:00 [ldlm_bl_14]
root      37421  0.0  0.0      0     0 ?        D    21:20   0:00 [ldlm_bl_15]
root      37435  0.0  0.0      0     0 ?        D    21:20   0:00 [ldlm_bl_16]
root      37891  0.0  0.0      0     0 ?        D    21:30   0:00 [ldlm_bl_17]
root      37906  0.0  0.0      0     0 ?        D    21:30   0:00 [ldlm_bl_18]
root      38372  0.0  0.0      0     0 ?        D    21:40   0:00 [ldlm_bl_19]
root      38386  0.0  0.0      0     0 ?        D    21:40   0:00 [ldlm_bl_20]
root      38843  0.0  0.0      0     0 ?        D    21:50   0:00 [ldlm_bl_21]
root      38857  0.0  0.0      0     0 ?        D    21:50   0:00 [ldlm_bl_22]
root      39437  0.0  0.0      0     0 ?        D    22:00   0:00 [ldlm_bl_23]
root      39451  0.0  0.0      0     0 ?        D    22:00   0:00 [ldlm_bl_24]
root      39949  0.0  0.0      0     0 ?        D    22:10   0:00 [ldlm_bl_25]
root      39963  0.0  0.0      0     0 ?        D    22:10   0:00 [ldlm_bl_26]
root      40488  0.0  0.0      0     0 ?        D    22:20   0:00 [ldlm_bl_27]
root      40502  0.0  0.0      0     0 ?        D    22:20   0:00 [ldlm_bl_28]
root      41045  0.0  0.0      0     0 ?        D    22:30   0:00 [ldlm_bl_29]
root      41059  0.0  0.0      0     0 ?        D    22:30   0:00 [ldlm_bl_30]
root      41573  0.0  0.0      0     0 ?        D    22:40   0:00 [ldlm_bl_31]
root      41588  0.0  0.0      0     0 ?        D    22:40   0:00 [ldlm_bl_32]
root      42130  0.0  0.0      0     0 ?        D    22:50   0:00 [ldlm_bl_33]
root      42144  0.0  0.0      0     0 ?        D    22:50   0:00 [ldlm_bl_34]
root      42870  0.0  0.0      0     0 ?        D    23:00   0:00 [ldlm_bl_35]
root      42884  0.0  0.0      0     0 ?        D    23:00   0:00 [ldlm_bl_36]
root      43406  0.0  0.0      0     0 ?        D    23:10   0:00 [ldlm_bl_37]

All of these have a stack like:

[root@rhea516 ~]# cat /proc/43406/stack
[<ffffffffa0601d57>] cl_lock_mutex_get+0x77/0xd0 [obdclass]
[<ffffffffa090b7ba>] osc_ldlm_blocking_ast+0x7a/0x350 [osc]
[<ffffffffa071d3d0>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
[<ffffffffa071d911>] ldlm_bl_thread_main+0x271/0x3f0 [ptlrpc]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

But not all the ldlm_bl threads are in D

[root@rhea516 ~]# ps aux|awk '($11~/ldlm_bl/ && $8=="S") {print}'
root      33881  0.0  0.0      0     0 ?        S    20:13   0:00 [ldlm_bl_03]
root      43420  0.0  0.0      0     0 ?        S    23:10   0:00 [ldlm_bl_38]

The first one has the following stack:

[root@rhea516 ~]# cat /proc/33881/stack
[<ffffffffa091c770>] osc_extent_wait+0x420/0x670 [osc]
[<ffffffffa091cc8f>] osc_cache_wait_range+0x2cf/0x890 [osc]
[<ffffffffa091de61>] osc_cache_writeback_range+0xc11/0xfb0 [osc]
[<ffffffffa090a503>] osc_lock_flush+0x83/0x280 [osc]
[<ffffffffa090a7e7>] osc_lock_cancel+0xe7/0x1c0 [osc]
[<ffffffffa05ffe45>] cl_lock_cancel0+0x75/0x160 [obdclass]
[<ffffffffa06009fb>] cl_lock_cancel+0x13b/0x140 [obdclass]
[<ffffffffa090b87a>] osc_ldlm_blocking_ast+0x13a/0x350 [osc]
[<ffffffffa071d3d0>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
[<ffffffffa071d911>] ldlm_bl_thread_main+0x271/0x3f0 [ptlrpc]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

Access to the file system still seems to work (df completes). And I can still ping the router:

[root@rhea516 ~]# lctl ping 10.38.144.230@o2ib4
12345-0@lo
12345-10.36.145.230@o2ib
12345-10.38.144.230@o2ib4

1. What causes the initial IB errors (is it in LNET or OFED)? How do we debug that layer?
2. Why does the message get stuck - and why doesn't it get retried?
3. Why don't we see this behavior with a smaller number of routers?

Comment by James A Simmons [ 19/Dec/14 ]

Looks a lot like LU-4300.

Comment by Peter Jones [ 19/Dec/14 ]

Isaac

Could you please advise?

Thanks

Peter

Comment by Jinshan Xiong (Inactive) [ 08/Jan/15 ]

It's totally understandable to see those stack traces if the client can not reach the OSTs. I don't think this issue is related to LU-4300.

Comment by Isaac Huang (Inactive) [ 08/Jan/15 ]

Regarding the initial IB errors

LNet: 5422:0:(o2iblnd_cb.c:3267:kiblnd_complete()) RDMA (tx: ffffc9001e6df240) failed: 2
LNet: 5423:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from 10.38.144.230@o2ib4 failed: 5
LNet: 5423:0:(o2iblnd_cb.c:1895:kiblnd_close_conn_locked()) Closing conn to 10.38.144.230@o2ib4: error -5(waiting)

The 2nd and 3rd ones were likely aftermath of the 1st error: QP got closed down, and then all pending RXs got IB_WC_WR_FLUSH_ERR (5).

The 1st RDMA error was IB_WC_LOC_QP_OP_ERR. This was returned to LNET from the low-level IB HCA driver. In some cases it's caused by memory corruption issues such that the HCA couldn't read the WQE. It could be a bug with LNET or some totally unrelated driver corrupting LNET memory - hard to say at this point. My hunch is it may have something to do with LU-5271, which also indicated a corrupted WQE ID.

To narrow it down, it'd be helpful to run a native IB test to the routers, e.g. a client doing RDMA writes to the routers (more than 9 of them). I'm not up to date with the native OFED bandwidth benchmark tools, but it's important to: 1. use RDMA writes; 2. use reliable connected QPs. That's how the LNET uses the OFED APIs.

Comment by James A Simmons [ 08/Jan/15 ]

Isaac the low level IB test are being run on the routers today. We will report the results after we are done.

Comment by Matt Ezell [ 25/Feb/15 ]

We identified an issue with the IB setup on this cluster. Yesterday we ran a successful test shot using all of the routers. I think we can close this ticket - we can always request to have it reopened if the problems come back.

Thanks.

Comment by Peter Jones [ 25/Feb/15 ]

ok - thanks Matt!

Generated at Sat Feb 10 01:56:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.