[LU-6061] Adding more than nine routers produces errors Created: 19/Dec/14 Updated: 25/Feb/15 Resolved: 25/Feb/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Matt Ezell | Assignee: | Isaac Huang (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
ORNL clusters |
||
| Severity: | 3 |
| Rank (Obsolete): | 16876 |
| Description |
|
This ticket is to collect information on the problem that ORNL are experiencing when they try and add more than nine routers. "The issue we have been seeing is where we cannot increase the number of routers beyond 9. It does not matter which routers of the 28 the 9 are in use, when we increase beyond 9 we get LNET messages." Please fill in more details when available. |
| Comments |
| Comment by Matt Ezell [ 19/Dec/14 ] |
|
I think we end up with a message stuck on the node. It has a message queued to a router that isn’t disappearing. [root@rhea516 ~]# cat /proc/sys/lnet/peers nid refs state last max rtr min tx min queue 10.38.144.17@o2ib4 4 up -1 63 63 63 63 59 0 10.38.144.229@o2ib4 4 up -1 63 63 63 63 59 0 10.38.144.221@o2ib4 1 NA -1 63 63 63 63 63 0 10.38.144.234@o2ib4 4 up -1 63 63 63 63 60 0 10.38.144.14@o2ib4 4 up -1 63 63 63 63 60 0 10.38.144.226@o2ib4 4 up -1 63 63 63 63 60 0 10.38.144.218@o2ib4 1 NA -1 63 63 63 63 63 0 10.38.144.19@o2ib4 4 up -1 63 63 63 63 60 0 10.38.144.231@o2ib4 4 up -1 63 63 63 63 60 0 10.38.144.223@o2ib4 1 NA -1 63 63 63 63 63 0 10.38.144.16@o2ib4 4 up -1 63 63 63 63 60 0 10.38.144.228@o2ib4 4 up -1 63 63 63 63 60 0 10.38.144.220@o2ib4 1 NA -1 63 63 63 63 63 0 10.38.144.233@o2ib4 4 up -1 63 63 63 63 60 0 10.38.144.13@o2ib4 4 up -1 63 63 63 63 60 0 10.38.144.225@o2ib4 4 up -1 63 63 63 63 60 0 10.38.144.217@o2ib4 1 NA -1 63 63 63 63 63 0 10.38.144.18@o2ib4 5 up -1 63 63 63 63 60 0 10.38.144.230@o2ib4 5 up -1 63 63 63 62 60 1048648 10.38.144.222@o2ib4 1 NA -1 63 63 63 63 63 0 10.38.144.235@o2ib4 4 up -1 63 63 63 63 60 0 10.38.144.15@o2ib4 4 up -1 63 63 63 63 60 0 10.38.144.227@o2ib4 4 up -1 63 63 63 63 60 0 10.38.144.219@o2ib4 1 NA -1 63 63 63 63 63 0 10.38.144.232@o2ib4 4 up -1 63 63 63 63 60 0 10.38.144.12@o2ib4 4 up -1 63 63 63 63 60 0 10.38.144.224@o2ib4 1 NA -1 63 63 63 63 63 0 10.38.144.216@o2ib4 1 NA -1 63 63 63 63 63 0 Earlier, this node had some o2iblnd issues: [root@rhea516 ~]# dmesg|grep o2iblnd LNet: 5422:0:(o2iblnd_cb.c:3267:kiblnd_complete()) RDMA (tx: ffffc9001e6df240) failed: 2 LNet: 5423:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from 10.38.144.230@o2ib4 failed: 5 LNet: 5423:0:(o2iblnd_cb.c:1895:kiblnd_close_conn_locked()) Closing conn to 10.38.144.230@o2ib4: error -5(waiting) The IOR process that likely caused this is still hanging out, but it’s not stuck in the D state. Here’s its current stack: [root@rhea516 ~]# cat /proc/32998/stack [<ffffffffa091c770>] osc_extent_wait+0x420/0x670 [osc] [<ffffffffa091cc8f>] osc_cache_wait_range+0x2cf/0x890 [osc] [<ffffffffa090e2b9>] osc_io_fsync_end+0xa9/0xb0 [osc] [<ffffffffa0606490>] cl_io_end+0x60/0x150 [obdclass] [<ffffffffa099cc11>] lov_io_end_wrapper+0xf1/0x100 [lov] [<ffffffffa099ec84>] lov_io_fsync_end+0x84/0x1e0 [lov] [<ffffffffa0606490>] cl_io_end+0x60/0x150 [obdclass] [<ffffffffa060b4c2>] cl_io_loop+0xc2/0x1b0 [obdclass] [<ffffffffa0a1c54b>] cl_sync_file_range+0x31b/0x500 [lustre] [<ffffffffa0a43fab>] ll_writepages+0x8b/0x1c0 [lustre] [<ffffffff81134ad1>] do_writepages+0x21/0x40 [<ffffffff8111ff3b>] __filemap_fdatawrite_range+0x5b/0x60 [<ffffffff8111ff9a>] filemap_write_and_wait_range+0x5a/0x90 [<ffffffff811bb28e>] vfs_fsync_range+0x7e/0x100 [<ffffffff811bb37d>] vfs_fsync+0x1d/0x20 [<ffffffff811bb3be>] do_fsync+0x3e/0x60 [<ffffffff811bb410>] sys_fsync+0x10/0x20 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff There are lots of ldlm threads stuck in D, though: [root@rhea516 ~]# ps aux|awk '$8=="D" {print}'
root 5716 0.0 0.0 0 0 ? D 16:23 0:08 [ldlm_bl_00]
root 5717 0.0 0.0 0 0 ? D 16:23 0:07 [ldlm_bl_01]
root 33880 0.0 0.0 0 0 ? D 20:13 0:07 [ldlm_bl_02]
root 33882 0.0 0.0 0 0 ? D 20:13 0:00 [ldlm_bl_04]
root 33883 0.0 0.0 0 0 ? D 20:13 0:00 [ldlm_bl_05]
root 33884 0.0 0.0 0 0 ? D 20:13 0:00 [ldlm_bl_06]
root 33885 0.0 0.0 0 0 ? D 20:13 0:00 [ldlm_bl_07]
root 33886 0.0 0.0 0 0 ? D 20:13 0:00 [ldlm_bl_08]
root 35914 0.0 0.0 0 0 ? D 20:50 0:00 [ldlm_bl_09]
root 35928 0.0 0.0 0 0 ? D 20:50 0:00 [ldlm_bl_10]
root 36389 0.0 0.0 0 0 ? D 21:00 0:00 [ldlm_bl_11]
root 36403 0.0 0.0 0 0 ? D 21:00 0:00 [ldlm_bl_12]
root 36655 0.0 0.0 4068 480 ? D 21:03 0:00 sync
root 36924 0.0 0.0 0 0 ? D 21:10 0:00 [ldlm_bl_13]
root 36938 0.0 0.0 0 0 ? D 21:10 0:00 [ldlm_bl_14]
root 37421 0.0 0.0 0 0 ? D 21:20 0:00 [ldlm_bl_15]
root 37435 0.0 0.0 0 0 ? D 21:20 0:00 [ldlm_bl_16]
root 37891 0.0 0.0 0 0 ? D 21:30 0:00 [ldlm_bl_17]
root 37906 0.0 0.0 0 0 ? D 21:30 0:00 [ldlm_bl_18]
root 38372 0.0 0.0 0 0 ? D 21:40 0:00 [ldlm_bl_19]
root 38386 0.0 0.0 0 0 ? D 21:40 0:00 [ldlm_bl_20]
root 38843 0.0 0.0 0 0 ? D 21:50 0:00 [ldlm_bl_21]
root 38857 0.0 0.0 0 0 ? D 21:50 0:00 [ldlm_bl_22]
root 39437 0.0 0.0 0 0 ? D 22:00 0:00 [ldlm_bl_23]
root 39451 0.0 0.0 0 0 ? D 22:00 0:00 [ldlm_bl_24]
root 39949 0.0 0.0 0 0 ? D 22:10 0:00 [ldlm_bl_25]
root 39963 0.0 0.0 0 0 ? D 22:10 0:00 [ldlm_bl_26]
root 40488 0.0 0.0 0 0 ? D 22:20 0:00 [ldlm_bl_27]
root 40502 0.0 0.0 0 0 ? D 22:20 0:00 [ldlm_bl_28]
root 41045 0.0 0.0 0 0 ? D 22:30 0:00 [ldlm_bl_29]
root 41059 0.0 0.0 0 0 ? D 22:30 0:00 [ldlm_bl_30]
root 41573 0.0 0.0 0 0 ? D 22:40 0:00 [ldlm_bl_31]
root 41588 0.0 0.0 0 0 ? D 22:40 0:00 [ldlm_bl_32]
root 42130 0.0 0.0 0 0 ? D 22:50 0:00 [ldlm_bl_33]
root 42144 0.0 0.0 0 0 ? D 22:50 0:00 [ldlm_bl_34]
root 42870 0.0 0.0 0 0 ? D 23:00 0:00 [ldlm_bl_35]
root 42884 0.0 0.0 0 0 ? D 23:00 0:00 [ldlm_bl_36]
root 43406 0.0 0.0 0 0 ? D 23:10 0:00 [ldlm_bl_37]
All of these have a stack like: [root@rhea516 ~]# cat /proc/43406/stack [<ffffffffa0601d57>] cl_lock_mutex_get+0x77/0xd0 [obdclass] [<ffffffffa090b7ba>] osc_ldlm_blocking_ast+0x7a/0x350 [osc] [<ffffffffa071d3d0>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc] [<ffffffffa071d911>] ldlm_bl_thread_main+0x271/0x3f0 [ptlrpc] [<ffffffff8109abf6>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff But not all the ldlm_bl threads are in D [root@rhea516 ~]# ps aux|awk '($11~/ldlm_bl/ && $8=="S") {print}'
root 33881 0.0 0.0 0 0 ? S 20:13 0:00 [ldlm_bl_03]
root 43420 0.0 0.0 0 0 ? S 23:10 0:00 [ldlm_bl_38]
The first one has the following stack: [root@rhea516 ~]# cat /proc/33881/stack [<ffffffffa091c770>] osc_extent_wait+0x420/0x670 [osc] [<ffffffffa091cc8f>] osc_cache_wait_range+0x2cf/0x890 [osc] [<ffffffffa091de61>] osc_cache_writeback_range+0xc11/0xfb0 [osc] [<ffffffffa090a503>] osc_lock_flush+0x83/0x280 [osc] [<ffffffffa090a7e7>] osc_lock_cancel+0xe7/0x1c0 [osc] [<ffffffffa05ffe45>] cl_lock_cancel0+0x75/0x160 [obdclass] [<ffffffffa06009fb>] cl_lock_cancel+0x13b/0x140 [obdclass] [<ffffffffa090b87a>] osc_ldlm_blocking_ast+0x13a/0x350 [osc] [<ffffffffa071d3d0>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc] [<ffffffffa071d911>] ldlm_bl_thread_main+0x271/0x3f0 [ptlrpc] [<ffffffff8109abf6>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff Access to the file system still seems to work (df completes). And I can still ping the router: [root@rhea516 ~]# lctl ping 10.38.144.230@o2ib4 12345-0@lo 12345-10.36.145.230@o2ib 12345-10.38.144.230@o2ib4 1. What causes the initial IB errors (is it in LNET or OFED)? How do we debug that layer? |
| Comment by James A Simmons [ 19/Dec/14 ] |
|
Looks a lot like |
| Comment by Peter Jones [ 19/Dec/14 ] |
|
Isaac Could you please advise? Thanks Peter |
| Comment by Jinshan Xiong (Inactive) [ 08/Jan/15 ] |
|
It's totally understandable to see those stack traces if the client can not reach the OSTs. I don't think this issue is related to |
| Comment by Isaac Huang (Inactive) [ 08/Jan/15 ] |
|
Regarding the initial IB errors LNet: 5422:0:(o2iblnd_cb.c:3267:kiblnd_complete()) RDMA (tx: ffffc9001e6df240) failed: 2 LNet: 5423:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from 10.38.144.230@o2ib4 failed: 5 LNet: 5423:0:(o2iblnd_cb.c:1895:kiblnd_close_conn_locked()) Closing conn to 10.38.144.230@o2ib4: error -5(waiting) The 2nd and 3rd ones were likely aftermath of the 1st error: QP got closed down, and then all pending RXs got IB_WC_WR_FLUSH_ERR (5). The 1st RDMA error was IB_WC_LOC_QP_OP_ERR. This was returned to LNET from the low-level IB HCA driver. In some cases it's caused by memory corruption issues such that the HCA couldn't read the WQE. It could be a bug with LNET or some totally unrelated driver corrupting LNET memory - hard to say at this point. My hunch is it may have something to do with To narrow it down, it'd be helpful to run a native IB test to the routers, e.g. a client doing RDMA writes to the routers (more than 9 of them). I'm not up to date with the native OFED bandwidth benchmark tools, but it's important to: 1. use RDMA writes; 2. use reliable connected QPs. That's how the LNET uses the OFED APIs. |
| Comment by James A Simmons [ 08/Jan/15 ] |
|
Isaac the low level IB test are being run on the routers today. We will report the results after we are done. |
| Comment by Matt Ezell [ 25/Feb/15 ] |
|
We identified an issue with the IB setup on this cluster. Yesterday we ran a successful test shot using all of the routers. I think we can close this ticket - we can always request to have it reopened if the problems come back. Thanks. |
| Comment by Peter Jones [ 25/Feb/15 ] |
|
ok - thanks Matt! |