[LU-33] client can't recover on N-hop router configuraton Created: 28/Dec/10 Updated: 28/Jun/11 Resolved: 28/Jan/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Liang Zhen (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 10177 |
| Description |
|
I'm testing on Lustre N-hop routing (e.g. o2ib0 <> tcp <> o2ib1) below. MDS/OSS <-- IB (o2ib0) --> Router1 <-- TCP (tcp0) --> Router2 <-- IB (o2ib1) --> Client - Network configuration - There are two IB fabrics and 1GbE connects both fabrics with LNET routers. MDS/OSS IP address: 192.168.100.120@o2ib0 options lnet networks=o2ib0 routes="tcp0 192.168.100.121@o2ib0; o2ib1 192.168.100.121@o2ib0" Router1 IP address: 192.168.100.121@o2ib0, 192.168.10.121@tcp0 options lnet ip2nets="tcp0 192.168.20.*; o2ib0(ib0) 192.168.100.*" routes="o2ib1 192.168.20.122@tcp0" forwarding="enabled" Router2 IP address: 192.168.200.122@o2ib1, 192.168.10.122@tcp0 options lnet ip2nets="tcp0 192.168.20.*; o2ib1(ib0) 192.168.200.*" routes="o2ib0 192.168.20.121@tcp0" forwarding="enabled" Client IP address: 192.168.200.123@o2ib1 options lnet networks=o2ib1(ib0) routes="o2ib0 192.168.200.122@o2ib1" It worked with above the configurations, but it seems that there is an issue if Router2 downs (e.g. 'lctl net down'), then restart it. The problem is that the client can't be recovery unless the client umount and remount the filesytem on it. |
| Comments |
| Comment by Liang Zhen (Inactive) [ 28/Dec/10 ] |
|
Ihara, could you help me to get a few things after restarting the router in the test:
NB: also, could you tell me detail version of your lustre? Thanks |
| Comment by Shuichi Ihara (Inactive) [ 28/Dec/10 ] |
|
Liang, here are results you requested. I just remount the lustre on the clients. do you need the following results when the problem happens? 1. /proc/sys/lnet/peers on router2
2./proc/sys/lnet/peers and /proc/sys/lnet/routes on client
3. lctl ping route2 on client
4. lctl ping client on the route2
I'm using the lustre-1.8.4.ddn2 which is based on lustre-1.8.4 and backported patches mostly from 1.8.5. |
| Comment by Liang Zhen (Inactive) [ 28/Dec/10 ] |
|
Ihara, yes please collect these information when you got the problem, and if possible, please attach source tar-ball of lnet at here. Thanks |
| Comment by Shuichi Ihara (Inactive) [ 30/Dec/10 ] |
|
Here are results when the problem happened. 1. /proc/sys/lnet/peers on router2 2. /proc/sys/lnet/peers and /proc/sys/lnet/routes on client [root@r04 ~]# cat /proc/sys/lnet/routes 3. lctl ping route2 on client [root@r04 ~]# lctl ping 192.168.200.122@o2ib1 4. lctl ping client on the route2 lctl ping route2 on client [root@r03 ~]# lctl ping 192.168.200.123@o2ib1 when I did "lctl ping route2" on client, got an Input/output error, but tried again once, the connection restore, then the client did connection to MGS and recovered the filesystem correctly. And if "lctl ping client" on router2, it can also help connection restore once I did. |
| Comment by Shuichi Ihara (Inactive) [ 30/Dec/10 ] |
|
attached is part of lnet code in lustre-1.8.4.ddn2. |
| Comment by Liang Zhen (Inactive) [ 30/Dec/10 ] |
|
Ihara, I think the reason is:
So I think you can add this to your modprobe.conf (on clients and servers) You may also want to add this, "live_router_check_interval=60", so client/server will check live router for each minute. Liang |
| Comment by Shuichi Ihara (Inactive) [ 05/Jan/11 ] |
|
Liang, Thanks. It can be fixed by your advises, but I needed to add two parameters (dead_router_check_interval=5 and live_router_check_interval=60) not only on servers/clients, but also both routers. I just wonder if we could see any notices on the routers when the connection back. when all router1 fails, we can see the following error messages on router2. Jan 5 22:23:14 r03 kernel: [127541.771483] Lustre: No route to 12345-192.168.100.120@o2ib via LNET_NID_ANY (all routers down) However, even router1 is back from fails and all connection restores on the clients, there are no any messages on router2. Jan 5 22:22:34 r11 kernel: [ 3202.159898] Lustre: MGC192.168.100.120@o2ib: Connection restored to service MGS using nid 192.168.100.120@o2ib. Ihara |
| Comment by Liang Zhen (Inactive) [ 25/Jan/11 ] |
|
Ihara, Thanks |
| Comment by Shuichi Ihara (Inactive) [ 25/Jan/11 ] |
|
Liang, it's enough now. And, yes, we do get information from /proc as much as possible. please close this ticket. Many thanks! Ihara |
| Comment by Liang Zhen (Inactive) [ 28/Jan/11 ] |
|
mark it as resolved |