[LU-1642] Clients get disconnected and reconnected during heavy IO immediately after the halt of a blade. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Incomplete
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.2.0
Labels:
None
Environment:

Hide
----------------------------------------------------------------------------------------------------
## MDS HW ##
----------------------------------------------------------------------------------------------------
Linux XXXX.admin.cscs.ch 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
Vendor ID: AuthenticAMD
CPU family: 16
64Gb RAM
Interconnect IB 40Gb/s
---
MDT LSI 5480 Pikes Peak
SSDs SLC
----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
## OSS HW ##
----------------------------------------------------------------------------------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
Vendor ID: GenuineIntel
CPU family: 6
64Gb RAM
Interconnect IB 40Gb/s
---
OSTs ---> LSI 7900 SATA Disks
----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
## Router nodes ##
----------------------------------------------------------------------------------------------------
12 Cray XE6 Service nodes as router nodes - IB 40Gb/s
----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
## Clients ##
----------------------------------------------------------------------------------------------------
~ 1500 Cray XE6 nodes - Lustre 1.8.6
----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
## LUSTRE Config ##
----------------------------------------------------------------------------------------------------
1 MDS + 1 fail over (MDT on SSD array)
12 OSSs - 6 OSTs per OSS (72 OSTs)

Luster Servers ---> 2.2.51.0
Lustre Clients ---> 1.8.6 (~1500 nodes) / 2.2.51.0 (~20 nodes)
----------------------------------------------------------------------------------------------------

Show
---------------------------------------------------------------------------------------------------- ## MDS HW ## ---------------------------------------------------------------------------------------------------- Linux XXXX.admin.cscs.ch 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 Vendor ID: AuthenticAMD CPU family: 16 64Gb RAM Interconnect IB 40Gb/s --- MDT LSI 5480 Pikes Peak SSDs SLC ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ## OSS HW ## ---------------------------------------------------------------------------------------------------- Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 Vendor ID: GenuineIntel CPU family: 6 64Gb RAM Interconnect IB 40Gb/s --- OSTs ---> LSI 7900 SATA Disks ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ## Router nodes ## ---------------------------------------------------------------------------------------------------- 12 Cray XE6 Service nodes as router nodes - IB 40Gb/s ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ## Clients ## ---------------------------------------------------------------------------------------------------- ~ 1500 Cray XE6 nodes - Lustre 1.8.6 ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ## LUSTRE Config ## ---------------------------------------------------------------------------------------------------- 1 MDS + 1 fail over (MDT on SSD array) 12 OSSs - 6 OSTs per OSS (72 OSTs) Luster Servers ---> 2.2.51.0 Lustre Clients ---> 1.8.6 (~1500 nodes) / 2.2.51.0 (~20 nodes) ----------------------------------------------------------------------------------------------------

Severity:
3
Rank (Obsolete):
4006

Description

During Lustre testing yesterday we observe this behaviours:

Halt 4 nodes on a blade
jobs doing IO intense such as IOR or MPIIO starts

Jul 17 16:38:15 nid00475 aprun.x[13684]: apid=1177710, Starting, user=20859, batch_id=377847, cmd_line="/usr/bin/apr
un.x -n 256 src/C/IOR -a MPIIO -B -b 4096m -t 4096K -k -r -w -e -g -s 1 -i 2 -F -C -o /scratch/weisshorn/fverzell/te
st5/IORtest-377847 ", num_nodes=64, node_list=64-65,126-129,190-191,702-705,766-769,830-833,894-897,958-961,1022-102
5,1086-1089,1150-1153,1214-1217,1278-1281,1294-1295,1342-1345,1406-1409,1470-1473,1534-1535

then a few minutes later Lustre is acting up

Lustre server log:

Jul 17 16:39:57 weisshorn03 kernel: LNetError: 4754:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 11 seconds
Jul 17 16:39:57 weisshorn03 kernel: LNetError: 4754:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 148.187.7.73@o2ib2 (0): c: 0, oc: 1, rc: 5

Jul 17 16:39:58 weisshorn08 kernel: LNetError: 5045:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds
Jul 17 16:39:58 weisshorn08 kernel: LNetError: 5045:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 148.187.7.78@o2ib2 (0): c: 0, oc: 3, rc: 4

Jul 17 16:39:59 weisshorn13 kernel: LNet: 3394:0:(o2iblnd_cb.c:2340:kiblnd_passive_connect()) Conn race 148.187.7.81@o2ib2
Jul 17 16:39:59 weisshorn05 kernel: LNetError: 4875:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds

Notice "IO Bulk write error" for nid833 which part of job mentioned above followed by "inactive thread " then dumptrace:

Jul 17 16:40:05 weisshorn14 kernel: LustreError: 7929:0:(ldlm_lib.c:2717:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@f
fff880f07f8b400 x1407748581382025/t0(0) o4->412fabdd-3b3a-df4b-bdc6-264145113d70@833@gni:0/0 lens 448/416 e 0 to 0 dl 1342536406 ref 1 fl Interpret:/0/0 rc 0/0
Jul 17 16:40:05 weisshorn14 kernel: Lustre: scratch-OST003f: Bulk IO write error with 412fabdd-3b3a-df4b-bdc6-264145113d70 (at 833@gni), c
lient will retry: rc -110

Jul 17 16:43:19 weisshorn13 kernel: Lustre: 6182:0:(service.c:1034:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/1), not sending early reply
Jul 17 16:43:19 weisshorn13 kernel: req@ffff880ddeb81400 x1407748579298657/t0(0) o4->a34c7ab8-980f-db22-6596-e1db30724c4d@12@gni:0/0 lens 448/416 e 1 to 0 dl 1342536204 ref 2 fl Interpret:/0/0 rc 0/0
Jul 17 16:43:19 weisshorn13 kernel: Lustre: 6182:0:(service.c:1034:ptlrpc_at_send_early_reply()) Skipped 19 previous similar messages

Jul 17 16:43:20 weisshorn13 kernel: LNet: Service thread pid 8102 was inactive for 600.00s. The thread might be hung, or it might only be
slow and will resume later. Dumping the stack trace for debugging purposes:
Jul 17 16:43:20 weisshorn13 kernel: Pid: 8102, comm: ll_ost_io_153
Jul 17 16:43:20 weisshorn13 kernel:
Jul 17 16:43:20 weisshorn13 kernel: Call Trace:
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8107bf8c>] ? lock_timer_base+0x3c/0x70
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff814edc52>] schedule_timeout+0x192/0x2e0
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8107c0a0>] ? process_timeout+0x0/0x10
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa03a65c1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa05df0ad>] target_bulk_io+0x38d/0x8b0 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8105e7f0>] ? default_wake_function+0x0/0x20
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0b4c792>] ost_brw_write+0x1172/0x1380 [ost]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa03a527b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa05d64a0>] ? target_bulk_timeout+0x0/0x80 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0b507c4>] ost_handle+0x2764/0x39e0 [ost]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0612c83>] ? ptlrpc_update_export_timer+0x1c3/0x360 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa06183c1>] ptlrpc_server_handle_request+0x3c1/0xcb0 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa03a64ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa03b0ef9>] ? lc_watchdog_touch+0x79/0x110 [libcfs]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0612462>] ? ptlrpc_wait_event+0xb2/0x2c0 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa06193cf>] ptlrpc_main+0x71f/0x1210 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0618cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0618cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffffa0618cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
Jul 17 16:43:20 weisshorn13 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Jul 17 16:43:20 weisshorn13 kernel:

Nid833 is then being evicted:

Jul 17 16:47:44 weisshorn14 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 568
s: evicting client at 833@gni ns: filter-scratch-OST003f_UUID lock: ffff8808aa9fc480/0x9cc518d034bea3c8 lrc: 3/0,0 mode: PW/PW res: 22186
722/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x20 remote: 0x629daae2e6cf7351 expref: 5 pid: 5811 timeout 42997
03474

On SMW console log:

[2012-07-17 16:48:07][c7-0c1s0n1]Lustre: 9196:0:(client.c:1492:ptlrpc_expire_one_request()) @@@ Request x1407748581382025
sent from scratch-OST003f-osc-ffff88041e142400 to NID 148.187.7.114@o2ib2 471s ago has timed out (471s prior to deadline
).
[2012-07-17 16:48:07][c7-0c1s0n1] req@ffff8801bea18800 x1407748581382025/t0 o4->scratch-OST003f_UUID@148.187.7.114@o2ib2
:6/4 lens 448/608 e 0 to 1 dl 1342536485 ref 2 fl Rpc:/0/0 rc 0/0
[2012-07-17 16:48:07][c7-0c1s0n1]Lustre: scratch-OST003f-osc-ffff88041e142400: Connection to service scratch-OST003f via
nid 148.187.7.114@o2ib2 was lost; in progress operations using this service will wait for recovery to complete.
[2012-07-17 16:48:07][c7-0c1s0n1]LustreError: 167-0: This client was evicted by scratch-OST003f; in progress operations u
sing this service will fail.
[2012-07-17 16:48:07][c7-0c1s0n1]Lustre: Server scratch-OST003f_UUID version (2.2.51.0) is much newer than client version
(1.8.6)
[2012-07-17 16:48:07][c7-0c1s0n1]Lustre: Skipped 72 previous similar messages
[2012-07-17 16:48:07][c7-0c1s0n1]Lustre: scratch-OST003f-osc-ffff88041e142400: Connection restored to service scratch-OST
003f using nid 148.187.7.114@o2ib2.

Job is talled then finally is killed due cputime limit exceeded

slurmd[rosa12]: *** JOB 377847 CANCELLED AT 17:03:36 DUE TO TIME LIMIT ***
aprun.x: Apid 1177710: Caught signal Terminated, sending to application

Attached the log file of Cray XE machine of the specific time range.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

nid2CrayMapping.txt
89 kB
19/Jul/12 6:24 AM
sdb.log
574 kB
18/Jul/12 4:24 AM
smw.log
129 kB
18/Jul/12 4:17 AM
weiss02.tar.gz
2 kB
19/Jul/12 7:31 PM
weisshorn02.tar.gz
0.6 kB
19/Jul/12 5:33 PM

Activity

[LU-1642] Clients get disconnected and reconnected during heavy IO immediately after the halt of a blade.

Liang Zhen (Inactive) added a comment - 20/Jul/12 5:14 AM

I've added Fanyong to CC list, I think if MDT size is growing faster than you thought, then it's very likely because our OI files are growing forever (~~LU-1512~~). It's a design defect of IAM, Fanyong has already worked out a patch, but I don't know if it's possible to apply to existed filesystem.

Fanyong, could you comment on this?

Liang Zhen (Inactive) added a comment - 20/Jul/12 5:14 AM I've added Fanyong to CC list, I think if MDT size is growing faster than you thought, then it's very likely because our OI files are growing forever ( LU-1512 ). It's a design defect of IAM, Fanyong has already worked out a patch, but I don't know if it's possible to apply to existed filesystem. Fanyong, could you comment on this?

Fabio Verzelloni added a comment - 20/Jul/12 4:40 AM

Is it normal that the MDT is using 110Gb? I think I've never seen the MDT so full.

Thanks
Fabio

Fabio Verzelloni added a comment - 20/Jul/12 4:40 AM Is it normal that the MDT is using 110Gb? I think I've never seen the MDT so full. Thanks Fabio

Fabio Verzelloni added a comment - 20/Jul/12 4:10 AM

I'm going to make a fsck on the MDT.

Fabio

Fabio Verzelloni added a comment - 20/Jul/12 4:10 AM I'm going to make a fsck on the MDT. Fabio

Fabio Verzelloni added a comment - 20/Jul/12 4:07 AM

> Please enable console logging of network errors: echo +neterror > /proc/sys/lnet/printk
Done on all the Lustre Cluster (weisshorn).

>1- Server IB network:
>the errors reported by ibcheckerrors should be double checked. Also, the error counters should be reset, so that later we can query them again and be able to interpret >the results better.

Today I'll have a look with the network administrator.

>2- Client GNI network:
>If the errors were all about nodes in the halted blade, then they can be ignored. Otherwise, they must be investigated.

Yes the gni network errors were from nodes halted.

>3- On routers:
>-More buffer credits should be granted to the servers. I'd need to see the module options on routers and the files under /proc/sys/lnet/ to make suggestion on router >buffer settings.
>-Peer health option must be turned on for both the ko2iblnd and the gnilnd:
> options ko2iblnd peer_timeout=180
> options kgnilnd peer_health=60

I'll do it at the first reboot of the cluster.

Fabio Verzelloni added a comment - 20/Jul/12 4:07 AM > Please enable console logging of network errors: echo +neterror > /proc/sys/lnet/printk Done on all the Lustre Cluster (weisshorn). >1- Server IB network: >the errors reported by ibcheckerrors should be double checked. Also, the error counters should be reset, so that later we can query them again and be able to interpret >the results better. Today I'll have a look with the network administrator. >2- Client GNI network: >If the errors were all about nodes in the halted blade, then they can be ignored. Otherwise, they must be investigated. Yes the gni network errors were from nodes halted. >3- On routers: >-More buffer credits should be granted to the servers. I'd need to see the module options on routers and the files under /proc/sys/lnet/ to make suggestion on router >buffer settings. >-Peer health option must be turned on for both the ko2iblnd and the gnilnd: > options ko2iblnd peer_timeout=180 > options kgnilnd peer_health=60 I'll do it at the first reboot of the cluster.

Isaac Huang (Inactive) added a comment - 20/Jul/12 3:55 AM

You're correct on kgnilnd peer_health. Thanks.
If there's only a small number of servers, ko2iblnd peer_buffer_credits could be set to higher than 128.

Isaac Huang (Inactive) added a comment - 20/Jul/12 3:55 AM You're correct on kgnilnd peer_health. Thanks. If there's only a small number of servers, ko2iblnd peer_buffer_credits could be set to higher than 128.

Fabio Verzelloni added a comment - 20/Jul/12 3:55 AM

We are having a file system hang right now, can you please connect to weisshorn and have a look, I'm here in case of any kind of needs to help you with logs, details, etc.

Thanks
Fabio

Fabio Verzelloni added a comment - 20/Jul/12 3:55 AM We are having a file system hang right now, can you please connect to weisshorn and have a look, I'm here in case of any kind of needs to help you with logs, details, etc. Thanks Fabio

Liang Zhen (Inactive) added a comment - 20/Jul/12 3:05 AM

Isaac, I think you meant "options kgnilnd peer_health=1" correct? because peer_health is a boolean,

I remember that the ko2iblnd peer_buffer_credits on router is set to 128, but need to be verified, Fabio, could you check this? As Isaac said, all modparameters on routers could be helpful, I saw various versions of parameters were posted on the other ticket but not sure which one is your final choice, so could you post them at here.

Liang Zhen (Inactive) added a comment - 20/Jul/12 3:05 AM Isaac, I think you meant "options kgnilnd peer_health=1" correct? because peer_health is a boolean, I remember that the ko2iblnd peer_buffer_credits on router is set to 128, but need to be verified, Fabio, could you check this? As Isaac said, all modparameters on routers could be helpful, I saw various versions of parameters were posted on the other ticket but not sure which one is your final choice, so could you post them at here.

Isaac Huang (Inactive) added a comment - 20/Jul/12 2:48 AM

Hi Fabio, three more notes on collecting data on routers:

Please enable console logging of network errors: echo +neterror > /proc/sys/lnet/printk
Cliff mentioned that "tar -czvf `hostname`.tgz /proc/sys/lnet/" might fail to grab the files as some were not readable. Cliff can you advise how you managed to get the "Better tarball"? Or maybe you could find out in the shell command line history on 148.187.7.102@o2ib2.
It'd be helpful to include a timestamp in the tarball, e.g. tar -czvf `hostname`_`date +%T`.tgz. That'll help me correlate the data with events reported in the log files.

Thanks!

Isaac Huang (Inactive) added a comment - 20/Jul/12 2:48 AM Hi Fabio, three more notes on collecting data on routers: Please enable console logging of network errors: echo +neterror > /proc/sys/lnet/printk Cliff mentioned that "tar -czvf `hostname`.tgz /proc/sys/lnet/" might fail to grab the files as some were not readable. Cliff can you advise how you managed to get the "Better tarball"? Or maybe you could find out in the shell command line history on 148.187.7.102@o2ib2. It'd be helpful to include a timestamp in the tarball, e.g. tar -czvf `hostname`_`date +%T`.tgz. That'll help me correlate the data with events reported in the log files. Thanks!

Fabio Verzelloni added a comment - 20/Jul/12 2:01 AM

Dear Isaac,
I'll get in touch with our network admin to have a look on our Ib network looking for errors, regarding your second question the Cray XE machine has been rebooted so all the /proc/sys/lnet is not anymore the one would be helpful. In case something will happen I'll take immediately a dump of all the router nodes if that could help.

Regards
Fabio

Fabio Verzelloni added a comment - 20/Jul/12 2:01 AM Dear Isaac, I'll get in touch with our network admin to have a look on our Ib network looking for errors, regarding your second question the Cray XE machine has been rebooted so all the /proc/sys/lnet is not anymore the one would be helpful. In case something will happen I'll take immediately a dump of all the router nodes if that could help. Regards Fabio

Isaac Huang (Inactive) added a comment - 19/Jul/12 7:40 PM

First of all, Cliff did an ibcheckerrors, which said:

Summary: 236 nodes checked, 2 bad nodes found
786 ports checked, 342 ports have errors beyond threshold

I'm not sure how serious the problem is, as the error counters could have been accumulating for months. But I think it's a good idea to have your IB admin double check that the IB fabric is running OK. Such problems can be hard to nail down when they begin manifesting themselves at upper layers.

From the data available, and under the assumption that the nodes' system clocks are roughly synchronized, to seconds at least, here's my speculation of what happened.

16:35:01 Timeouts and errors began to show up in the @gni network:

No gnilnd traffic received from 50@gni for 120 seconds, terminating connection. Is node down?
kgnilnd_close_conn_locked()) closing conn to 12@gni: error -110
kgnilnd_tx_done()) $$ error ~~113 on tx 0xffff8803b4b87b68~~><?>
kgnilnd_reaper_dgram_check()) GNILND_DGRAM_REQ datagram to 12@gni timed out

These errors could indicate problems in the GNI network, or they could be OK if they were errors about messages to the halted blade.
Then, on routers, messages to nodes in the @gni network got queued up, as the GNI network couldn't forward them out. Then servers ran out of router buffer credits. As a result, routers couldn't return TX credits back to servers. Then next step...
16:39:57 Servers began to see RDMA timeouts:

Jul 17 16:39:57 weisshorn03 kernel: LNetError: 4754:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 11 seconds
Jul 17 16:39:57 weisshorn03 kernel: LNetError: 4754:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 148.187.7.73@o2ib2 (0): c: 0, oc: 1, rc: 5

These TXs were never put out on wire. Instead they waited for TX credits for too long and timed out before reaching the wire. The 1st message said they were waiting for TX credit, and the 2nd said the connection had no credit to use. The lack of tx credits could be seen from the peers file on the MDS:

nid refs state last max rtr min tx min queue
148.187.7.71@o2ib2 3 up -1 16 16 16 16 -153 0
148.187.7.72@o2ib2 3 up -1 16 16 16 16 -150 0
148.187.7.73@o2ib2 3 up -1 16 16 16 16 -152 0
148.187.7.74@o2ib2 3 up -1 16 16 16 16 -152 0
148.187.7.75@o2ib2 3 up -1 16 16 16 16 -152 0
148.187.7.76@o2ib2 3 up -1 16 16 16 16 -151 0
148.187.7.77@o2ib2 3 up -1 16 16 16 16 -151 0
148.187.7.78@o2ib2 3 up -1 16 16 16 16 -151 0
148.187.7.79@o2ib2 3 up -1 16 16 16 16 -152 0
148.187.7.80@o2ib2 3 up -1 16 16 16 16 -151 0
148.187.7.81@o2ib2 3 up -1 16 16 16 16 -151 0
148.187.7.82@o2ib2 3 up -1 16 16 16 16 -150 0

The TX queues for routers became quite long at one point.
Active clients now wouldn't see any progress, as the servers couldn't send messages to the routers.

To fix it, I'd suggest to:

Server IB network: the errors reported by ibcheckerrors should be double checked. Also, the error counters should be reset, so that later we can query them again and be able to interpret the results better.
Client GNI network: If the errors were all about nodes in the halted blade, then they can be ignored. Otherwise, they must be investigated.
On routers:
1. More buffer credits should be granted to the servers. I'd need to see the module options on routers and the files under /proc/sys/lnet/ to make suggestion on router buffer settings.
2. Peer health option must be turned on for both the ko2iblnd and the gnilnd:
  options ko2iblnd peer_timeout=180
  options kgnilnd peer_health=60

Isaac Huang (Inactive) added a comment - 19/Jul/12 7:40 PM First of all, Cliff did an ibcheckerrors, which said: Summary: 236 nodes checked, 2 bad nodes found 786 ports checked, 342 ports have errors beyond threshold I'm not sure how serious the problem is, as the error counters could have been accumulating for months. But I think it's a good idea to have your IB admin double check that the IB fabric is running OK. Such problems can be hard to nail down when they begin manifesting themselves at upper layers. From the data available, and under the assumption that the nodes' system clocks are roughly synchronized, to seconds at least, here's my speculation of what happened. 16:35:01 Timeouts and errors began to show up in the @gni network: No gnilnd traffic received from 50@gni for 120 seconds, terminating connection. Is node down? kgnilnd_close_conn_locked()) closing conn to 12@gni: error -110 kgnilnd_tx_done()) $$ error 113 on tx 0xffff8803b4b87b68 ><?> kgnilnd_reaper_dgram_check()) GNILND_DGRAM_REQ datagram to 12@gni timed out These errors could indicate problems in the GNI network, or they could be OK if they were errors about messages to the halted blade. Then, on routers, messages to nodes in the @gni network got queued up, as the GNI network couldn't forward them out. Then servers ran out of router buffer credits. As a result, routers couldn't return TX credits back to servers. Then next step... 16:39:57 Servers began to see RDMA timeouts: Jul 17 16:39:57 weisshorn03 kernel: LNetError: 4754:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 11 seconds Jul 17 16:39:57 weisshorn03 kernel: LNetError: 4754:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 148.187.7.73@o2ib2 (0): c: 0, oc: 1, rc: 5 These TXs were never put out on wire. Instead they waited for TX credits for too long and timed out before reaching the wire. The 1st message said they were waiting for TX credit, and the 2nd said the connection had no credit to use. The lack of tx credits could be seen from the peers file on the MDS: nid refs state last max rtr min tx min queue 148.187.7.71@o2ib2 3 up -1 16 16 16 16 -153 0 148.187.7.72@o2ib2 3 up -1 16 16 16 16 -150 0 148.187.7.73@o2ib2 3 up -1 16 16 16 16 -152 0 148.187.7.74@o2ib2 3 up -1 16 16 16 16 -152 0 148.187.7.75@o2ib2 3 up -1 16 16 16 16 -152 0 148.187.7.76@o2ib2 3 up -1 16 16 16 16 -151 0 148.187.7.77@o2ib2 3 up -1 16 16 16 16 -151 0 148.187.7.78@o2ib2 3 up -1 16 16 16 16 -151 0 148.187.7.79@o2ib2 3 up -1 16 16 16 16 -152 0 148.187.7.80@o2ib2 3 up -1 16 16 16 16 -151 0 148.187.7.81@o2ib2 3 up -1 16 16 16 16 -151 0 148.187.7.82@o2ib2 3 up -1 16 16 16 16 -150 0 The TX queues for routers became quite long at one point. Active clients now wouldn't see any progress, as the servers couldn't send messages to the routers. To fix it, I'd suggest to: Server IB network: the errors reported by ibcheckerrors should be double checked. Also, the error counters should be reset, so that later we can query them again and be able to interpret the results better. Client GNI network: If the errors were all about nodes in the halted blade, then they can be ignored. Otherwise, they must be investigated. On routers: More buffer credits should be granted to the servers. I'd need to see the module options on routers and the files under /proc/sys/lnet/ to make suggestion on router buffer settings. Peer health option must be turned on for both the ko2iblnd and the gnilnd: options ko2iblnd peer_timeout=180 options kgnilnd peer_health=60

Cliff White (Inactive) added a comment - 19/Jul/12 7:31 PM

Better tarball. Still need this from the routers

Cliff White (Inactive) added a comment - 19/Jul/12 7:31 PM Better tarball. Still need this from the routers

People

Assignee:: Oleg Drokin

Reporter:: Fabio Verzelloni

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 18/Jul/12 4:17 AM

Updated:: 29/May/17 4:34 AM

Resolved:: 29/May/17 4:34 AM