[LU-1642] Clients get disconnected and reconnected during heavy IO immediately after the halt of a blade. Created: 18/Jul/12 Updated: 29/May/17 Resolved: 29/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Fabio Verzelloni | Assignee: | Oleg Drokin |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Environment: |
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Luster Servers ---> 2.2.51.0 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4006 |
| Description |
|
During Lustre testing yesterday we observe this behaviours:
Jul 17 16:38:15 nid00475 aprun.x[13684]: apid=1177710, Starting, user=20859, batch_id=377847, cmd_line="/usr/bin/apr
Lustre server log: Jul 17 16:39:57 weisshorn03 kernel: LNetError: 4754:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 11 seconds Jul 17 16:39:58 weisshorn08 kernel: LNetError: 5045:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds Jul 17 16:39:59 weisshorn13 kernel: LNet: 3394:0:(o2iblnd_cb.c:2340:kiblnd_passive_connect()) Conn race 148.187.7.81@o2ib2
Jul 17 16:40:05 weisshorn14 kernel: LustreError: 7929:0:(ldlm_lib.c:2717:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@f Jul 17 16:43:19 weisshorn13 kernel: Lustre: 6182:0:(service.c:1034:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/1), not sending early reply Jul 17 16:43:20 weisshorn13 kernel: LNet: Service thread pid 8102 was inactive for 600.00s. The thread might be hung, or it might only be
Jul 17 16:47:44 weisshorn14 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 568 On SMW console log: [2012-07-17 16:48:07][c7-0c1s0n1]Lustre: 9196:0:(client.c:1492:ptlrpc_expire_one_request()) @@@ Request x1407748581382025 Job is talled then finally is killed due cputime limit exceeded slurmd[rosa12]: *** JOB 377847 CANCELLED AT 17:03:36 DUE TO TIME LIMIT *** Attached the log file of Cray XE machine of the specific time range. |
| Comments |
| Comment by Peter Jones [ 18/Jul/12 ] |
|
Oleg Could you please look into this one? Thanks Peter |
| Comment by Cliff White (Inactive) [ 18/Jul/12 ] |
|
I am still seeing these messages: LNetError: 1480:0:(o2iblnd_cb.c:2273:kiblnd_passive_connect()) Can't accept 148.187.6.201@o2ib2: incompatible queue depth 8 (16 wanted) which would indicate your parameters are not the same on all nodes. This could impact the issue, please make sure that IB parameters match every where. |
| Comment by Fabio Verzelloni [ 18/Jul/12 ] |
|
Hi Cliff, Thanks |
| Comment by Liang Zhen (Inactive) [ 18/Jul/12 ] |
|
Hi Fabio, a few questsion:
|
| Comment by Isaac Huang (Inactive) [ 18/Jul/12 ] |
|
Hi Fabio, can you please also run a 'ibcheckerrors' to make sure the IB fabric is clean? Sometimes those RDMA timeout errors are caused by faulty fabric. It makes sense to me to first make sure the network itself is healthy. |
| Comment by Colin McMurtrie [ 19/Jul/12 ] |
|
The two log files (sdb.log and smw.log) are from the Cray XE6 so the events logged there relate to the clients running on the Cray (compute nodes and service nodes running the LNET routers). Nid833 and c7-0c1s0n1 refer to the same compute node (i.e. they are different names for the same thing). I have attached the file nid2CrayMapping.txt so that you can see this mapping for all nodes on our Cray XE6. |
| Comment by Liang Zhen (Inactive) [ 19/Jul/12 ] |
|
I'm still trying to understand the network topology at here, could you give a description of network topology? I think OSSs are on o2ib network correct? But after I checked the sdb.log I saw this: Jul 17 16:37:52 nid00394 kernel: LNet: 12047:0:(gnilnd_conn.c:1872:kgnilnd_reaper_dgram_check()) GNILND_DGRAM_REQ datagram to 12@gni timed out @ 128s dgram 0xffff8803ef766b48 state GNILND_DGRAM_POSTED conn 0xffff8803b4b07000 Jul 17 16:37:52 nid01530 kernel: LNet: 12033:0:(gnilnd_conn.c:1872:kgnilnd_reaper_dgram_check()) GNILND_DGRAM_REQ datagram to 50@gni timed out @ 128s dgram 0xffff880407221b08 state GNILND_DGRAM_POSTED conn 0xffff8803cedc0000 Jul 17 16:37:54 nid01530 kernel: LNet: could not send to 50@gni due to connection setup failure after 130 seconds Jul 17 16:37:54 nid01530 kernel: LNet: 12028:0:(gnilnd_cb.c:1104:kgnilnd_tx_done()) $$ error -113 on tx 0xffff8803e6be9b68-><?> id 0/0 state GNILND_TX_ALLOCD age 130s msg@0xffff8803e6be9be8 m/v/ty/ck/pck/pl b00fbabe/8/2/0/22d/0 x0:GNILND_MSG_IMMEDIATE Jul 17 16:37:58 nid00394 kernel: LNet: could not send to 12@gni due to connection setup failure after 134 seconds Jul 17 16:37:58 nid00394 kernel: LNet: 12042:0:(gnilnd_cb.c:1104:kgnilnd_tx_done()) $$ error -113 on tx 0xffff8803fdcab248-><?> id 0/0 state GNILND_TX_ALLOCD age 134s msg@0xffff8803fdcab2c8 m/v/ty/ck/pck/pl b00fbabe/8/2/0/24d4/0 x0:GNILND_MSG_IMMEDIATE I checked nid2CrayMapping.txt, these nodes (nid00394, nid01530) are marked as "service", does it mean that those OSSs are also acting as "router", or they are dedicated routers but also marked as "service" nodes? Also, what's o2ib NIDs of these two nodes(nid00394, nid01530)? what's hostname of 12@gni, 50@gni? We need to find out errors/logs on a message path (OSS< |
| Comment by Isaac Huang (Inactive) [ 19/Jul/12 ] |
|
Hi Colin and Fabio, I'm an Intel/Whamcloud engineer (despite that my email isn't from either one) working with Liang on this bug. I'd appreciate a couple of things from you:
|
| Comment by Fabio Verzelloni [ 19/Jul/12 ] |
|
Dear Liang, These are the router node "inside" Cray XE System: These node have the IB & GNI. The following lines are from weisshorn, that is basically were Lustre is built: [root@weisshorn01 ~]# lctl show_route These nodes have only IB. Weisshorn & Cray are two completely separated things, the nodes nid00394, nid01530 are router: nid00394:~ # lctl list_nids In Cray speaking what is called 12@gni or "any_number"@gni if it is a compute nodes, so basically a node for computation should be called in different ways, but as example a node called 50@gni will be: 50@gni What you see marked as service, should be a router nodes or fronend nodes. Please let me know if you need more details. |
| Comment by Isaac Huang (Inactive) [ 19/Jul/12 ] |
|
Hi Fabio, thanks for the feedback. It's about 3AM for Liang now, so he's likely not going to respond soon. Please have a look at my previous comment where additional data was requested. |
| Comment by Cliff White (Inactive) [ 19/Jul/12 ] |
|
Tarball of /proc/sys/lnet on the MDS |
| Comment by Cliff White (Inactive) [ 19/Jul/12 ] |
|
Better tarball. Still need this from the routers |
| Comment by Isaac Huang (Inactive) [ 19/Jul/12 ] |
|
First of all, Cliff did an ibcheckerrors, which said:
I'm not sure how serious the problem is, as the error counters could have been accumulating for months. But I think it's a good idea to have your IB admin double check that the IB fabric is running OK. Such problems can be hard to nail down when they begin manifesting themselves at upper layers. From the data available, and under the assumption that the nodes' system clocks are roughly synchronized, to seconds at least, here's my speculation of what happened.
To fix it, I'd suggest to:
|
| Comment by Fabio Verzelloni [ 20/Jul/12 ] |
|
Dear Isaac, Regards |
| Comment by Isaac Huang (Inactive) [ 20/Jul/12 ] |
|
Hi Fabio, three more notes on collecting data on routers:
Thanks! |
| Comment by Liang Zhen (Inactive) [ 20/Jul/12 ] |
|
Isaac, I think you meant "options kgnilnd peer_health=1" correct? because peer_health is a boolean, I remember that the ko2iblnd peer_buffer_credits on router is set to 128, but need to be verified, Fabio, could you check this? As Isaac said, all modparameters on routers could be helpful, I saw various versions of parameters were posted on the other ticket but not sure which one is your final choice, so could you post them at here. |
| Comment by Fabio Verzelloni [ 20/Jul/12 ] |
|
We are having a file system hang right now, can you please connect to weisshorn and have a look, I'm here in case of any kind of needs to help you with logs, details, etc. Thanks |
| Comment by Isaac Huang (Inactive) [ 20/Jul/12 ] |
|
| Comment by Fabio Verzelloni [ 20/Jul/12 ] |
|
> Please enable console logging of network errors: echo +neterror > /proc/sys/lnet/printk >1- Server IB network: Today I'll have a look with the network administrator. >2- Client GNI network: Yes the gni network errors were from nodes halted. >3- On routers: I'll do it at the first reboot of the cluster. |
| Comment by Fabio Verzelloni [ 20/Jul/12 ] |
|
I'm going to make a fsck on the MDT. Fabio |
| Comment by Fabio Verzelloni [ 20/Jul/12 ] |
|
Is it normal that the MDT is using 110Gb? I think I've never seen the MDT so full. Thanks |
| Comment by Liang Zhen (Inactive) [ 20/Jul/12 ] |
|
I've added Fanyong to CC list, I think if MDT size is growing faster than you thought, then it's very likely because our OI files are growing forever ( Fanyong, could you comment on this? |
| Comment by Liang Zhen (Inactive) [ 20/Jul/12 ] |
|
sorry, delete comment on wrong ticket |
| Comment by nasf (Inactive) [ 23/Jul/12 ] |
|
Hi Fabio, Can you please to mount the MDT device as the type of "ldiskfs", then check which file(s) consumed such large space? |