[LU-3714] Single client data copy from/to lfs hangs client. [server,client]bulk_callback errors Created: 06/Aug/13  Updated: 07/Aug/13  Resolved: 07/Aug/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Jeff Johnson (Inactive) Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

CentOS 6.4, 2.6.32-358.11.1.el6_lustre.x86_64, Intel Truescale IB/QDR, single-rail, in-kernel infiniband.


Severity: 3
Epic: client, hang, server
Rank (Obsolete): 9565

 Description   

Fresh boot/mount of lfs 2.1.6. Pre-existing ldiskfs OSTs, lfs upgraded from 2.1.5. Single client mount of lfs via o2ib. Copy of 2GB files from/to lfs causes client hang and loss of connection to two OSS nodes.

Datafile creation:

cd /lustre2 ; tar cf ./test.tar /usr

Simple copy test:

for i in `cat iter`; do cp test.tar test.tar.$i; done

After 40GB of data transfer (2GB, read & write to new file, 10 files) the client process hangs.

Logs of MDS, OSS and client shows no IB lid or other hardware errors.

Output from /var/log/messages
MDS:

Aug  6 13:27:38 lustrefs-sys-mds0 kernel: Lustre: 7848:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request  sent has failed due to network error: [sent 1375813658/real 1375813658]  req@ffff88047b2bc800 x1442643895649363/t0(0) o8->lustrefssys-OST000a-osc-MDT0000@10.148.0.154@o2ib:28/4 lens 368/512 e 0 to 1 dl 1375813713 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Aug  6 13:27:38 lustrefs-sys-mds0 kernel: Lustre: 7848:0:(client.c:1817:ptlrpc_expire_one_request()) Skipped 23 previous similar messages
Aug  6 13:28:00 lustrefs-sys-mds0 kernel: Lustre: 7892:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from d053eba6-b0f0-eafb-4a55-cb86e1c046fb@10.148.0.154@o2ib t0 exp (null) cur 1375813680 last 0
Aug  6 13:28:00 lustrefs-sys-mds0 kernel: Lustre: 7892:0:(ldlm_lib.c:952:target_handle_connect()) Skipped 4 previous similar messages
Aug  6 13:28:03 lustrefs-sys-mds0 kernel: Lustre: lustrefssys-OST000a-osc-MDT0000: Connection restored to lustrefssys-OST000a (at 10.148.0.154@o2ib)
Aug  6 13:28:03 lustrefs-sys-mds0 kernel: Lustre: MDS mdd_obd-lustrefssys-MDT0000: lustrefssys-OST000a_UUID now active, resetting orphans
Aug  6 13:28:03 lustrefs-sys-mds0 kernel: Lustre: Skipped 14 previous similar messages

OSS10:

Aug  6 13:28:00 lustrefs-sys-oss10 kernel: Lustre: 7777:0:(ldlm_lib.c:952:target_handle_connect()) lustrefssys-OST000a: connection from 4ece0c04-00b5-aedd-f612-11cbcc7fb566@10.148.0.143@o2ib recovering/t0 exp (null) cur 1375813680 last 0
Aug  6 13:28:00 lustrefs-sys-oss10 kernel: Lustre: lustrefssys-OST000a: Denying connection for new client 10.148.0.143@o2ib (at 4ece0c04-00b5-aedd-f612-11cbcc7fb566), waiting for 0 clients in recovery for 5:00
Aug  6 13:28:00 lustrefs-sys-oss10 kernel: Lustre: MGC10.148.0.142@o2ib: Reactivating import
Aug  6 13:28:00 lustrefs-sys-oss10 kernel: Lustre: 7777:0:(ldlm_lib.c:952:target_handle_connect()) lustrefssys-OST000a: connection from 4ece0c04-00b5-aedd-f612-11cbcc7fb566@10.148.0.143@o2ib recovering/t0 exp (null) cur 1375813680 last 0
Aug  6 13:28:00 lustrefs-sys-oss10 kernel: Lustre: lustrefssys-OST000a: Denying connection for new client 10.148.0.143@o2ib (at 4ece0c04-00b5-aedd-f612-11cbcc7fb566), waiting for 0 clients in recovery for 4:59
Aug  6 13:28:03 lustrefs-sys-oss10 ntpd[7983]: Listening on interface #7 ib0, fe80::211:7500:77:dc5a#123 Enabled
Aug  6 13:28:03 lustrefs-sys-oss10 kernel: Lustre: 7777:0:(ldlm_lib.c:952:target_handle_connect()) lustrefssys-OST000a: connection from lustrefssys-MDT0000-mdtlov_UUID@10.148.0.142@o2ib recovering/t0 exp ffff880270301000 cur 1375813683 last 1375812940
Aug  6 13:28:03 lustrefs-sys-oss10 kernel: Lustre: lustrefssys-OST000a: sending delayed replies to recovered clients
Aug  6 13:28:03 lustrefs-sys-oss10 kernel: Lustre: lustrefssys-OST000a: received MDS connection from 10.148.0.142@o2ib
Aug  6 13:28:09 lustrefs-sys-oss10 ntpd[7983]: synchronized to 198.122.144.26, stratum 2
Aug  6 13:28:25 lustrefs-sys-oss10 kernel: Lustre: 7777:0:(ldlm_lib.c:952:target_handle_connect()) lustrefssys-OST000a: connection from 4ece0c04-00b5-aedd-f612-11cbcc7fb566@10.148.0.143@o2ib t0 exp (null) cur 1375813705 last 0

oss06:

Aug  6 13:31:47 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff8804795b6000
Aug  6 13:31:47 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff8804795b6000
Aug  6 13:31:47 lustrefs-sys-oss06 kernel: LustreError: 8060:0:(ldlm_lib.c:2685:target_bulk_io()) @@@ network error on bulk GET 0(1048576)  req@ffff88026672a850 x1442644628631725/t0(0) o4->4ece0c04-00b5-aedd-f612-11cbcc7fb566@10.148.0.143@o2ib:0/0 lens 456/416 e 1 to 0 dl 1375813926 ref 1 fl Interpret:/0/0 rc 0/0
Aug  6 13:31:47 lustrefs-sys-oss06 kernel: Lustre: lustrefssys-OST0006: Bulk IO write error with 4ece0c04-00b5-aedd-f612-11cbcc7fb566 (at 10.148.0.143@o2ib), client will retry: rc -110
Aug  6 13:32:06 lustrefs-sys-oss06 kernel: Lustre: lustrefssys-OST0006: Client 4ece0c04-00b5-aedd-f612-11cbcc7fb566 (at 10.148.0.143@o2ib) reconnecting
Aug  6 13:32:06 lustrefs-sys-oss06 kernel: Lustre: 7930:0:(ldlm_lib.c:952:target_handle_connect()) lustrefssys-OST0006: connection from 4ece0c04-00b5-aedd-f612-11cbcc7fb566@10.148.0.143@o2ib t9859661 exp ffff88025da7b000 cur 1375813926 last 1375813926
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff8802591e8000
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff8802591e8000
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 8056:0:(ldlm_lib.c:2685:target_bulk_io()) @@@ network error on bulk GET 0(1048576)  req@ffff880264a09800 x1442644628631910/t0(0) o4->4ece0c04-00b5-aedd-f612-11cbcc7fb566@10.148.0.143@o2ib:0/0 lens 456/416 e 0 to 0 dl 1375813969 ref 1 fl Interpret:/2/0 rc 0/0
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff8802592bc000
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff8802592bc000
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff8802592be000
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff8802592be000
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: Lustre: lustrefssys-OST0006: Bulk IO write error with 4ece0c04-00b5-aedd-f612-11cbcc7fb566 (at 10.148.0.143@o2ib), client will retry: rc -110
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff8802592c4000
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff8802592c4000
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff88045232c000
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff88045232c000
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff88045232e000
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff88045232e000
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff880452330000
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff880452330000
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff880452332000
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff880452332000
Aug  6 13:32:19 lustrefs-sys-oss06 kernel: LustreError: 8056:0:(ldlm_lib.c:2685:target_bulk_io()) Skipped 7 previous similar messages
Aug  6 13:32:43 lustrefs-sys-oss06 kernel: Lustre: lustrefssys-OST0006: Client 4ece0c04-00b5-aedd-f612-11cbcc7fb566 (at 10.148.0.143@o2ib) reconnecting
Aug  6 13:32:43 lustrefs-sys-oss06 kernel: Lustre: 7930:0:(ldlm_lib.c:952:target_handle_connect()) lustrefssys-OST0006: connection from 4ece0c04-00b5-aedd-f612-11cbcc7fb566@10.148.0.143@o2ib t9859661 exp ffff88025da7b000 cur 1375813963 last 1375813963
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff880259354000
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff880259354000
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 8056:0:(ldlm_lib.c:2685:target_bulk_io()) @@@ network error on bulk GET 0(1048576)  req@ffff88025a3f6800 x1442644628631943/t0(0) o4->4ece0c04-00b5-aedd-f612-11cbcc7fb566@10.148.0.143@o2ib:0/0 lens 456/416 e 0 to 0 dl 1375814006 ref 1 fl Interpret:/2/0 rc 0/0
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff880259356000
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff880259356000
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff880259366000
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: Lustre: lustrefssys-OST0006: Bulk IO write error with 4ece0c04-00b5-aedd-f612-11cbcc7fb566 (at 10.148.0.143@o2ib), client will retry: rc -110
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: Lustre: Skipped 7 previous similar messages
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff880259366000
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff8802592dc000
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff8802592dc000
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff8804523b2000
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff8804523b2000
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff88047a6b0000
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff88047a6b0000
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff880259360000
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff880259360000
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff880259358000
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff880259358000
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: LustreError: 8056:0:(ldlm_lib.c:2685:target_bulk_io()) Skipped 7 previous similar messages
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: Lustre: 2209:0:(o2iblnd_cb.c:2341:kiblnd_passive_connect()) Conn race 10.148.0.143@o2ib
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: Lustre: lustrefssys-OST0006: Client 4ece0c04-00b5-aedd-f612-11cbcc7fb566 (at 10.148.0.143@o2ib) reconnecting
Aug  6 13:32:56 lustrefs-sys-oss06 kernel: Lustre: 7930:0:(ldlm_lib.c:952:target_handle_connect()) lustrefssys-OST0006: connection from 4ece0c04-00b5-aedd-f612-11cbcc7fb566@10.148.0.143@o2ib t9859661 exp ffff88025da7b000 cur 1375813976 last 1375813976
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff880259366000
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff880259366000
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff880259354000
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 8053:0:(ldlm_lib.c:2685:target_bulk_io()) @@@ network error on bulk GET 0(1048576)  req@ffff880264abf400 x1442644628631952/t0(0) o4->4ece0c04-00b5-aedd-f612-11cbcc7fb566@10.148.0.143@o2ib:0/0 lens 456/416 e 0 to 0 dl 1375814019 ref 1 fl Interpret:/2/0 rc 0/0
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: Lustre: lustrefssys-OST0006: Bulk IO write error with 4ece0c04-00b5-aedd-f612-11cbcc7fb566 (at 10.148.0.143@o2ib), client will retry: rc -110
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: Lustre: Skipped 7 previous similar messages
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff880259354000
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff880259358000
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff880259358000
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff880259208000
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff880259208000
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff8804523b2000
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff8804523b2000
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff88047a6b0000
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff88047a6b0000
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff88045236c000
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff88045236c000
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff8804523e2000
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff8804523e2000
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: Lustre: 2208:0:(o2iblnd_cb.c:2341:kiblnd_passive_connect()) Conn race 10.148.0.143@o2ib
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: Lustre: lustrefssys-OST0006: Client 4ece0c04-00b5-aedd-f612-11cbcc7fb566 (at 10.148.0.143@o2ib) reconnecting
Aug  6 13:33:09 lustrefs-sys-oss06 kernel: Lustre: 7930:0:(ldlm_lib.c:952:target_handle_connect()) lustrefssys-OST0006: connection from 4ece0c04-00b5-aedd-f612-11cbcc7fb566@10.148.0.143@o2ib t9859661 exp ffff88025da7b000 cur 1375813989 last 1375813989
Aug  6 13:33:22 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 4, status -103, desc ffff8804523e2000
Aug  6 13:33:22 lustrefs-sys-oss06 kernel: LustreError: 7738:0:(events.c:396:server_bulk_callback()) event type 2, status -103, desc ffff8804523e2000

lustre-client:

Aug  6 13:31:41 lustrefs-sys-mds1 kernel: Lustre: 9547:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for sent delay: [sent 1375813894/real 0]  req@ffff8802cb514400 x1442644628631729/t0(0) o4->lustrefssys-OST0006-osc-ffff88027c008800@10.148.0.150@o2ib:6/4 lens 456/416 e 0 to 1 dl 1375813901 ref 3 fl Rpc:X/0/ffffffff rc 0/-1
Aug  6 13:31:41 lustrefs-sys-mds1 kernel: Lustre: lustrefssys-OST0006-osc-ffff88027c008800: Connection to lustrefssys-OST0006 (at 10.148.0.150@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Aug  6 13:31:47 lustrefs-sys-mds1 kernel: LustreError: 9527:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff8802a1370000
Aug  6 13:32:06 lustrefs-sys-mds1 kernel: Lustre: lustrefssys-OST0006-osc-ffff88027c008800: Connection restored to lustrefssys-OST0006 (at 10.148.0.150@o2ib)
Aug  6 13:32:18 lustrefs-sys-mds1 kernel: Lustre: 9547:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1375813926/real 1375813926]  req@ffff8802cb514400 x1442644628631911/t0(0) o4->lustrefssys-OST0006-osc-ffff88027c008800@10.148.0.150@o2ib:6/4 lens 456/416 e 0 to 1 dl 1375813938 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Aug  6 13:32:18 lustrefs-sys-mds1 kernel: Lustre: 9547:0:(client.c:1817:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Aug  6 13:32:18 lustrefs-sys-mds1 kernel: Lustre: lustrefssys-OST0006-osc-ffff88027c008800: Connection to lustrefssys-OST0006 (at 10.148.0.150@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Aug  6 13:32:19 lustrefs-sys-mds1 kernel: LustreError: 9528:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c3c6000
Aug  6 13:32:19 lustrefs-sys-mds1 kernel: LustreError: 9521:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c244000
Aug  6 13:32:19 lustrefs-sys-mds1 kernel: LustreError: 9526:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d2ae000
Aug  6 13:32:19 lustrefs-sys-mds1 kernel: LustreError: 9527:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c3cc000
Aug  6 13:32:19 lustrefs-sys-mds1 kernel: LustreError: 9523:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d272000
Aug  6 13:32:19 lustrefs-sys-mds1 kernel: LustreError: 9524:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d23c000
Aug  6 13:32:19 lustrefs-sys-mds1 kernel: LustreError: 9525:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff8802a1370000
Aug  6 13:32:19 lustrefs-sys-mds1 kernel: LustreError: 9522:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c362000
Aug  6 13:32:43 lustrefs-sys-mds1 kernel: Lustre: lustrefssys-OST0006-osc-ffff88027c008800: Connection restored to lustrefssys-OST0006 (at 10.148.0.150@o2ib)
Aug  6 13:32:56 lustrefs-sys-mds1 kernel: LustreError: 9523:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d23c000
Aug  6 13:32:56 lustrefs-sys-mds1 kernel: LustreError: 9521:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d272000
Aug  6 13:32:56 lustrefs-sys-mds1 kernel: LustreError: 9525:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c3cc000
Aug  6 13:32:56 lustrefs-sys-mds1 kernel: LustreError: 9524:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c244000
Aug  6 13:32:56 lustrefs-sys-mds1 kernel: LustreError: 9526:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c3c6000
Aug  6 13:32:56 lustrefs-sys-mds1 kernel: LustreError: 9527:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c362000
Aug  6 13:32:56 lustrefs-sys-mds1 kernel: LustreError: 9522:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d2ae000
Aug  6 13:32:56 lustrefs-sys-mds1 kernel: LustreError: 9528:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff8802a1370000
Aug  6 13:32:56 lustrefs-sys-mds1 kernel: Lustre: lustrefssys-OST0006-osc-ffff88027c008800: Connection to lustrefssys-OST0006 (at 10.148.0.150@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Aug  6 13:32:56 lustrefs-sys-mds1 kernel: Lustre: lustrefssys-OST0006-osc-ffff88027c008800: Connection restored to lustrefssys-OST0006 (at 10.148.0.150@o2ib)
Aug  6 13:33:09 lustrefs-sys-mds1 kernel: LustreError: 9528:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c244000
Aug  6 13:33:09 lustrefs-sys-mds1 kernel: LustreError: 9521:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff8802a1370000
Aug  6 13:33:09 lustrefs-sys-mds1 kernel: LustreError: 9522:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c3cc000
Aug  6 13:33:09 lustrefs-sys-mds1 kernel: LustreError: 9527:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d23c000
Aug  6 13:33:09 lustrefs-sys-mds1 kernel: LustreError: 9526:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c362000
Aug  6 13:33:09 lustrefs-sys-mds1 kernel: LustreError: 9525:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d272000
Aug  6 13:33:09 lustrefs-sys-mds1 kernel: LustreError: 9524:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d2ae000
Aug  6 13:33:09 lustrefs-sys-mds1 kernel: LustreError: 9523:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c3c6000
Aug  6 13:33:09 lustrefs-sys-mds1 kernel: Lustre: lustrefssys-OST0006-osc-ffff88027c008800: Connection to lustrefssys-OST0006 (at 10.148.0.150@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Aug  6 13:33:09 lustrefs-sys-mds1 kernel: Lustre: lustrefssys-OST0006-osc-ffff88027c008800: Connection restored to lustrefssys-OST0006 (at 10.148.0.150@o2ib)
Aug  6 13:33:22 lustrefs-sys-mds1 kernel: LustreError: 9526:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c362000
Aug  6 13:33:22 lustrefs-sys-mds1 kernel: LustreError: 9521:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d23c000
Aug  6 13:33:22 lustrefs-sys-mds1 kernel: LustreError: 9525:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c3cc000
Aug  6 13:33:22 lustrefs-sys-mds1 kernel: LustreError: 9528:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c3c6000
Aug  6 13:33:22 lustrefs-sys-mds1 kernel: LustreError: 9524:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c244000
Aug  6 13:33:22 lustrefs-sys-mds1 kernel: LustreError: 9527:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d272000
Aug  6 13:33:22 lustrefs-sys-mds1 kernel: LustreError: 9522:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d2ae000
Aug  6 13:33:22 lustrefs-sys-mds1 kernel: Lustre: lustrefssys-OST0006-osc-ffff88027c008800: Connection to lustrefssys-OST0006 (at 10.148.0.150@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Aug  6 13:33:22 lustrefs-sys-mds1 kernel: LustreError: 9523:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff8802a1370000
Aug  6 13:33:22 lustrefs-sys-mds1 kernel: Lustre: lustrefssys-OST0006-osc-ffff88027c008800: Connection restored to lustrefssys-OST0006 (at 10.148.0.150@o2ib)
Aug  6 13:33:35 lustrefs-sys-mds1 kernel: LustreError: 9525:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c3cc000
Aug  6 13:33:35 lustrefs-sys-mds1 kernel: LustreError: 9527:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d23c000
Aug  6 13:33:35 lustrefs-sys-mds1 kernel: Lustre: 9547:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request  sent has failed due to network error: [sent 1375814002/real 1375814002]  req@ffff8802cb514400 x1442644628631991/t0(0) o4->lustrefssys-OST0006-osc-ffff88027c008800@10.148.0.150@o2ib:6/4 lens 456/416 e 0 to 1 dl 1375814019 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Aug  6 13:33:35 lustrefs-sys-mds1 kernel: Lustre: 9547:0:(client.c:1817:ptlrpc_expire_one_request()) Skipped 32 previous similar messages
Aug  6 13:33:35 lustrefs-sys-mds1 kernel: LustreError: 9523:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c362000
Aug  6 13:33:35 lustrefs-sys-mds1 kernel: LustreError: 9528:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d272000
Aug  6 13:33:35 lustrefs-sys-mds1 kernel: LustreError: 9524:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff8802a1370000
Aug  6 13:33:35 lustrefs-sys-mds1 kernel: LustreError: 9526:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c244000
Aug  6 13:33:35 lustrefs-sys-mds1 kernel: LustreError: 9521:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c3c6000
Aug  6 13:33:35 lustrefs-sys-mds1 kernel: Lustre: lustrefssys-OST0006-osc-ffff88027c008800: Connection to lustrefssys-OST0006 (at 10.148.0.150@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Aug  6 13:33:35 lustrefs-sys-mds1 kernel: Lustre: lustrefssys-OST0006-osc-ffff88027c008800: Connection restored to lustrefssys-OST0006 (at 10.148.0.150@o2ib)
Aug  6 13:33:48 lustrefs-sys-mds1 kernel: LustreError: 9523:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff8802a1370000
Aug  6 13:33:48 lustrefs-sys-mds1 kernel: LustreError: 9527:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c244000
Aug  6 13:33:48 lustrefs-sys-mds1 kernel: LustreError: 9524:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d272000
Aug  6 13:33:48 lustrefs-sys-mds1 kernel: LustreError: 9525:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c3cc000
Aug  6 13:33:48 lustrefs-sys-mds1 kernel: LustreError: 9522:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d2ae000
Aug  6 13:33:48 lustrefs-sys-mds1 kernel: LustreError: 9528:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c362000
Aug  6 13:33:48 lustrefs-sys-mds1 kernel: LustreError: 9526:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d23c000
Aug  6 13:33:48 lustrefs-sys-mds1 kernel: LustreError: 9521:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c3c6000
Aug  6 13:33:48 lustrefs-sys-mds1 kernel: Lustre: lustrefssys-OST0006-osc-ffff88027c008800: Connection to lustrefssys-OST0006 (at 10.148.0.150@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Aug  6 13:33:48 lustrefs-sys-mds1 kernel: Lustre: lustrefssys-OST0006-osc-ffff88027c008800: Connection restored to lustrefssys-OST0006 (at 10.148.0.150@o2ib)
Aug  6 13:34:01 lustrefs-sys-mds1 kernel: LustreError: 9521:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c362000
Aug  6 13:34:01 lustrefs-sys-mds1 kernel: LustreError: 9527:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d272000
Aug  6 13:34:01 lustrefs-sys-mds1 kernel: LustreError: 9525:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c244000
Aug  6 13:34:01 lustrefs-sys-mds1 kernel: LustreError: 9528:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c3c6000
Aug  6 13:34:01 lustrefs-sys-mds1 kernel: LustreError: 9524:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d23c000
Aug  6 13:34:01 lustrefs-sys-mds1 kernel: LustreError: 9526:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c3cc000
Aug  6 13:34:01 lustrefs-sys-mds1 kernel: LustreError: 9523:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d2ae000
Aug  6 13:34:01 lustrefs-sys-mds1 kernel: LustreError: 9522:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff8802a1370000
Aug  6 13:34:01 lustrefs-sys-mds1 kernel: LustreError: 11-0: an error occurred while communicating with 10.148.0.150@o2ib. The ost_connect operation failed with -16
Aug  6 13:34:26 lustrefs-sys-mds1 kernel: Lustre: lustrefssys-OST0006-osc-ffff88027c008800: Connection restored to lustrefssys-OST0006 (at 10.148.0.150@o2ib)
Aug  6 13:34:39 lustrefs-sys-mds1 kernel: LustreError: 9521:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d2ae000
Aug  6 13:34:39 lustrefs-sys-mds1 kernel: LustreError: 9525:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff8802a1370000
Aug  6 13:34:39 lustrefs-sys-mds1 kernel: LustreError: 9528:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c3c6000
Aug  6 13:34:39 lustrefs-sys-mds1 kernel: LustreError: 9523:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029d272000
Aug  6 13:34:39 lustrefs-sys-mds1 kernel: LustreError: 9524:0:(events.c:203:client_bulk_callback()) event type 0, status -5, desc ffff88029c3cc000

Client mount options: -o nochecksum -o flock

Intel Truescale IB module opts: singleport=1 krcvqs=3 pcie_caps=0x51 rcvhdrcnt=4096
Lustre module options: ko2iblnd map_on_demand=32

Filesystem description:
2.1.6 server and client. 18 OSS, 1 OST per OSS. Intel Truescale QDR single rail.

Note: machine mds1 used as client. Not currently configured as an MDS.



 Comments   
Comment by Jeff Johnson (Inactive) [ 06/Aug/13 ]

addendum
MDS and OSS hardware description:
Dual socket Intel Xeon E5-2603, 16GB DDR3-1600, Intel Truescale QDR/IB QLE7342, Adaptec 6445 (aacraid)

Comment by Jeff Johnson (Inactive) [ 07/Aug/13 ]

This appears to be caused by a flawed SGI enhanced-hypercube routing configuration. Node to edge switch connections are stable and error free. Random lnet RPCs being lost in-flight randomly due to bad IB DOR tables. Issue unresolved but appears not to be Lustre issue. Please close.

Comment by Peter Jones [ 07/Aug/13 ]

ok. Thanks Jeff

Generated at Sat Feb 10 01:36:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.