Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.8.0, Lustre 2.9.0
-
None
-
Power8 running RHEL with a MOFED 3.3 stack.
-
3
-
9223372036854775807
Description
Moving to our production Power8 system running an MOFED stack we are seeing a new IB error in the ko2iblnd that wasn't encountered before.
[ 170.597561] mlx5_warn:mlx5_0:dump_cqe:257:(pid 8738): dump error cqe
[ 170.597620] mlx5_warn:mlx5_0:dump_cqe:257:(pid 8714): dump error cqe
[ 170.597622] 00000000 00000000 00000000 00000000
[ 170.597623] 00000000 00000000 00000000 00000000
[ 170.597625] 00000000 00000000 00000000 00000000
[ 170.597626] 00000000 08007806 25000039 0642b3d2
[ 170.597651] LNet: 8714:0:(o2iblnd_cb.c:3433:kiblnd_complete()) FastReg failed: 6
[ 170.597728] LNet: 8713:0:(o2iblnd_cb.c:3444:kiblnd_complete()) RDMA (tx: c000003c6a78c5a8) failed: 5
[ 170.598355] 00000000 00000000 00000000 00000000
[ 170.598403] 00000000 00000000 00000000 00000000
[ 170.599245] powernv-cpufreq: CPU 104 on Chip 1 has Pmax restored to 0
[ 170.599647] LNet: 8714:0:(o2iblnd_cb.c:990:kiblnd_tx_complete()) Tx -> 10.39.232.11@o2ib6 cookie 0x63e sending 1 waiting 0: failed 5
[ 170.599651] LNet: 8714:0:(o2iblnd_cb.c:990:kiblnd_tx_complete()) Skipped 2 previous similar messages
[ 170.599654] LNet: 8713:0:(o2iblnd_cb.c:1934:kiblnd_close_conn_locked()) Closing conn to 10.39.232.11@o2ib6: error -5(waiting)
[ 170.599669] LustreError: 8714:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc c000003c62cf5c00
[ 170.599675] Lustre: 8896:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1476124274/real 1476124274] req@c000003c4e340000 x1547828424878916/t0(0) o4->atlastds-OST0035-osc-c000001fc5b75000@10.36.226.69@o2ib:6/4 lens 608/448 e 0 to 1 dl 1476124841 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[ 170.599681] Lustre: atlastds-OST0035-osc-c000001fc5b75000: Connection to atlastds-OST0035 (at 10.36.226.69@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[ 170.611219] 00000000 00000000 00000000 00000000
[ 170.612270] 00000000 08007806 2500003a 06789cd2
[ 170.613866] LustreError: 8737:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc c000001fb98c0400