Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.8.0, Lustre 2.9.0
-
None
-
Power8 running RHEL with a MOFED 3.3 stack.
-
3
-
9223372036854775807
Description
Moving to our production Power8 system running an MOFED stack we are seeing a new IB error in the ko2iblnd that wasn't encountered before.
[ 170.597561] mlx5_warn:mlx5_0:dump_cqe:257:(pid 8738): dump error cqe
[ 170.597620] mlx5_warn:mlx5_0:dump_cqe:257:(pid 8714): dump error cqe
[ 170.597622] 00000000 00000000 00000000 00000000
[ 170.597623] 00000000 00000000 00000000 00000000
[ 170.597625] 00000000 00000000 00000000 00000000
[ 170.597626] 00000000 08007806 25000039 0642b3d2
[ 170.597651] LNet: 8714:0:(o2iblnd_cb.c:3433:kiblnd_complete()) FastReg failed: 6
[ 170.597728] LNet: 8713:0:(o2iblnd_cb.c:3444:kiblnd_complete()) RDMA (tx: c000003c6a78c5a8) failed: 5
[ 170.598355] 00000000 00000000 00000000 00000000
[ 170.598403] 00000000 00000000 00000000 00000000
[ 170.599245] powernv-cpufreq: CPU 104 on Chip 1 has Pmax restored to 0
[ 170.599647] LNet: 8714:0:(o2iblnd_cb.c:990:kiblnd_tx_complete()) Tx -> 10.39.232.11@o2ib6 cookie 0x63e sending 1 waiting 0: failed 5
[ 170.599651] LNet: 8714:0:(o2iblnd_cb.c:990:kiblnd_tx_complete()) Skipped 2 previous similar messages
[ 170.599654] LNet: 8713:0:(o2iblnd_cb.c:1934:kiblnd_close_conn_locked()) Closing conn to 10.39.232.11@o2ib6: error -5(waiting)
[ 170.599669] LustreError: 8714:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc c000003c62cf5c00
[ 170.599675] Lustre: 8896:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1476124274/real 1476124274] req@c000003c4e340000 x1547828424878916/t0(0) o4->atlastds-OST0035-osc-c000001fc5b75000@10.36.226.69@o2ib:6/4 lens 608/448 e 0 to 1 dl 1476124841 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[ 170.599681] Lustre: atlastds-OST0035-osc-c000001fc5b75000: Connection to atlastds-OST0035 (at 10.36.226.69@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[ 170.611219] 00000000 00000000 00000000 00000000
[ 170.612270] 00000000 08007806 2500003a 06789cd2
[ 170.613866] LustreError: 8737:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc c000001fb98c0400
It would be very useful to know under what conditions MOFED returns this error. Without access to the MOFED source or the firmware source (if the error is generated by firmware), I cannot determine that.
Do you have a support ticket opened with Mellanox for this? If they can provide us with a list of conditions which generate this error, we would have something to work with to debug what we are doing wrong in o2iblnd.