Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.11.0
-
Soak cluster - latest lustre-master build (3650) version=2.10.53_32_g20ffe21
-
3
-
9223372036854775807
Description
Reformatted, created complete new filesystem.
MDS/MGS mounts.
OSS mount fails
OSS errors
[ 584.585335] Lustre: 13317:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507069853/real 1507069853] req@ffff88083f758300 x1580277278179360/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507069860 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 [ 584.680525] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail [ 584.727353] LustreError: 15f-b: soaked-OST0000: cannot register this server with the MGS: rc = -5. Is the MGS running? [ 584.755152] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib) [ 584.796681] LustreError: 13317:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5 [ 584.828634] LustreError: 13317:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-OST0000 [ 584.858501] LustreError: 13317:0:(obd_mount_server.c:132:server_deregister_mount()) soaked-OST0000 not registered [ 585.118058] Lustre: server umount soaked-OST0000 complete [ 585.135868] LustreError: 13317:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount (-5)
Errors on MDS
[17524.112385] Lustre: soaked-MDT0000: new disk, initializing [17524.138680] Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900 [17524.153739] Lustre: ctl-soaked-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400]:0:mdt [17528.212947] Lustre: MGS: Connection restored to aa8414d2-f089-bf2e-b8c3-406370f048cc (at 192.168.1.102@o2ib) [17528.273221] LustreError: 12811:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs [17528.287356] LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check [17528.305726] LustreError: 14558:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.102@o2ib x1580277278179360 [17528.418152] Lustre: MGS: Received new LWP connection from 192.168.1.102@o2ib, removing former export from same NID
Not sure where to go from here.
there appears to be 3 separate issues:
1. Fastreg is not supported on OPA so reverting
LU-9810works (as a side note, I also tried John's patch forLU-9983and that resolves the problem we were seeing there).2. Fastreg broken on MLX-5. I've been debugging this problem yesterday, and we already have a few bugs that are all related. For this particular problem ib_map_mr_sg() is called to map 11 fragments but ends up mapping 10. I'm trying to look at the mlx5 driver code and understand why it stops before mapping all fragments. The fragments look like:
So it looks like it stops on the fragment of length 73.
3. MLX-4 failure. I still need to investigate further, because it could be different than both of the above.