Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
Lustre 1.8.9
-
None
-
RHEL5: 2.6.18-348.1.1.el5_lustre
-
3
-
8161
Description
Recently we have been hitting an LBUG during Lustre startup. It appears to have happened before the 1.8.9, but it seems to be much more frequent since the upgrade. During startup our HA software attempts to mount all the resources in parallel. It looks like occasionally the OST attempts to contact the MGS before it is mounted, and we get messages like:
May 7 17:36:35 lfs-oss-2-16 kernel: LustreError: 8596:0:(obd_config.c:372:class_setup()) setup OSS failed (-22)
May 7 17:36:35 lfs-oss-2-16 kernel: LustreError: 8596:0:(obd_mount.c:483:lustre_start_simple()) OSS setup error -22
May 7 17:36:35 lfs-oss-2-16 kernel: LustreError: 8596:0:(obd_mount.c:1096:server_start_targets()) failed to start OSS: -22
May 7 17:36:35 lfs-oss-2-16 kernel: LustreError: 8596:0:(obd_mount.c:1672:server_fill_super()) Unable to start targets: -22
May 7 17:36:35 lfs-oss-2-16 kernel: LustreError: 8596:0:(obd_mount.c:1455:server_put_super()) no obd scratch2-OST00a7
May 7 17:36:35 lfs-oss-2-16 kernel: LustreError: 8596:0:(obd_mount.c:149:server_deregister_mount()) scratch2-OST00a7 not registered
May 7 17:36:40 lfs-oss-2-16 kernel: Lustre: 19173:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1431216012696932 sent from MGC10.175.31.242@o2ib4 to NID 10.175.31.242@o2ib4 5s ago has timed out (5s prior to deadline).
May 7 17:36:40 lfs-oss-2-16 kernel: req@ffff8105831f2c00 x1431216012696932/t0 o250->MGS@MGC10.175.31.242@o2ib4_0:26/25 lens 368/584 e 0 to 1 dl 1367948200 ref 1 fl Rpc:N/0/0 rc 0/0
May 7 17:36:41 lfs-oss-2-16 kernel: Lustre: server umount scratch2-OST00a7 complete
The MGS then starts, and the other OSTs on the OSS begin starting up. Then we hit:
May 7 17:37:38 lfs-oss-2-16 kernel: LustreError: 10235:0:(service.c:2041:ptlrpc_unregister_service()) ASSERTION(list_empty(&service->srv_active_rqbds)) failed
May 7 17:37:38 lfs-oss-2-16 kernel: LustreError: 10235:0:(service.c:2041:ptlrpc_unregister_service()) LBUG
My guess is that it is some kind of race between mount and umount. There is the obvious workaround of making sure that the MGS is started, and we will do that in the future, but the HA isn't perfect and sometimes it just happens that the MGS isn't there.