Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3295

ASSERTION(list_empty(&service->srv_active_rqbds)) failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 1.8.9
    • None
    • RHEL5: 2.6.18-348.1.1.el5_lustre
    • 3
    • 8161

    Description

      Recently we have been hitting an LBUG during Lustre startup. It appears to have happened before the 1.8.9, but it seems to be much more frequent since the upgrade. During startup our HA software attempts to mount all the resources in parallel. It looks like occasionally the OST attempts to contact the MGS before it is mounted, and we get messages like:

      May 7 17:36:35 lfs-oss-2-16 kernel: LustreError: 8596:0:(obd_config.c:372:class_setup()) setup OSS failed (-22)
      May 7 17:36:35 lfs-oss-2-16 kernel: LustreError: 8596:0:(obd_mount.c:483:lustre_start_simple()) OSS setup error -22
      May 7 17:36:35 lfs-oss-2-16 kernel: LustreError: 8596:0:(obd_mount.c:1096:server_start_targets()) failed to start OSS: -22
      May 7 17:36:35 lfs-oss-2-16 kernel: LustreError: 8596:0:(obd_mount.c:1672:server_fill_super()) Unable to start targets: -22
      May 7 17:36:35 lfs-oss-2-16 kernel: LustreError: 8596:0:(obd_mount.c:1455:server_put_super()) no obd scratch2-OST00a7
      May 7 17:36:35 lfs-oss-2-16 kernel: LustreError: 8596:0:(obd_mount.c:149:server_deregister_mount()) scratch2-OST00a7 not registered
      May 7 17:36:40 lfs-oss-2-16 kernel: Lustre: 19173:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1431216012696932 sent from MGC10.175.31.242@o2ib4 to NID 10.175.31.242@o2ib4 5s ago has timed out (5s prior to deadline).
      May 7 17:36:40 lfs-oss-2-16 kernel: req@ffff8105831f2c00 x1431216012696932/t0 o250->MGS@MGC10.175.31.242@o2ib4_0:26/25 lens 368/584 e 0 to 1 dl 1367948200 ref 1 fl Rpc:N/0/0 rc 0/0
      May 7 17:36:41 lfs-oss-2-16 kernel: Lustre: server umount scratch2-OST00a7 complete

      The MGS then starts, and the other OSTs on the OSS begin starting up. Then we hit:
      May 7 17:37:38 lfs-oss-2-16 kernel: LustreError: 10235:0:(service.c:2041:ptlrpc_unregister_service()) ASSERTION(list_empty(&service->srv_active_rqbds)) failed
      May 7 17:37:38 lfs-oss-2-16 kernel: LustreError: 10235:0:(service.c:2041:ptlrpc_unregister_service()) LBUG

      My guess is that it is some kind of race between mount and umount. There is the obvious workaround of making sure that the MGS is started, and we will do that in the future, but the HA isn't perfect and sometimes it just happens that the MGS isn't there.

      Attachments

        Activity

          People

            hongchao.zhang Hongchao Zhang
            kitwestneat Kit Westneat (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: