Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7778

mount of MDT(==MGS) failed after MDS restart

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0, Lustre 2.9.0
    • Lustre 2.8.0
    • lola
      build: 2.8.50-6-gf9ca359 ; commit f9ca359284357d145819beb08b316e932f7a3060
    • 3
    • 9223372036854775807

    Description

      Error happened during soak testing of build '20160215' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20150215). DNE is enabled.
      MDT had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active-active HA failover configuration.

      Please note that build 20150215 is a vanilla build of the master brunch.
      This issue might be addressed by the changes included in build '20160210' as we didn't observe this issue in a two day test session.

      Sequence of events:

      • 2016-02-15 16:25:21,179:fsmgmt.fsmgmt:INFO triggering fault mds_restart
      • 2016-02-15 16:31:41,282:fsmgmt.fsmgmt:INFO lola-8 is up
      • 2016-02-15 16:36:50,594:fsmgmt.fsmgmt:INFO ... soaked-MDT0001 mounted successfully on lola-8
      • 2016-02-15 16:38:20, mount of MDT0000 (== MGS) fails
        Error message reads as:
        Feb 15 16:38:20 lola-8 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib10: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
        Feb 15 16:38:20 lola-8 kernel: LustreError: 4538:0:(obd_mount_server.c:1309:server_start_targets()) failed to start server soaked-MDT0000: -5
        Feb 15 16:38:20 lola-8 kernel: LustreError: 4538:0:(obd_mount_server.c:1798:server_fill_super()) Unable to start targets: -5
        Feb 15 16:38:20 lola-8 kernel: LustreError: 4538:0:(obd_mount_server.c:1512:server_put_super()) no obd soaked-MDT0000
        Feb 15 16:38:20 lola-8 kernel: Lustre: server umount soaked-MDT0000 complete
        Feb 15 16:38:20 lola-8 kernel: LustreError: 4538:0:(obd_mount.c:1426:lustre_fill_super()) Unable to mount  (-5)
        
      • I checked the HW and cluster configuration: no problem with IB HCA, LNet is working, routers are up; Disk device file of MDT-0000 can be read and accessed.

      Attached messages, console and manual forced debug log of node lola-8.

      Attachments

        Activity

          [LU-7778] mount of MDT(==MGS) failed after MDS restart
          pjones Peter Jones added a comment -

          Landed for 2.8 and 2.9

          pjones Peter Jones added a comment - Landed for 2.8 and 2.9

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18509/
          Subject: LU-7778 osd: check if the object is destroyed
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: a19c1ea92fb8d9909ec9fb98f22a8a9e4835c572

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18509/ Subject: LU-7778 osd: check if the object is destroyed Project: fs/lustre-release Branch: master Current Patch Set: Commit: a19c1ea92fb8d9909ec9fb98f22a8a9e4835c572

          wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18509
          Subject: LU-7778 osd: check if the object is destroyed
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 3096d9dbeae6bafccf10104b8221b91fac05a08f

          gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18509 Subject: LU-7778 osd: check if the object is destroyed Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3096d9dbeae6bafccf10104b8221b91fac05a08f
          di.wang Di Wang added a comment -
          Feb 15 16:37:47 lola-8 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. quota=on. Opts:
          Feb 15 16:37:48 lola-8 kernel: LustreError: 11-0: soaked-MDT0006-osp-MDT0001: operation mds_connect to node 192.168.1.111@o2ib10 failed: rc = -16
          Feb 15 16:37:48 lola-8 kernel: LustreError: Skipped 3 previous similar messages
          Feb 15 16:37:48 lola-8 kernel: Lustre: 4320:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1455583068/real 1455583068]  req@ffff8804037909c0 x1526289292853684/t0(0) o38->soaked-MDT0000-osp-MDT0001@192.168.1.109@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1455583079 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          Feb 15 16:37:48 lola-8 kernel: Lustre: 4320:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
          Feb 15 16:37:54 lola-8 kernel: LustreError: 137-5: soaked-MDT0003_UUID: not available for connect from 192.168.1.104@o2ib10 (no target). If you are running an HA pair check that the target is mounted on the other server.
          Feb 15 16:37:54 lola-8 kernel: LustreError: Skipped 58 previous similar messages
          Feb 15 16:38:03 lola-8 kernel: Lustre: soaked-MDT0001: Client d26c53bc-3d10-5c53-0c35-f189140fc2e8 (at 192.168.1.131@o2ib100) reconnecting, waiting for 14 clients in recovery for 3:53
          Feb 15 16:38:03 lola-8 kernel: Lustre: Skipped 180 previous similar messages
          Feb 15 16:38:20 lola-8 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib10: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
          Feb 15 16:38:20 lola-8 kernel: LustreError: 4538:0:(obd_mount_server.c:1309:server_start_targets()) failed to start server soaked-MDT0000: -5
          

          It looks like MDT0 has trouble to communicate with MGS. But unfortunately, there are no logs to indicate what happens. I guess I need monitor the "run".

          di.wang Di Wang added a comment - Feb 15 16:37:47 lola-8 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. quota=on. Opts: Feb 15 16:37:48 lola-8 kernel: LustreError: 11-0: soaked-MDT0006-osp-MDT0001: operation mds_connect to node 192.168.1.111@o2ib10 failed: rc = -16 Feb 15 16:37:48 lola-8 kernel: LustreError: Skipped 3 previous similar messages Feb 15 16:37:48 lola-8 kernel: Lustre: 4320:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1455583068/real 1455583068] req@ffff8804037909c0 x1526289292853684/t0(0) o38->soaked-MDT0000-osp-MDT0001@192.168.1.109@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1455583079 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Feb 15 16:37:48 lola-8 kernel: Lustre: 4320:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Feb 15 16:37:54 lola-8 kernel: LustreError: 137-5: soaked-MDT0003_UUID: not available for connect from 192.168.1.104@o2ib10 (no target). If you are running an HA pair check that the target is mounted on the other server. Feb 15 16:37:54 lola-8 kernel: LustreError: Skipped 58 previous similar messages Feb 15 16:38:03 lola-8 kernel: Lustre: soaked-MDT0001: Client d26c53bc-3d10-5c53-0c35-f189140fc2e8 (at 192.168.1.131@o2ib100) reconnecting, waiting for 14 clients in recovery for 3:53 Feb 15 16:38:03 lola-8 kernel: Lustre: Skipped 180 previous similar messages Feb 15 16:38:20 lola-8 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib10: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Feb 15 16:38:20 lola-8 kernel: LustreError: 4538:0:(obd_mount_server.c:1309:server_start_targets()) failed to start server soaked-MDT0000: -5 It looks like MDT0 has trouble to communicate with MGS. But unfortunately, there are no logs to indicate what happens. I guess I need monitor the "run".

          People

            di.wang Di Wang
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: