Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2330

OST unable to complete registration with MGS

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.0
    • lustre-orion-2.3.54-6chaos
    • 3
    • 5562

    Description

      We created a new filesystem with 768 zfs-osd OSTs. The OSTs were started out-of-index order, in parallel. There was a failure in the initial registration of about 35 of the OSTs:

      2012-11-14 12:30:07 Lustre: Lustre: Build Version: 2.3.54-6chaos-6chaos--PRISTINE-2.6.32-220.23.1.2chaos.ch5.x86_64
      2012-11-14 12:30:19 LustreError: 41374:0:(client.c:1123:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff881011707000 x1418644692140034/t0(0) o253->MGC172.20.5.1@o2ib500@172.20.5.1@o2ib500:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      2012-11-14 12:30:19 Lustre: Error -5 communicating with the MGS, is the MGS running?
      2012-11-14 12:30:25 LustreError: 41374:0:(client.c:1123:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff881011707000 x1418644692140035/t0(0) o101->MGC172.20.5.1@o2ib500@172.20.5.1@o2ib500:26/25 lens 328/384 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      2012-11-14 12:30:31 LustreError: 41374:0:(client.c:1123:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff881011707000 x1418644692140036/t0(0) o101->MGC172.20.5.1@o2ib500@172.20.5.1@o2ib500:26/25 lens 328/384 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      2012-11-14 12:30:31 LustreError: 15c-8: MGC172.20.5.1@o2ib500: The configuration from log 'lsfull-OST0062' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      2012-11-14 12:30:31 LustreError: 41374:0:(obd_mount.c:1851:server_start_targets()) failed to start server lsfull-OST0062: -5
      2012-11-14 12:30:31 Lustre: lsfull-OST0062: Unable to start target: -5
      2012-11-14 12:30:31 LustreError: 41374:0:(obd_mount.c:1350:lustre_disconnect_osp()) Can't end config log lsfull
      2012-11-14 12:30:31 LustreError: 41374:0:(obd_mount.c:2113:server_put_super()) lsfull-OST0062: failed to disconnect osp-on-ost (rc=-2)!
      2012-11-14 12:30:31 LustreError: 41374:0:(obd_mount.c:2143:server_put_super()) no obd lsfull-OST0062
      2012-11-14 12:30:31 LustreError: 41374:0:(obd_mount.c:1418:lustre_stop_osp()) Can not find osp-on-ost lsfull-MDT0000-osp-OST0062
      2012-11-14 12:30:31 LustreError: 41374:0:(obd_mount.c:2158:server_put_super()) lsfull-OST0062: Fail to stop osp-on-ost!
      2012-11-14 12:30:58 Lustre: server umount lsfull-OST0062 complete
      2012-11-14 12:30:58 LustreError: 41374:0:(obd_mount.c:2990:lustre_fill_super()) Unable to mount  (-5)
      2012-11-14 12:32:54 LustreError: 42061:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
      2012-11-14 12:32:54 Lustre: lsfull-OST0062: Initializing new disk
      2012-11-14 12:34:10 LustreError: 166-1: MGC172.20.5.1@o2ib500: Connection to MGS (at 172.20.5.1@o2ib500) was lost; in progress operations using this service will fail
      2012-11-14 12:50:50 Lustre: Evicted from MGS (at MGC172.20.5.1@o2ib500_0) after server handle changed from 0x2ca8c28d7be253b7 to 0xa10ed509d011108e
      2012-11-14 12:50:50 Lustre: MGC172.20.5.1@o2ib500: Connection restored to MGS (at 172.20.5.1@o2ib500)
      2012-11-14 12:56:27 LustreError: 137-5: UUID 'ls1-OST0062_UUID' is not available for connect (no target)
      2012-11-14 12:56:27 LustreError: Skipped 21 previous similar messages
      

      df shows the OST mounted, and the lustre:svname property value no longer has the ":", however, but it has no exports:

      # grove98 /root > df -t lustre
      Filesystem           1K-blocks      Used Available Use% Mounted on
      grove98/lsfull-ost0  67554518656      1152 67554515456   1% /mnt/lustre/local/lsfull-OST0062
      # grove98 /root > zfs get lustre:svname grove98/lsfull-ost0
      NAME                 PROPERTY       VALUE           SOURCE
      grove98/lsfull-ost0  lustre:svname  lsfull-OST0062  local
      # grove98 /root > ls /proc/fs/lustre/obdfilter/lsfull-OST0062/exports/
      clear
      

      If I try to restart the OST, the MGS complains it isn't registered:

      MGS:

      LustreError: 142-7: The target lsfull-OST0062 has not registered yet. It must be started before failnids can be added.
      LustreError: 527:0:(mgs_llog.c:2956:mgs_write_log_param()) err -2 on param 'failover.node=172.20.1.97@o2ib500'
      LustreError: 527:0:(mgs_handler.c:393:mgs_handle_target_reg()) Failed to write lsfull-OST0062 log (-2)
      

      OSS:

      2012-11-14 15:52:56 Lustre: Error -2 communicating with the MGS, is the MGS running?
      2012-11-14 15:52:56 LustreError: 45074:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
      2012-11-14 15:52:56 LustreError: 15c-8: MGC172.20.5.1@o2ib500: The configuration from log 'lsfull-OST0062' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      2012-11-14 15:52:56 LustreError: 45074:0:(obd_mount.c:1851:server_start_targets()) failed to start server lsfull-OST0062: -2
      2012-11-14 15:52:56 Lustre: lsfull-OST0062: Unable to start target: -2
      2012-11-14 15:52:56 LustreError: 45074:0:(obd_mount.c:1350:lustre_disconnect_osp()) Can't end config log lsfull
      2012-11-14 15:52:56 LustreError: 45074:0:(obd_mount.c:2113:server_put_super()) lsfull-OST0062: failed to disconnect osp-on-ost (rc=-2)!
      2012-11-14 15:52:56 LustreError: 45074:0:(obd_mount.c:2143:server_put_super()) no obd lsfull-OST0062
      2012-11-14 15:52:56 LustreError: 45074:0:(obd_mount.c:1418:lustre_stop_osp()) Can not find osp-on-ost lsfull-MDT0000-osp-OST0062
      2012-11-14 15:52:56 LustreError: 45074:0:(obd_mount.c:2158:server_put_super()) lsfull-OST0062: Fail to stop osp-on-ost!
      2012-11-14 15:52:57 Lustre: server umount lsfull-OST0062 complete
      2012-11-14 15:52:57 LustreError: 45074:0:(obd_mount.c:2990:lustre_fill_super()) Unable to mount  (-2)
      

      Finally, if put the ":" back in the lustre:svname property to try to force re-registration, the MGS complains the index is already in use:

      OSS:

      # grove98 /root > zfs set lustre:svname=lsfull:OST0062 grove98/lsfull-ost0
      # grove98 /root > zfs get lustre:svname grove98/lsfull-ost0
      NAME                 PROPERTY       VALUE           SOURCE
      grove98/lsfull-ost0  lustre:svname  lsfull:OST0062  local
      # grove98 /root > /etc/init.d/lustre start
      Mounting grove98/lsfull-ost0 on /mnt/lustre/local/lsfull-OST0062
      mount.lustre: mount grove98/lsfull-ost0 at /mnt/lustre/local/lsfull-OST0062 failed: No such file or directory
      Is the MGS specification correct?
      Is the filesystem name correct?
      If upgrading, is the copied client log valid? (see upgrade docs)
      # grove98 /root > dmesg | tail
      Lustre: Error -98 communicating with the MGS, is the MGS running?
      LustreError: 45333:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
      LustreError: 15c-8: MGC172.20.5.1@o2ib500: The configuration from log 'lsfull-OST0062' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      LustreError: 45333:0:(obd_mount.c:1851:server_start_targets()) failed to start server lsfull-OST0062: -2
      Lustre: lsfull-OST0062: Unable to start target: -2
      LustreError: 45333:0:(obd_mount.c:1350:lustre_disconnect_osp()) Can't end config log lsfull
      LustreError: 45333:0:(obd_mount.c:2113:server_put_super()) lsfull-OST0062: failed to disconnect osp-on-ost (rc=-2)!
      LustreError: 45333:0:(obd_mount.c:2143:server_put_super()) no obd lsfull-OST0062
      LustreError: 45333:0:(obd_mount.c:1418:lustre_stop_osp()) Can not find osp-on-ost lsfull-MDT0000-osp-OST0062
      LustreError: 45333:0:(obd_mount.c:2158:server_put_super()) lsfull-OST0062: Fail to stop osp-on-ost!
      Lustre: server umount lsfull-OST0062 complete
      LustreError: 45333:0:(obd_mount.c:2990:lustre_fill_super()) Unable to mount  (-2)
      

      MGS:

      LustreError: 140-5: Server lsfull-OST0062 requested index 98, but that index is already in use. Use --writeconf to force
      LustreError: 534:0:(mgs_llog.c:3005:mgs_write_log_target()) Can't get index (-98)
      LustreError: 534:0:(mgs_handler.c:393:mgs_handle_target_reg()) Failed to write lsfull-OST0062 log (-98)
      

      Attachments

        Issue Links

          Activity

            People

              liwei Li Wei (Inactive)
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: