Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1257

OST registration snafu

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.4.0
    • Lustre 2.4.0
    • None
    • 3
    • 5533

    Description

      Our sysadmins were expanding one of our 2.1 filesystems yesterday and ran into a problem. Because of previous 1.8 problems that made out-of-order OST registration problematic, the admin was using a script to mount the OSTs one at a time, in sequential ID order.

      With each new OST registration the MGS lock is revoked. Lock timeouts were not infrequent.

      One OST hit a timeout while communicating with the MGS, and this proved to be a fairly non-recoverable event. I was forced to break out hexedit to get things working again reasonably.

      2012-03-22 10:07:58 LustreError: 166-1: MGC172.19.1.100@o2ib100: Connection to MGS (at 172.19.1.100@o2ib100) was lost; in progress operations using this service will fail
      2012-03-22 10:07:58 LustreError: 12150:0:(ldlm_request.c:115:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1332435772, 306s ago), entering recovery for MGS@MGC172.19.1.100@o2ib100_0 ns: MGC172.19.1.100@o2ib100 lock: ffff880314ba0480/0x54303899ba77f6ed lrc: 4/1,0 mode: --/CR res: 6517612/0 rrc: 1 type: PLN flags: 0x10000010 remote: 0x1d01b502e0a66a07 expref: -99 pid: 12150 timeout 0
      2012-03-22 10:07:59 LustreError: 12718:0:(obd_mount.c:1163:server_start_targets()) Required registration failed for lsc-OST0174: -4
      2012-03-22 10:07:59 LustreError: 12718:0:(obd_mount.c:1719:server_fill_super()) Unable to start targets: -4
      2012-03-22 10:07:59 LustreError: 12718:0:(obd_mount.c:1508:server_put_super()) no obd lsc-OST0174
      2012-03-22 10:07:59 LustreError: 12718:0:(obd_mount.c:141:server_deregister_mount()) lsc-OST0174 not registered
      2012-03-22 10:07:59 LustreError: 11-0: MGC172.19.1.100@o2ib100: Communicating with 172.19.1.100@o2ib100, operation mgs_connect failed with -16.
      2012-03-22 10:07:59 Lustre: server umount lsc-OST0174 complete
      2012-03-22 10:07:59 LustreError: 12718:0:(obd_mount.c:2160:lustre_fill_super()) Unable to mount  (-4)
      2012-03-22 10:08:24 LustreError: 11-0: MGC172.19.1.100@o2ib100: Communicating with 172.19.1.100@o2ib100, operation mgs_connect failed with -16.
      2012-03-22 10:08:49 LustreError: 11-0: MGC172.19.1.100@o2ib100: Communicating with 172.19.1.100@o2ib100, operation mgs_connect failed with -16.
      2012-03-22 10:09:14 LustreError: 11-0: MGC172.19.1.100@o2ib100: Communicating with 172.19.1.100@o2ib100, operation mgs_connect failed with -16.
      2012-03-22 10:09:38 LustreError: 11-0: MGC172.19.1.100@o2ib100: Communicating with 172.19.1.100@o2ib100, operation mgs_connect failed with -16.
      

      The admin reran the mount command several more times and each time got either -4 (EINTR) which corresponds with an instance of a lock timeout as above, or -5 (EIO).

      When I was called in, I tried the mount my self to start gathering info. At that point it returned -98 (EADDRINUSE), and from the MGS logs that it HAD finally processed the OST's registration:

      2012-03-22 11:09:22 LustreError: 140-5: Server lsc-OST0174 requested index 372, but that index is already in use. Use --writeconf to force
      2012-03-22 11:09:22 LustreError: 12649:0:(mgs_llog.c:2710:mgs_write_log_target()) Can't get index (-98)
      2012-03-22 11:09:22 LustreError: 12649:0:(mgs_handler.c:520:mgs_handle_target_reg()) Failed to write lsc-OST0174 log (-98)
      

      But the OST does not know that the registration succeeded, so its mountdata still has the flag LDD_F_VIRGIN set. Because of that, the MGS will never let the OST connect.

      That left us two courses of action (to the best of my knowledge)

      1. Unmount the filesystem completely, use --writeconf on the MGS, restart everything
      2. Use a hexeditor on the OST's mountdata file to clear the LDD_F_VIRGIN flag

      Since we did not want to cause a downtime for the filesystem, we chose the latter.

      The mount of the OST seemed to mostly go well, and it appears to be functioning fine now, but I did see this error on the MGS/MDS console:

      2012-03-22 15:09:48 Lustre: Found index 372 for lsc-OST0174, updating log
      2012-03-22 15:12:26 LustreError: 6002:0:(obd_config.c:1019:class_process_config()) no device for: lsc-OST0174-osc
      2012-03-22 15:12:26 LustreError: 6002:0:(obd_config.c:1363:class_config_llog_handler()) Err -22 on cfg command:
      2012-03-22 15:12:26 Lustre:    cmd=cf00b 0:lsc-OST0174-osc  1:172.19.1.127@o2ib100  
      

      The OST appears to be working fine, so I am not sure how worried I should be about that llog error.

      I will attach some console logs to show what was going when the OST registration failed.

      Attachments

        1. console.sumom25
          153 kB
          Christopher Morrone
        2. console.sumom-mds1
          174 kB
          Christopher Morrone

        Issue Links

          Activity

            People

              niu Niu Yawei (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: