Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10836

MDS hangs on --replace and remount OSTs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.3
    • None
    • Centos 7.4 w/ mellanox OFED 4.2-1.2.0.0
      ZFS OSTs, ldiskfs MDT
    • 3
    • 9223372036854775807

    Description

      I recreated two OSTs on one of our file systems because they had corrupted ZFS meta data affecting their spacemaps. Their files had been migrated, they were set to max_create_count=0 and also deactivated for the destroy / --replace.

      I ran these commands on the OSS:

       

       

      umount /mnt/lustre/local/iliad-OST0018
      umount /mnt/lustre/local/iliad-OST0019
      zpool list
      zpool destroy iliad-ost18
      zpool destroy iliad-ost19
      zpool list
      mkfs.lustre --fsname=iliad --ost --replace --backfstype=zfs --index=24 --mgsnode=172.16.25.4@o2ib iliad-ost18/ost18 /dev/mapper/mpathc
      mkfs.lustre --fsname=iliad --ost --replace --backfstype=zfs --index=25 --mgsnode=172.16.25.4@o2ib iliad-ost19/ost19 /dev/mapper/mpathdsystemctl restart lustre
      

       

       

      Then on the MDS:

      lctl set_param osp.iliad-OST0019-osc-MDT0000.active=1
      lctl set_param osp.iliad-OST0018-osc-MDT0000.active=1
      lctl set_param lod.iliad-MDT0000-mdtlov.qos_threshold_rr=17
      lctl get_param osp.iliad-OST0019-osc-MDT0000.max_create_count
      lctl set_param osp.iliad-OST0019-osc-MDT0000.max_create_count=20000
      lctl set_param osp.iliad-OST0018-osc-MDT0000.max_create_count=20000

       

      At this point, I noticed the inode count on each OST was 10191 and it wasn't increasing. I tried to copy a file but the command hung. I checked the status of the MDS with the following commands, ultimately remounting the ldiskfs MDT:

       

      lctl get_param osp.iliad-OST00*.active
      dmesg
      less /var/log/messages
      umount /mnt/meta
      dmesg | tail
      mount /mnt/meta

       

      Upon mount, the file system went into recovery, completed in 20 seconds, and began operating normally. The stack trace is attached, truncated to avoid redundant traces.

      The file system appears to be fine and I am currently migrating files back onto the replaced OSTs.

      Attachments

        Activity

          People

            wc-triage WC Triage
            jstroik Jesse Stroik
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: