Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.10.3
-
None
-
Centos 7.4 w/ mellanox OFED 4.2-1.2.0.0
ZFS OSTs, ldiskfs MDT
-
3
-
9223372036854775807
Description
I recreated two OSTs on one of our file systems because they had corrupted ZFS meta data affecting their spacemaps. Their files had been migrated, they were set to max_create_count=0 and also deactivated for the destroy / --replace.
I ran these commands on the OSS:
umount /mnt/lustre/local/iliad-OST0018 umount /mnt/lustre/local/iliad-OST0019 zpool list zpool destroy iliad-ost18 zpool destroy iliad-ost19 zpool list mkfs.lustre --fsname=iliad --ost --replace --backfstype=zfs --index=24 --mgsnode=172.16.25.4@o2ib iliad-ost18/ost18 /dev/mapper/mpathc mkfs.lustre --fsname=iliad --ost --replace --backfstype=zfs --index=25 --mgsnode=172.16.25.4@o2ib iliad-ost19/ost19 /dev/mapper/mpathdsystemctl restart lustre
Then on the MDS:
lctl set_param osp.iliad-OST0019-osc-MDT0000.active=1 lctl set_param osp.iliad-OST0018-osc-MDT0000.active=1 lctl set_param lod.iliad-MDT0000-mdtlov.qos_threshold_rr=17 lctl get_param osp.iliad-OST0019-osc-MDT0000.max_create_count lctl set_param osp.iliad-OST0019-osc-MDT0000.max_create_count=20000 lctl set_param osp.iliad-OST0018-osc-MDT0000.max_create_count=20000
At this point, I noticed the inode count on each OST was 10191 and it wasn't increasing. I tried to copy a file but the command hung. I checked the status of the MDS with the following commands, ultimately remounting the ldiskfs MDT:
lctl get_param osp.iliad-OST00*.active
dmesg
less /var/log/messages
umount /mnt/meta
dmesg | tail
mount /mnt/meta
Upon mount, the file system went into recovery, completed in 20 seconds, and began operating normally. The stack trace is attached, truncated to avoid redundant traces.
The file system appears to be fine and I am currently migrating files back onto the replaced OSTs.