CentOS 7.9 (3.10.0-1160.6.1.el7_lustre.pl1.x86_64)
We are seeing a weird problem on our Oak storage system where MDT0000 doesn’t seem to be able to allocate new objects on specific OSTs after a MDT0000 failover. Lustre doesn’t seem to complain about that in the logs (which is already a problem per se), but we can see that something is wrong as prealloc_status is set to -11 (EAGAIN) for those:
While 388 other OSTs are fine:
Note: prealloc_status for MDT0001 and MDT0002 located on the same MDS are fine, which likely indicates that the OSTs are fine:
MDT0 prealloc info for a problematic OST:
osp stats for a problematic OST:
vs. a working one for comparison purposes:
State for the same problematic OST:
So far, we have no user reports of any issue with the filesystem, probably thanks to Lustre which does a good job at NOT using these OSTs for new objects, likely due to a proper check on prealloc_status. We only noticed the problem thanks to a micro-benchmark monitoring script that periodically performs some I/O on every OSTs of the filesystem by creating files on MDT0000:
We can see that setstripe does time out after 100s on this, which is Lustre’s timeout value.
It looks like to me that MDT0000 is somehow not able to run the preallocation routines for these 76 OSTs and they are stuck with this status. But nothing in the logs at the start of the MDT seems to indicate a problem. Just in case, I am attaching the Lustre logs at the start of both MDT0000 and MDT0003 (due to manual failover to doing a service maintenance on a server) as oak-md1-s1-lustre-mdt-failover.log along with lustre log file lustre-log.1676310742.6098.gz which was dumped during the start. But it’s not clear from these logs there was any preallocation problem. Lustre should have a better way of logging that when this happens.
- I tried things like setting force_sync=1, or changing the value of max_create_count to try to force a refresh of the object preallocation, to no avail.
- Note that we do have the patch (available in 2.12.9):
LU-15009ofd: continue precreate if LAST_ID is less on MDT
and I am wondering if that could be related, as the patch transforms a CERROR into a LCONSOLE(D_INFO,…) in some cases, which we would have likely missed, as we don’t run with +info, it’s way too verbose. It’s important enough that it should probably be logged as an error instead. But really not sure this is related to our issue..
As for the server version we’re running here, sorry it’s not ideal but it’s somewhere between 2.12.8 and 2.12.9 + additional patches not in 2.12.x yet (but only with patches from Gerrit). To see that patches we have since 2.12.8, see git-patches.log.
Any ideas, suggestions of things to try, before trying a stop/start of the impacted OSTs maybe to force reconnect (or is there a way to force a reconnect for a specific OST just for MDT0?).