It looks like there are a couple of different bugs being hit here:
- the OSTs are apparently reporting -ENOSPC when there is space available on them. Is this possibly a case where the OST is reporting ENOSPC during precreate when there are free inodes, but there are no free blocks? What is the lfs df and lfs df -i output from the filesystem? If this state hits again, please also collect the output from lctl get_param osp..sync_ osp.*.create_count on the MDS
- the MDS goes into a busy loop trying to create objects on the OSTs, rather than returning -ENOSPC to the client. While it is good to ensure that the OSTs have been tried for creates, it doesn't make sense to try precreate so often if the OST is already out of space.
There was previously debugging code in osp_precreate_reserve() because I was worried about exactly this kind of situation, but my recent patch http://review.whamcloud.com/6219 (commit dc2bcafd2a0b) removed it before the 2.4.0 release because it would have LBUG'd in this case:
@@ -1062,17 +1061,6 @@ int osp_precreate_reserve(const struct lu_env *env, struct osp_device *d)
while ((rc = d->opd_pre_status) == 0 || rc == -ENOSPC ||
rc == -ENODEV || rc == -EAGAIN) {
-#if LUSTRE_VERSION_CODE < OBD_OCD_VERSION(2, 3, 90, 0)
- /*
- * to address Andreas's concern on possible busy-loop
- * between this thread and osp_precreate_send()
- */
- if (unlikely(count++ == 1000)) {
- osp_precreate_timeout_condition(d);
- LBUG();
- }
-#endif
-
/*
* increase number of precreations
*/
It probably makes sense to reinstate this if it loops only, say, twice for -ENOSPC, since there are other OSTs that could be used and it doesn't make sense to block the MDS thread for such a long time.
It seems to me that the update of osp_pre_status at the start of osp_pre_update_status() is racy. If rc == 0 (i.e. the statfs succeeded) then osp_pre_status = 0, even though it is set to -ENOSPC again later on. It would be better to have something like:
if (rc != 0) {
d->opd_pre_status = rc;
goto out;
}
Otherwise, opd_pre_status can be changed to 0 and break osp_precreate_ready_condition() checking of -ENOSPC.
It also seems like the -ENOSPC hysteresis that is so well described in the comment in osp_pre_update_status() is not actually implemented. When the free space is < 0.1% opd_pre_status = -ENOSPC, but it is immediately cleared if free space is >= 0.1%. It also seems that there is a longstanding bug in the code, with min(used blocks vs. 1 GByte). It seems something like the following is needed:
* On very large disk, say 16TB 0.1% will be 16 GB. We don't want to
* lose that amount of space so in those cases we report no space left
* if their is less than 1 GB left, and clear it at 2GB.
if (likely(msfs->os_type != 0)) {
used = min((msfs->os_blocks - msfs->os_bfree) >> 10,
1ULL << (30 - msfs->os_bsize);
if (d->opd_pre_status == 0 &&
msfs->os_ffree < 32 || msfs->os_bavail < used) {
d->opd_pre_status = -ENOSPC;
:
:
} else if (d->opd_pre_status == -ENOSPC &&
msgs->os_ffree > 64 && msfs->os_bavail > used * 2) {
d->opd_pre_status = 0;
:
:
}
The opd_pre_status = 0 state should only be set if there is enough space, not only if the STATFS RPC succeeded.
James, how many clients were mounting this filesystem? If each OST is 250GB, and each client gets a 32MB grant, that means 32 clients/GB of free space, so 8000 clients would essentially pin all of the available space on each client. I see something a bit strange in the code that might be causing a problem here:
This is subtracting the granted space from the available space returned to the MDS, but I think it should be adding the granted space back into os_bavail so that the MDS does not consider the grant space as "used". Otherwise, if the clients have reserved a lot of space on the OSTs they may not actually get to use it because the MDS will never allocate space there.
A secondary issue is that there is no coordination between the space granted to a specific client and the objects that the MDS allocates to that client, which would become more important as the free space is running out. There could be some kind of (IMHO complex) coordination here between the MDS and clients/OSTs, but I think it would be easier if we just got the grant shrinking code to work again, as there is no guarantee that (a) clients doing IO will have any grant at all, and (b) the clients have grant on the OSTs for which they have been asked to write on. Returning unused grant to the OSTs as the free space shrinks is the best way to ensure that there is some left for the clients actually doing IO.