[LU-12025] Adding OST may cause EIO - delay activation of new OSTs on existing filesystem Created: 27/Feb/19 Updated: 19/Nov/23 Resolved: 08/Nov/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Di Wang | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
During OST addition, So it probably needs add version to the config log to avoid this happen. This happens quite often in cloud environment. though not sure if there is duplicate ticket already. |
| Comments |
| Comment by Andreas Dilger [ 27/Feb/19 ] |
|
I saw another related comment recently about the desire to be able to configure new OSTs on the filesystem, but not have them immediately active on the MDS until the administrator wants them enabled. I suspect there are a number of simple mechanisms that might allow this to happen:
Allowing both modes (autmatic activation after timeout, wait for manual activation) may be useful under different circumstances. |
| Comment by Di Wang [ 27/Feb/19 ] |
|
My initial thought is MGS maintains a version number, which can bump up when new target is added (or removed?), then all other nodes will get the version number when it fetches the config log from MGS. when client send request to server, if their version number does not match, then either client or server will needs to refresh their configuration from MGS. Or even further, each server target can also maintain its own version number (when it is being added to MGS), then the target can process the request as long as the req version number is newer than the target version. |
| Comment by Andreas Dilger [ 27/Feb/19 ] |
|
It isn't clear that having a per-target version number would help. The target being added (OST) can provide a version number, but the client shouldn't have to pass its "connect/config version" for every OST it knows about to the MDS for every file it is creating. Sending a single "config record version" (really just the last MGS config record number that the client processed) from the client to the MDS with each create would be more useful, since this would (indirectly) tell the MDS which OSTs the client is connected to and it could skip ones that were added after that version. Something like storing a "minimum config record number" on each target in LOD which is the config llog record in which the OST was added, and the client request would include their "current config record number" along with each request. A check like:
if (req->current_config_rec < lod->target[ost_idx]->tgt_min_config_rec)
continue;
during create would be enough to skip the OST for that create. HOWEVER this has some significant drawbacks:
So, my suggestion for a simpler solution is just to use a timeout (maybe variable), based on how quickly config llog records are processed by clients, before an OST (or MDT, for DNE) can be used for new file allocations. There is already a small delay before the OST could be used, because the MDS needs to precreate objects there, but that is only a fraction of a second in most cases. Instead of storing the "config version" in the LOD target, store the "config time" for the target, and skip it for new allocations for e.g. 10s after it connects. This should handle the case where the MDS itself was just mounted and all OSTs are pre-existing (e.g. only delay usage if the lov_objids entry was just added). |
| Comment by Di Wang [ 27/Feb/19 ] |
it would only fix the problem of one client creating and using a file itself, but would not fix the problem of a different client trying to access that file before it had processed the config llog updates (which may be delayed tens of seconds if there are many clients) Oh, I mean the stale client should not be allowed to access the "newer" server until it refreshes its own config from MGS, but this indeed needs a lot changes, probably not worth for this as you said. Adding timeout is probably good enough here. thanks. |
| Comment by Andreas Dilger [ 27/Feb/19 ] |
|
In relation to The code in osp_pre_update_status() needs to be cleaned up to avoid clearing the OS_STATE_NOPRECREATE flag (and probably OS_STATE_ENOSPC and OS_STATE_ENOINO, so the OST could send these itself as well). The obd_statfs structure is refreshed every few seconds, so the MDS shouldn't need to clear the flags it sets itself. Also, it appears there is a bit of a race in osp_statfs_interpret() calling osp_pre_update_status():
d->opd_statfs = *msfs;
if (d->opd_pre)
osp_pre_update_status(d, rc);
since there is a window between when the new *msfs is stored in opd_statfs that is returned by osp_statfs(), and when osp_pre_update_status() might set the various OS_STATE_* flags that the create threads check. That may allow file creation on an OST that should otherwise be unavailable (out of space, disabled, etc). Instead, the new *msfs should be passed as an argument to osp_pre_update_status() and updated before it is stored into opd_statfs. |
| Comment by Patrick Farrell (Inactive) [ 28/Feb/19 ] |
|
I think the request from |
| Comment by Gerrit Updater [ 01/Jun/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35029 |
| Comment by Gerrit Updater [ 22/Oct/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35029/ |
| Comment by Andreas Dilger [ 08/Nov/19 ] |
|
This allows the OST to completely disable itself from precreation without having to hack around in the state. Using "degraded" is only partially disabled (can still be used in emergency), and faking "out of space" is a hack. In |
| Comment by Gerrit Updater [ 26/Nov/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36872 |
| Comment by Gerrit Updater [ 12/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36872/ |