Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.5.0, Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0
-
None
-
3
-
9223372036854775807
Description
problem discovered while testing a OST failovers. OST pool with 10 OST was created and striping with -1 assigned to it.
half (even indexes) OST's have failed during create.
object creation was blocked in several places, sometimes after reserving an object on failed OST. In that case OSP threads was blocked to start a delete orphans due situation when allocation hold an some reserved objects and can't be release this reservation due blocking on waiting recovery on next assigned OST. Due some object allocations in parallel - MDT hit in situation when each failed OST have an own reserved object and objects allocation blocked by long time waiting a specially when all OSP timeouts (each obd_timeout) expired. It may need a large amount of time - half or full hour.
That bug introduced as regression after LOV > LOD moving on MDT side.
Original ticket is https://projectlava.xyratex.com/show_bug.cgi?id=18357
Attachments
Issue Links
- is related to
-
LU-9498 osp_precreate_get_fid()) ASSERTION( osp_fid_diff(&d->opd_pre_used_fid, &d->opd_pre_last_created_fid) < 0 ) failed: next fid [0x680000402:0x25de031:0x0] last created fid [0x680000402:0x25de031:0x0]
- Resolved
-
LU-10336 osp: wakeup opd_pre_waitq when decrement opd_pre_reserved
- Resolved
-
LU-16425 Interop recovery-small test_144a: MDT failover took 252 seconds
- Resolved
-
LU-9285 revert LU-8367 and LU-8972
- Resolved