[LU-8367] delete orphan phase isn't stated for multistriped file Created: 05/Jul/16  Updated: 24/Nov/23  Resolved: 03/Jan/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0, Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Alexey Lyashkov Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Attachments: HTML File test1    
Issue Links:
Blocker
Duplicate
Related
is related to LU-9498 osp_precreate_get_fid()) ASSERTION( o... Resolved
is related to LU-10336 osp: wakeup opd_pre_waitq when decrem... Resolved
is related to LU-16425 Interop recovery-small test_144a: MDT... Resolved
is related to LU-9285 revert LU-8367 and LU-8972 Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

problem discovered while testing a OST failovers. OST pool with 10 OST was created and striping with -1 assigned to it.
half (even indexes) OST's have failed during create.
object creation was blocked in several places, sometimes after reserving an object on failed OST. In that case OSP threads was blocked to start a delete orphans due situation when allocation hold an some reserved objects and can't be release this reservation due blocking on waiting recovery on next assigned OST. Due some object allocations in parallel - MDT hit in situation when each failed OST have an own reserved object and objects allocation blocked by long time waiting a specially when all OSP timeouts (each obd_timeout) expired. It may need a large amount of time - half or full hour.

That bug introduced as regression after LOV > LOD moving on MDT side.
Original ticket is https://projectlava.xyratex.com/show_bug.cgi?id=18357



 Comments   
Comment by Alexey Lyashkov [ 05/Jul/16 ]

problem easy replicated with attached test.

Comment by Alexey Lyashkov [ 05/Jul/16 ]

main problem is not about del orphan start, but recovery block stop any allocation until del orphan finished, so one failed create will block any allocations.

Comment by Gerrit Updater [ 22/Jul/16 ]

Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: http://review.whamcloud.com/21483
Subject: LU-8367 test: long waiting for multistripe file creation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d23f8ae00fc85a7fe070a72f35cc90a1f036afb3

Comment by Sergey Cheremencev [ 09/Sep/16 ]

Suggested solution http://review.whamcloud.com/#/c/21785/4 doesn't solve the issue when reconnection is caused by OST failover.
I changed reproducer http://review.whamcloud.com/21483 (patchset 4). Now it shows hunging osp_precreate_cleanup_orphans after OST failover.

Comment by Tejas Bhise (Inactive) [ 05/Oct/16 ]

Hi Peter, Vitaly,

I think this one is also stuck due to mismatch in expectations. Is it possible to quickly summarize from each team what the expectations are so we can try to move this forward?

Regards,
Tejas.

Comment by Alex Zhuravlev [ 05/Oct/16 ]

We discussed the issue at LAD, I've been working on the solution..

Comment by Tejas Bhise (Inactive) [ 05/Oct/16 ]

Great!! .. Thanks ..

Comment by Alex Zhuravlev [ 21/Oct/16 ]

http://review.whamcloud.com/#/c/23168/

Comment by Alexey Lyashkov [ 27/Oct/16 ]

May you explain why it hack better to disable a del orphan expect an MDT recovery phase ?

Comment by Gerrit Updater [ 23/Dec/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23168/
Subject: LU-8367 osp: do not block orphan cleanup
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2ce0d5b0640e3e440822080e407eee1ce1cafd75

Comment by Peter Jones [ 23/Dec/16 ]

Landed for 2.10

Comment by Gerrit Updater [ 10/Mar/17 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/25925
Subject: Revert "LU-8367 osp: do not block orphan cleanup"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 27bdaf74982212f9441cefd28cb533289c0f7bfe

Comment by Gerrit Updater [ 09/May/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25926/
Subject: LU-8367 osp: orphan cleanup do not wait for reserved
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1b3028ab142a1f605e6274a6019bb39d89ae8d25

Comment by Peter Jones [ 13/Jun/17 ]

Reverted from master under LU-9498

Comment by Alexander Boyko [ 21/Feb/22 ]

This issue is actual still with Lustre 2.15.  We do have a vmcore with MDT failover problem. However r/c brings me to this issue. The test did not include massive failover as description has, only one by one  random node failover/failback during IO.

[18971.201645] LustreError: 0-0: Forced cleanup waiting for mdt-kjcf05-MDT0001_UUID namespace with 46 resources in use, (rc=-110)

MDT01 creation threads hanged on lq_rw_sem 27 threads sleeped about 1.5 hours
[0 01:29:59.961] [UN]  PID: 49816  TASK: ffff98c656af17c0  CPU: 5   COMMAND: "mdt01_000"
, lod_ost_alloc_rr holded semaphore for read by 3 threads 
[0 00:05:00.729] [ID]  PID: 56895  TASK: ffff98c67f1217c0  CPU: 22  COMMAND: "mdt05_007"
They waked up every obdtimeout (5minutes), and did not go further because of no objects.
The next msg repeated during 4000+ seconds.

00000004:00080000:7.0:1644534589.764767:0:90378:0:(osp_precreate.c:1529:osp_precreate_reserve()) kjcf05-OST0003-osc-MDT0001: slow creates, last=[0x340000400:0x23a4f483:0x0], next=[0x340000400:0x23a4f378:0x0], reserved=267, sync_changes=1, sync_rpcs_in_progress=0, status=-19 
....
00000004:00080000:3.0:1644539373.892843:0:84791:0:(osp_precreate.c:1529:osp_precreate_reserve()) kjcf05-OST0003-osc-MDT0001: slow creates, last=[0x340000400:0x23a4f483:0x0], next=[0x340000400:0x23a4f378:0x0], reserved=267, sync_changes=0, sync_rpcs_in_progress=0, status=0

During this time period MDT0001 successfully reconnected to OST0003, it have done recovery etc.  four times. But objects were not created.

00000100:02000000:26.0:1644534989.327955:0:49612:0:(import.c:1642:ptlrpc_import_recovery_state_machine()) kjcf05-OST0003-osc-MDT0001: Connection restored to 10.16.100.58@o2ib (at 10.16.100.58@o2ib)
00000100:02000000:26.0:1644536000.579884:0:49612:0:(import.c:1642:ptlrpc_import_recovery_state_machine()) kjcf05-OST0003-osc-MDT0001: Connection restored to 10.16.100.60@o2ib (at 10.16.100.60@o2ib)
00000100:02000000:27.0:1644537071.966959:0:49612:0:(import.c:1642:ptlrpc_import_recovery_state_machine()) kjcf05-OST0003-osc-MDT0001: Connection restored to 10.16.100.58@o2ib (at 10.16.100.58@o2ib)
00000100:02000000:24.0:1644538459.546055:0:49612:0:(import.c:1642:ptlrpc_import_recovery_state_machine()) kjcf05-OST0003-osc-MDT0001: Connection restored to 10.16.100.60@o2ib (at 10.16.100.60@o2ib)

osp_precreate_thread sleeps here

PID: 49938  TASK: ffff98c63a248000  CPU: 30  COMMAND: "osp-pre-3-1"
 #0 [ffffa545b38efd48] __schedule at ffffffff8e54e1d4
 #1 [ffffa545b38efde0] schedule at ffffffff8e54e648
 #2 [ffffa545b38efdf0] osp_precreate_cleanup_orphans at ffffffffc17d00e9 [osp]
 #3 [ffffa545b38efe70] osp_precreate_thread at ffffffffc17d18da [osp]

The wait is

wait_event_idle(d->opd_pre_waitq,
                        (!d->opd_pre_reserved && d->opd_recovery_completed) ||
                        !d->opd_pre_task || d->opd_got_disconnected);

osp_pre_reserved = 0x10b,

osp_pre_reserverd was not decreased to 0, and osp_precreate_thread was blocked at osp_precreate_cleanup_orphans()
The hanged component is overstriped with 2000 stripes count.

crash> lod_layout_component 0xffff98c68f4a8100
struct lod_layout_component {
  llc_extent = {
    e_start = 536870912,
    e_end = 536870912
  },
  llc_id = 4,
  llc_flags = 0,
  llc_stripe_size = 1048576,
  llc_pattern = 513,
  llc_layout_gen = 0,
  llc_stripe_offset = 65535,
  llc_stripe_count = 2000,
  llc_stripes_allocated = 0,
  llc_timestamp = 0,
  llc_pool = 0xffff98c60eef5980 "flash",
  llc_ostlist = {

bzzz ^

Comment by Alex Zhuravlev [ 22/Feb/22 ]

the root cause is the object creation which were holding some objects reserved for simple reason - orphan cleanup doesn't know which object to start cleanup from. in turn, object reservation is required to avoid waiting for object within a local transaction which would be bad.
probably the object precreation should be fixed to handle overstriping "faster" when some OSTs are missing.

Comment by Gerrit Updater [ 08/Aug/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46889/
Subject: LU-8367 osp: enable replay for precreation request
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 63e17799a369e2ff0b140fd41dc5d7d8656d2bf0

Comment by Gerrit Updater [ 14/Nov/22 ]

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49151
Subject: LU-8367 osp: detect reformatted OST for FID_SEQ_NORMAL
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e39c59d6845b0bff7458fc996bdf02f9f62980a7

Comment by Gerrit Updater [ 03/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49151/
Subject: LU-8367 osp: wait for precreate on reformatted OST
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e06b2ed956f45feffa3adc7e2e7399ab737b37be

Comment by Peter Jones [ 03/Jan/23 ]

Landed for 2.16

Comment by Gerrit Updater [ 05/Apr/23 ]

"Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50543
Subject: LU-8367 osp: unused fail_locs from sanity-27S
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 188103c3203355df222e7b43a45e02405bd8fe4a

Comment by Gerrit Updater [ 22/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50543/
Subject: LU-8367 osp: remove unused fail_locs from sanity/27S,822
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e69eea5f60eec17ac32cea8d2a60768e0738a052

Generated at Sat Feb 10 02:16:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.