Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8367

delete orphan phase isn't stated for multistriped file

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • Lustre 2.5.0, Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      problem discovered while testing a OST failovers. OST pool with 10 OST was created and striping with -1 assigned to it.
      half (even indexes) OST's have failed during create.
      object creation was blocked in several places, sometimes after reserving an object on failed OST. In that case OSP threads was blocked to start a delete orphans due situation when allocation hold an some reserved objects and can't be release this reservation due blocking on waiting recovery on next assigned OST. Due some object allocations in parallel - MDT hit in situation when each failed OST have an own reserved object and objects allocation blocked by long time waiting a specially when all OSP timeouts (each obd_timeout) expired. It may need a large amount of time - half or full hour.

      That bug introduced as regression after LOV > LOD moving on MDT side.
      Original ticket is https://projectlava.xyratex.com/show_bug.cgi?id=18357

      Attachments

        Issue Links

          Activity

            [LU-8367] delete orphan phase isn't stated for multistriped file

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50543/
            Subject: LU-8367 osp: remove unused fail_locs from sanity/27S,822
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e69eea5f60eec17ac32cea8d2a60768e0738a052

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50543/ Subject: LU-8367 osp: remove unused fail_locs from sanity/27S,822 Project: fs/lustre-release Branch: master Current Patch Set: Commit: e69eea5f60eec17ac32cea8d2a60768e0738a052

            "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50543
            Subject: LU-8367 osp: unused fail_locs from sanity-27S
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 188103c3203355df222e7b43a45e02405bd8fe4a

            gerrit Gerrit Updater added a comment - "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50543 Subject: LU-8367 osp: unused fail_locs from sanity-27S Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 188103c3203355df222e7b43a45e02405bd8fe4a
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49151/
            Subject: LU-8367 osp: wait for precreate on reformatted OST
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e06b2ed956f45feffa3adc7e2e7399ab737b37be

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49151/ Subject: LU-8367 osp: wait for precreate on reformatted OST Project: fs/lustre-release Branch: master Current Patch Set: Commit: e06b2ed956f45feffa3adc7e2e7399ab737b37be

            "Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49151
            Subject: LU-8367 osp: detect reformatted OST for FID_SEQ_NORMAL
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e39c59d6845b0bff7458fc996bdf02f9f62980a7

            gerrit Gerrit Updater added a comment - "Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49151 Subject: LU-8367 osp: detect reformatted OST for FID_SEQ_NORMAL Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e39c59d6845b0bff7458fc996bdf02f9f62980a7

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46889/
            Subject: LU-8367 osp: enable replay for precreation request
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 63e17799a369e2ff0b140fd41dc5d7d8656d2bf0

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46889/ Subject: LU-8367 osp: enable replay for precreation request Project: fs/lustre-release Branch: master Current Patch Set: Commit: 63e17799a369e2ff0b140fd41dc5d7d8656d2bf0

            the root cause is the object creation which were holding some objects reserved for simple reason - orphan cleanup doesn't know which object to start cleanup from. in turn, object reservation is required to avoid waiting for object within a local transaction which would be bad.
            probably the object precreation should be fixed to handle overstriping "faster" when some OSTs are missing.

            bzzz Alex Zhuravlev added a comment - the root cause is the object creation which were holding some objects reserved for simple reason - orphan cleanup doesn't know which object to start cleanup from. in turn, object reservation is required to avoid waiting for object within a local transaction which would be bad. probably the object precreation should be fixed to handle overstriping "faster" when some OSTs are missing.

            This issue is actual still with Lustre 2.15.  We do have a vmcore with MDT failover problem. However r/c brings me to this issue. The test did not include massive failover as description has, only one by one  random node failover/failback during IO.

            [18971.201645] LustreError: 0-0: Forced cleanup waiting for mdt-kjcf05-MDT0001_UUID namespace with 46 resources in use, (rc=-110)

            MDT01 creation threads hanged on lq_rw_sem 27 threads sleeped about 1.5 hours
            [0 01:29:59.961] [UN]  PID: 49816  TASK: ffff98c656af17c0  CPU: 5   COMMAND: "mdt01_000"
            , lod_ost_alloc_rr holded semaphore for read by 3 threads 
            [0 00:05:00.729] [ID]  PID: 56895  TASK: ffff98c67f1217c0  CPU: 22  COMMAND: "mdt05_007"
            They waked up every obdtimeout (5minutes), and did not go further because of no objects.
            The next msg repeated during 4000+ seconds.

            00000004:00080000:7.0:1644534589.764767:0:90378:0:(osp_precreate.c:1529:osp_precreate_reserve()) kjcf05-OST0003-osc-MDT0001: slow creates, last=[0x340000400:0x23a4f483:0x0], next=[0x340000400:0x23a4f378:0x0], reserved=267, sync_changes=1, sync_rpcs_in_progress=0, status=-19 
            ....
            00000004:00080000:3.0:1644539373.892843:0:84791:0:(osp_precreate.c:1529:osp_precreate_reserve()) kjcf05-OST0003-osc-MDT0001: slow creates, last=[0x340000400:0x23a4f483:0x0], next=[0x340000400:0x23a4f378:0x0], reserved=267, sync_changes=0, sync_rpcs_in_progress=0, status=0
            

            During this time period MDT0001 successfully reconnected to OST0003, it have done recovery etc.  four times. But objects were not created.

            00000100:02000000:26.0:1644534989.327955:0:49612:0:(import.c:1642:ptlrpc_import_recovery_state_machine()) kjcf05-OST0003-osc-MDT0001: Connection restored to 10.16.100.58@o2ib (at 10.16.100.58@o2ib)
            00000100:02000000:26.0:1644536000.579884:0:49612:0:(import.c:1642:ptlrpc_import_recovery_state_machine()) kjcf05-OST0003-osc-MDT0001: Connection restored to 10.16.100.60@o2ib (at 10.16.100.60@o2ib)
            00000100:02000000:27.0:1644537071.966959:0:49612:0:(import.c:1642:ptlrpc_import_recovery_state_machine()) kjcf05-OST0003-osc-MDT0001: Connection restored to 10.16.100.58@o2ib (at 10.16.100.58@o2ib)
            00000100:02000000:24.0:1644538459.546055:0:49612:0:(import.c:1642:ptlrpc_import_recovery_state_machine()) kjcf05-OST0003-osc-MDT0001: Connection restored to 10.16.100.60@o2ib (at 10.16.100.60@o2ib)
            

            osp_precreate_thread sleeps here

            PID: 49938  TASK: ffff98c63a248000  CPU: 30  COMMAND: "osp-pre-3-1"
             #0 [ffffa545b38efd48] __schedule at ffffffff8e54e1d4
             #1 [ffffa545b38efde0] schedule at ffffffff8e54e648
             #2 [ffffa545b38efdf0] osp_precreate_cleanup_orphans at ffffffffc17d00e9 [osp]
             #3 [ffffa545b38efe70] osp_precreate_thread at ffffffffc17d18da [osp]
            

            The wait is

            wait_event_idle(d->opd_pre_waitq,
                                    (!d->opd_pre_reserved && d->opd_recovery_completed) ||
                                    !d->opd_pre_task || d->opd_got_disconnected);
            
            osp_pre_reserved = 0x10b,
            

            osp_pre_reserverd was not decreased to 0, and osp_precreate_thread was blocked at osp_precreate_cleanup_orphans()
            The hanged component is overstriped with 2000 stripes count.

            crash> lod_layout_component 0xffff98c68f4a8100
            struct lod_layout_component {
              llc_extent = {
                e_start = 536870912,
                e_end = 536870912
              },
              llc_id = 4,
              llc_flags = 0,
              llc_stripe_size = 1048576,
              llc_pattern = 513,
              llc_layout_gen = 0,
              llc_stripe_offset = 65535,
              llc_stripe_count = 2000,
              llc_stripes_allocated = 0,
              llc_timestamp = 0,
              llc_pool = 0xffff98c60eef5980 "flash",
              llc_ostlist = {
            

            bzzz ^

            aboyko Alexander Boyko added a comment - This issue is actual still with Lustre 2.15.  We do have a vmcore with MDT failover problem. However r/c brings me to this issue. The test did not include massive failover as description has, only one by one  random node failover/failback during IO. [18971.201645] LustreError: 0-0: Forced cleanup waiting for mdt-kjcf05-MDT0001_UUID namespace with 46 resources in use, (rc=-110) MDT01 creation threads hanged on lq_rw_sem 27 threads sleeped about 1.5 hours [0 01:29:59.961] [UN]  PID: 49816  TASK: ffff98c656af17c0  CPU: 5   COMMAND: "mdt01_000" , lod_ost_alloc_rr holded semaphore for read by 3 threads  [0 00:05:00.729] [ID]  PID: 56895  TASK: ffff98c67f1217c0  CPU: 22  COMMAND: "mdt05_007" They waked up every obdtimeout (5minutes), and did not go further because of no objects. The next msg repeated during 4000+ seconds. 00000004:00080000:7.0:1644534589.764767:0:90378:0:(osp_precreate.c:1529:osp_precreate_reserve()) kjcf05-OST0003-osc-MDT0001: slow creates, last=[0x340000400:0x23a4f483:0x0], next=[0x340000400:0x23a4f378:0x0], reserved=267, sync_changes=1, sync_rpcs_in_progress=0, status=-19  .... 00000004:00080000:3.0:1644539373.892843:0:84791:0:(osp_precreate.c:1529:osp_precreate_reserve()) kjcf05-OST0003-osc-MDT0001: slow creates, last=[0x340000400:0x23a4f483:0x0], next=[0x340000400:0x23a4f378:0x0], reserved=267, sync_changes=0, sync_rpcs_in_progress=0, status=0 During this time period MDT0001 successfully reconnected to OST0003, it have done recovery etc.  four times. But objects were not created. 00000100:02000000:26.0:1644534989.327955:0:49612:0:(import.c:1642:ptlrpc_import_recovery_state_machine()) kjcf05-OST0003-osc-MDT0001: Connection restored to 10.16.100.58@o2ib (at 10.16.100.58@o2ib) 00000100:02000000:26.0:1644536000.579884:0:49612:0:(import.c:1642:ptlrpc_import_recovery_state_machine()) kjcf05-OST0003-osc-MDT0001: Connection restored to 10.16.100.60@o2ib (at 10.16.100.60@o2ib) 00000100:02000000:27.0:1644537071.966959:0:49612:0:(import.c:1642:ptlrpc_import_recovery_state_machine()) kjcf05-OST0003-osc-MDT0001: Connection restored to 10.16.100.58@o2ib (at 10.16.100.58@o2ib) 00000100:02000000:24.0:1644538459.546055:0:49612:0:(import.c:1642:ptlrpc_import_recovery_state_machine()) kjcf05-OST0003-osc-MDT0001: Connection restored to 10.16.100.60@o2ib (at 10.16.100.60@o2ib) osp_precreate_thread sleeps here PID: 49938 TASK: ffff98c63a248000 CPU: 30 COMMAND: "osp-pre-3-1" #0 [ffffa545b38efd48] __schedule at ffffffff8e54e1d4 #1 [ffffa545b38efde0] schedule at ffffffff8e54e648 #2 [ffffa545b38efdf0] osp_precreate_cleanup_orphans at ffffffffc17d00e9 [osp] #3 [ffffa545b38efe70] osp_precreate_thread at ffffffffc17d18da [osp] The wait is wait_event_idle(d->opd_pre_waitq, (!d->opd_pre_reserved && d->opd_recovery_completed) || !d->opd_pre_task || d->opd_got_disconnected); osp_pre_reserved = 0x10b, osp_pre_reserverd was not decreased to 0, and osp_precreate_thread was blocked at osp_precreate_cleanup_orphans() The hanged component is overstriped with 2000 stripes count. crash> lod_layout_component 0xffff98c68f4a8100 struct lod_layout_component { llc_extent = { e_start = 536870912, e_end = 536870912 }, llc_id = 4, llc_flags = 0, llc_stripe_size = 1048576, llc_pattern = 513, llc_layout_gen = 0, llc_stripe_offset = 65535, llc_stripe_count = 2000, llc_stripes_allocated = 0, llc_timestamp = 0, llc_pool = 0xffff98c60eef5980 "flash", llc_ostlist = { bzzz ^
            pjones Peter Jones added a comment -

            Reverted from master under LU-9498

            pjones Peter Jones added a comment - Reverted from master under LU-9498

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25926/
            Subject: LU-8367 osp: orphan cleanup do not wait for reserved
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1b3028ab142a1f605e6274a6019bb39d89ae8d25

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25926/ Subject: LU-8367 osp: orphan cleanup do not wait for reserved Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1b3028ab142a1f605e6274a6019bb39d89ae8d25

            People

              bzzz Alex Zhuravlev
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: