Details

    • 9223372036854775807

    Description

      Currently ltd_qos.lq_rw_sem is used at next LOD paths

      lod_qos_statfs_update() write - does not protect anything I hope it will gone with LU-14277
      lod_qos_calc_rr() write - refill pool array if LQ_DIRTY was set, rare
      lod_ost_alloc_rr() read - whole path for objects reservation
      lod_mdt_alloc_rr() read - the same
      lod_ost_alloc_qos() write - whole path for OST weight calculation and objects allocation
      lod_mdt_alloc_qos() write - the same
      lu_qos_add_tgt() write - adds a new target marks LQ_DIRTY, rare
      lu_qos_del_tgt() write - dels a target, marks LQ_DIRTY, rare

      call graph for these functions

      lod_qos_prep_create() {
              lod_qos_statfs_update()
              rc = lod_ost_alloc_qos()
              if (rc == -EAGAIN)
                      rc = lod_ost_alloc_rr() {
                                      lod_qos_calc_rr()
                                      lod_check_and_reserve_ost() {
                                              lod_qos_declare_object_on()
                                      }
                      }
      }
      

      lod_qos_declare_object_on() could block on object creation when OST was lost, failover or so. This leads that ltd_qos.lq_rw_sem would be hold
      by lod_ost_alloc_rr() for read all failover time. This also means that other creation threads would stuck at
      lod_ost_alloc_qos() on down_write(). No matter how many OSTs Lustre could use, all creation threads would hang in this case.

      I'm suggesting a patch to unblock lod_ost_alloc_qos() threads with EAGAIN, it leads to lod_ost_alloc_rr() where semaphore is shared for read. So creation threads could take health OSTs and allocates objects.

      Attachments

        Issue Links

          Activity

            [LU-15393] object allocation when OST is lost

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49097/
            Subject: LU-15393 tests: check QoS hang with OST failover
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 3692450355585c1a3a8502ce0f96a36650941f96

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49097/ Subject: LU-15393 tests: check QoS hang with OST failover Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 3692450355585c1a3a8502ce0f96a36650941f96

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49096/
            Subject: LU-15393 lod: skip qos for qos_threshold_rr=100
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 0b1aa418ac26d879d4794db1aab360a2230c891d

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49096/ Subject: LU-15393 lod: skip qos for qos_threshold_rr=100 Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 0b1aa418ac26d879d4794db1aab360a2230c891d

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49095/
            Subject: LU-15393 lod: use killable semaphore for creation path
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 18c098261104fef9350e932d124d78296b0cc135

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49095/ Subject: LU-15393 lod: use killable semaphore for creation path Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 18c098261104fef9350e932d124d78296b0cc135

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49097
            Subject: LU-15393 tests: check QoS hang with OST failover
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 1a44918703ab0f75c3ee7ab45bf9d6db7c1a6674

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49097 Subject: LU-15393 tests: check QoS hang with OST failover Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 1a44918703ab0f75c3ee7ab45bf9d6db7c1a6674

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49096
            Subject: LU-15393 lod: skip qos for qos_threshold_rr=100
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 17b646aac70cf702d1358e65bf8ce22f16f41dfd

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49096 Subject: LU-15393 lod: skip qos for qos_threshold_rr=100 Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 17b646aac70cf702d1358e65bf8ce22f16f41dfd

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49095
            Subject: LU-15393 lod: use killable semaphore for creation path
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: eaf1700f3d57ae88b48099611219ea6f3d2de75f

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49095 Subject: LU-15393 lod: use killable semaphore for creation path Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: eaf1700f3d57ae88b48099611219ea6f3d2de75f

            The recovery-small test_152 failed once:
            https://testing.whamcloud.com/test_sets/2ac04215-a77d-4436-8b38-65a379dd5855

            Not sure if this is a problem yet.

            adilger Andreas Dilger added a comment - The recovery-small test_152 failed once: https://testing.whamcloud.com/test_sets/2ac04215-a77d-4436-8b38-65a379dd5855 Not sure if this is a problem yet.

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47715/
            Subject: LU-15393 tests: check QoS hang with OST failover
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 52057d85eaef8c7b5262f0718629fabff919ff1d

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47715/ Subject: LU-15393 tests: check QoS hang with OST failover Project: fs/lustre-release Branch: master Current Patch Set: Commit: 52057d85eaef8c7b5262f0718629fabff919ff1d

            "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47715
            Subject: LU-15393 tests: check QoS hang with OST failover
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7d87322d92352865cc86438cba517d98aad0c789

            gerrit Gerrit Updater added a comment - "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47715 Subject: LU-15393 tests: check QoS hang with OST failover Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7d87322d92352865cc86438cba517d98aad0c789
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            People

              aboyko Alexander Boyko
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: