Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7235

ZFS: dmu_object_alloc() serializes object creations

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • None
    • None
    • 9223372036854775807

    Description

      dmu_object_alloc() in ZFS serializes object creations with a mutex:

      dmu_object_alloc(objset_t *os, dmu_object_type_t ot,
                        int dmu_object_type_t bonustype,
                        int bonuslen, dmu_tx_t *tx)
      {
      	uint64_t object;
      	uint64_t L2_dnode_count = DNODES_PER_BLOCK <<
      	    (DMU_META_DNODE(os)->dn_indblkshift - SPA_BLKPTRSHIFT);
      	dnode_t *dn = NULL;
      	int restarted = B_FALSE;
      
      	mutex_enter(&os->os_obj_lock);
      

      this can be a bottleneck holding metadata performance much lower compared to ldiskfs where inode creation is more concurrent.

      Attachments

        Issue Links

          Activity

            [LU-7235] ZFS: dmu_object_alloc() serializes object creations
            donut-crowd Donut Crowd (Inactive) made changes -
            Remote Link Original: This issue links to "Page (HPDD Community Wiki)" [ 15586 ]
            bzzz Alex Zhuravlev made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Closed [ 6 ]

            I suspect this ticket can be closed, based on landings to upstream ZFS?

            adilger Andreas Dilger added a comment - I suspect this ticket can be closed, based on landings to upstream ZFS?
            bzzz Alex Zhuravlev made changes -
            Link New: This issue is blocking LU-2600 [ LU-2600 ]
            dbrady Don Brady (Inactive) added a comment - - edited

            Initial observation

            The os_obj_lock lock only protects os->os_obj_next. Note there are over 100 additional mutex acquired while holding it.

            It looks like we can narrow the lock window by dropping it after we successfully holding the free dnode we found. As far as I can tell, the os_obj_lock lock does not need to be held across dnode_allocate() and dnode_rele(). Base on an ftrace forward call graph, it looks like that would cut the lock held duration from 948us down to 236us.

            See the attached call graph (line 153) for more context.

            dbrady Don Brady (Inactive) added a comment - - edited Initial observation The os_obj_lock lock only protects os->os_obj_next . Note there are over 100 additional mutex acquired while holding it. It looks like we can narrow the lock window by dropping it after we successfully holding the free dnode we found. As far as I can tell, the os_obj_lock lock does not need to be held across dnode_allocate() and dnode_rele() . Base on an ftrace forward call graph, it looks like that would cut the lock held duration from 948us down to 236us. See the attached call graph (line 153) for more context.
            dbrady Don Brady (Inactive) made changes -
            Attachment New: dmu_object_alloc-call-graph.txt [ 19104 ]
            adilger Andreas Dilger made changes -
            Description Original: dmu_object_alloc() in ZFS serializes object creations with a mutex:
            {code}
            dmu_object_alloc(objset_t *os, dmu_object_type_t ot, int blocksize,
                dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
            {
            uint64_t object;
            uint64_t L2_dnode_count = DNODES_PER_BLOCK <<
            (DMU_META_DNODE(os)->dn_indblkshift - SPA_BLKPTRSHIFT);
            dnode_t *dn = NULL;
            int restarted = B_FALSE;

            mutex_enter(&os->os_obj_lock);
            {code}
            this can be a bottleneck holding metadata performance much lower compared to ldiskfs where inode creation is more concurrent.
            New: dmu_object_alloc() in ZFS serializes object creations with a mutex:
            {code}
            dmu_object_alloc(objset_t *os, dmu_object_type_t ot,
                              int dmu_object_type_t bonustype,
                              int bonuslen, dmu_tx_t *tx)
            {
            uint64_t object;
            uint64_t L2_dnode_count = DNODES_PER_BLOCK <<
            (DMU_META_DNODE(os)->dn_indblkshift - SPA_BLKPTRSHIFT);
            dnode_t *dn = NULL;
            int restarted = B_FALSE;

            mutex_enter(&os->os_obj_lock);
            {code}
            this can be a bottleneck holding metadata performance much lower compared to ldiskfs where inode creation is more concurrent.
            adilger Andreas Dilger made changes -
            Description Original: dmu_object_alloc() in ZFS serializes object creations with a mutex:

            dmu_object_alloc(objset_t *os, dmu_object_type_t ot, int blocksize,
                dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
            {
            uint64_t object;
            uint64_t L2_dnode_count = DNODES_PER_BLOCK <<
            (DMU_META_DNODE(os)->dn_indblkshift - SPA_BLKPTRSHIFT);
            dnode_t *dn = NULL;
            int restarted = B_FALSE;

            mutex_enter(&os->os_obj_lock);

            this can be a bottleneck holding metadata performance much lower compared to ldiskfs where inode creation is more concurrent.
            New: dmu_object_alloc() in ZFS serializes object creations with a mutex:
            {code}
            dmu_object_alloc(objset_t *os, dmu_object_type_t ot, int blocksize,
                dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
            {
            uint64_t object;
            uint64_t L2_dnode_count = DNODES_PER_BLOCK <<
            (DMU_META_DNODE(os)->dn_indblkshift - SPA_BLKPTRSHIFT);
            dnode_t *dn = NULL;
            int restarted = B_FALSE;

            mutex_enter(&os->os_obj_lock);
            {code}
            this can be a bottleneck holding metadata performance much lower compared to ldiskfs where inode creation is more concurrent.
            jlevi Jodi Levi (Inactive) made changes -
            Remote Link New: This issue links to "Page (HPDD Community Wiki)" [ 15586 ]
            pjones Peter Jones made changes -
            Assignee Original: WC Triage [ wc-triage ] New: Alex Zhuravlev [ bzzz ]

            People

              bzzz Alex Zhuravlev
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: