Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1248

all mdt_rdpg_* threads busy in osd_ea_fid_get()

Details

    • 3
    • 6428

    Description

      The load average on the MDS for a classified production 2.1 filesystem jumped to over 400. Top showed mdt_rdpg_* threads all using 4-7% CPU time. This may have been due to a pathological workload, but we were wondering if there's something like an overly contended lock in ldiskfs going on here.

      Most of the stacks looked like this:

      __cond_resched
      _cond_resched
      ifind_fast
      iget_locked
      ldiskfs_iget
      ? generic_detach_inode
      osd_iget
      osd_ea_fid_get
      osd_it_ea_rec
      mdd_readpage
      cml_readpage
      mdt_readpage
      ? mdt_unpack_req_pack_rep
      mdt_handle_common
      ? lustre_msg_get_transno
      mdt_readpage_handle
      ptlrpc_main
      child_rip

      Attachments

        Activity

          [LU-1248] all mdt_rdpg_* threads busy in osd_ea_fid_get()

          http://jira.whamcloud.com/browse/LUDOC-68 has been created to track the manual changes

          cliffw Cliff White (Inactive) added a comment - http://jira.whamcloud.com/browse/LUDOC-68 has been created to track the manual changes
          pjones Peter Jones added a comment -

          Cliff could you please create an LUDOC ticket to track Andreas's request?

          pjones Peter Jones added a comment - Cliff could you please create an LUDOC ticket to track Andreas's request?

          The Lustre Manual should be updated to inform users about how to enable "dirdata" on an upgraded 1.8->2.x MDT, once they are sure that they will not be downgrading the MDS to 1.8 again. This will minimize performance impact on newly created files.

          adilger Andreas Dilger added a comment - The Lustre Manual should be updated to inform users about how to enable "dirdata" on an upgraded 1.8->2.x MDT, once they are sure that they will not be downgrading the MDS to 1.8 again. This will minimize performance impact on newly created files.
          laisiyao Lai Siyao added a comment -

          If 1.8 system is upgraded to 2.x successfully, tunefs can be used to enable dirdata feature, then new directory will contain inode FID in its data.

          laisiyao Lai Siyao added a comment - If 1.8 system is upgraded to 2.x successfully, tunefs can be used to enable dirdata feature, then new directory will contain inode FID in its data.

          Lai, enabling dir_data seems like a reasonable course of action. We'll start some local testing and propose it to our sysadmin team. Thanks

          nedbass Ned Bass (Inactive) added a comment - Lai, enabling dir_data seems like a reasonable course of action. We'll start some local testing and propose it to our sysadmin team. Thanks
          laisiyao Lai Siyao added a comment -

          The command should be `tune2fs -O dirdata /dev/

          {mdtdev}

          `, and I've verified that newly created dir will store FID in it.

          laisiyao Lai Siyao added a comment - The command should be `tune2fs -O dirdata /dev/ {mdtdev} `, and I've verified that newly created dir will store FID in it.
          laisiyao Lai Siyao added a comment -

          Andreas, thanks for your detailed explanation! I'll verify `tune2fs` to enable 'dir_data' feature later.

          Ned, are you fine with the result?

          laisiyao Lai Siyao added a comment - Andreas, thanks for your detailed explanation! I'll verify `tune2fs` to enable 'dir_data' feature later. Ned, are you fine with the result?
          adilger Andreas Dilger added a comment - - edited

          The need to be able to downgrade from 2.x to 1.8 is only in the case of "simple" upgrade to 2.x that has hit problems and needs to be able to downgrade. If the upgrade has been successful, and then the admin (separately) enables the "dir_data" feature using tune2fs on the filesystem, this should be enough to allow storing FIDs in the directory entries. After that point, the filesystem should not be downgraded to 1.8 anymore.

          What definitely should be avoided is any automatic enabling of the "dir_data" feature on the filesystem when it is first mounted, since this will cause problems if there are FIDs stored in the directory entries, then the filesystem is downgraded to 1.8, the FID-in-LMA is deleted upon access (reverting to IGIF for that inode), and then the filesystem is upgraded again. That would cause the FID-in-dirent to contain invalid data that OI scrub and e2fsck will not fix yet.

          So, my understanding is that if you are sure there is no need to downgrade to 1.8, it should be possible with 2.1+ to use:

          tune2fs -O dirdata /dev/{mdtdev}
          

          to enable this feature, and then newly-created files/links will store the FID in the directory. I don't know if we have tested this process or not.

          Assuming this is OK, it would then be possible in that case to "refresh" the directory with a script to re-link filenames that are expected to live for a long time, assuming they are not in use, something like:

          lfs find /mnt/lustre/some/dir -type f | while read F; do
                  FTMP="$F.XXXXXX.$$.$RANDOM"
                  mv "$F" "$FTMP" && mv "$FTMP" "$F"
          done
          

          In a later phase of LFSCK, the FID-in-dirent data will be verified and refreshed if missing, but this is not part of the Phase I deliverable.

          This existing behaviour is not a net performance loss in many use cases, since it is prefetching the inode into MDS memory for use as soon as e.g. "ls" does a stat() on the file. There would only be a visible slowdown in the case of e.g. "find" that is not accessing any of the file attributes, and only generating pathnames.

          adilger Andreas Dilger added a comment - - edited The need to be able to downgrade from 2.x to 1.8 is only in the case of "simple" upgrade to 2.x that has hit problems and needs to be able to downgrade. If the upgrade has been successful, and then the admin (separately) enables the "dir_data" feature using tune2fs on the filesystem, this should be enough to allow storing FIDs in the directory entries. After that point, the filesystem should not be downgraded to 1.8 anymore. What definitely should be avoided is any automatic enabling of the "dir_data" feature on the filesystem when it is first mounted, since this will cause problems if there are FIDs stored in the directory entries, then the filesystem is downgraded to 1.8, the FID-in-LMA is deleted upon access (reverting to IGIF for that inode), and then the filesystem is upgraded again. That would cause the FID-in-dirent to contain invalid data that OI scrub and e2fsck will not fix yet. So, my understanding is that if you are sure there is no need to downgrade to 1.8, it should be possible with 2.1+ to use: tune2fs -O dirdata /dev/{mdtdev} to enable this feature, and then newly-created files/links will store the FID in the directory. I don't know if we have tested this process or not. Assuming this is OK, it would then be possible in that case to "refresh" the directory with a script to re-link filenames that are expected to live for a long time, assuming they are not in use, something like: lfs find /mnt/lustre/some/dir -type f | while read F; do FTMP="$F.XXXXXX.$$.$RANDOM" mv "$F" "$FTMP" && mv "$FTMP" "$F" done In a later phase of LFSCK, the FID-in-dirent data will be verified and refreshed if missing, but this is not part of the Phase I deliverable. This existing behaviour is not a net performance loss in many use cases, since it is prefetching the inode into MDS memory for use as soon as e.g. "ls" does a stat() on the file. There would only be a visible slowdown in the case of e.g. "find" that is not accessing any of the file attributes, and only generating pathnames.
          laisiyao Lai Siyao added a comment -

          Andreas, it looks like we need change both 2.x and 1.8 ldiskfs code to keep both backward and forward compatibility for this, any suggestion for this?

          laisiyao Lai Siyao added a comment - Andreas, it looks like we need change both 2.x and 1.8 ldiskfs code to keep both backward and forward compatibility for this, any suggestion for this?

          There are incompatible format between b1_8 and b2_1: in b1_8, lvfs_dentry_params is appended after the name entry in parent directory; but in b2_1, it is ldiskfs_dentry_param. They are different and incompatible. So when system upgraded from b1_8 to b2_1, the new created fie cannot append ldiskfs_dentry_param (which contains the FID) after its name entry in parent directory, otherwise, the system cannot downgrade back to b1_8.

          But if without FID appended after name entry in parent directory, it will cause performance (for dir readpage) regression. I do not think it is good solution, because upgrade is more often used than downgrade.

          We should make some patch in b2_x to support appending FID after the name entry in parent directory for upgrading case and skip lvfs_dentry_params after the name entry for old files. On the other hand, need another patch against b1_8_x (x >= 8) to skip FID after the name entry in parent directory to support downgrade back to b1_8_x (x >= 8).

          yong.fan nasf (Inactive) added a comment - There are incompatible format between b1_8 and b2_1: in b1_8, lvfs_dentry_params is appended after the name entry in parent directory; but in b2_1, it is ldiskfs_dentry_param. They are different and incompatible. So when system upgraded from b1_8 to b2_1, the new created fie cannot append ldiskfs_dentry_param (which contains the FID) after its name entry in parent directory, otherwise, the system cannot downgrade back to b1_8. But if without FID appended after name entry in parent directory, it will cause performance (for dir readpage) regression. I do not think it is good solution, because upgrade is more often used than downgrade. We should make some patch in b2_x to support appending FID after the name entry in parent directory for upgrading case and skip lvfs_dentry_params after the name entry for old files. On the other hand, need another patch against b1_8_x (x >= 8) to skip FID after the name entry in parent directory to support downgrade back to b1_8_x (x >= 8).

          People

            laisiyao Lai Siyao
            nedbass Ned Bass (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: