Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1248

all mdt_rdpg_* threads busy in osd_ea_fid_get()

Details

    • 3
    • 6428

    Description

      The load average on the MDS for a classified production 2.1 filesystem jumped to over 400. Top showed mdt_rdpg_* threads all using 4-7% CPU time. This may have been due to a pathological workload, but we were wondering if there's something like an overly contended lock in ldiskfs going on here.

      Most of the stacks looked like this:

      __cond_resched
      _cond_resched
      ifind_fast
      iget_locked
      ldiskfs_iget
      ? generic_detach_inode
      osd_iget
      osd_ea_fid_get
      osd_it_ea_rec
      mdd_readpage
      cml_readpage
      mdt_readpage
      ? mdt_unpack_req_pack_rep
      mdt_handle_common
      ? lustre_msg_get_transno
      mdt_readpage_handle
      ptlrpc_main
      child_rip

      Attachments

        Activity

          [LU-1248] all mdt_rdpg_* threads busy in osd_ea_fid_get()
          laisiyao Lai Siyao added a comment -

          Andreas, it looks like we need change both 2.x and 1.8 ldiskfs code to keep both backward and forward compatibility for this, any suggestion for this?

          laisiyao Lai Siyao added a comment - Andreas, it looks like we need change both 2.x and 1.8 ldiskfs code to keep both backward and forward compatibility for this, any suggestion for this?

          There are incompatible format between b1_8 and b2_1: in b1_8, lvfs_dentry_params is appended after the name entry in parent directory; but in b2_1, it is ldiskfs_dentry_param. They are different and incompatible. So when system upgraded from b1_8 to b2_1, the new created fie cannot append ldiskfs_dentry_param (which contains the FID) after its name entry in parent directory, otherwise, the system cannot downgrade back to b1_8.

          But if without FID appended after name entry in parent directory, it will cause performance (for dir readpage) regression. I do not think it is good solution, because upgrade is more often used than downgrade.

          We should make some patch in b2_x to support appending FID after the name entry in parent directory for upgrading case and skip lvfs_dentry_params after the name entry for old files. On the other hand, need another patch against b1_8_x (x >= 8) to skip FID after the name entry in parent directory to support downgrade back to b1_8_x (x >= 8).

          yong.fan nasf (Inactive) added a comment - There are incompatible format between b1_8 and b2_1: in b1_8, lvfs_dentry_params is appended after the name entry in parent directory; but in b2_1, it is ldiskfs_dentry_param. They are different and incompatible. So when system upgraded from b1_8 to b2_1, the new created fie cannot append ldiskfs_dentry_param (which contains the FID) after its name entry in parent directory, otherwise, the system cannot downgrade back to b1_8. But if without FID appended after name entry in parent directory, it will cause performance (for dir readpage) regression. I do not think it is good solution, because upgrade is more often used than downgrade. We should make some patch in b2_x to support appending FID after the name entry in parent directory for upgrading case and skip lvfs_dentry_params after the name entry for old files. On the other hand, need another patch against b1_8_x (x >= 8) to skip FID after the name entry in parent directory to support downgrade back to b1_8_x (x >= 8).
          laisiyao Lai Siyao added a comment -

          I don't find a easy way to implement: without change for disk format, there's no way to distinguish 1.8 directory and newly created directory. The original design for 1.8 <-> 2.x interoperatability server is on bz11826.

          laisiyao Lai Siyao added a comment - I don't find a easy way to implement: without change for disk format, there's no way to distinguish 1.8 directory and newly created directory. The original design for 1.8 <-> 2.x interoperatability server is on bz11826 .
          laisiyao Lai Siyao added a comment -

          Yes, on upgraded system even newly created dir won't store fid in dir data; I'll try whether it's easy to implement this.

          laisiyao Lai Siyao added a comment - Yes, on upgraded system even newly created dir won't store fid in dir data; I'll try whether it's easy to implement this.

          Yes the system was upgraded from 1.8. Will files created after the upgrade store the fid in the dir data?

          There are a few "osd_object_delete() Failed to cleanup: -2" console messages on the MDS from around that time. I don't find any other errors worth mentioning.

          nedbass Ned Bass (Inactive) added a comment - Yes the system was upgraded from 1.8. Will files created after the upgrade store the fid in the dir data? There are a few "osd_object_delete() Failed to cleanup: -2" console messages on the MDS from around that time. I don't find any other errors worth mentioning.
          laisiyao Lai Siyao added a comment -

          Ned, is the system upgraded from 1.8? Normally readdir can get fid from dir data, and then it doesn't need read fid from ea, but for a upgraded system, it needs query each inode. Besides, did you see any error messages related with this dir?

          laisiyao Lai Siyao added a comment - Ned, is the system upgraded from 1.8? Normally readdir can get fid from dir data, and then it doesn't need read fid from ea, but for a upgraded system, it needs query each inode. Besides, did you see any error messages related with this dir?
          laisiyao Lai Siyao added a comment -

          This looks normal from the code. Basically it's an inode scalability problem, these busy threads are contending on inode_lock, and NIck Piggin's inode lock scalability patches are getting merged into kernel 3.x.

          Currently MDT still uses directory+ea to store metadata, while IAM looks to have better performance and scalability, but I'm not clear why it's not enabled yet.

          laisiyao Lai Siyao added a comment - This looks normal from the code. Basically it's an inode scalability problem, these busy threads are contending on inode_lock, and NIck Piggin's inode lock scalability patches are getting merged into kernel 3.x. Currently MDT still uses directory+ea to store metadata, while IAM looks to have better performance and scalability, but I'm not clear why it's not enabled yet.
          pjones Peter Jones added a comment -

          Lsi

          Could you please comment on this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Lsi Could you please comment on this one? Thanks Peter

          People

            laisiyao Lai Siyao
            nedbass Ned Bass (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: