Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8927

osp-syn processes contending for osq_lock drives system cpu usage > 80%

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • None
    • None
    • lustre-2.8.0_5.chaos-2.ch6.x86_64
      zfs-0.7.0-0.6llnl.ch6.x86_64
      DNE with 16 MDTs
    • 3
    • 9223372036854775807

    Description

      Ran jobs which created remote directories (not striped) and then ran mdtest within them, several MDS nodes are using >80% of their cpu time for osp-syn-* processes.

      There are 36 osp-syn-* processes.

      The processes are spending almost all their time contending for osq_lock. According to perf, the offending stack is:

      osq_lock
      __mutex_lock_slowpath
      mutex_lock
      spa_config_enter
      bp_get_dsize
      dmu_tx_hold_free
      osd_declare_object_destroy
      llog_osd_declare_destroy
      llog_declare_destroy
      llog_cancel_rec
      llog_cat_cancel_records
      osp_sync_process_committed
      osp_sync_process_queues
      llog_process_thread
      llog_process_or_fork
      llog_cat_process_cb
      llog_process_thread
      llog_process_or_fork
      llog_cat_process_or_fork
      llog_cat_process
      osp_sync_thread
      kthread
      ret_from_fork
      osp-syn-X-Y

      Attachments

        Issue Links

          Activity

            [LU-8927] osp-syn processes contending for osq_lock drives system cpu usage > 80%
            ofaaland Olaf Faaland added a comment -

            This lock contention has not resulted in problems in production, and there is so much related change in 2.10 and master that it's quite possible the problem does not occur there. Closing the ticket.

            ofaaland Olaf Faaland added a comment - This lock contention has not resulted in problems in production, and there is so much related change in 2.10 and master that it's quite possible the problem does not occur there. Closing the ticket.
            pjones Peter Jones added a comment -

            I would like a level set on this ticket. All of the planned work to improve metadata performance for ZFS has now landed to master (and b2_10). Are there any specific tasks identified and remaining beyond that?

            pjones Peter Jones added a comment - I would like a level set on this ticket. All of the planned work to improve metadata performance for ZFS has now landed to master (and b2_10). Are there any specific tasks identified and remaining beyond that?
            ofaaland Olaf Faaland added a comment - - edited

            Alex,

            I applied LU-7898 on top of our 2.8.0+patch stack and see the same symptoms. The patch didn't appear to me to change any of the functions in the contending stacks, so not surprising.

            The full set of patches above would be too much for a stable branch, I would think. So I've rewritten llog_cancel_rec() to destroy the llog in a second transaction, if it's necessary. Maybe this is a poor approach; feedback or an alternative would be welcome. In any case I've pushed it to gerrit and will do local testing after it passes maloo.

            https://review.whamcloud.com/#/c/24687/

            ofaaland Olaf Faaland added a comment - - edited Alex, I applied LU-7898 on top of our 2.8.0+patch stack and see the same symptoms. The patch didn't appear to me to change any of the functions in the contending stacks, so not surprising. The full set of patches above would be too much for a stable branch, I would think. So I've rewritten llog_cancel_rec() to destroy the llog in a second transaction, if it's necessary. Maybe this is a poor approach; feedback or an alternative would be welcome. In any case I've pushed it to gerrit and will do local testing after it passes maloo. https://review.whamcloud.com/#/c/24687/

            you're right, I mean LU-8873, basically yet another point to save on dnode#->dnode_t lookup.

            bzzz Alex Zhuravlev added a comment - you're right, I mean LU-8873 , basically yet another point to save on dnode#->dnode_t lookup.
            ofaaland Olaf Faaland added a comment -

            OK, thanks.

            In the above list of tickets, you include LU-8893. Did you mean LU-8873?

            I looked briefly at the LU-7898 patch to remove unnecessary declarations. I'll see if I can apply and test it.

            ofaaland Olaf Faaland added a comment - OK, thanks. In the above list of tickets, you include LU-8893 . Did you mean LU-8873 ? I looked briefly at the LU-7898 patch to remove unnecessary declarations. I'll see if I can apply and test it.
            bzzz Alex Zhuravlev added a comment - - edited

            in very few words - I've been working to make declarations with ZFS cheap. right now those are quite expensive because DMU API works with dnode numbers, so every time it needs to translate dnode number into dnode structure using the global hash table. few patches have been landed already onto master branch and released as a part of 2.9 (e.g. LU-7898 osd: remove unnecessary declarations). yet more improvements are expected with landing of the following patches:

            LU-8882 osd: use bydnode methods to access ZAP
            LU-8928 osd: convert osd-zfs to reference dnode, not db
            LU-8873 osd: use sa_handle_get_from_db()
            LU-2435 osd-zfs: use zfs native dnode accounting
            and https://github.com/zfsonlinux/zfs/pull/5464

            this way the declarations should become mostly lockless and much cheaper.

            bzzz Alex Zhuravlev added a comment - - edited in very few words - I've been working to make declarations with ZFS cheap. right now those are quite expensive because DMU API works with dnode numbers, so every time it needs to translate dnode number into dnode structure using the global hash table. few patches have been landed already onto master branch and released as a part of 2.9 (e.g. LU-7898 osd: remove unnecessary declarations). yet more improvements are expected with landing of the following patches: LU-8882 osd: use bydnode methods to access ZAP LU-8928 osd: convert osd-zfs to reference dnode, not db LU-8873 osd: use sa_handle_get_from_db() LU-2435 osd-zfs: use zfs native dnode accounting and https://github.com/zfsonlinux/zfs/pull/5464 this way the declarations should become mostly lockless and much cheaper.

            Yes, please elaborate. I know there are many ways to work on this and it would be great to know the nature and scope of the fix you have in mind.

            I looked again and the MDTs are still working to clear llog records from jobs run about 45 hours ago (contended the entire time). I don't think we can go into production without a fix for this.

            ofaaland Olaf Faaland added a comment - Yes, please elaborate. I know there are many ways to work on this and it would be great to know the nature and scope of the fix you have in mind. I looked again and the MDTs are still working to clear llog records from jobs run about 45 hours ago (contended the entire time). I don't think we can go into production without a fix for this.
            pjones Peter Jones added a comment -

            Alex

            Could you please elaborate about the work underway in this area?

            Thanks

            Peter

            pjones Peter Jones added a comment - Alex Could you please elaborate about the work underway in this area? Thanks Peter

            when we declare llog cancelation we don't known whether it will be last one or not, otherwise we'd have to lock llog since declaration upto transaction stop killing concurrency. the newer versions of Lustre will fix this problem in osd-zfs module.

            bzzz Alex Zhuravlev added a comment - when we declare llog cancelation we don't known whether it will be last one or not, otherwise we'd have to lock llog since declaration upto transaction stop killing concurrency. the newer versions of Lustre will fix this problem in osd-zfs module.

            I see that llog_cancel_rec() contains the following:

                    rc = llog_declare_write_rec(env, loghandle, &llh->llh_hdr, index,
                    if (rc < 0)
                            GOTO(out_trans, rc);
            
                    if ((llh->llh_flags & LLOG_F_ZAP_WHEN_EMPTY))
                            rc = llog_declare_destroy(env, loghandle, th);
            
                    th->th_wait_submit = 1;
                    rc = dt_trans_start_local(env, dt, th);
            

            So it seems to declare that it will destroy the llog object every time it cancels a record, as if every record is the last one. Why is that? Shouldn't it also depend on how many active records the llog contains?

            ofaaland Olaf Faaland added a comment - I see that llog_cancel_rec() contains the following: rc = llog_declare_write_rec(env, loghandle, &llh->llh_hdr, index, if (rc < 0) GOTO(out_trans, rc); if ((llh->llh_flags & LLOG_F_ZAP_WHEN_EMPTY)) rc = llog_declare_destroy(env, loghandle, th); th->th_wait_submit = 1; rc = dt_trans_start_local(env, dt, th); So it seems to declare that it will destroy the llog object every time it cancels a record, as if every record is the last one. Why is that? Shouldn't it also depend on how many active records the llog contains?

            People

              bzzz Alex Zhuravlev
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: