Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12651

High kworker CPU usage (osc_grant_work_handler) on IDLE connections

Details

    • 3
    • 9223372036854775807

    Description

      We discovered that on our systems with lustre mounted, the kworker is using signifficant amount of CPU.
      perf top shows on an idle system:

       39.44%  [kernel]                  [k] osc_should_shrink_grant
        12.14%  [kernel]                  [k] osc_grant_work_handler
         2.81%  [kernel]                  [k] process_one_work
         2.64%  [kernel]                  [k] __queue_work
         2.56%  [kernel]                  [k] read_tsc
      

      We currently have grant_shrink=0 on this system.

      Looks like doing just du -hs /fs makes the problem go away for some time.
      Also unmounting the filesystem makes the problem go away.
      This is Centos 7.6 system with Lustre 2.12.0

      Attachments

        Issue Links

          Activity

            [LU-12651] High kworker CPU usage (osc_grant_work_handler) on IDLE connections

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37572/
            Subject: LU-12651 osc: always call update_next_shrink
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 10a799263964422df575038d3dfb507a09bfa221

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37572/ Subject: LU-12651 osc: always call update_next_shrink Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 10a799263964422df575038d3dfb507a09bfa221

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37572
            Subject: LU-12651 osc: always call update_next_shrink
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 70d299f149e1cb5f396576baf452a5eba911a30a

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37572 Subject: LU-12651 osc: always call update_next_shrink Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 70d299f149e1cb5f396576baf452a5eba911a30a
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37429/
            Subject: LU-12651 osc: always call update_next_shrink
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 117f587bc3e60f4dd1c939f8488e43cb752c12ca

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37429/ Subject: LU-12651 osc: always call update_next_shrink Project: fs/lustre-release Branch: master Current Patch Set: Commit: 117f587bc3e60f4dd1c939f8488e43cb752c12ca

            Hi Alexander,
            Our initial testing on a machine with patched client(2.12.3 + LU-12759 + this patch) shows that the kworker does not go crazy anymore.
            Great job! Thanks!
            Will let you know if we run into any issues with this patch.
            Jacek Tomaka

            Tomaka Jacek Tomaka (Inactive) added a comment - Hi Alexander, Our initial testing on a machine with patched client(2.12.3 + LU-12759 + this patch) shows that the kworker does not go crazy anymore. Great job! Thanks! Will let you know if we run into any issues with this patch. Jacek Tomaka

            Jacek,
            >Thanks for looking into it. Would you be so kind to provide patch for 2.12.3 as well?
            The same patch applies to b2_12.

            zam Alexander Zarochentsev added a comment - Jacek, >Thanks for looking into it. Would you be so kind to provide patch for 2.12.3 as well? The same patch applies to b2_12.

            Hi Alexander,
            Thanks for looking into it. Would you be so kind to provide patch for 2.12.3 as well?
            Regards.
            Jacek Tomaka

            Tomaka Jacek Tomaka (Inactive) added a comment - Hi Alexander, Thanks for looking into it. Would you be so kind to provide patch for 2.12.3 as well? Regards. Jacek Tomaka
            zam Alexander Zarochentsev added a comment - - edited

            my experiments with 2.12-based lustre and grant_shrink=0:

            w/o the fix, kworker starts to eat 100% CPU after 20 min from Lustre mount time (default grant shrinking interval)

            top - 00:03:08 up 2 days, 11:32,  3 users,  load average: 2.95, 2.47, 2.22
            Tasks: 258 total,   3 running, 255 sleeping,   0 stopped,   0 zombie
            %Cpu(s):  0.0 us, 25.0 sy,  0.0 ni, 75.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
            KiB Mem :  2914024 total,  1138684 free,   544988 used,  1230352 buff/cache
            KiB Swap:  2113532 total,  2113532 free,        0 used.  2190536 avail Mem 
            
              PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                        
            21631 root      20   0       0      0      0 R 100.0  0.0   3:03.08 kworker/3:2                                                                                    
                1 root      20   0  191032   3912   2584 S   0.0  0.1   0:06.70 systemd                                                                                        
                2 root      20   0       0      0      0 S   0.0  0.0   0:00.06 kthreadd                                                                                       
                3 root      20   0       0      0      0 S   0.0  0.0   0:01.06 ksoftirqd/0                                                                                    
                5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H                                                                                   
            

            with the fix,  22 min after start, system is idle:

            top - 00:32:05 up 2 days, 12:01,  3 users,  load average: 2.00, 2.01, 2.06
            Tasks: 261 total,   2 running, 259 sleeping,   0 stopped,   0 zombie
            %Cpu(s):  0.1 us,  0.1 sy,  0.0 ni, 99.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
            KiB Mem :  2914024 total,  1133004 free,   549940 used,  1231080 buff/cache
            KiB Swap:  2113532 total,  2113532 free,        0 used.  2185136 avail Mem 
            
              PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                        
              367 root      20   0  162180   2456   1584 R   0.3  0.1   0:00.03 top                                                                                            
                1 root      20   0  191032   3912   2584 S   0.0  0.1   0:06.85 systemd                                                                                        
                2 root      20   0       0      0      0 S   0.0  0.0   0:00.07 kthreadd                                                                                       
                3 root      20   0       0      0      0 S   0.0  0.0   0:01.10 ksoftirqd/0                                                                                    
                5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H                                                                                   
                7 root      rt   0       0      0      0 S   0.0  0.0   0:00.61 migration/0                                                                                    
                8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh                                                                                         
            
            zam Alexander Zarochentsev added a comment - - edited my experiments with 2.12-based lustre and grant_shrink=0: w/o the fix, kworker starts to eat 100% CPU after 20 min from Lustre mount time (default grant shrinking interval) top - 00:03:08 up 2 days, 11:32, 3 users, load average: 2.95, 2.47, 2.22 Tasks: 258 total, 3 running, 255 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 25.0 sy, 0.0 ni, 75.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 2914024 total, 1138684 free, 544988 used, 1230352 buff/cache KiB Swap: 2113532 total, 2113532 free, 0 used. 2190536 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21631 root 20 0 0 0 0 R 100.0 0.0 3:03.08 kworker/3:2 1 root 20 0 191032 3912 2584 S 0.0 0.1 0:06.70 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.06 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:01.06 ksoftirqd/0 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H with the fix,  22 min after start, system is idle: top - 00:32:05 up 2 days, 12:01, 3 users, load average: 2.00, 2.01, 2.06 Tasks: 261 total, 2 running, 259 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.1 us, 0.1 sy, 0.0 ni, 99.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 2914024 total, 1133004 free, 549940 used, 1231080 buff/cache KiB Swap: 2113532 total, 2113532 free, 0 used. 2185136 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 367 root 20 0 162180 2456 1584 R 0.3 0.1 0:00.03 top 1 root 20 0 191032 3912 2584 S 0.0 0.1 0:06.85 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.07 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:01.10 ksoftirqd/0 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H 7 root rt 0 0 0 0 S 0.0 0.0 0:00.61 migration/0 8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
            zam Alexander Zarochentsev added a comment - Jasek Tomaka, can you try https://review.whamcloud.com/37429 ?

            Alexander Zarochentsev (c17826@cray.com) uploaded a new patch: https://review.whamcloud.com/37429
            Subject: LU-12651 osc: always call update_next_shrink
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2986155c51914c5a63f6c351908c9a49dbe5042f

            gerrit Gerrit Updater added a comment - Alexander Zarochentsev (c17826@cray.com) uploaded a new patch: https://review.whamcloud.com/37429 Subject: LU-12651 osc: always call update_next_shrink Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2986155c51914c5a63f6c351908c9a49dbe5042f

            People

              zam Alexander Zarochentsev
              Tomaka Jacek Tomaka (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: