[LU-12651] High kworker CPU usage (osc_grant_work_handler) on IDLE connections Created: 09/Aug/19  Updated: 25/Feb/20  Resolved: 14/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.2
Fix Version/s: Lustre 2.14.0, Lustre 2.12.5

Type: Bug Priority: Major
Reporter: Jacek Tomaka Assignee: Alexander Zarochentsev
Resolution: Fixed Votes: 0
Labels: LTS12

Issue Links:
Duplicate
Related
is related to LU-8708 Grant shrinking disabled all the time Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We discovered that on our systems with lustre mounted, the kworker is using signifficant amount of CPU.
perf top shows on an idle system:

 39.44%  [kernel]                  [k] osc_should_shrink_grant
  12.14%  [kernel]                  [k] osc_grant_work_handler
   2.81%  [kernel]                  [k] process_one_work
   2.64%  [kernel]                  [k] __queue_work
   2.56%  [kernel]                  [k] read_tsc

We currently have grant_shrink=0 on this system.

Looks like doing just du -hs /fs makes the problem go away for some time.
Also unmounting the filesystem makes the problem go away.
This is Centos 7.6 system with Lustre 2.12.0



 Comments   
Comment by Jacek Tomaka [ 09/Aug/19 ]

Most likely regression from LU-8708

Comment by Jacek Tomaka [ 08/Jan/20 ]

Any news on this ticket?

Comment by Gerrit Updater [ 04/Feb/20 ]

Alexander Zarochentsev (c17826@cray.com) uploaded a new patch: https://review.whamcloud.com/37429
Subject: LU-12651 osc: always call update_next_shrink
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2986155c51914c5a63f6c351908c9a49dbe5042f

Comment by Alexander Zarochentsev [ 04/Feb/20 ]

Jasek Tomaka,
can you try https://review.whamcloud.com/37429 ?

Comment by Alexander Zarochentsev [ 04/Feb/20 ]

my experiments with 2.12-based lustre and grant_shrink=0:

w/o the fix, kworker starts to eat 100% CPU after 20 min from Lustre mount time (default grant shrinking interval)

top - 00:03:08 up 2 days, 11:32,  3 users,  load average: 2.95, 2.47, 2.22
Tasks: 258 total,   3 running, 255 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 25.0 sy,  0.0 ni, 75.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  2914024 total,  1138684 free,   544988 used,  1230352 buff/cache
KiB Swap:  2113532 total,  2113532 free,        0 used.  2190536 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                        
21631 root      20   0       0      0      0 R 100.0  0.0   3:03.08 kworker/3:2                                                                                    
    1 root      20   0  191032   3912   2584 S   0.0  0.1   0:06.70 systemd                                                                                        
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.06 kthreadd                                                                                       
    3 root      20   0       0      0      0 S   0.0  0.0   0:01.06 ksoftirqd/0                                                                                    
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H                                                                                   

with the fix,  22 min after start, system is idle:

top - 00:32:05 up 2 days, 12:01,  3 users,  load average: 2.00, 2.01, 2.06
Tasks: 261 total,   2 running, 259 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  0.1 sy,  0.0 ni, 99.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  2914024 total,  1133004 free,   549940 used,  1231080 buff/cache
KiB Swap:  2113532 total,  2113532 free,        0 used.  2185136 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                        
  367 root      20   0  162180   2456   1584 R   0.3  0.1   0:00.03 top                                                                                            
    1 root      20   0  191032   3912   2584 S   0.0  0.1   0:06.85 systemd                                                                                        
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.07 kthreadd                                                                                       
    3 root      20   0       0      0      0 S   0.0  0.0   0:01.10 ksoftirqd/0                                                                                    
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H                                                                                   
    7 root      rt   0       0      0      0 S   0.0  0.0   0:00.61 migration/0                                                                                    
    8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh                                                                                         
Comment by Jacek Tomaka [ 05/Feb/20 ]

Hi Alexander,
Thanks for looking into it. Would you be so kind to provide patch for 2.12.3 as well?
Regards.
Jacek Tomaka

Comment by Alexander Zarochentsev [ 05/Feb/20 ]

Jacek,
>Thanks for looking into it. Would you be so kind to provide patch for 2.12.3 as well?
The same patch applies to b2_12.

Comment by Jacek Tomaka [ 10/Feb/20 ]

Hi Alexander,
Our initial testing on a machine with patched client(2.12.3 + LU-12759 + this patch) shows that the kworker does not go crazy anymore.
Great job! Thanks!
Will let you know if we run into any issues with this patch.
Jacek Tomaka

Comment by Gerrit Updater [ 14/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37429/
Subject: LU-12651 osc: always call update_next_shrink
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 117f587bc3e60f4dd1c939f8488e43cb752c12ca

Comment by Peter Jones [ 14/Feb/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 14/Feb/20 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37572
Subject: LU-12651 osc: always call update_next_shrink
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 70d299f149e1cb5f396576baf452a5eba911a30a

Comment by Gerrit Updater [ 25/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37572/
Subject: LU-12651 osc: always call update_next_shrink
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 10a799263964422df575038d3dfb507a09bfa221

Generated at Sat Feb 10 02:54:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.