[LU-5415] High ldlm_poold load on client - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.7.0, Lustre 2.5.3
Affects Version/s: Lustre 2.6.0
Labels:
- patch

Severity:
3
Rank (Obsolete):
15059

Description

When LRU resizing is enabled on client, sometimes, ldlm_poold have extremely high CPU load. And at the meantime, schedule_timeout() complains about negative timeout. After some time, the problem will recover without any manual intervention. But it happens really frequently when the file system is under high load.

top - 09:48:51 up 6 days, 11:17,  2 users,  load average: 1.00, 1.01, 1.00
Tasks: 516 total,   2 running, 514 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  6.4%sy,  0.0%ni, 93.4%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  65903880k total, 24300068k used, 41603812k free,   346516k buffers
Swap: 65535992k total,        0k used, 65535992k free, 18665656k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 37976 root      20   0     0    0    0 R 99.4  0.0   2412:25 ldlm_bl_04

Jul 13 12:49:30 mu01 kernel: LustreError: 11-0: lustre-OST000a-osc-ffff88080fdad800: Communicating with 10.0.2.2@o2ib, operation obd_ping failed with -107.
Jul 13 12:49:30 mu01 kernel: Lustre: lustre-OST000a-osc-ffff88080fdad800: Connection to lustre-OST000a (at 10.0.2.2@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Jul 13 12:49:30 mu01 kernel: LustreError: 167-0: lustre-OST000a-osc-ffff88080fdad800: This client was evicted by lustre-OST000a; in progress operations using this service will fail.
Jul 13 12:49:31 mu01 kernel: schedule_timeout: wrong timeout value fffffffff5c2c8c0
Jul 13 12:49:31 mu01 kernel: Pid: 4054, comm: ldlm_poold Tainted: G           ---------------  T 2.6.32-279.el6.x86_64 #1
Jul 13 12:49:31 mu01 kernel: Call Trace:
Jul 13 12:49:31 mu01 kernel: [<ffffffff814fe759>] ? schedule_timeout+0x2c9/0x2e0
Jul 13 12:49:31 mu01 kernel: [<ffffffffa086612b>] ? ldlm_pool_recalc+0x10b/0x130 [ptlrpc]
Jul 13 12:49:31 mu01 kernel: [<ffffffffa084cfb9>] ? ldlm_namespace_put+0x29/0x60 [ptlrpc]
Jul 13 12:49:31 mu01 kernel: [<ffffffffa08670b0>] ? ldlm_pools_thread_main+0x1d0/0x2f0 [ptlrpc]
Jul 13 12:49:31 mu01 kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
Jul 13 12:49:31 mu01 kernel: [<ffffffffa0866ee0>] ? ldlm_pools_thread_main+0x0/0x2f0 [ptlrpc]
Jul 13 12:49:31 mu01 kernel: [<ffffffff81091d66>] ? kthread+0x96/0xa0
Jul 13 12:49:31 mu01 kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
Jul 13 12:49:31 mu01 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
Jul 13 12:49:31 mu01 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Jul 13 12:49:33 mu01 kernel: Lustre: lustre-OST000a-osc-ffff88080fdad800: Connection restored to lustre-OST000a (at 10.0.2.2@o2ib)

Attachments

Issue Links

is related to

LU-4536 sanity test_65ic

Resolved

is related to

LU-2924 shrink ldlm_poold workload

Resolved

Activity

People

Assignee:: Zhenyu Xu

Reporter:: Li Xi (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 25/Jul/14 3:14 AM

Updated:: 29/Oct/15 5:25 PM

Resolved:: 14/Aug/14 2:05 PM