Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.6.0
-
3
-
15059
Description
When LRU resizing is enabled on client, sometimes, ldlm_poold have extremely high CPU load. And at the meantime, schedule_timeout() complains about negative timeout. After some time, the problem will recover without any manual intervention. But it happens really frequently when the file system is under high load.
top - 09:48:51 up 6 days, 11:17, 2 users, load average: 1.00, 1.01, 1.00 Tasks: 516 total, 2 running, 514 sleeping, 0 stopped, 0 zombie Cpu(s): 0.1%us, 6.4%sy, 0.0%ni, 93.4%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 65903880k total, 24300068k used, 41603812k free, 346516k buffers Swap: 65535992k total, 0k used, 65535992k free, 18665656k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 37976 root 20 0 0 0 0 R 99.4 0.0 2412:25 ldlm_bl_04 Jul 13 12:49:30 mu01 kernel: LustreError: 11-0: lustre-OST000a-osc-ffff88080fdad800: Communicating with 10.0.2.2@o2ib, operation obd_ping failed with -107. Jul 13 12:49:30 mu01 kernel: Lustre: lustre-OST000a-osc-ffff88080fdad800: Connection to lustre-OST000a (at 10.0.2.2@o2ib) was lost; in progress operations using this service will wait for recovery to complete Jul 13 12:49:30 mu01 kernel: LustreError: 167-0: lustre-OST000a-osc-ffff88080fdad800: This client was evicted by lustre-OST000a; in progress operations using this service will fail. Jul 13 12:49:31 mu01 kernel: schedule_timeout: wrong timeout value fffffffff5c2c8c0 Jul 13 12:49:31 mu01 kernel: Pid: 4054, comm: ldlm_poold Tainted: G --------------- T 2.6.32-279.el6.x86_64 #1 Jul 13 12:49:31 mu01 kernel: Call Trace: Jul 13 12:49:31 mu01 kernel: [<ffffffff814fe759>] ? schedule_timeout+0x2c9/0x2e0 Jul 13 12:49:31 mu01 kernel: [<ffffffffa086612b>] ? ldlm_pool_recalc+0x10b/0x130 [ptlrpc] Jul 13 12:49:31 mu01 kernel: [<ffffffffa084cfb9>] ? ldlm_namespace_put+0x29/0x60 [ptlrpc] Jul 13 12:49:31 mu01 kernel: [<ffffffffa08670b0>] ? ldlm_pools_thread_main+0x1d0/0x2f0 [ptlrpc] Jul 13 12:49:31 mu01 kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 Jul 13 12:49:31 mu01 kernel: [<ffffffffa0866ee0>] ? ldlm_pools_thread_main+0x0/0x2f0 [ptlrpc] Jul 13 12:49:31 mu01 kernel: [<ffffffff81091d66>] ? kthread+0x96/0xa0 Jul 13 12:49:31 mu01 kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20 Jul 13 12:49:31 mu01 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0 Jul 13 12:49:31 mu01 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Jul 13 12:49:33 mu01 kernel: Lustre: lustre-OST000a-osc-ffff88080fdad800: Connection restored to lustre-OST000a (at 10.0.2.2@o2ib)