Details

    • 3
    • 15059

    Description

      When LRU resizing is enabled on client, sometimes, ldlm_poold have extremely high CPU load. And at the meantime, schedule_timeout() complains about negative timeout. After some time, the problem will recover without any manual intervention. But it happens really frequently when the file system is under high load.

      top - 09:48:51 up 6 days, 11:17,  2 users,  load average: 1.00, 1.01, 1.00
      Tasks: 516 total,   2 running, 514 sleeping,   0 stopped,   0 zombie
      Cpu(s):  0.1%us,  6.4%sy,  0.0%ni, 93.4%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
      Mem:  65903880k total, 24300068k used, 41603812k free,   346516k buffers
      Swap: 65535992k total,        0k used, 65535992k free, 18665656k cached
      
         PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
       37976 root      20   0     0    0    0 R 99.4  0.0   2412:25 ldlm_bl_04
      
      Jul 13 12:49:30 mu01 kernel: LustreError: 11-0: lustre-OST000a-osc-ffff88080fdad800: Communicating with 10.0.2.2@o2ib, operation obd_ping failed with -107.
      Jul 13 12:49:30 mu01 kernel: Lustre: lustre-OST000a-osc-ffff88080fdad800: Connection to lustre-OST000a (at 10.0.2.2@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Jul 13 12:49:30 mu01 kernel: LustreError: 167-0: lustre-OST000a-osc-ffff88080fdad800: This client was evicted by lustre-OST000a; in progress operations using this service will fail.
      Jul 13 12:49:31 mu01 kernel: schedule_timeout: wrong timeout value fffffffff5c2c8c0
      Jul 13 12:49:31 mu01 kernel: Pid: 4054, comm: ldlm_poold Tainted: G           ---------------  T 2.6.32-279.el6.x86_64 #1
      Jul 13 12:49:31 mu01 kernel: Call Trace:
      Jul 13 12:49:31 mu01 kernel: [<ffffffff814fe759>] ? schedule_timeout+0x2c9/0x2e0
      Jul 13 12:49:31 mu01 kernel: [<ffffffffa086612b>] ? ldlm_pool_recalc+0x10b/0x130 [ptlrpc]
      Jul 13 12:49:31 mu01 kernel: [<ffffffffa084cfb9>] ? ldlm_namespace_put+0x29/0x60 [ptlrpc]
      Jul 13 12:49:31 mu01 kernel: [<ffffffffa08670b0>] ? ldlm_pools_thread_main+0x1d0/0x2f0 [ptlrpc]
      Jul 13 12:49:31 mu01 kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
      Jul 13 12:49:31 mu01 kernel: [<ffffffffa0866ee0>] ? ldlm_pools_thread_main+0x0/0x2f0 [ptlrpc]
      Jul 13 12:49:31 mu01 kernel: [<ffffffff81091d66>] ? kthread+0x96/0xa0
      Jul 13 12:49:31 mu01 kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
      Jul 13 12:49:31 mu01 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
      Jul 13 12:49:31 mu01 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Jul 13 12:49:33 mu01 kernel: Lustre: lustre-OST000a-osc-ffff88080fdad800: Connection restored to lustre-OST000a (at 10.0.2.2@o2ib)
      

      Attachments

        Issue Links

          Activity

            [LU-5415] High ldlm_poold load on client
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-4536 [ LU-4536 ]
            adilger Andreas Dilger made changes -
            Description Original: When LRU resizing is enabled on client, sometimes, ldlm_poold have extremely high CPU load. And at the meantime, schedule_timeout() complains about negative timeout. After some time, the problem will recover without any manual intervention. But it happens really frequently when the file system is under high load.

            top - 09:48:51 up 6 days, 11:17, 2 users, load average: 1.00, 1.01, 1.00
            Tasks: 516 total, 2 running, 514 sleeping, 0 stopped, 0 zombie
            Cpu(s): 0.1%us, 6.4%sy, 0.0%ni, 93.4%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
            Mem: 65903880k total, 24300068k used, 41603812k free, 346516k buffers
            Swap: 65535992k total, 0k used, 65535992k free, 18665656k cached

               PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
             37976 root 20 0 0 0 0 R 99.4 0.0 2412:25 ldlm_bl_04

            Jul 13 12:49:30 mu01 kernel: LustreError: 11-0: lustre-OST000a-osc-ffff88080fdad800: Communicating with 10.0.2.2@o2ib, operation obd_ping failed with -107.
            Jul 13 12:49:30 mu01 kernel: Lustre: lustre-OST000a-osc-ffff88080fdad800: Connection to lustre-OST000a (at 10.0.2.2@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            Jul 13 12:49:30 mu01 kernel: LustreError: 167-0: lustre-OST000a-osc-ffff88080fdad800: This client was evicted by lustre-OST000a; in progress operations using this service will fail.
            Jul 13 12:49:31 mu01 kernel: schedule_timeout: wrong timeout value fffffffff5c2c8c0
            Jul 13 12:49:31 mu01 kernel: Pid: 4054, comm: ldlm_poold Tainted: G --------------- T 2.6.32-279.el6.x86_64 #1
            Jul 13 12:49:31 mu01 kernel: Call Trace:
            Jul 13 12:49:31 mu01 kernel: [<ffffffff814fe759>] ? schedule_timeout+0x2c9/0x2e0
            Jul 13 12:49:31 mu01 kernel: [<ffffffffa086612b>] ? ldlm_pool_recalc+0x10b/0x130 [ptlrpc]
            Jul 13 12:49:31 mu01 kernel: [<ffffffffa084cfb9>] ? ldlm_namespace_put+0x29/0x60 [ptlrpc]
            Jul 13 12:49:31 mu01 kernel: [<ffffffffa08670b0>] ? ldlm_pools_thread_main+0x1d0/0x2f0 [ptlrpc]
            Jul 13 12:49:31 mu01 kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
            Jul 13 12:49:31 mu01 kernel: [<ffffffffa0866ee0>] ? ldlm_pools_thread_main+0x0/0x2f0 [ptlrpc]
            Jul 13 12:49:31 mu01 kernel: [<ffffffff81091d66>] ? kthread+0x96/0xa0
            Jul 13 12:49:31 mu01 kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
            Jul 13 12:49:31 mu01 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
            Jul 13 12:49:31 mu01 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
            Jul 13 12:49:33 mu01 kernel: Lustre: lustre-OST000a-osc-ffff88080fdad800: Connection restored to lustre-OST000a (at 10.0.2.2@o2ib)
            New: When LRU resizing is enabled on client, sometimes, ldlm_poold have extremely high CPU load. And at the meantime, schedule_timeout() complains about negative timeout. After some time, the problem will recover without any manual intervention. But it happens really frequently when the file system is under high load.
            {noformat}
            top - 09:48:51 up 6 days, 11:17, 2 users, load average: 1.00, 1.01, 1.00
            Tasks: 516 total, 2 running, 514 sleeping, 0 stopped, 0 zombie
            Cpu(s): 0.1%us, 6.4%sy, 0.0%ni, 93.4%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
            Mem: 65903880k total, 24300068k used, 41603812k free, 346516k buffers
            Swap: 65535992k total, 0k used, 65535992k free, 18665656k cached

               PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
             37976 root 20 0 0 0 0 R 99.4 0.0 2412:25 ldlm_bl_04

            Jul 13 12:49:30 mu01 kernel: LustreError: 11-0: lustre-OST000a-osc-ffff88080fdad800: Communicating with 10.0.2.2@o2ib, operation obd_ping failed with -107.
            Jul 13 12:49:30 mu01 kernel: Lustre: lustre-OST000a-osc-ffff88080fdad800: Connection to lustre-OST000a (at 10.0.2.2@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            Jul 13 12:49:30 mu01 kernel: LustreError: 167-0: lustre-OST000a-osc-ffff88080fdad800: This client was evicted by lustre-OST000a; in progress operations using this service will fail.
            Jul 13 12:49:31 mu01 kernel: schedule_timeout: wrong timeout value fffffffff5c2c8c0
            Jul 13 12:49:31 mu01 kernel: Pid: 4054, comm: ldlm_poold Tainted: G --------------- T 2.6.32-279.el6.x86_64 #1
            Jul 13 12:49:31 mu01 kernel: Call Trace:
            Jul 13 12:49:31 mu01 kernel: [<ffffffff814fe759>] ? schedule_timeout+0x2c9/0x2e0
            Jul 13 12:49:31 mu01 kernel: [<ffffffffa086612b>] ? ldlm_pool_recalc+0x10b/0x130 [ptlrpc]
            Jul 13 12:49:31 mu01 kernel: [<ffffffffa084cfb9>] ? ldlm_namespace_put+0x29/0x60 [ptlrpc]
            Jul 13 12:49:31 mu01 kernel: [<ffffffffa08670b0>] ? ldlm_pools_thread_main+0x1d0/0x2f0 [ptlrpc]
            Jul 13 12:49:31 mu01 kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
            Jul 13 12:49:31 mu01 kernel: [<ffffffffa0866ee0>] ? ldlm_pools_thread_main+0x0/0x2f0 [ptlrpc]
            Jul 13 12:49:31 mu01 kernel: [<ffffffff81091d66>] ? kthread+0x96/0xa0
            Jul 13 12:49:31 mu01 kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
            Jul 13 12:49:31 mu01 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
            Jul 13 12:49:31 mu01 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
            Jul 13 12:49:33 mu01 kernel: Lustre: lustre-OST000a-osc-ffff88080fdad800: Connection restored to lustre-OST000a (at 10.0.2.2@o2ib)
            {noformat}
            jgmitter Joseph Gmitter (Inactive) made changes -
            Link New: This issue is related to DELL-86 [ DELL-86 ]
            pjones Peter Jones made changes -
            Labels Original: 22i patch New: patch
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.5.3 [ 11100 ]
            pjones Peter Jones made changes -
            Labels Original: i22 patch New: 22i patch
            pjones Peter Jones made changes -
            Labels Original: patch New: i22 patch
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.7.0 [ 10631 ]
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones made changes -
            Assignee Original: Lai Siyao [ laisiyao ] New: Zhenyu Xu [ bobijam ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-2924 [ LU-2924 ]

            People

              bobijam Zhenyu Xu
              lixi Li Xi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: