Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4195

MDT Slow with ptlrpcd using 100% cpu.

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.1.5
    • None
    • server running 2.1.5-2nas
      our source is at git://github.com/jlan/lustre-nas.git
    • 3
    • 11369

    Description

      mdt response very slow. Top showed ptlrpcd running at 100% cpu. Console showed errors. Was able to run debug trace. See attached files.

      Lustre: Service thread pid 7065 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Pid: 7065, comm: mdt_93

      Call Trace:
      [<ffffffff810539b2>] ? enqueue_task_fair+0x142/0x490
      [<ffffffff81096f5d>] ? sched_clock_cpu+0xcd/0x110
      [<ffffffffa04f960e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      [<ffffffffa0a9f6de>] qos_statfs_update+0x7fe/0xa70 [lov]
      [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20
      [<ffffffffa0aa00fd>] alloc_qos+0x1ad/0x21a0 [lov]
      [<ffffffffa0aa306c>] qos_prep_create+0x1ec/0x2380 [lov]
      [<ffffffffa04f98be>] ? cfs_free+0xe/0x10 [libcfs]
      [<ffffffffa0a9c63a>] lov_prep_create_set+0xea/0x390 [lov]
      [<ffffffffa0a84b0c>] lov_create+0x1ac/0x1400 [lov]
      [<ffffffffa0d8a50f>] ? obd_iocontrol+0xef/0x390 [mdd]
      [<ffffffffa0d8f90e>] mdd_lov_create+0x9ee/0x1ba0 [mdd]
      [<ffffffffa0da1871>] mdd_create+0xf81/0x1a90 [mdd]
      [<ffffffffa0ea9df3>] ? osd_oi_lookup+0x83/0x110 [osd_ldiskfs]
      [<ffffffffa0ea456c>] ? osd_object_init+0xdc/0x3e0 [osd_ldiskfs]
      [<ffffffffa0eda3f7>] cml_create+0x97/0x250 [cmm]
      [<ffffffffa0e165e1>] ? mdt_version_get_save+0x91/0xd0 [mdt]
      [<ffffffffa0e2c06e>] mdt_reint_open+0x1aae/0x28a0 [mdt]
      [<ffffffffa078f724>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]
      [<ffffffffa0da456e>] ? md_ucred+0x1e/0x60 [mdd]
      [<ffffffffa0e14c81>] mdt_reint_rec+0x41/0xe0 [mdt]
      [<ffffffffa0e0bed4>] mdt_reint_internal+0x544/0x8e0 [mdt]
      [<ffffffffa0e0c53d>] mdt_intent_reint+0x1ed/0x530 [mdt]
      [<ffffffffa0e0ac09>] mdt_intent_policy+0x379/0x690 [mdt]
      [<ffffffffa074b351>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
      [<ffffffffa07711ad>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
      [<ffffffffa0e0b586>] mdt_enqueue+0x46/0x130 [mdt]
      [<ffffffffa0e00772>] mdt_handle_common+0x932/0x1750 [mdt]
      [<ffffffffa0e01665>] mdt_regular_handle+0x15/0x20 [mdt]
      [<ffffffffa079fb4e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
      [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
      [<ffffffff8100c0ca>] child_rip+0xa/0x20
      [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
      [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
      [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

      Attachments

        1. bt.LU4195
          1.07 MB
        2. meminfo.LU4195
          33 kB

        Activity

          [LU-4195] MDT Slow with ptlrpcd using 100% cpu.
          pjones Peter Jones added a comment -

          ok thanks Mahmoud

          pjones Peter Jones added a comment - ok thanks Mahmoud

          please close.

          mhanafi Mahmoud Hanafi added a comment - please close.

          Hello again Mahmoud,
          Please advise if you want us to keep this ticket open?

          If we don't hear back we'll mark it as resolved, with no fix, and we can re-open it if requested to do that.

          Thanks,
          ~ jfc.

          jfc John Fuchs-Chesney (Inactive) added a comment - Hello again Mahmoud, Please advise if you want us to keep this ticket open? If we don't hear back we'll mark it as resolved, with no fix, and we can re-open it if requested to do that. Thanks, ~ jfc.

          Hello Mahmoud,
          Just want to check in to see if this is still a 'live' issue for you?
          If the problem has been dealt with, can we mark this issue as 'resolved'?
          Thanks,
          ~ jfc.

          jfc John Fuchs-Chesney (Inactive) added a comment - Hello Mahmoud, Just want to check in to see if this is still a 'live' issue for you? If the problem has been dealt with, can we mark this issue as 'resolved'? Thanks, ~ jfc.
          ashehata Amir Shehata (Inactive) added a comment - - edited

          Would it be possible to grab the route configuration for one of the nodes that have the problem? Also if you could please highlight the problematic routes.

          As basic sanity check, please make sure that routers are actually configured as routers, IE: forwarding="enabled". If not, then the node will drop all messages not destined to itself.

          This error message is hit whenever LNET tries to send a message to a final destination that exists on a net that do not have a route from the net the current node is on. Thus an appropriate route is chosen.

          Furthermore, the routes can exist but they might be down because the NID specified is not reachable. Is it possible to try and do an lctl ping <nid> from the node that's having the problem. That should return a list of NIDs of the target router and their status (up/down). If one of these nids are down and avoid_asym_router_failure is 1 (which it is by default), then the entire router is considered down and when sending messages we'll hit the above error.

          NOTE: if you have a router that has multiple nids, but one of the nids is "unused" (IE, it sends/receives no messages), this would cause that NID to be considered down, and will lead to the above described scenario.

          ashehata Amir Shehata (Inactive) added a comment - - edited Would it be possible to grab the route configuration for one of the nodes that have the problem? Also if you could please highlight the problematic routes. As basic sanity check, please make sure that routers are actually configured as routers, IE: forwarding="enabled". If not, then the node will drop all messages not destined to itself. This error message is hit whenever LNET tries to send a message to a final destination that exists on a net that do not have a route from the net the current node is on. Thus an appropriate route is chosen. Furthermore, the routes can exist but they might be down because the NID specified is not reachable. Is it possible to try and do an lctl ping <nid> from the node that's having the problem. That should return a list of NIDs of the target router and their status (up/down). If one of these nids are down and avoid_asym_router_failure is 1 (which it is by default), then the entire router is considered down and when sending messages we'll hit the above error. NOTE: if you have a router that has multiple nids, but one of the nids is "unused" (IE, it sends/receives no messages), this would cause that NID to be considered down, and will lead to the above described scenario.

          Amir,
          Could you take a look at this one?
          Thank you!

          jlevi Jodi Levi (Inactive) added a comment - Amir, Could you take a look at this one? Thank you!

          I agree with Lai, the log shows that no router for o2ib to o2ib233, and the stack trace shows all ptlrpcd threads were busy on acquiring LNET_LOCK (which I think is caused by the router problem).

          Mahmoud, I didn't see anything related to quota in the log, what kind of evidence indicating it's triggered by 'lfs setquota' command?

          niu Niu Yawei (Inactive) added a comment - I agree with Lai, the log shows that no router for o2ib to o2ib233, and the stack trace shows all ptlrpcd threads were busy on acquiring LNET_LOCK (which I think is caused by the router problem). Mahmoud, I didn't see anything related to quota in the log, what kind of evidence indicating it's triggered by 'lfs setquota' command?

          We were not able to recover from this and had to dump the system.
          See attached files for memory and stack track info

          We have 7 other filesystem that use the same routers and clients and they were not experiencing this issue.

          There is "some" evidence that this may have been triggered by running a "lfs setquota" command.

          mhanafi Mahmoud Hanafi added a comment - We were not able to recover from this and had to dump the system. See attached files for memory and stack track info We have 7 other filesystem that use the same routers and clients and they were not experiencing this issue. There is "some" evidence that this may have been triggered by running a "lfs setquota" command.
          laisiyao Lai Siyao added a comment -

          In debug logs there are lots of messages like this:

          00000400:02000400:2.0:1383201992.510338:0:7620:0:(lib-move.c:1454:lnet_send()) No route to 12345-10.153.1.199@o2ib233 via 10.151.27.60@o2ib (all routers down)
          

          It looks like routers are down, and all connections are down too, and clients kept reconnecting to MDS (but failed), so ptlrpcd are always 100% busy.

          laisiyao Lai Siyao added a comment - In debug logs there are lots of messages like this: 00000400:02000400:2.0:1383201992.510338:0:7620:0:(lib-move.c:1454:lnet_send()) No route to 12345-10.153.1.199@o2ib233 via 10.151.27.60@o2ib (all routers down) It looks like routers are down, and all connections are down too, and clients kept reconnecting to MDS (but failed), so ptlrpcd are always 100% busy.

          Could you provide the full stack trace (for all tasks on all CPUs) as well? What about the memmory usage? Can this situation be recovered?

          niu Niu Yawei (Inactive) added a comment - Could you provide the full stack trace (for all tasks on all CPUs) as well? What about the memmory usage? Can this situation be recovered?

          People

            ashehata Amir Shehata (Inactive)
            mhanafi Mahmoud Hanafi
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: