Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2630

Client hang in mdc_enqueue()

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.0
    • Sequoia, lustre 2.3.58-5chaos (github.com/chaos/lustre)
    • 3
    • 6154

    Description

      On a number of Sequoia IONs (ppc64 lustre clients) we hit an mdc hang unde mdc_enqueue(). Many processes are stuck in do_path_loopup() like this:

      2013-01-16 13:33:21.160679 {DefaultControlEventListener} [mmcs]{474}.0.0: sysiod        D 00000fffaea576bc     0 15111   3083 0x00000002
      2013-01-16 13:33:21.160730 {DefaultControlEventListener} [mmcs]{474}.0.0: Call Trace:
      2013-01-16 13:33:21.160782 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416aa30] [8000000004012b20] nscheds+0x21dc/0xfffffffffffec004 [ko2iblnd] (unreliable)
      2013-01-16 13:33:21.160835 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416ac00] [c000000000009b2c] .__switch_to+0xc4/0x100
      2013-01-16 13:33:21.160886 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416ac90] [c00000000042a418] .schedule+0x7d4/0x944
      2013-01-16 13:33:21.160937 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416af40] [c00000000042b4a8] .__mutex_lock_slowpath+0x208/0x390
      2013-01-16 13:33:21.160988 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b040] [c00000000042bf44] .mutex_lock+0x38/0x58
      2013-01-16 13:33:21.161040 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b0c0] [8000000005ac1830] .mdc_enqueue+0x6e0/0x1c60 [mdc]
      2013-01-16 13:33:21.161092 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b280] [8000000005ac3060] .mdc_intent_lock+0x2b0/0x820 [mdc]
      2013-01-16 13:33:21.161158 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b400] [8000000007411f64] .lmv_intent_open+0x4e4/0x1940 [lmv]
      2013-01-16 13:33:21.161232 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b590] [80000000074136dc] .lmv_intent_lock+0x31c/0x4b0 [lmv]
      2013-01-16 13:33:21.161314 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b690] [80000000068d3a3c] .ll_revalidate_it+0x41c/0x2220 [lustre]
      2013-01-16 13:33:21.161410 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b7f0] [80000000068d59ec] .ll_revalidate_nd+0x1ac/0x560 [lustre]
      2013-01-16 13:33:21.161539 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b8b0] [c0000000000dc318] .do_lookup+0x180/0x218
      2013-01-16 13:33:21.161626 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b970] [c0000000000dee8c] .__link_path_walk+0xb90/0x11d4
      2013-01-16 13:33:21.161694 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416ba90] [c0000000000df830] .path_walk+0x98/0x148
      2013-01-16 13:33:21.161752 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416bb30] [c0000000000dfa34] .do_path_lookup+0x38/0x78
      2013-01-16 13:33:21.161817 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416bbc0] [c0000000000e08a4] .do_filp_open+0x11c/0xadc
      2013-01-16 13:33:21.161880 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416bd80] [c0000000000cf498] .do_sys_open+0x8c/0x18c
      2013-01-16 13:33:21.161942 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416be30] [c000000000000580] syscall_exit+0x0/0x2c
      

      because this processes is sleeping while holding the lock:

      2013-01-16 13:33:21.078061 {DefaultControlEventListener} [mmcs]{474}.0.0: sysiod        S 00000fff965b7148     0 15097   3083 0x00000000
      2013-01-16 13:33:21.078112 {DefaultControlEventListener} [mmcs]{474}.0.0: Call Trace:
      2013-01-16 13:33:21.078163 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2630] [c000000000027d1c] .resched_task+0x9c/0xb4 (unreliable)
      2013-01-16 13:33:21.078214 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2800] [c000000000009b2c] .__switch_to+0xc4/0x100
      2013-01-16 13:33:21.078266 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2890] [c00000000042a418] .schedule+0x7d4/0x944
      2013-01-16 13:33:21.078317 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2b40] [8000000000ae2090] .cfs_waitq_wait+0x10/0x30 [libcfs]
      2013-01-16 13:33:21.078368 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2bb0] [8000000003b56c3c] .ldlm_completion_ast+0x6dc/0xe30 [ptlrpc]
      2013-01-16 13:33:21.078420 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2cf0] [8000000003b542c4] .ldlm_cli_enqueue_fini+0xa34/0x16b0 [ptlrpc]
      2013-01-16 13:33:21.078472 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2e20] [8000000003b553b4] .ldlm_cli_enqueue+0x474/0x970 [ptlrpc]
      2013-01-16 13:33:21.078524 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2f60] [8000000005ac18ac] .mdc_enqueue+0x75c/0x1c60 [mdc]
      2013-01-16 13:33:21.078576 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3120] [80000000073ffbf4] .lmv_enqueue+0x504/0x19f0 [lmv]
      2013-01-16 13:33:21.078627 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3280] [80000000068de4a4] .ll_get_dir_page+0x354/0x1ee0 [lustre]
      2013-01-16 13:33:21.078679 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3450] [8000000006956a28] .do_statahead_enter+0x6d8/0x21f0 [lustre]
      2013-01-16 13:33:21.078730 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3610] [800000000693c84c] .ll_lookup_it+0xb8c/0xfc0 [lustre]
      2013-01-16 13:33:21.078782 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3770] [800000000693cd50] .ll_lookup_nd+0xd0/0x580 [lustre]
      2013-01-16 13:33:21.078834 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3830] [c0000000000dc2b8] .do_lookup+0x120/0x218
      2013-01-16 13:33:21.078885 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae38f0] [c0000000000dee8c] .__link_path_walk+0xb90/0x11d4
      2013-01-16 13:33:21.078936 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3a10] [c0000000000df830] .path_walk+0x98/0x148
      2013-01-16 13:33:21.078988 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3ab0] [c0000000000dfa34] .do_path_lookup+0x38/0x78
      2013-01-16 13:33:21.079039 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3b40] [c0000000000e05e0] .user_path_at+0x68/0xc0
      2013-01-16 13:33:21.079090 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3c90] [c0000000000d5e30] .vfs_fstatat+0x5c/0xa8
      2013-01-16 13:33:21.079141 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3d30] [c0000000000d5f00] .SyS_newfstatat+0x2c/0x58
      2013-01-16 13:33:21.079193 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3e30] [c000000000000580] syscall_exit+0x0/0x2c
      

      It could be that this situation was a result of an eviction event. But note also that the client claimed to be fully recovered with the mds. See attached console log with sysrq-t output.

      Also note that sending a "kill -9" to process 15097 did break the system out of its hang.

      Attachments

        1. R20-ID-J02.log
          2.44 MB
          Christopher Morrone

        Activity

          People

            yong.fan nasf (Inactive)
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: