Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.0
    • Sequoia, lustre 2.3.58-5chaos (github.com/chaos/lustre)
    • 3
    • 6154

    Description

      On a number of Sequoia IONs (ppc64 lustre clients) we hit an mdc hang unde mdc_enqueue(). Many processes are stuck in do_path_loopup() like this:

      2013-01-16 13:33:21.160679 {DefaultControlEventListener} [mmcs]{474}.0.0: sysiod        D 00000fffaea576bc     0 15111   3083 0x00000002
      2013-01-16 13:33:21.160730 {DefaultControlEventListener} [mmcs]{474}.0.0: Call Trace:
      2013-01-16 13:33:21.160782 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416aa30] [8000000004012b20] nscheds+0x21dc/0xfffffffffffec004 [ko2iblnd] (unreliable)
      2013-01-16 13:33:21.160835 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416ac00] [c000000000009b2c] .__switch_to+0xc4/0x100
      2013-01-16 13:33:21.160886 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416ac90] [c00000000042a418] .schedule+0x7d4/0x944
      2013-01-16 13:33:21.160937 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416af40] [c00000000042b4a8] .__mutex_lock_slowpath+0x208/0x390
      2013-01-16 13:33:21.160988 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b040] [c00000000042bf44] .mutex_lock+0x38/0x58
      2013-01-16 13:33:21.161040 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b0c0] [8000000005ac1830] .mdc_enqueue+0x6e0/0x1c60 [mdc]
      2013-01-16 13:33:21.161092 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b280] [8000000005ac3060] .mdc_intent_lock+0x2b0/0x820 [mdc]
      2013-01-16 13:33:21.161158 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b400] [8000000007411f64] .lmv_intent_open+0x4e4/0x1940 [lmv]
      2013-01-16 13:33:21.161232 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b590] [80000000074136dc] .lmv_intent_lock+0x31c/0x4b0 [lmv]
      2013-01-16 13:33:21.161314 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b690] [80000000068d3a3c] .ll_revalidate_it+0x41c/0x2220 [lustre]
      2013-01-16 13:33:21.161410 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b7f0] [80000000068d59ec] .ll_revalidate_nd+0x1ac/0x560 [lustre]
      2013-01-16 13:33:21.161539 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b8b0] [c0000000000dc318] .do_lookup+0x180/0x218
      2013-01-16 13:33:21.161626 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b970] [c0000000000dee8c] .__link_path_walk+0xb90/0x11d4
      2013-01-16 13:33:21.161694 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416ba90] [c0000000000df830] .path_walk+0x98/0x148
      2013-01-16 13:33:21.161752 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416bb30] [c0000000000dfa34] .do_path_lookup+0x38/0x78
      2013-01-16 13:33:21.161817 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416bbc0] [c0000000000e08a4] .do_filp_open+0x11c/0xadc
      2013-01-16 13:33:21.161880 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416bd80] [c0000000000cf498] .do_sys_open+0x8c/0x18c
      2013-01-16 13:33:21.161942 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416be30] [c000000000000580] syscall_exit+0x0/0x2c
      

      because this processes is sleeping while holding the lock:

      2013-01-16 13:33:21.078061 {DefaultControlEventListener} [mmcs]{474}.0.0: sysiod        S 00000fff965b7148     0 15097   3083 0x00000000
      2013-01-16 13:33:21.078112 {DefaultControlEventListener} [mmcs]{474}.0.0: Call Trace:
      2013-01-16 13:33:21.078163 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2630] [c000000000027d1c] .resched_task+0x9c/0xb4 (unreliable)
      2013-01-16 13:33:21.078214 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2800] [c000000000009b2c] .__switch_to+0xc4/0x100
      2013-01-16 13:33:21.078266 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2890] [c00000000042a418] .schedule+0x7d4/0x944
      2013-01-16 13:33:21.078317 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2b40] [8000000000ae2090] .cfs_waitq_wait+0x10/0x30 [libcfs]
      2013-01-16 13:33:21.078368 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2bb0] [8000000003b56c3c] .ldlm_completion_ast+0x6dc/0xe30 [ptlrpc]
      2013-01-16 13:33:21.078420 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2cf0] [8000000003b542c4] .ldlm_cli_enqueue_fini+0xa34/0x16b0 [ptlrpc]
      2013-01-16 13:33:21.078472 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2e20] [8000000003b553b4] .ldlm_cli_enqueue+0x474/0x970 [ptlrpc]
      2013-01-16 13:33:21.078524 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2f60] [8000000005ac18ac] .mdc_enqueue+0x75c/0x1c60 [mdc]
      2013-01-16 13:33:21.078576 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3120] [80000000073ffbf4] .lmv_enqueue+0x504/0x19f0 [lmv]
      2013-01-16 13:33:21.078627 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3280] [80000000068de4a4] .ll_get_dir_page+0x354/0x1ee0 [lustre]
      2013-01-16 13:33:21.078679 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3450] [8000000006956a28] .do_statahead_enter+0x6d8/0x21f0 [lustre]
      2013-01-16 13:33:21.078730 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3610] [800000000693c84c] .ll_lookup_it+0xb8c/0xfc0 [lustre]
      2013-01-16 13:33:21.078782 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3770] [800000000693cd50] .ll_lookup_nd+0xd0/0x580 [lustre]
      2013-01-16 13:33:21.078834 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3830] [c0000000000dc2b8] .do_lookup+0x120/0x218
      2013-01-16 13:33:21.078885 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae38f0] [c0000000000dee8c] .__link_path_walk+0xb90/0x11d4
      2013-01-16 13:33:21.078936 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3a10] [c0000000000df830] .path_walk+0x98/0x148
      2013-01-16 13:33:21.078988 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3ab0] [c0000000000dfa34] .do_path_lookup+0x38/0x78
      2013-01-16 13:33:21.079039 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3b40] [c0000000000e05e0] .user_path_at+0x68/0xc0
      2013-01-16 13:33:21.079090 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3c90] [c0000000000d5e30] .vfs_fstatat+0x5c/0xa8
      2013-01-16 13:33:21.079141 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3d30] [c0000000000d5f00] .SyS_newfstatat+0x2c/0x58
      2013-01-16 13:33:21.079193 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3e30] [c000000000000580] syscall_exit+0x0/0x2c
      

      It could be that this situation was a result of an eviction event. But note also that the client claimed to be fully recovered with the mds. See attached console log with sysrq-t output.

      Also note that sending a "kill -9" to process 15097 did break the system out of its hang.

      Attachments

        Activity

          [LU-2630] Client hang in mdc_enqueue()

          We cannot reproduce it anymore since LU-2645 patch applied. Reopen it if reproduced it in the future.

          yong.fan nasf (Inactive) added a comment - We cannot reproduce it anymore since LU-2645 patch applied. Reopen it if reproduced it in the future.

          That is fine.

          morrone Christopher Morrone (Inactive) added a comment - That is fine.

          Can we close it since we have not seen it for months since Lu-2645 resolved? We can re-open it when it is reproduced in the future.

          yong.fan nasf (Inactive) added a comment - Can we close it since we have not seen it for months since Lu-2645 resolved? We can re-open it when it is reproduced in the future.
          pjones Peter Jones added a comment -

          ok then I will drop the priority for now

          pjones Peter Jones added a comment - ok then I will drop the priority for now

          No, haven't seen recently.

          morrone Christopher Morrone (Inactive) added a comment - No, haven't seen recently.

          Chris, have you seen the failure recently? Would you please to show me the MDT side log? Thanks!

          yong.fan nasf (Inactive) added a comment - Chris, have you seen the failure recently? Would you please to show me the MDT side log? Thanks!

          It is more similar as LU-2645, the client evicted due to lock completion callback timeout, although LU-2645 is under interop mode. I am working on LU-2645. So if possible, we can see what will happen after LU-2645 resolved, or just re-assign this one to me also.

          yong.fan nasf (Inactive) added a comment - It is more similar as LU-2645 , the client evicted due to lock completion callback timeout, although LU-2645 is under interop mode. I am working on LU-2645 . So if possible, we can see what will happen after LU-2645 resolved, or just re-assign this one to me also.

          Fan Yong, any comment here about the LVB on the IBIT lock?

          adilger Andreas Dilger added a comment - Fan Yong, any comment here about the LVB on the IBIT lock?
          laisiyao Lai Siyao added a comment -
          2013-01-16 08:17:03.466123 {DefaultControlEventListener} [mmcs]{474}.8.3: LustreError: 12612:0:(ldlm_lock.c:1570:ldlm_fill_lvb()) ### Unexpected LVB type ns: ls1-MDT0000-mdc-c00000036832f400 lock: c0000002d4f2ed80/0x10212001ce8912e0 lrc: 5/1,0 mode: --/PR res: 8589991817/133 bits 0x2 rrc: 1 type: IBT flags: 0x400400000000 nid: local remote: 0x481c21d383474cbf expref: -99 pid: 15097 timeout: 0 lvb_type: 0
          

          This line looks quite suspicious, because an IBIT lock shouldn't have lvb(currently only extent/quota/layout lock has lvb), but I can't find any place in the code this could happen. If this can be reproduce, and with more logs (debug logs with ldlm is best) of both client and MDS, it will help a lot.

          laisiyao Lai Siyao added a comment - 2013-01-16 08:17:03.466123 {DefaultControlEventListener} [mmcs]{474}.8.3: LustreError: 12612:0:(ldlm_lock.c:1570:ldlm_fill_lvb()) ### Unexpected LVB type ns: ls1-MDT0000-mdc-c00000036832f400 lock: c0000002d4f2ed80/0x10212001ce8912e0 lrc: 5/1,0 mode: --/PR res: 8589991817/133 bits 0x2 rrc: 1 type: IBT flags: 0x400400000000 nid: local remote: 0x481c21d383474cbf expref: -99 pid: 15097 timeout: 0 lvb_type: 0 This line looks quite suspicious, because an IBIT lock shouldn't have lvb(currently only extent/quota/layout lock has lvb), but I can't find any place in the code this could happen. If this can be reproduce, and with more logs (debug logs with ldlm is best) of both client and MDS, it will help a lot.
          laisiyao Lai Siyao added a comment -

          Chris, I was on vacation last week, how is the MDS logs? I'm looking into the right branch now, last time I was checking 2.3.57 branch.

          laisiyao Lai Siyao added a comment - Chris, I was on vacation last week, how is the MDS logs? I'm looking into the right branch now, last time I was checking 2.3.57 branch.

          People

            yong.fan nasf (Inactive)
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: