Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2404

Many threads stuck on cfs_down_read of lov->lo_type_guard

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.4.0
    • 3
    • 5698

    Description

      We've had reports on Seqoia of many clients getting stuck during reads. I had a chance to dump the stacks on a client in this state and saw many threads which appeared to be stuck on the lov->lo_type_gaurd in a cfs_down_read. Here's an example stack:

      2012-11-28 09:33:05.886131 {DefaultControlEventListener} [mmcs]{716}.0.0: sysiod        D 00000fffa66863e0     0  8855   3105 0x00000002
      2012-11-28 09:33:05.886181 {DefaultControlEventListener} [mmcs]{716}.0.0: Call Trace:
      2012-11-28 09:33:05.886232 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7160] [c0000003ec8e71f0] 0xc0000003ec8e71f0 (unreliable)
      2012-11-28 09:33:05.886282 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7330] [c000000000008de0] .__switch_to+0xc4/0x100
      2012-11-28 09:33:05.886333 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e73c0] [c00000000042b0e0] .schedule+0x858/0x9c0
      2012-11-28 09:33:05.886384 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7670] [c00000000042dcac] .rwsem_down_failed_common+0x270/0x2b8
      2012-11-28 09:33:05.886435 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7730] [c00000000042dd60] .rwsem_down_read_failed+0x2c/0x44
      2012-11-28 09:33:05.886486 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e77d0] [c00000000042cee8] .down_read+0x30/0x44
      2012-11-28 09:33:05.886537 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7840] [80000000051460d8] .lov_lsm_addref+0x48/0x200 [lov]
      2012-11-28 09:33:05.886587 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e78e0] [8000000005146e94] .lov_io_init+0x84/0x160 [lov]
      2012-11-28 09:33:05.886638 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7980] [800000000247aea4] .cl_io_init0+0x104/0x260 [obdclass]
      2012-11-28 09:33:05.886689 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7a30] [800000000695a09c] .ll_file_io_generic+0x11c/0x670 [lustre]
      2012-11-28 09:33:05.886740 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7b30] [800000000695b134] .ll_file_aio_read+0x1d4/0x3a0 [lustre]
      2012-11-28 09:33:05.886790 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7c00] [800000000695b450] .ll_file_read+0x150/0x320 [lustre]
      2012-11-28 09:33:05.886841 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7ce0] [c0000000000d21a0] .vfs_read+0xd0/0x1c4
      2012-11-28 09:33:05.886893 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7d80] [c0000000000d2390] .SyS_read+0x54/0x98
      2012-11-28 09:33:05.886943 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7e30] [c000000000000580] syscall_exit+0x0/0x2c
      

      My initial guess is there is another thread holding a write lock on that semaphore, but I haven't able to pin down a thread holding the write lock looking at the stacks. We also don't have access to crash or kdumps enabled, so I can't directly inspect the mutex.

      I've attached the stacks for all processes on the system (sysrq-t), unfortunately I forgot to dump the running processes (sysrq-l).

      Attachments

        Issue Links

          Activity

            [LU-2404] Many threads stuck on cfs_down_read of lov->lo_type_guard
            pjones Peter Jones added a comment -

            ok thanks Prakash!

            pjones Peter Jones added a comment - ok thanks Prakash!

            Thanks Jinshan. I'll pull that into our branch. I don't have a solid reproducer for this issue, so I'm OK if you want to close it. It can always be reopened if it is seen with the fix applied.

            prakash Prakash Surya (Inactive) added a comment - Thanks Jinshan. I'll pull that into our branch. I don't have a solid reproducer for this issue, so I'm OK if you want to close it. It can always be reopened if it is seen with the fix applied.

            This problem should have been fixed at LU-1876 and the commit number is: ecaba99677b28536f9c376b2b835b554a7792668.

            Let's leave this ticket open until it's verified in the next test

            jay Jinshan Xiong (Inactive) added a comment - This problem should have been fixed at LU-1876 and the commit number is: ecaba99677b28536f9c376b2b835b554a7792668. Let's leave this ticket open until it's verified in the next test
            pjones Peter Jones added a comment -

            Jinshan

            Andreas suggested that you should review this one

            Peter

            pjones Peter Jones added a comment - Jinshan Andreas suggested that you should review this one Peter
            pjones Peter Jones added a comment -

            Alex, what do you think about this one?

            pjones Peter Jones added a comment - Alex, what do you think about this one?

            People

              jay Jinshan Xiong (Inactive)
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: