Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2404

Many threads stuck on cfs_down_read of lov->lo_type_guard

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.4.0
    • 3
    • 5698

    Description

      We've had reports on Seqoia of many clients getting stuck during reads. I had a chance to dump the stacks on a client in this state and saw many threads which appeared to be stuck on the lov->lo_type_gaurd in a cfs_down_read. Here's an example stack:

      2012-11-28 09:33:05.886131 {DefaultControlEventListener} [mmcs]{716}.0.0: sysiod        D 00000fffa66863e0     0  8855   3105 0x00000002
      2012-11-28 09:33:05.886181 {DefaultControlEventListener} [mmcs]{716}.0.0: Call Trace:
      2012-11-28 09:33:05.886232 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7160] [c0000003ec8e71f0] 0xc0000003ec8e71f0 (unreliable)
      2012-11-28 09:33:05.886282 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7330] [c000000000008de0] .__switch_to+0xc4/0x100
      2012-11-28 09:33:05.886333 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e73c0] [c00000000042b0e0] .schedule+0x858/0x9c0
      2012-11-28 09:33:05.886384 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7670] [c00000000042dcac] .rwsem_down_failed_common+0x270/0x2b8
      2012-11-28 09:33:05.886435 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7730] [c00000000042dd60] .rwsem_down_read_failed+0x2c/0x44
      2012-11-28 09:33:05.886486 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e77d0] [c00000000042cee8] .down_read+0x30/0x44
      2012-11-28 09:33:05.886537 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7840] [80000000051460d8] .lov_lsm_addref+0x48/0x200 [lov]
      2012-11-28 09:33:05.886587 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e78e0] [8000000005146e94] .lov_io_init+0x84/0x160 [lov]
      2012-11-28 09:33:05.886638 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7980] [800000000247aea4] .cl_io_init0+0x104/0x260 [obdclass]
      2012-11-28 09:33:05.886689 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7a30] [800000000695a09c] .ll_file_io_generic+0x11c/0x670 [lustre]
      2012-11-28 09:33:05.886740 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7b30] [800000000695b134] .ll_file_aio_read+0x1d4/0x3a0 [lustre]
      2012-11-28 09:33:05.886790 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7c00] [800000000695b450] .ll_file_read+0x150/0x320 [lustre]
      2012-11-28 09:33:05.886841 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7ce0] [c0000000000d21a0] .vfs_read+0xd0/0x1c4
      2012-11-28 09:33:05.886893 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7d80] [c0000000000d2390] .SyS_read+0x54/0x98
      2012-11-28 09:33:05.886943 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7e30] [c000000000000580] syscall_exit+0x0/0x2c
      

      My initial guess is there is another thread holding a write lock on that semaphore, but I haven't able to pin down a thread holding the write lock looking at the stacks. We also don't have access to crash or kdumps enabled, so I can't directly inspect the mutex.

      I've attached the stacks for all processes on the system (sysrq-t), unfortunately I forgot to dump the running processes (sysrq-l).

      Attachments

        Issue Links

          Activity

            People

              jay Jinshan Xiong (Inactive)
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: