Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2683

Client deadlock in cl_lock_mutex_get

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • Sequoia ppc64 client, 2.3.58-5chaos, servers are x86_64 running 2.3.58-6chaos
    • 3
    • 6270

    Description

      With 2.3.58-5chaos ppc64 clients, we are seeing Lustre hang with many threads waiting on a mutex under cl_lock_mutex_get(). In the attached file "RA0-ID-J03.log.txt", you can see the sysrq-t and sysrq-l output. The node is responsive, but lustre is doing nothing because all ptlrpcd threads are stuck in the same path:

      2013-01-25 11:51:10.788870 {DefaultControlEventListener} [mmcs]{131}.0.0: ptlrpcd_0     D 0000000000000000     0  3268      2 0x00000000
      2013-01-25 11:51:10.788922 {DefaultControlEventListener} [mmcs]{131}.0.0: Call Trace:
      2013-01-25 11:51:10.788974 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621ee70] [c00000000068f010] svc_rdma_ops+0xda18/0x1a900 (unreliable)
      2013-01-25 11:51:10.789026 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f040] [c000000000009b2c] .__switch_to+0xc4/0x100
      2013-01-25 11:51:10.789078 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f0d0] [c00000000042a418] .schedule+0x7d4/0x944
      2013-01-25 11:51:10.789129 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f380] [c00000000042b4a8] .__mutex_lock_slowpath+0x208/0x390
      2013-01-25 11:51:10.789181 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f480] [c00000000042bf44] .mutex_lock+0x38/0x58
      2013-01-25 11:51:10.789234 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f500] [8000000002501cf8] .cl_lock_mutex_get+0xc8/0x110 [obdclass]
      2013-01-25 11:51:10.789285 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f590] [800000000514d124] .lovsub_parent_lock+0x94/0x260 [lov]
      2013-01-25 11:51:10.789337 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f640] [800000000514de70] .lovsub_lock_state+0xd0/0x300 [lov]
      2013-01-25 11:51:10.789389 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f710] [80000000024fe344] .cl_lock_state_signal+0xd4/0x2c0 [obdclass]
      2013-01-25 11:51:10.789441 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f7d0] [80000000024ff448] .cl_lock_signal+0xa8/0x260 [obdclass]
      2013-01-25 11:51:10.789492 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f890] [80000000046adb24] .osc_lock_upcall+0x194/0x810 [osc]
      2013-01-25 11:51:10.789544 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f970] [8000000004688a7c] .osc_enqueue_fini+0xfc/0x3f0 [osc]
      2013-01-25 11:51:10.789596 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621fa60] [8000000004697364] .osc_enqueue_interpret+0x104/0x240 [osc]
      2013-01-25 11:51:10.789647 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621fb40] [8000000003b7a308] .ptlrpc_check_set+0x3c8/0x4e50 [ptlrpc]
      2013-01-25 11:51:10.789700 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621fd20] [8000000003bcffec] .ptlrpcd_check+0x66c/0x870 [ptlrpc]
      2013-01-25 11:51:10.789751 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621fe40] [8000000003bd054c] .ptlrpcd+0x35c/0x510 [ptlrpc]
      2013-01-25 11:51:10.789802 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621ff90] [c00000000001b9a0] .kernel_thread+0x54/0x70
      

      Attachments

        Issue Links

          Activity

            [LU-2683] Client deadlock in cl_lock_mutex_get
            pjones Peter Jones added a comment -

            Landed for 2.4

            pjones Peter Jones added a comment - Landed for 2.4

            With change 5208, we have seen no repeats of the problem for four days. It looks like the patched has fixed the problem. Thanks!

            morrone Christopher Morrone (Inactive) added a comment - With change 5208 , we have seen no repeats of the problem for four days. It looks like the patched has fixed the problem. Thanks!

            We have about a day of testing with patch set 2 of change 5208, and so far we have had no occurrences of the client deadlock.

            Probably too soon to be definitive, but it is promising.

            morrone Christopher Morrone (Inactive) added a comment - We have about a day of testing with patch set 2 of change 5208 , and so far we have had no occurrences of the client deadlock. Probably too soon to be definitive, but it is promising.

            Bruno has claimed that this patch can at least fix the original problem so let's land this patch for now.

            jay Jinshan Xiong (Inactive) added a comment - Bruno has claimed that this patch can at least fix the original problem so let's land this patch for now.

            I see, thanks for explanation.

            jay Jinshan Xiong (Inactive) added a comment - I see, thanks for explanation.
            prakash Prakash Surya (Inactive) added a comment - - edited

            Well in total, I see that message 12 times for this particular occurrence (from Feb. 11th not running with 5208). AFAIK, all of the OSTs are healthy but when the IONs hit this issue, some of the OSTs will report as "inactive" when running "lfs check servers" on the client. That is how the admins discover IONs in this state, and then manually reboot them.

            With that said, I may have jumped the gun in my previous comments here and here. What I reported in those comments happened on a login node and wasn't actually a deadlock as far as I can tell. I think those were just "slow" threads (cause unknown) as they eventually cleared up (but then hit LU-2779).

            So unless you think differently , it probably isn't worth spending more time on this until it's seen on one of the IONs running with the patch applied. Sorry for the confusion.

            prakash Prakash Surya (Inactive) added a comment - - edited Well in total, I see that message 12 times for this particular occurrence (from Feb. 11th not running with 5208 ). AFAIK, all of the OSTs are healthy but when the IONs hit this issue, some of the OSTs will report as "inactive" when running "lfs check servers" on the client. That is how the admins discover IONs in this state, and then manually reboot them. With that said, I may have jumped the gun in my previous comments here and here . What I reported in those comments happened on a login node and wasn't actually a deadlock as far as I can tell. I think those were just "slow" threads (cause unknown) as they eventually cleared up (but then hit LU-2779 ). So unless you think differently , it probably isn't worth spending more time on this until it's seen on one of the IONs running with the patch applied. Sorry for the confusion.

            why were there so many of them? Is one OST in recovery at that time?

            It will be helpful to dump log, with DLMTRACE, CACHE and VFSTRACE enabled, when you first see this kind of message. It will be easily done with a script

            jay Jinshan Xiong (Inactive) added a comment - why were there so many of them? Is one OST in recovery at that time? It will be helpful to dump log, with DLMTRACE, CACHE and VFSTRACE enabled, when you first see this kind of message. It will be easily done with a script

            Like these messages?

            2013-02-11 13:29:13.788428 {DefaultControlEventListener} [mmcs]{613}.12.0: LustreError: 31967:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001c1fc2fe8@{[26 -> 28/31], [3|0|+|rpc|wihY|c0000001b0756f60], [196608|3|+|-|c0000001cdfaa6e0|16|c00000037ab0f300]} ls1-OST005a-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress?
            2013-02-11 13:30:26.835353 {DefaultControlEventListener} [mmcs]{613}.0.3:  LustreError: 40667:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001cdacee38@{[2 -> 4/15], [3|0|+|rpc|wiY|c0000001b0759480], [196608|3|+|-|c0000001cdfa9600|16|c0000003c54483a0]} ls1-OST007b-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress?
            2013-02-11 13:30:58.855528 {DefaultControlEventListener} [mmcs]{613}.3.0:  LustreError: 32002:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001cdacfca8@{[2 -> 4/15], [3|0|+|rpc|wihY|c0000001b0759d80], [196608|3|+|-|c0000001cdfa99c0|16|c0000003742646e0]} ls1-OST0083-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress?
            2013-02-11 13:30:58.856226 {DefaultControlEventListener} [mmcs]{613}.3.2:  LustreError: 3958:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001cedc8d18@{[42 -> 44/47], [3|0|+|rpc|wihY|c0000002e43ed020], [196608|3|+|-|c0000001c0c13c00|16|c0000003c87a1e60]} ls1-OST014d-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress?
            2013-02-11 13:31:44.273199 {DefaultControlEventListener} [mmcs]{613}.12.0: LustreError: 31983:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001c9b17230@{[2 -> 4/15], [3|0|+|rpc|wihY|c0000001b075a680], [196608|3|+|-|c0000002e1471d10|16|c0000003c80ad4a0]} ls1-OST008b-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress?
            
            prakash Prakash Surya (Inactive) added a comment - Like these messages? 2013-02-11 13:29:13.788428 {DefaultControlEventListener} [mmcs]{613}.12.0: LustreError: 31967:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001c1fc2fe8@{[26 -> 28/31], [3|0|+|rpc|wihY|c0000001b0756f60], [196608|3|+|-|c0000001cdfaa6e0|16|c00000037ab0f300]} ls1-OST005a-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress? 2013-02-11 13:30:26.835353 {DefaultControlEventListener} [mmcs]{613}.0.3: LustreError: 40667:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001cdacee38@{[2 -> 4/15], [3|0|+|rpc|wiY|c0000001b0759480], [196608|3|+|-|c0000001cdfa9600|16|c0000003c54483a0]} ls1-OST007b-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress? 2013-02-11 13:30:58.855528 {DefaultControlEventListener} [mmcs]{613}.3.0: LustreError: 32002:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001cdacfca8@{[2 -> 4/15], [3|0|+|rpc|wihY|c0000001b0759d80], [196608|3|+|-|c0000001cdfa99c0|16|c0000003742646e0]} ls1-OST0083-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress? 2013-02-11 13:30:58.856226 {DefaultControlEventListener} [mmcs]{613}.3.2: LustreError: 3958:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001cedc8d18@{[42 -> 44/47], [3|0|+|rpc|wihY|c0000002e43ed020], [196608|3|+|-|c0000001c0c13c00|16|c0000003c87a1e60]} ls1-OST014d-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress? 2013-02-11 13:31:44.273199 {DefaultControlEventListener} [mmcs]{613}.12.0: LustreError: 31983:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001c9b17230@{[2 -> 4/15], [3|0|+|rpc|wihY|c0000001b075a680], [196608|3|+|-|c0000002e1471d10|16|c0000003c80ad4a0]} ls1-OST008b-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress?

            Did you find anything was printed in osc_extent_wait()?

            jay Jinshan Xiong (Inactive) added a comment - Did you find anything was printed in osc_extent_wait()?

            I need more information. Let me work out a patch for LU-2779 firstly so that you can use sysrq-t

            jay Jinshan Xiong (Inactive) added a comment - I need more information. Let me work out a patch for LU-2779 firstly so that you can use sysrq-t

            I don't have sysrq-l information from the node because it crashed shortly after due to LU-2779, but these were on the console:

            2013-02-12 16:46:00 INFO: task ptlrpcd_10:3486 blocked for more than 120 seconds.
            2013-02-12 16:46:00 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            2013-02-12 16:46:00 ptlrpcd_10    D 0000000000000000     0  3486      2 0x00008000
            2013-02-12 16:46:00 Call Trace:
            2013-02-12 16:46:00 [c000000f321a2de0] [c000000f321a2e90] 0xc000000f321a2e90 (unreliable)
            2013-02-12 16:46:00 [c000000f321a2fb0] [c0000000000142d8] .__switch_to+0xf8/0x1d0
            2013-02-12 16:46:00 [c000000f321a3040] [c0000000005bb0a8] .schedule+0x3f8/0xd30
            2013-02-12 16:46:00 [c000000f321a3340] [c0000000005bcb4c] .__mutex_lock_slowpath+0x1bc/0x2d0
            2013-02-12 16:46:00 [c000000f321a3420] [c0000000005bd42c] .mutex_lock+0x5c/0x60
            2013-02-12 16:46:00 [c000000f321a34a0] [d00000000f2923e8] .cl_lock_mutex_get+0xc8/0x110 [obdclass]
            2013-02-12 16:46:00 [c000000f321a3530] [d000000011d2d294] .lovsub_parent_lock+0x94/0x260 [lov]
            2013-02-12 16:46:00 [c000000f321a35e0] [d000000011d2f058] .lovsub_lock_modify+0xc8/0x2c0 [lov]
            2013-02-12 16:46:00 [c000000f321a36b0] [d00000000f2914f4] .cl_lock_modify+0x114/0x490 [obdclass]
            2013-02-12 16:46:00 [c000000f321a37a0] [d0000000111ec3a4] .osc_lock_granted+0x184/0x3d0 [osc]
            2013-02-12 16:46:00 [c000000f321a3870] [d0000000111ee1c4] .osc_lock_upcall+0x4c4/0x840 [osc]
            2013-02-12 16:46:00 [c000000f321a3950] [d0000000111c8b3c] .osc_enqueue_fini+0xfc/0x3f0 [osc]
            2013-02-12 16:46:00 [c000000f321a3a40] [d0000000111d7504] .osc_enqueue_interpret+0x104/0x240 [osc]
            2013-02-12 16:46:00 [c000000f321a3b20] [d00000001097a724] .ptlrpc_check_set+0x364/0x4f40 [ptlrpc]
            2013-02-12 16:46:00 [c000000f321a3cc0] [d0000000109d127c] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc]
            2013-02-12 16:46:00 [c000000f321a3de0] [d0000000109d182c] .ptlrpcd+0x37c/0x530 [ptlrpc]
            2013-02-12 16:46:00 [c000000f321a3f90] [c000000000032fd4] .kernel_thread+0x54/0x70
            2013-02-12 16:46:00 INFO: task ptlrpcd_11:3487 blocked for more than 120 seconds.
            2013-02-12 16:46:00 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            2013-02-12 16:46:00 ptlrpcd_11    D 0000000000000000     0  3487      2 0x00008000
            2013-02-12 16:46:00 Call Trace:
            2013-02-12 16:46:00 [c000000f4fefade0] [c000000f4fefae90] 0xc000000f4fefae90 (unreliable)
            2013-02-12 16:46:00 [c000000f4fefafb0] [c0000000000142d8] .__switch_to+0xf8/0x1d0
            2013-02-12 16:46:00 [c000000f4fefb040] [c0000000005bb0a8] .schedule+0x3f8/0xd30
            2013-02-12 16:46:00 [c000000f4fefb340] [c0000000005bcb4c] .__mutex_lock_slowpath+0x1bc/0x2d0
            2013-02-12 16:46:00 [c000000f4fefb420] [c0000000005bd42c] .mutex_lock+0x5c/0x60
            2013-02-12 16:46:00 [c000000f4fefb4a0] [d00000000f2923e8] .cl_lock_mutex_get+0xc8/0x110 [obdclass]
            2013-02-12 16:46:00 [c000000f4fefb530] [d000000011d2d294] .lovsub_parent_lock+0x94/0x260 [lov]
            2013-02-12 16:46:00 [c000000f4fefb5e0] [d000000011d2f058] .lovsub_lock_modify+0xc8/0x2c0 [lov]
            2013-02-12 16:46:00 [c000000f4fefb6b0] [d00000000f2914f4] .cl_lock_modify+0x114/0x490 [obdclass]
            2013-02-12 16:46:00 [c000000f4fefb7a0] [d0000000111ec3a4] .osc_lock_granted+0x184/0x3d0 [osc]
            2013-02-12 16:46:00 [c000000f4fefb870] [d0000000111ee1c4] .osc_lock_upcall+0x4c4/0x840 [osc]
            2013-02-12 16:46:00 [c000000f4fefb950] [d0000000111c8b3c] .osc_enqueue_fini+0xfc/0x3f0 [osc]
            2013-02-12 16:46:00 [c000000f4fefba40] [d0000000111d7504] .osc_enqueue_interpret+0x104/0x240 [osc]
            2013-02-12 16:46:00 [c000000f4fefbb20] [d00000001097a724] .ptlrpc_check_set+0x364/0x4f40 [ptlrpc]
            2013-02-12 16:46:00 [c000000f4fefbcc0] [d0000000109d127c] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc]
            2013-02-12 16:46:00 [c000000f4fefbde0] [d0000000109d182c] .ptlrpcd+0x37c/0x530 [ptlrpc]
            2013-02-12 16:46:00 [c000000f4fefbf90] [c000000000032fd4] .kernel_thread+0x54/0x70
            2013-02-12 16:46:00 INFO: task ldlm_poold:3516 blocked for more than 120 seconds.
            2013-02-12 16:46:00 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            2013-02-12 16:46:00 ldlm_poold    D 0000000000000000     0  3516      2 0x00008000
            2013-02-12 16:46:00 Call Trace:
            2013-02-12 16:46:00 [c000000f545b31b0] [c000000f545b3260] 0xc000000f545b3260 (unreliable)
            2013-02-12 16:46:00 [c000000f545b3380] [c0000000000142d8] .__switch_to+0xf8/0x1d0
            2013-02-12 16:46:00 [c000000f545b3410] [c0000000005bb0a8] .schedule+0x3f8/0xd30
            2013-02-12 16:46:00 [c000000f545b3710] [c0000000005bc388] .schedule_timeout+0x258/0x340
            2013-02-12 16:46:00 [c000000f545b3830] [c0000000005bbf88] .wait_for_common+0x198/0x210
            2013-02-12 16:46:00 [c000000f545b3910] [d0000000109597c8] .__ldlm_bl_to_thread+0x1f8/0x280 [ptlrpc]
            2013-02-12 16:46:00 [c000000f545b39c0] [d000000010959c2c] .ldlm_bl_to_thread+0x3dc/0x720 [ptlrpc]
            2013-02-12 16:46:00 [c000000f545b3b70] [d00000001094d488] .ldlm_cancel_lru+0xb8/0x2c0 [ptlrpc]
            2013-02-12 16:46:00 [c000000f545b3c40] [d000000010967010] .ldlm_cli_pool_recalc+0x320/0x3e0 [ptlrpc]
            2013-02-12 16:46:00 [c000000f545b3ce0] [d0000000109671fc] .ldlm_pool_recalc+0x12c/0x180 [ptlrpc]
            2013-02-12 16:46:00 [c000000f545b3d80] [d000000010969198] .ldlm_pools_recalc+0x1c8/0x480 [ptlrpc]
            2013-02-12 16:46:00 [c000000f545b3e70] [d000000010969550] .ldlm_pools_thread_main+0x100/0x490 [ptlrpc]
            2013-02-12 16:46:00 [c000000f545b3f90] [c000000000032fd4] .kernel_thread+0x54/0x70
            2013-02-12 16:46:00 INFO: task bgxlC_r:5444 blocked for more than 120 seconds.
            2013-02-12 16:46:00 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            2013-02-12 16:46:00 bgxlC_r       D 000000000fedd308     0  5444      1 0x00008012
            2013-02-12 16:46:00 Call Trace:
            2013-02-12 16:46:00 [c000000afa5c6b90] [c000000afa5c6c40] 0xc000000afa5c6c40 (unreliable)
            2013-02-12 16:46:00 [c000000afa5c6d60] [c0000000000142d8] .__switch_to+0xf8/0x1d0
            2013-02-12 16:46:00 [c000000afa5c6df0] [c0000000005bb0a8] .schedule+0x3f8/0xd30
            2013-02-12 16:46:00 [c000000afa5c70f0] [c0000000005bc388] .schedule_timeout+0x258/0x340
            2013-02-12 16:46:00 [c000000afa5c7210] [c0000000005bbf88] .wait_for_common+0x198/0x210
            2013-02-12 16:46:00 [c000000afa5c72f0] [d0000000111f0878] .osc_io_setattr_end+0xd8/0x1e0 [osc]
            2013-02-12 16:46:00 [c000000afa5c73b0] [d00000000f2987fc] .cl_io_end+0xcc/0x2b0 [obdclass]
            2013-02-12 16:46:00 [c000000afa5c7460] [d000000011d25950] .lov_io_end_wrapper+0x180/0x1f0 [lov]
            2013-02-12 16:46:00 [c000000afa5c7500] [d000000011d23ed4] .lov_io_call+0xe4/0x230 [lov]
            2013-02-12 16:46:00 [c000000afa5c75b0] [d000000011d24484] .lov_io_end+0x54/0x110 [lov]
            2013-02-12 16:46:00 [c000000afa5c7640] [d00000000f2987fc] .cl_io_end+0xcc/0x2b0 [obdclass]
            2013-02-12 16:46:00 [c000000afa5c76f0] [d00000000f2a22a4] .cl_io_loop+0x1a4/0x2c0 [obdclass]
            2013-02-12 16:46:00 [c000000afa5c77a0] [d0000000136a197c] .cl_setattr_ost+0x25c/0x390 [lustre]
            2013-02-12 16:46:00 [c000000afa5c7880] [d000000013659030] .ll_setattr_raw+0xc10/0x1930 [lustre]
            2013-02-12 16:46:00 [c000000afa5c79a0] [c0000000001e3814] .notify_change+0x1f4/0x400
            2013-02-12 16:46:00 [c000000afa5c7a60] [c0000000001c1d28] .do_truncate+0x78/0xd0
            2013-02-12 16:46:00 [c000000afa5c7b40] [c0000000001d6140] .do_filp_open+0x4f0/0xdc0
            2013-02-12 16:46:00 [c000000afa5c7d10] [c0000000001c0724] .do_sys_open+0x84/0x180
            2013-02-12 16:46:00 [c000000afa5c7dc0] [c0000000002209b4] .compat_sys_open+0x24/0x40
            2013-02-12 16:46:00 [c000000afa5c7e30] [c000000000008564] syscall_exit+0x0/0x40
            

            The stacks of ptlrpcd_10 and ptlrpcd_11 look similar to what's in the description, so I thought it was the same bug. Do you think this is something different?

            prakash Prakash Surya (Inactive) added a comment - I don't have sysrq-l information from the node because it crashed shortly after due to LU-2779 , but these were on the console: 2013-02-12 16:46:00 INFO: task ptlrpcd_10:3486 blocked for more than 120 seconds. 2013-02-12 16:46:00 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2013-02-12 16:46:00 ptlrpcd_10 D 0000000000000000 0 3486 2 0x00008000 2013-02-12 16:46:00 Call Trace: 2013-02-12 16:46:00 [c000000f321a2de0] [c000000f321a2e90] 0xc000000f321a2e90 (unreliable) 2013-02-12 16:46:00 [c000000f321a2fb0] [c0000000000142d8] .__switch_to+0xf8/0x1d0 2013-02-12 16:46:00 [c000000f321a3040] [c0000000005bb0a8] .schedule+0x3f8/0xd30 2013-02-12 16:46:00 [c000000f321a3340] [c0000000005bcb4c] .__mutex_lock_slowpath+0x1bc/0x2d0 2013-02-12 16:46:00 [c000000f321a3420] [c0000000005bd42c] .mutex_lock+0x5c/0x60 2013-02-12 16:46:00 [c000000f321a34a0] [d00000000f2923e8] .cl_lock_mutex_get+0xc8/0x110 [obdclass] 2013-02-12 16:46:00 [c000000f321a3530] [d000000011d2d294] .lovsub_parent_lock+0x94/0x260 [lov] 2013-02-12 16:46:00 [c000000f321a35e0] [d000000011d2f058] .lovsub_lock_modify+0xc8/0x2c0 [lov] 2013-02-12 16:46:00 [c000000f321a36b0] [d00000000f2914f4] .cl_lock_modify+0x114/0x490 [obdclass] 2013-02-12 16:46:00 [c000000f321a37a0] [d0000000111ec3a4] .osc_lock_granted+0x184/0x3d0 [osc] 2013-02-12 16:46:00 [c000000f321a3870] [d0000000111ee1c4] .osc_lock_upcall+0x4c4/0x840 [osc] 2013-02-12 16:46:00 [c000000f321a3950] [d0000000111c8b3c] .osc_enqueue_fini+0xfc/0x3f0 [osc] 2013-02-12 16:46:00 [c000000f321a3a40] [d0000000111d7504] .osc_enqueue_interpret+0x104/0x240 [osc] 2013-02-12 16:46:00 [c000000f321a3b20] [d00000001097a724] .ptlrpc_check_set+0x364/0x4f40 [ptlrpc] 2013-02-12 16:46:00 [c000000f321a3cc0] [d0000000109d127c] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc] 2013-02-12 16:46:00 [c000000f321a3de0] [d0000000109d182c] .ptlrpcd+0x37c/0x530 [ptlrpc] 2013-02-12 16:46:00 [c000000f321a3f90] [c000000000032fd4] .kernel_thread+0x54/0x70 2013-02-12 16:46:00 INFO: task ptlrpcd_11:3487 blocked for more than 120 seconds. 2013-02-12 16:46:00 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2013-02-12 16:46:00 ptlrpcd_11 D 0000000000000000 0 3487 2 0x00008000 2013-02-12 16:46:00 Call Trace: 2013-02-12 16:46:00 [c000000f4fefade0] [c000000f4fefae90] 0xc000000f4fefae90 (unreliable) 2013-02-12 16:46:00 [c000000f4fefafb0] [c0000000000142d8] .__switch_to+0xf8/0x1d0 2013-02-12 16:46:00 [c000000f4fefb040] [c0000000005bb0a8] .schedule+0x3f8/0xd30 2013-02-12 16:46:00 [c000000f4fefb340] [c0000000005bcb4c] .__mutex_lock_slowpath+0x1bc/0x2d0 2013-02-12 16:46:00 [c000000f4fefb420] [c0000000005bd42c] .mutex_lock+0x5c/0x60 2013-02-12 16:46:00 [c000000f4fefb4a0] [d00000000f2923e8] .cl_lock_mutex_get+0xc8/0x110 [obdclass] 2013-02-12 16:46:00 [c000000f4fefb530] [d000000011d2d294] .lovsub_parent_lock+0x94/0x260 [lov] 2013-02-12 16:46:00 [c000000f4fefb5e0] [d000000011d2f058] .lovsub_lock_modify+0xc8/0x2c0 [lov] 2013-02-12 16:46:00 [c000000f4fefb6b0] [d00000000f2914f4] .cl_lock_modify+0x114/0x490 [obdclass] 2013-02-12 16:46:00 [c000000f4fefb7a0] [d0000000111ec3a4] .osc_lock_granted+0x184/0x3d0 [osc] 2013-02-12 16:46:00 [c000000f4fefb870] [d0000000111ee1c4] .osc_lock_upcall+0x4c4/0x840 [osc] 2013-02-12 16:46:00 [c000000f4fefb950] [d0000000111c8b3c] .osc_enqueue_fini+0xfc/0x3f0 [osc] 2013-02-12 16:46:00 [c000000f4fefba40] [d0000000111d7504] .osc_enqueue_interpret+0x104/0x240 [osc] 2013-02-12 16:46:00 [c000000f4fefbb20] [d00000001097a724] .ptlrpc_check_set+0x364/0x4f40 [ptlrpc] 2013-02-12 16:46:00 [c000000f4fefbcc0] [d0000000109d127c] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc] 2013-02-12 16:46:00 [c000000f4fefbde0] [d0000000109d182c] .ptlrpcd+0x37c/0x530 [ptlrpc] 2013-02-12 16:46:00 [c000000f4fefbf90] [c000000000032fd4] .kernel_thread+0x54/0x70 2013-02-12 16:46:00 INFO: task ldlm_poold:3516 blocked for more than 120 seconds. 2013-02-12 16:46:00 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2013-02-12 16:46:00 ldlm_poold D 0000000000000000 0 3516 2 0x00008000 2013-02-12 16:46:00 Call Trace: 2013-02-12 16:46:00 [c000000f545b31b0] [c000000f545b3260] 0xc000000f545b3260 (unreliable) 2013-02-12 16:46:00 [c000000f545b3380] [c0000000000142d8] .__switch_to+0xf8/0x1d0 2013-02-12 16:46:00 [c000000f545b3410] [c0000000005bb0a8] .schedule+0x3f8/0xd30 2013-02-12 16:46:00 [c000000f545b3710] [c0000000005bc388] .schedule_timeout+0x258/0x340 2013-02-12 16:46:00 [c000000f545b3830] [c0000000005bbf88] .wait_for_common+0x198/0x210 2013-02-12 16:46:00 [c000000f545b3910] [d0000000109597c8] .__ldlm_bl_to_thread+0x1f8/0x280 [ptlrpc] 2013-02-12 16:46:00 [c000000f545b39c0] [d000000010959c2c] .ldlm_bl_to_thread+0x3dc/0x720 [ptlrpc] 2013-02-12 16:46:00 [c000000f545b3b70] [d00000001094d488] .ldlm_cancel_lru+0xb8/0x2c0 [ptlrpc] 2013-02-12 16:46:00 [c000000f545b3c40] [d000000010967010] .ldlm_cli_pool_recalc+0x320/0x3e0 [ptlrpc] 2013-02-12 16:46:00 [c000000f545b3ce0] [d0000000109671fc] .ldlm_pool_recalc+0x12c/0x180 [ptlrpc] 2013-02-12 16:46:00 [c000000f545b3d80] [d000000010969198] .ldlm_pools_recalc+0x1c8/0x480 [ptlrpc] 2013-02-12 16:46:00 [c000000f545b3e70] [d000000010969550] .ldlm_pools_thread_main+0x100/0x490 [ptlrpc] 2013-02-12 16:46:00 [c000000f545b3f90] [c000000000032fd4] .kernel_thread+0x54/0x70 2013-02-12 16:46:00 INFO: task bgxlC_r:5444 blocked for more than 120 seconds. 2013-02-12 16:46:00 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2013-02-12 16:46:00 bgxlC_r D 000000000fedd308 0 5444 1 0x00008012 2013-02-12 16:46:00 Call Trace: 2013-02-12 16:46:00 [c000000afa5c6b90] [c000000afa5c6c40] 0xc000000afa5c6c40 (unreliable) 2013-02-12 16:46:00 [c000000afa5c6d60] [c0000000000142d8] .__switch_to+0xf8/0x1d0 2013-02-12 16:46:00 [c000000afa5c6df0] [c0000000005bb0a8] .schedule+0x3f8/0xd30 2013-02-12 16:46:00 [c000000afa5c70f0] [c0000000005bc388] .schedule_timeout+0x258/0x340 2013-02-12 16:46:00 [c000000afa5c7210] [c0000000005bbf88] .wait_for_common+0x198/0x210 2013-02-12 16:46:00 [c000000afa5c72f0] [d0000000111f0878] .osc_io_setattr_end+0xd8/0x1e0 [osc] 2013-02-12 16:46:00 [c000000afa5c73b0] [d00000000f2987fc] .cl_io_end+0xcc/0x2b0 [obdclass] 2013-02-12 16:46:00 [c000000afa5c7460] [d000000011d25950] .lov_io_end_wrapper+0x180/0x1f0 [lov] 2013-02-12 16:46:00 [c000000afa5c7500] [d000000011d23ed4] .lov_io_call+0xe4/0x230 [lov] 2013-02-12 16:46:00 [c000000afa5c75b0] [d000000011d24484] .lov_io_end+0x54/0x110 [lov] 2013-02-12 16:46:00 [c000000afa5c7640] [d00000000f2987fc] .cl_io_end+0xcc/0x2b0 [obdclass] 2013-02-12 16:46:00 [c000000afa5c76f0] [d00000000f2a22a4] .cl_io_loop+0x1a4/0x2c0 [obdclass] 2013-02-12 16:46:00 [c000000afa5c77a0] [d0000000136a197c] .cl_setattr_ost+0x25c/0x390 [lustre] 2013-02-12 16:46:00 [c000000afa5c7880] [d000000013659030] .ll_setattr_raw+0xc10/0x1930 [lustre] 2013-02-12 16:46:00 [c000000afa5c79a0] [c0000000001e3814] .notify_change+0x1f4/0x400 2013-02-12 16:46:00 [c000000afa5c7a60] [c0000000001c1d28] .do_truncate+0x78/0xd0 2013-02-12 16:46:00 [c000000afa5c7b40] [c0000000001d6140] .do_filp_open+0x4f0/0xdc0 2013-02-12 16:46:00 [c000000afa5c7d10] [c0000000001c0724] .do_sys_open+0x84/0x180 2013-02-12 16:46:00 [c000000afa5c7dc0] [c0000000002209b4] .compat_sys_open+0x24/0x40 2013-02-12 16:46:00 [c000000afa5c7e30] [c000000000008564] syscall_exit+0x0/0x40 The stacks of ptlrpcd_10 and ptlrpcd_11 look similar to what's in the description, so I thought it was the same bug. Do you think this is something different?

            People

              jay Jinshan Xiong (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: