[LU-2683] Client deadlock in cl_lock_mutex_get - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.4.0
Affects Version/s: Lustre 2.4.0
Labels:
- HB
- mn1
- sequoia
- topsequoia
Environment:
Sequoia ppc64 client, 2.3.58-5chaos, servers are x86_64 running 2.3.58-6chaos

Severity:
3
Rank (Obsolete):
6270

Description

With 2.3.58-5chaos ppc64 clients, we are seeing Lustre hang with many threads waiting on a mutex under cl_lock_mutex_get(). In the attached file "RA0-ID-J03.log.txt", you can see the sysrq-t and sysrq-l output. The node is responsive, but lustre is doing nothing because all ptlrpcd threads are stuck in the same path:

2013-01-25 11:51:10.788870 {DefaultControlEventListener} [mmcs]{131}.0.0: ptlrpcd_0     D 0000000000000000     0  3268      2 0x00000000
2013-01-25 11:51:10.788922 {DefaultControlEventListener} [mmcs]{131}.0.0: Call Trace:
2013-01-25 11:51:10.788974 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621ee70] [c00000000068f010] svc_rdma_ops+0xda18/0x1a900 (unreliable)
2013-01-25 11:51:10.789026 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f040] [c000000000009b2c] .__switch_to+0xc4/0x100
2013-01-25 11:51:10.789078 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f0d0] [c00000000042a418] .schedule+0x7d4/0x944
2013-01-25 11:51:10.789129 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f380] [c00000000042b4a8] .__mutex_lock_slowpath+0x208/0x390
2013-01-25 11:51:10.789181 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f480] [c00000000042bf44] .mutex_lock+0x38/0x58
2013-01-25 11:51:10.789234 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f500] [8000000002501cf8] .cl_lock_mutex_get+0xc8/0x110 [obdclass]
2013-01-25 11:51:10.789285 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f590] [800000000514d124] .lovsub_parent_lock+0x94/0x260 [lov]
2013-01-25 11:51:10.789337 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f640] [800000000514de70] .lovsub_lock_state+0xd0/0x300 [lov]
2013-01-25 11:51:10.789389 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f710] [80000000024fe344] .cl_lock_state_signal+0xd4/0x2c0 [obdclass]
2013-01-25 11:51:10.789441 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f7d0] [80000000024ff448] .cl_lock_signal+0xa8/0x260 [obdclass]
2013-01-25 11:51:10.789492 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f890] [80000000046adb24] .osc_lock_upcall+0x194/0x810 [osc]
2013-01-25 11:51:10.789544 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621f970] [8000000004688a7c] .osc_enqueue_fini+0xfc/0x3f0 [osc]
2013-01-25 11:51:10.789596 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621fa60] [8000000004697364] .osc_enqueue_interpret+0x104/0x240 [osc]
2013-01-25 11:51:10.789647 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621fb40] [8000000003b7a308] .ptlrpc_check_set+0x3c8/0x4e50 [ptlrpc]
2013-01-25 11:51:10.789700 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621fd20] [8000000003bcffec] .ptlrpcd_check+0x66c/0x870 [ptlrpc]
2013-01-25 11:51:10.789751 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621fe40] [8000000003bd054c] .ptlrpcd+0x35c/0x510 [ptlrpc]
2013-01-25 11:51:10.789802 {DefaultControlEventListener} [mmcs]{131}.0.0: [c0000003c621ff90] [c00000000001b9a0] .kernel_thread+0x54/0x70

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

RA0-ID-J03.log.txt
1.22 MB
25/Jan/13 6:43 PM

Issue Links

duplicates

LU-874 Client eviction on lock callback timeout

Resolved

is duplicated by

LU-2779 LBUG in discard_cb: !(page->cp_type == CPT_CACHEABLE) || (!PageWriteback(cl_page_vmpage(env, page)))

Resolved

Activity

[LU-2683] Client deadlock in cl_lock_mutex_get

Peter Jones added a comment - 04/Mar/13 4:07 PM

Landed for 2.4

Peter Jones added a comment - 04/Mar/13 4:07 PM Landed for 2.4

Christopher Morrone (Inactive) added a comment - 19/Feb/13 4:41 PM

With change 5208, we have seen no repeats of the problem for four days. It looks like the patched has fixed the problem. Thanks!

Christopher Morrone (Inactive) added a comment - 19/Feb/13 4:41 PM With change 5208 , we have seen no repeats of the problem for four days. It looks like the patched has fixed the problem. Thanks!

Christopher Morrone (Inactive) added a comment - 15/Feb/13 6:39 PM

We have about a day of testing with patch set 2 of change 5208, and so far we have had no occurrences of the client deadlock.

Probably too soon to be definitive, but it is promising.

Christopher Morrone (Inactive) added a comment - 15/Feb/13 6:39 PM We have about a day of testing with patch set 2 of change 5208 , and so far we have had no occurrences of the client deadlock. Probably too soon to be definitive, but it is promising.

Jinshan Xiong (Inactive) added a comment - 15/Feb/13 4:16 PM

Bruno has claimed that this patch can at least fix the original problem so let's land this patch for now.

Jinshan Xiong (Inactive) added a comment - 15/Feb/13 4:16 PM Bruno has claimed that this patch can at least fix the original problem so let's land this patch for now.

Jinshan Xiong (Inactive) added a comment - 13/Feb/13 4:43 PM

I see, thanks for explanation.

Jinshan Xiong (Inactive) added a comment - 13/Feb/13 4:43 PM I see, thanks for explanation.

Prakash Surya (Inactive) added a comment - 13/Feb/13 4:29 PM - edited

Well in total, I see that message 12 times for this particular occurrence (from Feb. 11th not running with 5208). AFAIK, all of the OSTs are healthy but when the IONs hit this issue, some of the OSTs will report as "inactive" when running "lfs check servers" on the client. That is how the admins discover IONs in this state, and then manually reboot them.

With that said, I may have jumped the gun in my previous comments here and here. What I reported in those comments happened on a login node and wasn't actually a deadlock as far as I can tell. I think those were just "slow" threads (cause unknown) as they eventually cleared up (but then hit ~~LU-2779~~).

So unless you think differently , it probably isn't worth spending more time on this until it's seen on one of the IONs running with the patch applied. Sorry for the confusion.

Prakash Surya (Inactive) added a comment - 13/Feb/13 4:29 PM - edited Well in total, I see that message 12 times for this particular occurrence (from Feb. 11th not running with 5208 ). AFAIK, all of the OSTs are healthy but when the IONs hit this issue, some of the OSTs will report as "inactive" when running "lfs check servers" on the client. That is how the admins discover IONs in this state, and then manually reboot them. With that said, I may have jumped the gun in my previous comments here and here . What I reported in those comments happened on a login node and wasn't actually a deadlock as far as I can tell. I think those were just "slow" threads (cause unknown) as they eventually cleared up (but then hit LU-2779 ). So unless you think differently , it probably isn't worth spending more time on this until it's seen on one of the IONs running with the patch applied. Sorry for the confusion.

Jinshan Xiong (Inactive) added a comment - 13/Feb/13 3:28 PM

why were there so many of them? Is one OST in recovery at that time?

It will be helpful to dump log, with DLMTRACE, CACHE and VFSTRACE enabled, when you first see this kind of message. It will be easily done with a script

Jinshan Xiong (Inactive) added a comment - 13/Feb/13 3:28 PM why were there so many of them? Is one OST in recovery at that time? It will be helpful to dump log, with DLMTRACE, CACHE and VFSTRACE enabled, when you first see this kind of message. It will be easily done with a script

Prakash Surya (Inactive) added a comment - 13/Feb/13 2:58 PM

Like these messages?

2013-02-11 13:29:13.788428 {DefaultControlEventListener} [mmcs]{613}.12.0: LustreError: 31967:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001c1fc2fe8@{[26 -> 28/31], [3|0|+|rpc|wihY|c0000001b0756f60], [196608|3|+|-|c0000001cdfaa6e0|16|c00000037ab0f300]} ls1-OST005a-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress?
2013-02-11 13:30:26.835353 {DefaultControlEventListener} [mmcs]{613}.0.3:  LustreError: 40667:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001cdacee38@{[2 -> 4/15], [3|0|+|rpc|wiY|c0000001b0759480], [196608|3|+|-|c0000001cdfa9600|16|c0000003c54483a0]} ls1-OST007b-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress?
2013-02-11 13:30:58.855528 {DefaultControlEventListener} [mmcs]{613}.3.0:  LustreError: 32002:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001cdacfca8@{[2 -> 4/15], [3|0|+|rpc|wihY|c0000001b0759d80], [196608|3|+|-|c0000001cdfa99c0|16|c0000003742646e0]} ls1-OST0083-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress?
2013-02-11 13:30:58.856226 {DefaultControlEventListener} [mmcs]{613}.3.2:  LustreError: 3958:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001cedc8d18@{[42 -> 44/47], [3|0|+|rpc|wihY|c0000002e43ed020], [196608|3|+|-|c0000001c0c13c00|16|c0000003c87a1e60]} ls1-OST014d-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress?
2013-02-11 13:31:44.273199 {DefaultControlEventListener} [mmcs]{613}.12.0: LustreError: 31983:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001c9b17230@{[2 -> 4/15], [3|0|+|rpc|wihY|c0000001b075a680], [196608|3|+|-|c0000002e1471d10|16|c0000003c80ad4a0]} ls1-OST008b-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress?

Prakash Surya (Inactive) added a comment - 13/Feb/13 2:58 PM Like these messages? 2013-02-11 13:29:13.788428 {DefaultControlEventListener} [mmcs]{613}.12.0: LustreError: 31967:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001c1fc2fe8@{[26 -> 28/31], [3|0|+|rpc|wihY|c0000001b0756f60], [196608|3|+|-|c0000001cdfaa6e0|16|c00000037ab0f300]} ls1-OST005a-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress? 2013-02-11 13:30:26.835353 {DefaultControlEventListener} [mmcs]{613}.0.3: LustreError: 40667:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001cdacee38@{[2 -> 4/15], [3|0|+|rpc|wiY|c0000001b0759480], [196608|3|+|-|c0000001cdfa9600|16|c0000003c54483a0]} ls1-OST007b-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress? 2013-02-11 13:30:58.855528 {DefaultControlEventListener} [mmcs]{613}.3.0: LustreError: 32002:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001cdacfca8@{[2 -> 4/15], [3|0|+|rpc|wihY|c0000001b0759d80], [196608|3|+|-|c0000001cdfa99c0|16|c0000003742646e0]} ls1-OST0083-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress? 2013-02-11 13:30:58.856226 {DefaultControlEventListener} [mmcs]{613}.3.2: LustreError: 3958:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001cedc8d18@{[42 -> 44/47], [3|0|+|rpc|wihY|c0000002e43ed020], [196608|3|+|-|c0000001c0c13c00|16|c0000003c87a1e60]} ls1-OST014d-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress? 2013-02-11 13:31:44.273199 {DefaultControlEventListener} [mmcs]{613}.12.0: LustreError: 31983:0:(osc_cache.c:896:osc_extent_wait()) extent c0000001c9b17230@{[2 -> 4/15], [3|0|+|rpc|wihY|c0000001b075a680], [196608|3|+|-|c0000002e1471d10|16|c0000003c80ad4a0]} ls1-OST008b-osc-c0000003ec0c5800: wait ext to 0 timedout, recovery in progress?

Jinshan Xiong (Inactive) added a comment - 13/Feb/13 2:39 PM

Did you find anything was printed in osc_extent_wait()?

Jinshan Xiong (Inactive) added a comment - 13/Feb/13 2:39 PM Did you find anything was printed in osc_extent_wait()?

Jinshan Xiong (Inactive) added a comment - 13/Feb/13 2:33 PM

I need more information. Let me work out a patch for ~~LU-2779~~ firstly so that you can use sysrq-t

Jinshan Xiong (Inactive) added a comment - 13/Feb/13 2:33 PM I need more information. Let me work out a patch for LU-2779 firstly so that you can use sysrq-t

Prakash Surya (Inactive) added a comment - 13/Feb/13 10:15 AM

I don't have sysrq-l information from the node because it crashed shortly after due to ~~LU-2779~~, but these were on the console:

2013-02-12 16:46:00 INFO: task ptlrpcd_10:3486 blocked for more than 120 seconds.
2013-02-12 16:46:00 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2013-02-12 16:46:00 ptlrpcd_10    D 0000000000000000     0  3486      2 0x00008000
2013-02-12 16:46:00 Call Trace:
2013-02-12 16:46:00 [c000000f321a2de0] [c000000f321a2e90] 0xc000000f321a2e90 (unreliable)
2013-02-12 16:46:00 [c000000f321a2fb0] [c0000000000142d8] .__switch_to+0xf8/0x1d0
2013-02-12 16:46:00 [c000000f321a3040] [c0000000005bb0a8] .schedule+0x3f8/0xd30
2013-02-12 16:46:00 [c000000f321a3340] [c0000000005bcb4c] .__mutex_lock_slowpath+0x1bc/0x2d0
2013-02-12 16:46:00 [c000000f321a3420] [c0000000005bd42c] .mutex_lock+0x5c/0x60
2013-02-12 16:46:00 [c000000f321a34a0] [d00000000f2923e8] .cl_lock_mutex_get+0xc8/0x110 [obdclass]
2013-02-12 16:46:00 [c000000f321a3530] [d000000011d2d294] .lovsub_parent_lock+0x94/0x260 [lov]
2013-02-12 16:46:00 [c000000f321a35e0] [d000000011d2f058] .lovsub_lock_modify+0xc8/0x2c0 [lov]
2013-02-12 16:46:00 [c000000f321a36b0] [d00000000f2914f4] .cl_lock_modify+0x114/0x490 [obdclass]
2013-02-12 16:46:00 [c000000f321a37a0] [d0000000111ec3a4] .osc_lock_granted+0x184/0x3d0 [osc]
2013-02-12 16:46:00 [c000000f321a3870] [d0000000111ee1c4] .osc_lock_upcall+0x4c4/0x840 [osc]
2013-02-12 16:46:00 [c000000f321a3950] [d0000000111c8b3c] .osc_enqueue_fini+0xfc/0x3f0 [osc]
2013-02-12 16:46:00 [c000000f321a3a40] [d0000000111d7504] .osc_enqueue_interpret+0x104/0x240 [osc]
2013-02-12 16:46:00 [c000000f321a3b20] [d00000001097a724] .ptlrpc_check_set+0x364/0x4f40 [ptlrpc]
2013-02-12 16:46:00 [c000000f321a3cc0] [d0000000109d127c] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc]
2013-02-12 16:46:00 [c000000f321a3de0] [d0000000109d182c] .ptlrpcd+0x37c/0x530 [ptlrpc]
2013-02-12 16:46:00 [c000000f321a3f90] [c000000000032fd4] .kernel_thread+0x54/0x70
2013-02-12 16:46:00 INFO: task ptlrpcd_11:3487 blocked for more than 120 seconds.
2013-02-12 16:46:00 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2013-02-12 16:46:00 ptlrpcd_11    D 0000000000000000     0  3487      2 0x00008000
2013-02-12 16:46:00 Call Trace:
2013-02-12 16:46:00 [c000000f4fefade0] [c000000f4fefae90] 0xc000000f4fefae90 (unreliable)
2013-02-12 16:46:00 [c000000f4fefafb0] [c0000000000142d8] .__switch_to+0xf8/0x1d0
2013-02-12 16:46:00 [c000000f4fefb040] [c0000000005bb0a8] .schedule+0x3f8/0xd30
2013-02-12 16:46:00 [c000000f4fefb340] [c0000000005bcb4c] .__mutex_lock_slowpath+0x1bc/0x2d0
2013-02-12 16:46:00 [c000000f4fefb420] [c0000000005bd42c] .mutex_lock+0x5c/0x60
2013-02-12 16:46:00 [c000000f4fefb4a0] [d00000000f2923e8] .cl_lock_mutex_get+0xc8/0x110 [obdclass]
2013-02-12 16:46:00 [c000000f4fefb530] [d000000011d2d294] .lovsub_parent_lock+0x94/0x260 [lov]
2013-02-12 16:46:00 [c000000f4fefb5e0] [d000000011d2f058] .lovsub_lock_modify+0xc8/0x2c0 [lov]
2013-02-12 16:46:00 [c000000f4fefb6b0] [d00000000f2914f4] .cl_lock_modify+0x114/0x490 [obdclass]
2013-02-12 16:46:00 [c000000f4fefb7a0] [d0000000111ec3a4] .osc_lock_granted+0x184/0x3d0 [osc]
2013-02-12 16:46:00 [c000000f4fefb870] [d0000000111ee1c4] .osc_lock_upcall+0x4c4/0x840 [osc]
2013-02-12 16:46:00 [c000000f4fefb950] [d0000000111c8b3c] .osc_enqueue_fini+0xfc/0x3f0 [osc]
2013-02-12 16:46:00 [c000000f4fefba40] [d0000000111d7504] .osc_enqueue_interpret+0x104/0x240 [osc]
2013-02-12 16:46:00 [c000000f4fefbb20] [d00000001097a724] .ptlrpc_check_set+0x364/0x4f40 [ptlrpc]
2013-02-12 16:46:00 [c000000f4fefbcc0] [d0000000109d127c] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc]
2013-02-12 16:46:00 [c000000f4fefbde0] [d0000000109d182c] .ptlrpcd+0x37c/0x530 [ptlrpc]
2013-02-12 16:46:00 [c000000f4fefbf90] [c000000000032fd4] .kernel_thread+0x54/0x70
2013-02-12 16:46:00 INFO: task ldlm_poold:3516 blocked for more than 120 seconds.
2013-02-12 16:46:00 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2013-02-12 16:46:00 ldlm_poold    D 0000000000000000     0  3516      2 0x00008000
2013-02-12 16:46:00 Call Trace:
2013-02-12 16:46:00 [c000000f545b31b0] [c000000f545b3260] 0xc000000f545b3260 (unreliable)
2013-02-12 16:46:00 [c000000f545b3380] [c0000000000142d8] .__switch_to+0xf8/0x1d0
2013-02-12 16:46:00 [c000000f545b3410] [c0000000005bb0a8] .schedule+0x3f8/0xd30
2013-02-12 16:46:00 [c000000f545b3710] [c0000000005bc388] .schedule_timeout+0x258/0x340
2013-02-12 16:46:00 [c000000f545b3830] [c0000000005bbf88] .wait_for_common+0x198/0x210
2013-02-12 16:46:00 [c000000f545b3910] [d0000000109597c8] .__ldlm_bl_to_thread+0x1f8/0x280 [ptlrpc]
2013-02-12 16:46:00 [c000000f545b39c0] [d000000010959c2c] .ldlm_bl_to_thread+0x3dc/0x720 [ptlrpc]
2013-02-12 16:46:00 [c000000f545b3b70] [d00000001094d488] .ldlm_cancel_lru+0xb8/0x2c0 [ptlrpc]
2013-02-12 16:46:00 [c000000f545b3c40] [d000000010967010] .ldlm_cli_pool_recalc+0x320/0x3e0 [ptlrpc]
2013-02-12 16:46:00 [c000000f545b3ce0] [d0000000109671fc] .ldlm_pool_recalc+0x12c/0x180 [ptlrpc]
2013-02-12 16:46:00 [c000000f545b3d80] [d000000010969198] .ldlm_pools_recalc+0x1c8/0x480 [ptlrpc]
2013-02-12 16:46:00 [c000000f545b3e70] [d000000010969550] .ldlm_pools_thread_main+0x100/0x490 [ptlrpc]
2013-02-12 16:46:00 [c000000f545b3f90] [c000000000032fd4] .kernel_thread+0x54/0x70
2013-02-12 16:46:00 INFO: task bgxlC_r:5444 blocked for more than 120 seconds.
2013-02-12 16:46:00 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2013-02-12 16:46:00 bgxlC_r       D 000000000fedd308     0  5444      1 0x00008012
2013-02-12 16:46:00 Call Trace:
2013-02-12 16:46:00 [c000000afa5c6b90] [c000000afa5c6c40] 0xc000000afa5c6c40 (unreliable)
2013-02-12 16:46:00 [c000000afa5c6d60] [c0000000000142d8] .__switch_to+0xf8/0x1d0
2013-02-12 16:46:00 [c000000afa5c6df0] [c0000000005bb0a8] .schedule+0x3f8/0xd30
2013-02-12 16:46:00 [c000000afa5c70f0] [c0000000005bc388] .schedule_timeout+0x258/0x340
2013-02-12 16:46:00 [c000000afa5c7210] [c0000000005bbf88] .wait_for_common+0x198/0x210
2013-02-12 16:46:00 [c000000afa5c72f0] [d0000000111f0878] .osc_io_setattr_end+0xd8/0x1e0 [osc]
2013-02-12 16:46:00 [c000000afa5c73b0] [d00000000f2987fc] .cl_io_end+0xcc/0x2b0 [obdclass]
2013-02-12 16:46:00 [c000000afa5c7460] [d000000011d25950] .lov_io_end_wrapper+0x180/0x1f0 [lov]
2013-02-12 16:46:00 [c000000afa5c7500] [d000000011d23ed4] .lov_io_call+0xe4/0x230 [lov]
2013-02-12 16:46:00 [c000000afa5c75b0] [d000000011d24484] .lov_io_end+0x54/0x110 [lov]
2013-02-12 16:46:00 [c000000afa5c7640] [d00000000f2987fc] .cl_io_end+0xcc/0x2b0 [obdclass]
2013-02-12 16:46:00 [c000000afa5c76f0] [d00000000f2a22a4] .cl_io_loop+0x1a4/0x2c0 [obdclass]
2013-02-12 16:46:00 [c000000afa5c77a0] [d0000000136a197c] .cl_setattr_ost+0x25c/0x390 [lustre]
2013-02-12 16:46:00 [c000000afa5c7880] [d000000013659030] .ll_setattr_raw+0xc10/0x1930 [lustre]
2013-02-12 16:46:00 [c000000afa5c79a0] [c0000000001e3814] .notify_change+0x1f4/0x400
2013-02-12 16:46:00 [c000000afa5c7a60] [c0000000001c1d28] .do_truncate+0x78/0xd0
2013-02-12 16:46:00 [c000000afa5c7b40] [c0000000001d6140] .do_filp_open+0x4f0/0xdc0
2013-02-12 16:46:00 [c000000afa5c7d10] [c0000000001c0724] .do_sys_open+0x84/0x180
2013-02-12 16:46:00 [c000000afa5c7dc0] [c0000000002209b4] .compat_sys_open+0x24/0x40
2013-02-12 16:46:00 [c000000afa5c7e30] [c000000000008564] syscall_exit+0x0/0x40

The stacks of ptlrpcd_10 and ptlrpcd_11 look similar to what's in the description, so I thought it was the same bug. Do you think this is something different?

Prakash Surya (Inactive) added a comment - 13/Feb/13 10:15 AM I don't have sysrq-l information from the node because it crashed shortly after due to LU-2779 , but these were on the console: 2013-02-12 16:46:00 INFO: task ptlrpcd_10:3486 blocked for more than 120 seconds. 2013-02-12 16:46:00 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2013-02-12 16:46:00 ptlrpcd_10 D 0000000000000000 0 3486 2 0x00008000 2013-02-12 16:46:00 Call Trace: 2013-02-12 16:46:00 [c000000f321a2de0] [c000000f321a2e90] 0xc000000f321a2e90 (unreliable) 2013-02-12 16:46:00 [c000000f321a2fb0] [c0000000000142d8] .__switch_to+0xf8/0x1d0 2013-02-12 16:46:00 [c000000f321a3040] [c0000000005bb0a8] .schedule+0x3f8/0xd30 2013-02-12 16:46:00 [c000000f321a3340] [c0000000005bcb4c] .__mutex_lock_slowpath+0x1bc/0x2d0 2013-02-12 16:46:00 [c000000f321a3420] [c0000000005bd42c] .mutex_lock+0x5c/0x60 2013-02-12 16:46:00 [c000000f321a34a0] [d00000000f2923e8] .cl_lock_mutex_get+0xc8/0x110 [obdclass] 2013-02-12 16:46:00 [c000000f321a3530] [d000000011d2d294] .lovsub_parent_lock+0x94/0x260 [lov] 2013-02-12 16:46:00 [c000000f321a35e0] [d000000011d2f058] .lovsub_lock_modify+0xc8/0x2c0 [lov] 2013-02-12 16:46:00 [c000000f321a36b0] [d00000000f2914f4] .cl_lock_modify+0x114/0x490 [obdclass] 2013-02-12 16:46:00 [c000000f321a37a0] [d0000000111ec3a4] .osc_lock_granted+0x184/0x3d0 [osc] 2013-02-12 16:46:00 [c000000f321a3870] [d0000000111ee1c4] .osc_lock_upcall+0x4c4/0x840 [osc] 2013-02-12 16:46:00 [c000000f321a3950] [d0000000111c8b3c] .osc_enqueue_fini+0xfc/0x3f0 [osc] 2013-02-12 16:46:00 [c000000f321a3a40] [d0000000111d7504] .osc_enqueue_interpret+0x104/0x240 [osc] 2013-02-12 16:46:00 [c000000f321a3b20] [d00000001097a724] .ptlrpc_check_set+0x364/0x4f40 [ptlrpc] 2013-02-12 16:46:00 [c000000f321a3cc0] [d0000000109d127c] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc] 2013-02-12 16:46:00 [c000000f321a3de0] [d0000000109d182c] .ptlrpcd+0x37c/0x530 [ptlrpc] 2013-02-12 16:46:00 [c000000f321a3f90] [c000000000032fd4] .kernel_thread+0x54/0x70 2013-02-12 16:46:00 INFO: task ptlrpcd_11:3487 blocked for more than 120 seconds. 2013-02-12 16:46:00 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2013-02-12 16:46:00 ptlrpcd_11 D 0000000000000000 0 3487 2 0x00008000 2013-02-12 16:46:00 Call Trace: 2013-02-12 16:46:00 [c000000f4fefade0] [c000000f4fefae90] 0xc000000f4fefae90 (unreliable) 2013-02-12 16:46:00 [c000000f4fefafb0] [c0000000000142d8] .__switch_to+0xf8/0x1d0 2013-02-12 16:46:00 [c000000f4fefb040] [c0000000005bb0a8] .schedule+0x3f8/0xd30 2013-02-12 16:46:00 [c000000f4fefb340] [c0000000005bcb4c] .__mutex_lock_slowpath+0x1bc/0x2d0 2013-02-12 16:46:00 [c000000f4fefb420] [c0000000005bd42c] .mutex_lock+0x5c/0x60 2013-02-12 16:46:00 [c000000f4fefb4a0] [d00000000f2923e8] .cl_lock_mutex_get+0xc8/0x110 [obdclass] 2013-02-12 16:46:00 [c000000f4fefb530] [d000000011d2d294] .lovsub_parent_lock+0x94/0x260 [lov] 2013-02-12 16:46:00 [c000000f4fefb5e0] [d000000011d2f058] .lovsub_lock_modify+0xc8/0x2c0 [lov] 2013-02-12 16:46:00 [c000000f4fefb6b0] [d00000000f2914f4] .cl_lock_modify+0x114/0x490 [obdclass] 2013-02-12 16:46:00 [c000000f4fefb7a0] [d0000000111ec3a4] .osc_lock_granted+0x184/0x3d0 [osc] 2013-02-12 16:46:00 [c000000f4fefb870] [d0000000111ee1c4] .osc_lock_upcall+0x4c4/0x840 [osc] 2013-02-12 16:46:00 [c000000f4fefb950] [d0000000111c8b3c] .osc_enqueue_fini+0xfc/0x3f0 [osc] 2013-02-12 16:46:00 [c000000f4fefba40] [d0000000111d7504] .osc_enqueue_interpret+0x104/0x240 [osc] 2013-02-12 16:46:00 [c000000f4fefbb20] [d00000001097a724] .ptlrpc_check_set+0x364/0x4f40 [ptlrpc] 2013-02-12 16:46:00 [c000000f4fefbcc0] [d0000000109d127c] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc] 2013-02-12 16:46:00 [c000000f4fefbde0] [d0000000109d182c] .ptlrpcd+0x37c/0x530 [ptlrpc] 2013-02-12 16:46:00 [c000000f4fefbf90] [c000000000032fd4] .kernel_thread+0x54/0x70 2013-02-12 16:46:00 INFO: task ldlm_poold:3516 blocked for more than 120 seconds. 2013-02-12 16:46:00 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2013-02-12 16:46:00 ldlm_poold D 0000000000000000 0 3516 2 0x00008000 2013-02-12 16:46:00 Call Trace: 2013-02-12 16:46:00 [c000000f545b31b0] [c000000f545b3260] 0xc000000f545b3260 (unreliable) 2013-02-12 16:46:00 [c000000f545b3380] [c0000000000142d8] .__switch_to+0xf8/0x1d0 2013-02-12 16:46:00 [c000000f545b3410] [c0000000005bb0a8] .schedule+0x3f8/0xd30 2013-02-12 16:46:00 [c000000f545b3710] [c0000000005bc388] .schedule_timeout+0x258/0x340 2013-02-12 16:46:00 [c000000f545b3830] [c0000000005bbf88] .wait_for_common+0x198/0x210 2013-02-12 16:46:00 [c000000f545b3910] [d0000000109597c8] .__ldlm_bl_to_thread+0x1f8/0x280 [ptlrpc] 2013-02-12 16:46:00 [c000000f545b39c0] [d000000010959c2c] .ldlm_bl_to_thread+0x3dc/0x720 [ptlrpc] 2013-02-12 16:46:00 [c000000f545b3b70] [d00000001094d488] .ldlm_cancel_lru+0xb8/0x2c0 [ptlrpc] 2013-02-12 16:46:00 [c000000f545b3c40] [d000000010967010] .ldlm_cli_pool_recalc+0x320/0x3e0 [ptlrpc] 2013-02-12 16:46:00 [c000000f545b3ce0] [d0000000109671fc] .ldlm_pool_recalc+0x12c/0x180 [ptlrpc] 2013-02-12 16:46:00 [c000000f545b3d80] [d000000010969198] .ldlm_pools_recalc+0x1c8/0x480 [ptlrpc] 2013-02-12 16:46:00 [c000000f545b3e70] [d000000010969550] .ldlm_pools_thread_main+0x100/0x490 [ptlrpc] 2013-02-12 16:46:00 [c000000f545b3f90] [c000000000032fd4] .kernel_thread+0x54/0x70 2013-02-12 16:46:00 INFO: task bgxlC_r:5444 blocked for more than 120 seconds. 2013-02-12 16:46:00 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2013-02-12 16:46:00 bgxlC_r D 000000000fedd308 0 5444 1 0x00008012 2013-02-12 16:46:00 Call Trace: 2013-02-12 16:46:00 [c000000afa5c6b90] [c000000afa5c6c40] 0xc000000afa5c6c40 (unreliable) 2013-02-12 16:46:00 [c000000afa5c6d60] [c0000000000142d8] .__switch_to+0xf8/0x1d0 2013-02-12 16:46:00 [c000000afa5c6df0] [c0000000005bb0a8] .schedule+0x3f8/0xd30 2013-02-12 16:46:00 [c000000afa5c70f0] [c0000000005bc388] .schedule_timeout+0x258/0x340 2013-02-12 16:46:00 [c000000afa5c7210] [c0000000005bbf88] .wait_for_common+0x198/0x210 2013-02-12 16:46:00 [c000000afa5c72f0] [d0000000111f0878] .osc_io_setattr_end+0xd8/0x1e0 [osc] 2013-02-12 16:46:00 [c000000afa5c73b0] [d00000000f2987fc] .cl_io_end+0xcc/0x2b0 [obdclass] 2013-02-12 16:46:00 [c000000afa5c7460] [d000000011d25950] .lov_io_end_wrapper+0x180/0x1f0 [lov] 2013-02-12 16:46:00 [c000000afa5c7500] [d000000011d23ed4] .lov_io_call+0xe4/0x230 [lov] 2013-02-12 16:46:00 [c000000afa5c75b0] [d000000011d24484] .lov_io_end+0x54/0x110 [lov] 2013-02-12 16:46:00 [c000000afa5c7640] [d00000000f2987fc] .cl_io_end+0xcc/0x2b0 [obdclass] 2013-02-12 16:46:00 [c000000afa5c76f0] [d00000000f2a22a4] .cl_io_loop+0x1a4/0x2c0 [obdclass] 2013-02-12 16:46:00 [c000000afa5c77a0] [d0000000136a197c] .cl_setattr_ost+0x25c/0x390 [lustre] 2013-02-12 16:46:00 [c000000afa5c7880] [d000000013659030] .ll_setattr_raw+0xc10/0x1930 [lustre] 2013-02-12 16:46:00 [c000000afa5c79a0] [c0000000001e3814] .notify_change+0x1f4/0x400 2013-02-12 16:46:00 [c000000afa5c7a60] [c0000000001c1d28] .do_truncate+0x78/0xd0 2013-02-12 16:46:00 [c000000afa5c7b40] [c0000000001d6140] .do_filp_open+0x4f0/0xdc0 2013-02-12 16:46:00 [c000000afa5c7d10] [c0000000001c0724] .do_sys_open+0x84/0x180 2013-02-12 16:46:00 [c000000afa5c7dc0] [c0000000002209b4] .compat_sys_open+0x24/0x40 2013-02-12 16:46:00 [c000000afa5c7e30] [c000000000008564] syscall_exit+0x0/0x40 The stacks of ptlrpcd_10 and ptlrpcd_11 look similar to what's in the description, so I thought it was the same bug. Do you think this is something different?

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 25/Jan/13 6:43 PM

Updated:: 28/Nov/13 11:46 PM

Resolved:: 04/Mar/13 4:07 PM