[LU-5052] threads stuck in jbd2_journal_start Created: 12/May/14  Updated: 30/Apr/15  Resolved: 30/Apr/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: None
Environment:

lustre: 2.1.5
kernel: 2.6.32-279.19.1.el6.20130516.x86_64.lustre215
build: 2nasS_ofed154

SRC at https://github.com/jlan/lustre-nas


Attachments: File service200.gz    
Issue Links:
Duplicate
duplicates LU-4794 MDS threads all stuck in jbd2_journal... Resolved
Severity: 3
Rank (Obsolete): 13955

 Description   

MDS build up high load with no cpu activity. Lustre dumping call trace to console. (looks like dup of LU-4794. If so please advise when the patch will land)

Attached is full stack trace for all threads.

INFO: task ldlm_cn_00:6299 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ldlm_cn_00    D 000000000000001a     0  6299      2 0x00000080
 ffff881ec525db30 0000000000000046 0000000000000000 ffffffff8129507e
 ffff881ec525dad0 00000000dcd2dc2e ffff881fb0bd8d00 ffff881ec525dad0
 ffff881fafe73098 ffff881ec525dfd8 000000000000fc40 ffff881fafe73098
Call Trace:
 [<ffffffff8129507e>] ? number+0x2ee/0x320
 [<ffffffffa055c14a>] start_this_handle+0x27a/0x4a0 [jbd2]
 [<ffffffff8108ff00>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa055c570>] jbd2_journal_start+0xd0/0x110 [jbd2]
 [<ffffffffa08e6338>] ldiskfs_journal_start_sb+0x58/0x90 [ldiskfs]
 [<ffffffffa072c017>] fsfilt_ldiskfs_start+0x77/0x5e0 [fsfilt_ldiskfs]
 [<ffffffffa07a9ac0>] llog_origin_handle_cancel+0x4b0/0xd70 [ptlrpc]
 [<ffffffffa076f71f>] ldlm_cancel_handler+0x1bf/0x5e0 [ptlrpc]
 [<ffffffffa079fb4e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
 [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
 [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
INFO: task ldlm_cb_00:6302 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ldlm_cb_00    D 0000000000000002     0  6302      2 0x00000080
 ffff881ec5265b20 0000000000000046 0000000000000000 000000ab00000000
 ffff881ec5265b50 ffffffff8129507e 3634333236363330 3134363536363336
 ffff881ec5263af8 ffff881ec5265fd8 000000000000fc40 ffff881ec5263af8
Call Trace:
 [<ffffffff8129507e>] ? number+0x2ee/0x320
 [<ffffffff8151ecc5>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff8151ee23>] rwsem_down_write_failed+0x23/0x30
 [<ffffffff812992f3>] call_rwsem_down_write_failed+0x13/0x20
 [<ffffffff8151e322>] ? down_write+0x32/0x40
 [<ffffffffa09d543e>] dqacq_handler+0x35e/0xd20 [lquota]
 [<ffffffffa07b8486>] ? __req_capsule_get+0x176/0x750 [ptlrpc]
 [<ffffffffa07921e0>] ? lustre_swab_qdata+0x0/0x30 [ptlrpc]
 [<ffffffffa075e1d8>] target_handle_dqacq_callback+0x668/0xb90 [ptlrpc]
 [<ffffffffa09d50e0>] ? dqacq_handler+0x0/0xd20 [lquota]
 [<ffffffffa076df87>] ldlm_callback_handler+0xa17/0x1ff0 [ptlrpc]
 [<ffffffffa0503ea1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa04ff4a4>] ? libcfs_id2str+0x74/0xb0 [libcfs]
 [<ffffffffa079fb4e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
 [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
 [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
INFO: task ldlm_cb_01:6303 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ldlm_cb_01    D 000000000000000d     0  6303      2 0x00000080
 ffff881ec5267b20 0000000000000046 0000000000000000 000000ab00000000
 ffff881ec5267b50 ffffffff8129507e ffff881ec5267ad0 000000005c2ae174
 ffff881ec5263098 ffff881ec5267fd8 000000000000fc40 ffff881ec5263098


 Comments   
Comment by Peter Jones [ 12/May/14 ]

Bobijam

Does this appear to be a duplicate of LU-4794?

Peter

Comment by Bruno Faccini (Inactive) [ 13/May/14 ]

Bobi,
according to the full stacks traces dumped in dmesg, it looks more like a dup of LU-4271, still not proven has being a dup of LU-4794 itself, what do you think ?

Comment by Zhenyu Xu [ 13/May/14 ]

LU-4794 relates to llog handling get/journal transaction deadlock. And in LU-4794 bt-all.merged.txt, there also have similar thread stack trace just like the 2 threads you posted here.

Comment by Mahmoud Hanafi [ 30/Apr/15 ]

Please close

Comment by Peter Jones [ 30/Apr/15 ]

ok - thanks Mahmoud

Generated at Sat Feb 10 01:48:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.