Details
-
Bug
-
Resolution: Unresolved
-
Major
-
Lustre 1.8.8
-
None
-
3
-
10994
Description
One OST became unavailable ane kept on dumping stack traces until its service is taken over by another OSS. This issue occured a couple of time on different servers.
After some inverstigation, we found that a lot of service theads hang at different places. Here is a list of where they stuck.
ll_ost_01:10226,-ll_ost_07:10232,-ll_ost_09:10234,-ll_ost_11:10236,-ll_ost_13:10238,-ll_ost_15:10240,-ll_ost_18:10243
filter_lvbo_init
--filter_fid2dentry
----filter_parent_lock
------filter_lock_dentry
-------LOCK_INODE_MUTEX(dparent>d_inode);
ll_ost_06:10231,-ll_ost_16:10241,-ll_ost_484,-ll_ost_io_129,-ll_ost_io_123,-ll_ost_383
fsfilt_ext3_start
--ext3_journal_start
----journal_start
------start_this_handle
----------__jbd2_log_wait_for_space
-----------mutex_lock(&journal>j_checkpoint_mutex);
ll_ost_17:10242
filter_lvbo_init
--filter_fid2dentry
----filter_parent_lock
----lookup_one_len
------__lookup_hash
-------inode>i_op->lookup-=-ext4_lookup
----------ext4_iget
------------iget_locked
--------------ifind_fast
----------------find_inode_fast
------------------__wait_on_freeing_inode
-------------------?ldiskfs_bread...-Child-dentry's-inode__I_LOCK
ll_ost_io_15
ost_brw_write
--filter_commitrw_write
----fsfilt_ext3_commit_wait
------autoremove_wake_function
-------fsfilt_log_wait_commit=-jbd2_log_wait_commit
We think that is not neccessarily the problem of Lustre codes. And we found a nearly merged patch which fixes a similar deadlock problem in __jbd2_log_wait_for_space(). Maybe it is the root cause?