[LU-10402] Service thread hung at jbd2_journal_start Created: 15/Dec/17  Updated: 16/Jun/18  Resolved: 16/Jun/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre 2.7.3 fe


Attachments: File bt.all    
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

OSS started to become unresponsive with lots of strack traces.

First stack trace was

4>LNet: Service thread pid 30365 was inactive for 962.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
<4>LNet: Skipped 4 previous similar messages
<4>Pid: 30365, comm: ll_ost_io00_100
<4>
<4>Call Trace:
<4> [<ffffffff810a3f5e>] ? prepare_to_wait+0x4e/0x80
<4> [<ffffffffa0df0fca>] start_this_handle+0x25a/0x480 [jbd2]
<4> [<ffffffff810a3c30>] ? autoremove_wake_function+0x0/0x40
<4> [<ffffffffa0df13d5>] jbd2_journal_start+0xb5/0x100 [jbd2]
<4> [<ffffffffa0e49b86>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs]
<4> [<ffffffffa0f08ebf>] osd_trans_start+0x1df/0x660 [osd_ldiskfs]
<4> [<ffffffffa10ac4e5>] ofd_write_attr_set+0x2c5/0x8c0 [ofd]
<4> [<ffffffffa10ad4c6>] ofd_commitrw_write+0x256/0x11a0 [ofd]
<4> [<ffffffffa10b47ad>] ? ofd_fmd_find_nolock+0xad/0xd0 [ofd]
<4> [<ffffffffa10ae9c3>] ofd_commitrw+0x5b3/0xba0 [ofd]
<4> [<ffffffffa07045a1>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
<4> [<ffffffffa09b438d>] obd_commitrw.clone.0+0x11d/0x390 [ptlrpc]
<4> [<ffffffffa09bc299>] tgt_brw_write+0xc69/0x1520 [ptlrpc]
<4> [<ffffffffa090dd10>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
<4> [<ffffffffa09baece>] tgt_request_handle+0x8be/0x1020 [ptlrpc]
<4> [<ffffffffa0964ca1>] ptlrpc_main+0xf41/0x1a80 [ptlrpc]
<4> [<ffffffffa0963d60>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
<4> [<ffffffff810a379e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] child_rip+0xa/0x20
<4> [<ffffffff810a3700>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20

I will attach bt for all threads.

Is this a dup of LU-6918?



 Comments   
Comment by Peter Jones [ 18/Dec/17 ]

Yang Sheng

Can you please look into this one?

Thanks

Peter

Comment by Yang Sheng [ 19/Dec/17 ]

From stack trace:

ID: 22328  TASK: ffff881b2cbed520  CPU: 5   COMMAND: "ll_ost00_003"
 #0 [ffff881b2cbf3630] schedule at ffffffff81581292
 #1 [ffff881b2cbf3708] __wait_on_freeing_inode at ffffffff811b1f78
 #2 [ffff881b2cbf3778] find_inode_fast at ffffffff811b1ff8
 #3 [ffff881b2cbf37a8] ifind_fast at ffffffff811b315c
 #4 [ffff881b2cbf37d8] iget_locked at ffffffff811b33f9
 #5 [ffff881b2cbf3818] ldiskfs_iget at ffffffffa0e247b7 [ldiskfs]
 #6 [ffff881b2cbf3888] osd_iget at ffffffffa0f04d4e [osd_ldiskfs]
 #7 [ffff881b2cbf38b8] osd_obj_map_lookup at ffffffffa0f34743 [osd_ldiskfs]
 #8 [ffff881b2cbf3938] osd_oi_lookup at ffffffffa0f2117a [osd_ldiskfs]
 #9 [ffff881b2cbf3968] osd_object_init at ffffffffa0f16499 [osd_ldiskfs]
#10 [ffff881b2cbf3a48] lu_object_alloc at ffffffffa0729e18 [obdclass]
#11 [ffff881b2cbf3aa8] lu_object_find_try at ffffffffa072b361 [obdclass]
#12 [ffff881b2cbf3b38] lu_object_find_at at ffffffffa072b521 [obdclass]
#13 [ffff881b2cbf3bc8] lu_object_find at ffffffffa072b566 [obdclass]
#14 [ffff881b2cbf3bd8] ofd_object_find at ffffffffa10a5aa5 [ofd]
#15 [ffff881b2cbf3c08] ofd_lvbo_init at ffffffffa10b977f [ofd]
#16 [ffff881b2cbf3cb8] ldlm_handle_enqueue0 at ffffffffa09308dd [ptlrpc]
#17 [ffff881b2cbf3d28] tgt_enqueue at ffffffffa09b9eb1 [ptlrpc]
#18 [ffff881b2cbf3d48] tgt_request_handle at ffffffffa09baece [ptlrpc]
#19 [ffff881b2cbf3da8] ptlrpc_main at ffffffffa0964ca1 [ptlrpc]
#20 [ffff881b2cbf3ee8] kthread at ffffffff810a379e
#21 [ffff881b2cbf3f48] kernel_thread at ffffffff8
...............
PID: 61556  TASK: ffff8804b009a040  CPU: 8   COMMAND: "perfquery"
 #0 [ffff8804b009f5a8] schedule at ffffffff81581292
 #1 [ffff8804b009f680] start_this_handle at ffffffffa0df0fca [jbd2]
 #2 [ffff8804b009f740] jbd2_journal_start at ffffffffa0df13d5 [jbd2]
 #3 [ffff8804b009f780] ldiskfs_journal_start_sb at ffffffffa0e49b86 [ldiskfs]
 #4 [ffff8804b009f7a0] ldiskfs_dquot_drop at ffffffffa0e49f15 [ldiskfs]
 #5 [ffff8804b009f7d0] vfs_dq_drop at ffffffff811f7ca2
 #6 [ffff8804b009f7e0] clear_inode at ffffffff811b2623
 #7 [ffff8804b009f800] dispose_list at ffffffff811b2710
 #8 [ffff8804b009f840] shrink_icache_memory at ffffffff811b2a64
 #9 [ffff8804b009f8a0] shrink_slab at ffffffff8114253a
#10 [ffff8804b009f900] do_try_to_free_pages at ffffffff811448df
#11 [ffff8804b009f9a0] try_to_free_pages at ffffffff81144d85
#12 [ffff8804b009fa50] __alloc_pages_nodemask at ffffffff81138d8d
#13 [ffff8804b009fba0] alloc_pages_current at ffffffff8117255a
#14 [ffff8804b009fbd0] __get_free_pages at ffffffff8113655e
#15 [ffff8804b009fbe0] get_zeroed_page at ffffffff811365b6
#16 [ffff8804b009fbf0] sysfs_follow_link at ffffffff81215136
#17 [ffff8804b009fc50] __link_path_walk at ffffffff811a4f26
#18 [ffff8804b009fd30] path_walk at ffffffff811a5e0a
#19 [ffff8804b009fd70] filename_lookup at ffffffff811a601b
#20 [ffff8804b009fdb0] do_filp_open at ffffffff811a74f4
#21 [ffff8804b009ff20] do_sys_open at ffffffff81191607
#22 [ffff8804b009ff70] sys_open at ffffffff81191710
#23 [ffff8804b009ff80] system_call_fastpath at ffffffff8100b0d2

It is really duplicated with LU-6918.

Thanks,
YangSheng

Comment by Peter Jones [ 19/Dec/17 ]

As per Alex, LU-6918 was a duplicate of LU-6969 so this issue should be fixed in more current releases

Comment by Mahmoud Hanafi [ 14/Jun/18 ]

This can be closed

Comment by Peter Jones [ 16/Jun/18 ]

Thanks Mahmoud

Generated at Sat Feb 10 02:34:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.