[LU-5717] Dead lock of nrs_tbf_timer_cb Created: 08/Oct/14  Updated: 28/Jun/16  Resolved: 16/Mar/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Li Xi (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-3558 NRS TBF policy for QoS purposes Resolved
is related to LU-7448 sleeping under spinlock somewhere in ... Resolved
Severity: 3
Rank (Obsolete): 16034

 Description   

When TBF is enabled, following dead lock problem could be triggered when system is under heavy load.

<0>Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
<4>Pid: 24831, comm: ll_ost_io00_074 Not tainted 2.6.32-431.23.3.el6_lustre.2.5.24.ddn3.x86_64 #1
<4>Call Trace:
<4> <NMI> [<ffffffff8152896c>] ? panic+0xa7/0x16f
<4> [<ffffffff81014969>] ? sched_clock+0x9/0x10
<4> [<ffffffff810e67fd>] ? watchdog_overflow_callback+0xcd/0xd0
<4> [<ffffffff8111c707>] ? __perf_event_overflow+0xa7/0x240
<4> [<ffffffff8101d93d>] ? x86_perf_event_set_period+0xdd/0x170
<4> [<ffffffff8111ccd4>] ? perf_event_overflow+0x14/0x20
<4> [<ffffffff81022d87>] ? intel_pmu_handle_irq+0x187/0x2f0
<4> [<ffffffff8152e646>] ? kprobe_exceptions_notify+0x16/0x430
<4> [<ffffffff8152d1b9>] ? perf_event_nmi_handler+0x39/0xb0
<4> [<ffffffff8152ec75>] ? notifier_call_chain+0x55/0x80
<4> [<ffffffffa08517c0>] ? nrs_tbf_timer_cb+0x0/0x60 [ptlrpc]
<4> [<ffffffff8152ecda>] ? atomic_notifier_call_chain+0x1a/0x20
<4> [<ffffffff810a11de>] ? notify_die+0x2e/0x30
<4> [<ffffffff8152c93b>] ? do_nmi+0x1bb/0x340
<4> [<ffffffff8152c200>] ? nmi+0x20/0x30
<4> [<ffffffffa08517c0>] ? nrs_tbf_timer_cb+0x0/0x60 [ptlrpc]
<4> [<ffffffff8152ba6e>] ? _spin_lock+0x1e/0x30
<4> <<EOE>> <IRQ> [<ffffffffa08517ea>] ? nrs_tbf_timer_cb+0x2a/0x60 [ptlrpc]
<4> [<ffffffff8109f6be>] ? __run_hrtimer+0x8e/0x1a0
<4> [<ffffffff810a6a9f>] ? ktime_get_update_offsets+0x4f/0xd0
<4> [<ffffffff8109fa26>] ? hrtimer_interrupt+0xe6/0x260
<4> [<ffffffff81031f1d>] ? local_apic_timer_interrupt+0x3d/0x70
<4> [<ffffffff81532805>] ? smp_apic_timer_interrupt+0x45/0x60
<4> [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
<4> <EOI> [<ffffffffa084648a>] ? nrs_resource_get_safe+0x4a/0x100 [ptlrpc]
<4> [<ffffffffa0848a98>] ? ptlrpc_nrs_req_initialize+0x38/0x90 [ptlrpc]
<4> [<ffffffffa080ef41>] ? ptlrpc_server_handle_req_in+0x901/0xcd0 [ptlrpc]
<4> [<ffffffffa0815f0c>] ? ptlrpc_main+0x9ec/0x1990 [ptlrpc]
<4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
<4> [<ffffffff8152907e>] ? thread_return+0x4e/0x760
<4> [<ffffffffa0815520>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
<4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20



 Comments   
Comment by Li Xi (Inactive) [ 08/Oct/14 ]

This is a patch which tries to fix this problem:
http://review.whamcloud.com/#/c/12228/

Comment by Peter Jones [ 08/Oct/14 ]

Niu

Could you please review and comment on this patch?

Thanks

Peter

Comment by Gerrit Updater [ 09/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12228/
Subject: LU-5717 ptlrpc: fix deadlock problem of nrs_tbf_timer_cb
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f32919289d43dc9df37b93cf6fea483116fac5cb

Comment by Joseph Gmitter (Inactive) [ 16/Mar/16 ]

Ticket cleanup - this patch had landed to master for 2.8.0.

Generated at Sat Feb 10 01:53:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.