[LU-1046]  fsstress - Watchdog detected hard LOCKUP on cpu 0 Created: 26/Jan/12  Updated: 17/Feb/12  Resolved: 09/Feb/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: Lustre 2.2.0

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

2.1.55 - Hyperion


Severity: 3
Rank (Obsolete): 4745

 Description   

Running fsstress - a single client passes, 10 client run exited with this error.

I am not certain this is a lustre error.
Stack:
----------
2012-01-26 16:28:27 Lustre: 6811:0:(cmm_object.c:686:cml_rename_warn()) cml_rename failed for mdo_rename, should revoke: [mo_po [0x
200000400:0x18a:0x0]] [mo_pn [0x200000400:0x18a:0x0]] [lf [0x2000013a1:0x14938:0x0]] [sname fstest_b2f496891449bc8de19b7849182f2acc
] [mo_t [0x2000013a1:0x149a9:0x0]] [tname fstest_30283864daf0badb6075fdf78d2da862] [err -39]
2012-01-26 16:28:28 Lustre: 6654:0:(cmm_object.c:686:cml_rename_warn()) cml_rename failed for mdo_rename, should revoke: [mo_po [0x
200000400:0x18a:0x0]] [mo_pn [0x200000400:0x18a:0x0]] [lf [0x2000013a1:0x14938:0x0]] [sname fstest_b2f496891449bc8de19b7849182f2acc
] [mo_t [0x2000013a1:0x149a9:0x0]] [tname fstest_30283864daf0badb6075fdf78d2da862] [err -39]

<ConMan> Console [hyperion-rst6] departed by <root@localhost> on pts/131 at 01-26 16:44.
2012-01-26 16:44:25 Lustre: 6594:0:(ldlm_lib.c:909:target_handle_connect()) MGS: connection from 7215a55e-8798-a29d-836c-c59c854cdb
88@192.168.115.141@o2ib t0 exp (null) cur 1327625065 last 0
2012-01-26 16:44:25 Lustre: 6594:0:(ldlm_lib.c:909:target_handle_connect()) Skipped 12 previous similar messages
2012-01-26 16:45:29 Lustre: 6817:0:(ldlm_lib.c:909:target_handle_connect()) lustre-MDT0000: connection from c013f4e3-7f4f-0cba-76c9
-53313da4adf0@192.168.115.130@o2ib1 t0 exp (null) cur 1327625129 last 0
2012-01-26 16:45:29 Lustre: 6817:0:(ldlm_lib.c:909:target_handle_connect()) Skipped 42 previous similar messages
2012-01-26 16:57:22 Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
2012-01-26 16:57:22 Pid: 6147, comm: ptlrpcd_8 Not tainted 2.6.32-220.el6_lustre.x86_64 #1
2012-01-26 16:57:22 Call Trace:
2012-01-26 16:57:22 <NMI> [<ffffffff814ec701>] ? panic+0x78/0x143
2012-01-26 16:57:22 [<ffffffff810d8fad>] ? watchdog_overflow_callback+0xcd/0xd0
2012-01-26 16:57:22 [<ffffffff8110a89d>] ? __perf_event_overflow+0x9d/0x230
2012-01-26 16:57:22 [<ffffffff8110ae54>] ? perf_event_overflow+0x14/0x20
2012-01-26 16:57:22 [<ffffffff8101e096>] ? intel_pmu_handle_irq+0x336/0x550
2012-01-26 16:57:22 [<ffffffff814f22d6>] ? kprobe_exceptions_notify+0x16/0x430
2012-01-26 16:57:22 [<ffffffff814f0db9>] ? perf_event_nmi_handler+0x39/0xb0
2012-01-26 16:57:22 [<ffffffff814f2905>] ? notifier_call_chain+0x55/0x80
2012-01-26 16:57:22 [<ffffffff814f296a>] ? atomic_notifier_call_chain+0x1a/0x20
2012-01-26 16:57:22 [<ffffffff81096bce>] ? notify_die+0x2e/0x30
2012-01-26 16:57:22 [<ffffffff814f0583>] ? do_nmi+0x173/0x2b0
2012-01-26 16:57:22 [<ffffffff814efe90>] ? nmi+0x20/0x30
2012-01-26 16:57:22 [<ffffffff814ef5f2>] ? _spin_lock_irqsave+0x32/0x40
2012-01-26 16:57:22 <<EOE>> [<ffffffff810517f2>] ? __wake_up+0x32/0x70
2012-01-26 16:57:22 [<ffffffffa040175a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
2012-01-26 16:57:22 [<ffffffffa05b3bd9>] ? ldlm_cb_interpret+0x179/0x1b0 [ptlrpc]
2012-01-26 16:57:22 [<ffffffffa05cc9d0>] ? ptlrpc_check_set+0x5d0/0x1ac0 [ptlrpc]
2012-01-26 16:57:22 [<ffffffff8107c0ec>] ? lock_timer_base+0x3c/0x70
2012-01-26 16:57:22 [<ffffffffa0600a03>] ? ptlrpcd_check+0x5f3/0x610 [ptlrpc]
2012-01-26 16:57:22 [<ffffffffa060221b>] ? ptlrpcd+0x24b/0x3c0 [ptlrpc]
2012-01-26 16:57:22 [<ffffffff8105fa50>] ? default_wake_function+0x0/0x20
2012-01-26 16:57:22 [<ffffffffa0601fd0>] ? ptlrpcd+0x0/0x3c0 [ptlrpc]
2012-01-26 16:57:22 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
2012-01-26 16:57:22 [<ffffffffa0601fd0>] ? ptlrpcd+0x0/0x3c0 [ptlrpc]
2012-01-26 16:57:22 [<ffffffffa0601fd0>] ? ptlrpcd+0x0/0x3c0 [ptlrpc]
2012-01-26 16:57:22 [<ffffffff8100c140>] ? child_rip+0x0/0x20
2012-01-26 16:57:22 Initializing cgroup subsys cpuset
-------------



 Comments   
Comment by Peter Jones [ 27/Jan/12 ]

Fanyong

Oleg wonders whether this might be related to the multi threaded ptlrpcd changes.

Could you please advise whether this is the case?

Thanks

Peter

Comment by nasf (Inactive) [ 30/Jan/12 ]

From the stack, it seems that when ptlrpcd_8 tried to wakeup some other thread which was waiting for ldlm callback interpret, it found the wait queue (arg->waitq) lock was recursively locked as following:

=======================
void fastcall __wake_up(wait_queue_head_t *q, unsigned int mode,
int nr_exclusive, void *key)
{
unsigned long flags;

====> spin_lock_irqsave(&q->lock, flags);
__wake_up_common(q, mode, nr_exclusive, 0, key);
spin_unlock_irqrestore(&q->lock, flags);
}
======================

After checking Lustre code, I do not think Lustre code can trigger recursive lock on the wait queue. But if the wait queue was released or invalid before ptlrpcd_8 called "wake_up()", then it maybe misguide the system and cause unknown behavior.

Thread1 (some lock blocking/completion callback) was in "ldlm_run_ast_work()":
========================
int ldlm_run_ast_work(struct ldlm_namespace *ns, cfs_list_t *rpc_list,
ldlm_desc_ast_t ast_type)
{

struct ldlm_cb_set_arg arg =

{ 0 }

;
...

arg.threshold = 1;
(step2) l_wait_event(arg.waitq, cfs_atomic_read(&arg.rpcs) == 0, &lwi);

(step3) RETURN(cfs_atomic_read(&arg.restart) ? -ERESTART : 0);
}
========================

Thread2 (ptlrpcd_8) was in "ldlm_cb_interpret()"
========================
static int ldlm_cb_interpret(const struct lu_env *env,
struct ptlrpc_request *req, void *data, int rc)
{
struct ldlm_cb_async_args *ca = data;
struct ldlm_lock *lock = ca->ca_lock;
struct ldlm_cb_set_arg *arg = ca->ca_set_arg;
...
(step1) if (cfs_atomic_dec_return(&arg->rpcs) < arg->threshold)
(step4) cfs_waitq_signal(&arg->waitq);
RETURN(0);
}
========================

The real processing flow maybe like:
1) thread2::step1 on CPU0
2) thread1::step2 on CPUX
3) thread1::step3 on CPUX
4) thread2::step4 on CPU0

So when thread2 tried to wake up thread1 on the wait queue by step4, thread1 already exited the function "ldlm_run_ast_work()", so the wait queue (allocated on the stack of "ldlm_run_ast_work()") "arg->waitq" was already invalid and maybe re-allocated to other functions.

If above case occurred, then when thread2(ptlrpcd_8) tried to perform "spin_lock_irqsave(&q->lock, flags);" in function of "__wake_up()", it may be misguided to think as hard LOCKUP on cpu 0.

Currently, there are not enough logs to prove that it is just such reason caused the failure. Anyway, above race condition should be fixed.

Comment by nasf (Inactive) [ 31/Jan/12 ]

The patch for above race condition:

http://review.whamcloud.com/#change,2065

Comment by Peter Jones [ 02/Feb/12 ]

FanYong

Could you please rebase your patch on the tip of master to pickup the fix for LU797?

Thanks

Peter

Comment by nasf (Inactive) [ 02/Feb/12 ]

The patch against latest master branch with LU-797 fixed.

http://review.whamcloud.com/#change,2065,set2

Comment by Build Master (Inactive) [ 09/Feb/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #462
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = SUCCESS
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_lockd.c
  • lustre/ldlm/ldlm_lock.c
  • lustre/ldlm/ldlm_internal.h
Comment by Build Master (Inactive) [ 09/Feb/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #462
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = SUCCESS
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_lockd.c
  • lustre/ldlm/ldlm_lock.c
  • lustre/ldlm/ldlm_internal.h
Comment by Build Master (Inactive) [ 09/Feb/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #462
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = SUCCESS
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_lockd.c
  • lustre/ldlm/ldlm_lock.c
  • lustre/ldlm/ldlm_internal.h
Comment by Build Master (Inactive) [ 09/Feb/12 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #462
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = SUCCESS
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_lockd.c
  • lustre/ldlm/ldlm_lock.c
  • lustre/ldlm/ldlm_internal.h
Comment by Build Master (Inactive) [ 09/Feb/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #462
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = SUCCESS
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_lockd.c
  • lustre/ldlm/ldlm_lock.c
  • lustre/ldlm/ldlm_internal.h
Comment by Build Master (Inactive) [ 09/Feb/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #462
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = SUCCESS
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_internal.h
  • lustre/ldlm/ldlm_lockd.c
  • lustre/ldlm/ldlm_lock.c
Comment by Build Master (Inactive) [ 09/Feb/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #462
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = SUCCESS
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_internal.h
  • lustre/ldlm/ldlm_lock.c
  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 09/Feb/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #462
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = SUCCESS
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_lock.c
  • lustre/ldlm/ldlm_internal.h
  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 09/Feb/12 ]

Integrated in lustre-master » i686,server,el5,ofa #462
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = SUCCESS
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_lockd.c
  • lustre/ldlm/ldlm_internal.h
  • lustre/ldlm/ldlm_lock.c
Comment by Build Master (Inactive) [ 09/Feb/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #462
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = SUCCESS
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_internal.h
  • lustre/ldlm/ldlm_lock.c
  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 09/Feb/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #462
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = SUCCESS
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_lock.c
  • lustre/ldlm/ldlm_internal.h
  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 09/Feb/12 ]

Integrated in lustre-master » i686,client,el5,ofa #462
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = SUCCESS
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_lock.c
  • lustre/ldlm/ldlm_lockd.c
  • lustre/ldlm/ldlm_internal.h
Comment by Peter Jones [ 09/Feb/12 ]

Landed for 2.2

Comment by Build Master (Inactive) [ 09/Feb/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #462
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = SUCCESS
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_lock.c
  • lustre/ldlm/ldlm_internal.h
  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 09/Feb/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #462
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = SUCCESS
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_internal.h
  • lustre/ldlm/ldlm_lock.c
  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,server,el6,ofa #480
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = FAILURE
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_internal.h
  • lustre/ldlm/ldlm_lock.c
  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,client,el6,ofa #480
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = FAILURE
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_lockd.c
  • lustre/ldlm/ldlm_lock.c
  • lustre/ldlm/ldlm_internal.h
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » i686,client,el6,ofa #480
LU-1046 ldlm: hold ldlm_cb_set_arg until the last user exit (Revision 7d3723d80078c0852b9511523960ba5f4590643c)

Result = ABORTED
Oleg Drokin : 7d3723d80078c0852b9511523960ba5f4590643c
Files :

  • lustre/ldlm/ldlm_lock.c
  • lustre/ldlm/ldlm_internal.h
  • lustre/ldlm/ldlm_lockd.c
Generated at Sat Feb 10 01:12:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.