[LU-1046] fsstress - Watchdog detected hard LOCKUP on cpu 0 Created: 26/Jan/12 Updated: 17/Feb/12 Resolved: 09/Feb/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0 |
| Fix Version/s: | Lustre 2.2.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
2.1.55 - Hyperion |
||
| Severity: | 3 |
| Rank (Obsolete): | 4745 |
| Description |
|
Running fsstress - a single client passes, 10 client run exited with this error. I am not certain this is a lustre error. <ConMan> Console [hyperion-rst6] departed by <root@localhost> on pts/131 at 01-26 16:44. |
| Comments |
| Comment by Peter Jones [ 27/Jan/12 ] |
|
Fanyong Oleg wonders whether this might be related to the multi threaded ptlrpcd changes. Could you please advise whether this is the case? Thanks Peter |
| Comment by nasf (Inactive) [ 30/Jan/12 ] |
|
From the stack, it seems that when ptlrpcd_8 tried to wakeup some other thread which was waiting for ldlm callback interpret, it found the wait queue (arg->waitq) lock was recursively locked as following: ======================= ====> spin_lock_irqsave(&q->lock, flags); After checking Lustre code, I do not think Lustre code can trigger recursive lock on the wait queue. But if the wait queue was released or invalid before ptlrpcd_8 called "wake_up()", then it maybe misguide the system and cause unknown behavior. Thread1 (some lock blocking/completion callback) was in "ldlm_run_ast_work()": struct ldlm_cb_set_arg arg = { 0 }; arg.threshold = 1; (step3) RETURN(cfs_atomic_read(&arg.restart) ? -ERESTART : 0); Thread2 (ptlrpcd_8) was in "ldlm_cb_interpret()" The real processing flow maybe like: So when thread2 tried to wake up thread1 on the wait queue by step4, thread1 already exited the function "ldlm_run_ast_work()", so the wait queue (allocated on the stack of "ldlm_run_ast_work()") "arg->waitq" was already invalid and maybe re-allocated to other functions. If above case occurred, then when thread2(ptlrpcd_8) tried to perform "spin_lock_irqsave(&q->lock, flags);" in function of "__wake_up()", it may be misguided to think as hard LOCKUP on cpu 0. Currently, there are not enough logs to prove that it is just such reason caused the failure. Anyway, above race condition should be fixed. |
| Comment by nasf (Inactive) [ 31/Jan/12 ] |
|
The patch for above race condition: |
| Comment by Peter Jones [ 02/Feb/12 ] |
|
FanYong Could you please rebase your patch on the tip of master to pickup the fix for LU797? Thanks Peter |
| Comment by nasf (Inactive) [ 02/Feb/12 ] |
|
The patch against latest master branch with |
| Comment by Build Master (Inactive) [ 09/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 09/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 09/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 09/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 09/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 09/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 09/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 09/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 09/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 09/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 09/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 09/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Peter Jones [ 09/Feb/12 ] |
|
Landed for 2.2 |
| Comment by Build Master (Inactive) [ 09/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 09/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = FAILURE
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = FAILURE
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = ABORTED
|