[LU-1061] LBUG on cl_locks_prune() Created: 01/Feb/12 Updated: 17/Feb/12 Resolved: 13/Feb/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0 |
| Fix Version/s: | Lustre 2.2.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Niu Yawei (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4738 |
| Description |
|
an 31 14:13:43 spoon02 kernel: Lustre: DEBUG MARKER: == racer racer.sh test complete, duration 659 sec ==================================================== 14:1 > Jan 31 14:29:31 spoon02 kernel: LustreError: 25321:0:(file.c:157:ll_close_inode_openhandle()) Skipped 1 previous similar message It maybe caused by AGL (async glimpse lock), which holds an user count for AGL RPC reply processing later. |
| Comments |
| Comment by nasf (Inactive) [ 02/Feb/12 ] |
|
The patch for this bug: http://review.whamcloud.com/#change,2079 The AGL sponsor holds user reference count on the cl_lock before triggering AGL RPC. Such user reference count on the cl_lock will be released by AGL RPC reply upcall (osc_lock_upcall()). Such AGL mechanism conflicts with cl_locks_prune(), which requires no lock is in active using when the last iput() called. So introduce another cl_lock ops: cl_lock_operations::clo_abort(), to resolve above conflict as following: 1) If cl_locks_prune() finds some lock is in using, it calls clo_abort() to abort other cl_lock operations on the lock. Currently, it mainly processes the stub for AGL RPC reply upcall. 2) For AGL RPC reply upcall, it will detect whether the lock is aborted or not firstly. If yes, do nothing; otherwise, set some flag to prevent to be aborted by cl_locks_prune(), and then process the upcall as normal. |
| Comment by James A Simmons [ 03/Feb/12 ] |
|
I'm in the process of testing your patch. |
| Comment by James A Simmons [ 03/Feb/12 ] |
|
The first problem I saw was a regression in sanity test 132 which now fails. The second problem is racer now fails in another way. This time I'm seeing client nodes just rebooting. I attached a file called oops with the details of the run. The server side seems normal, just the clients had problems. |
| Comment by nasf (Inactive) [ 05/Feb/12 ] |
|
Can you paste test_132 failure log? I cannot reproduce the failure by myself. As for client reboot, from the log "oops", it is difficult to say what cause the rebooting. I suspect there were some memory crash on the client, but it was not included in the "oops" file. Would you please to collect more crash log through crash or serial port configuration? |
| Comment by James A Simmons [ 07/Feb/12 ] |
|
With your latest patch I still see sanity test 132 failure most of the time. Once in a while it does passes. As for racer I stopped seeing the oops and the test does complete, but now the various dd process on all the nodes are still running which prevents the test suite from going on. I'm going to let it running for several hours tonight to see if the dd process ever quite. If they don't then we have a locking issue. |
| Comment by nasf (Inactive) [ 08/Feb/12 ] |
|
Can you test Lustre without my patch under the same test environment to verify whether sanity_132 will fail or not? If still fail, please open a new jira ticket for test_132 failure. |
| Comment by James A Simmons [ 08/Feb/12 ] |
|
Yes the bug for test 132 still exist without the patch. I will reopen |
| Comment by Peter Jones [ 13/Feb/12 ] |
|
Landed for 2.2 |
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = FAILURE
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = FAILURE
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = ABORTED
|