[LU-4692] (osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6 Created: 03/Mar/14 Updated: 12/Nov/14 Resolved: 11/Sep/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Li Xi (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
2.5.54 |
||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 12901 | ||||||||||||||||||||
| Description |
|
The assertion failed on our system. The system are running Lustre-2.5.54, so patch from <3>LustreError: 109372:0:(file.c:3087:ll_inode_revalidate_fini()) home2: revalidate FID [0x2000048a2:0x17:0x0] error: rc = -116 |
| Comments |
| Comment by Peter Jones [ 03/Mar/14 ] |
|
Bobijam Could you pleases assess this issue reported against master? Thanks Peter |
| Comment by Zhenyu Xu [ 04/Mar/14 ] |
|
when an IO is starting, it will check whether there are existing locks matching to what this IO needs, the calling path is cl_lockset_lock_one()=>cl_lock_request()=>cl_lock_hold_mutex()=>cl_lock_find()=>cl_lock_lookup() matched = cl_lock_ext_match(&lock->cll_descr, need) &&
lock->cll_state < CLS_FREEING &&
lock->cll_error == 0 &&
!(lock->cll_flags & CLF_CANCELLED) &&
cl_lock_fits_into(env, lock, need, io);
and in osc_lock_fits_into() struct osc_lock *ols = cl2osc_lock(slice);
if (need->cld_enq_flags & CEF_NEVER)
return 0;
if (ols->ols_state >= OLS_CANCELLED)
return 0;
So means after osc_lock_fits_into() checking, the osc_lock was cancelled by another thread, so that the new IO think it finds a fitting lock, and when it tries to enqueue the lock, it finds that osc_lock is in OLS_CANCELLED instead of OLS_NEW. |
| Comment by Jinshan Xiong (Inactive) [ 05/Mar/14 ] |
|
The lock will be held in this case so it won't be cancelled. |
| Comment by Zhenyu Xu [ 05/Mar/14 ] |
|
Let's make up a scenario, thread1 is starting an IO, thread2 has the fitting lock (both top_lock and sub_lock), and thread2 is handling osc lock cancel ast requested from server thread1 thread2 note cl_lock_request() +->cl_lock_hold_mutex() +-->cl_lock_find() found the fitting top lock, and its sub osc lock is not in OLS_CANCELLED | | osc_ldlm_blocking_ast() | +->osc_ldlm_blocking_ast0() get mutex of sub lock | |+->osc_lock_blocking() | | +->cl_lock_cancel(sub_lock) | | +->cl_lock_cancel0(sub_lock) set sub_lock as CLF_CANCELLED, reverse call clo_cancel() | | |+->osc_lock_cancel(sub_osc_lock) set osc_lock as OLS_CANCELLED | | +->cl_lock_delete(sub_lock) | +->cl_lock_mutex_put(sub_lock) | | cl_lock_mutex_get(top_lock) at this time, top_lock is not in CLF_CANCELLED while sub osc_lock is in OLS_CANCELLED state. |
| Comment by Andreas Dilger [ 10/Mar/14 ] |
|
Li Xi, could you please comment about what test was being run to cause this problem? |
| Comment by Jinshan Xiong (Inactive) [ 10/Mar/14 ] |
|
This is probably a different representation of |
| Comment by Li Xi (Inactive) [ 12/Mar/14 ] |
|
Hi Andreas, Unfortunately, I don't know what test was running. I will let you know as soon as I get more information about ot. |
| Comment by Wojciech Turek (Inactive) [ 12/Mar/14 ] |
|
We hit the same problem shortly after upgrading to 2.4.2 clients. Have not seen this on 2.4.1 crash> bt |
| Comment by Jinshan Xiong (Inactive) [ 12/Mar/14 ] |
|
Hi Turek, Can you share us the core dump? You can upload it to ftp.whamcloud.com, thanks. Jinshan |
| Comment by Wojciech Turek (Inactive) [ 14/Mar/14 ] |
|
uploaded to ftp://ftp.whamcloud.com/uploads/LU-4692/ |
| Comment by Zhenyu Xu [ 21/Mar/14 ] |
|
Hi Wojciech, Can you upload the supporting files (uncompressed vmlinuz, System.map) as well? |
| Comment by Shuichi Ihara (Inactive) [ 24/Mar/14 ] |
|
Here is crashdump of original this problem. ftp://ftp.whamcloud.com/uploads/LU-4692-2/ |
| Comment by Wojciech Turek (Inactive) [ 25/Mar/14 ] |
|
we use patchless clients so the kernel is standard rhel6.5 kernel |
| Comment by Jinshan Xiong (Inactive) [ 25/Mar/14 ] |
|
This problem should be imported by |
| Comment by Jinshan Xiong (Inactive) [ 25/Mar/14 ] |
|
I think this is imported by It lacks of lustre module files so I can't analyze the core dump file. |
| Comment by Aurelien Degremont (Inactive) [ 26/Mar/14 ] |
|
Jinshan, |
| Comment by Jinshan Xiong (Inactive) [ 26/Mar/14 ] |
|
I see, thanks Aurelien. Hi Wojciech, please upload lustre modules into ftp so I can take a further look at the core dump. |
| Comment by Jodi Levi (Inactive) [ 02/Jun/14 ] |
|
The first part of this bug was related to |
| Comment by Zhenyu Xu [ 11/Sep/14 ] |
|
dup of |