[LU-3941] (lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed Created: 12/Sep/13 Updated: 27/Nov/13 Resolved: 27/Nov/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Sebastien Buisson (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mn1 | ||
| Severity: | 3 |
| Rank (Obsolete): | 10424 |
| Description |
|
A LBUG was hit at MeteoFrance running Lustre 2.1.6 with the following messages: <ConMan> Console [beaufix1034] log at 2013-09-08 04:00:00 UTC. 2013-09-08 04:03:18 LustreError: 11-0: an error occurred while communicating with 172.23.80.66@o2ib1. The obd_ping operation failed with -107 2013-09-08 04:03:18 Lustre: scratch-OST0014-osc-ffff880464f5a400: Connection to scratch-OST0014 (at 172.23.80.66@o2ib1) was lost; in progress operations using this service will wait for recovery to complete 2013-09-08 04:03:18 LustreError: 167-0: This client was evicted by scratch-OST0014; in progress operations using this service will fail. 2013-09-08 04:03:18 LustreError: 31914:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST0014-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 2013-09-08 04:03:18 LustreError: 31914:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff88016d5fbe40 (48993688/0/0/0) (rc: 1) 2013-09-08 04:03:18 Lustre: scratch-OST0014-osc-ffff880464f5a400: Connection restored to scratch-OST0014 (at 172.23.80.66@o2ib1) 2013-09-08 04:19:08 LustreError: 11-0: an error occurred while communicating with 172.23.80.69@o2ib1. The obd_ping operation failed with -107 2013-09-08 04:19:08 Lustre: scratch-OST001f-osc-ffff880464f5a400: Connection to scratch-OST001f (at 172.23.80.69@o2ib1) was lost; in progress operations using this service will wait for recovery to complete 2013-09-08 04:19:08 LustreError: 167-0: This client was evicted by scratch-OST001f; in progress operations using this service will fail. 2013-09-08 04:19:08 LustreError: 32128:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST001f-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 2013-09-08 04:19:08 LustreError: 32128:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff88047ba341c0 (48992886/0/0/0) (rc: 1) 2013-09-08 04:19:08 Lustre: scratch-OST001f-osc-ffff880464f5a400: Connection restored to scratch-OST001f (at 172.23.80.69@o2ib1) 2013-09-08 04:29:08 LustreError: 11-0: an error occurred while communicating with 172.23.64.63@o2ib. The obd_ping operation failed with -107 2013-09-08 04:29:08 Lustre: scratch-OST0004-osc-ffff880464f5a400: Connection to scratch-OST0004 (at 172.23.64.63@o2ib) was lost; in progress operations using this service will wait for recovery to complete 2013-09-08 04:29:08 LustreError: 167-0: This client was evicted by scratch-OST0004; in progress operations using this service will fail. 2013-09-08 04:29:08 LustreError: 32232:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST0004-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 2013-09-08 04:29:08 LustreError: 32232:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff880812ed6bc0 (48991503/0/0/0) (rc: 1) 2013-09-08 04:29:08 Lustre: scratch-OST0004-osc-ffff880464f5a400: Connection restored to scratch-OST0004 (at 172.23.64.63@o2ib) <ConMan> Console [beaufix1034] log at 2013-09-08 05:00:00 UTC. 2013-09-08 05:04:58 LustreError: 11-0: an error occurred while communicating with 172.23.80.68@o2ib1. The obd_ping operation failed with -107 2013-09-08 05:04:58 Lustre: scratch-OST002b-osc-ffff880464f5a400: Connection to scratch-OST002b (at 172.23.80.68@o2ib1) was lost; in progress operations using this service will wait for recovery to complete 2013-09-08 05:04:58 LustreError: 167-0: This client was evicted by scratch-OST002b; in progress operations using this service will fail. 2013-09-08 05:04:58 LustreError: 33732:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST002b-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 2013-09-08 05:04:58 LustreError: 33732:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff880320afb480 (48996861/0/0/0) (rc: 1) 2013-09-08 05:04:59 Lustre: scratch-OST002b-osc-ffff880464f5a400: Connection restored to scratch-OST002b (at 172.23.80.68@o2ib1) 2013-09-08 05:05:05 LustreError: 33253:0:(lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed: 2013-09-08 05:05:05 LustreError: 33253:0:(lov_lock.c:671:lov_lock_enqueue()) LBUG 2013-09-08 05:05:05 Pid: 33253, comm: arpege 2013-09-08 05:05:05 2013-09-08 05:05:05 Call Trace: 2013-09-08 05:05:05 [<ffffffffa03977f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2013-09-08 05:05:05 [<ffffffffa0397e07>] lbug_with_loc+0x47/0xb0 [libcfs] 2013-09-08 05:05:05 [<ffffffffa0741690>] ? osc_lock_enqueue+0x0/0x810 [osc] 2013-09-08 05:05:05 [<ffffffffa08858dd>] lov_lock_enqueue+0x71d/0x820 [lov] 2013-09-08 05:05:05 [<ffffffffa04b37fc>] cl_enqueue_try+0xfc/0x300 [obdclass] 2013-09-08 05:05:05 [<ffffffffa04b4bfd>] cl_enqueue_locked+0x6d/0x200 [obdclass] 2013-09-08 05:05:05 [<ffffffffa04b587e>] cl_lock_request+0x7e/0x1c0 [obdclass] 2013-09-08 05:05:05 [<ffffffffa09f02cf>] cl_glimpse_lock+0x16f/0x410 [lustre] 2013-09-08 05:05:05 [<ffffffffa09f0a94>] cl_glimpse_size+0x154/0x160 [lustre] 2013-09-08 05:05:05 [<ffffffffa09b49ae>] ll_inode_revalidate_it+0x5e/0x170 [lustre] 2013-09-08 05:05:05 [<ffffffffa09b4b09>] ll_getattr_it+0x49/0x170 [lustre] 2013-09-08 05:05:05 [<ffffffffa09b4c67>] ll_getattr+0x37/0x40 [lustre] 2013-09-08 05:05:05 [<ffffffff8116c591>] vfs_getattr+0x51/0x80 2013-09-08 05:05:05 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60 2013-09-08 05:05:05 [<ffffffff8116c84f>] vfs_fstat+0x3f/0x60 2013-09-08 05:05:05 [<ffffffff8116c894>] sys_newfstat+0x24/0x40 2013-09-08 05:05:05 [<ffffffff81173e85>] ? putname+0x35/0x50 2013-09-08 05:05:06 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140 2013-09-08 05:05:06 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b 2013-09-08 05:05:06 2013-09-08 05:05:06 Kernel panic - not syncing: LBUG 2013-09-08 05:05:06 Pid: 33253, comm: arpege Tainted: G W --------------- H 2.6.32-279.5.2.bl6.Bull.36.x86_64 #1 2013-09-08 05:05:06 Call Trace: 2013-09-08 05:05:06 [<ffffffff81495fe3>] ? panic+0xa0/0x168 2013-09-08 05:05:06 [<ffffffffa0397e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 2013-09-08 05:05:06 [<ffffffffa0741690>] ? osc_lock_enqueue+0x0/0x810 [osc] 2013-09-08 05:05:06 [<ffffffffa08858dd>] ? lov_lock_enqueue+0x71d/0x820 [lov] 2013-09-08 05:05:06 [<ffffffffa04b37fc>] ? cl_enqueue_try+0xfc/0x300 [obdclass] 2013-09-08 05:05:06 [<ffffffffa04b4bfd>] ? cl_enqueue_locked+0x6d/0x200 [obdclass] 2013-09-08 05:05:06 [<ffffffffa04b587e>] ? cl_lock_request+0x7e/0x1c0 [obdclass] 2013-09-08 05:05:06 [<ffffffffa09f02cf>] ? cl_glimpse_lock+0x16f/0x410 [lustre] 2013-09-08 05:05:06 [<ffffffffa09f0a94>] ? cl_glimpse_size+0x154/0x160 [lustre] 2013-09-08 05:05:06 [<ffffffffa09b49ae>] ? ll_inode_revalidate_it+0x5e/0x170 [lustre] 2013-09-08 05:05:06 [<ffffffffa09b4b09>] ? ll_getattr_it+0x49/0x170 [lustre] 2013-09-08 05:05:06 [<ffffffffa09b4c67>] ? ll_getattr+0x37/0x40 [lustre] 2013-09-08 05:05:06 [<ffffffff8116c591>] ? vfs_getattr+0x51/0x80 2013-09-08 05:05:06 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60 2013-09-08 05:05:06 [<ffffffff8116c84f>] ? vfs_fstat+0x3f/0x60 2013-09-08 05:05:06 [<ffffffff8116c894>] ? sys_newfstat+0x24/0x40 2013-09-08 05:05:06 [<ffffffff81173e85>] ? putname+0x35/0x50 2013-09-08 05:05:06 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140 2013-09-08 05:05:06 [<ffffffff810030f2>] ? system_call_fastpath+0x16/0x1b This looks very similar to Unfortunately, we do not have any dump to analyze further. Thanks, |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 12/Sep/13 ] |
|
Hello Seb, |
| Comment by Oleg Drokin [ 13/Sep/13 ] |
|
I believe this is all the evidence we need to confirm the long suspected fact that lu3874 is a long standing issue that patch lu3027 just made easier to hit. as such patch http://review.whamcloud.com/#/c/7569/2 should help here too. We probably should close this as a dup of lu-3874 once confirmed. |
| Comment by Sebastien Buisson (Inactive) [ 18/Sep/13 ] |
|
Hi, I am not sure to understand what we should do on our Lustre 2.1 here: integrate patch http://review.whamcloud.com/#/c/7569/2? For what purpose, fix the issue or ease issue diagnostic? TIA, |
| Comment by Sebastien Piechurski [ 19/Sep/13 ] |
|
Oleg, |
| Comment by Bruno Faccini (Inactive) [ 20/Sep/13 ] |
|
Hello Seb, |
| Comment by Bruno Faccini (Inactive) [ 04/Oct/13 ] |
|
Hello Sebs! INFO: task arpege:10985 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. arpege D 0000000000000025 0 10985 10978 0x00000000 ffff88045d703968 0000000000000086 ffff880419aec000 0000000000000086 ffff880419aec000 ffff88042b25fa60 0000000000000000 ffff88042b25fa78 ffff88035de67b10 ffff88045d703fd8 000000000000e160 ffff88035de67b10 Call Trace: [<ffffffff81497c5e>] __mutex_lock_slowpath+0x14e/0x190 [<ffffffff81497aeb>] mutex_lock+0x2b/0x50 [<ffffffffa0545d87>] cl_lock_mutex_get+0x77/0xe0 [obdclass] [<ffffffffa0548107>] cl_lock_enclosure+0x187/0x230 [obdclass] [<ffffffffa05481f3>] cl_lock_closure_build+0x43/0x150 [obdclass] [<ffffffffa09729b6>] lovsub_lock_closure+0x76/0x150 [lov] [<ffffffffa0548230>] cl_lock_closure_build+0x80/0x150 [obdclass] [<ffffffffa0968d5b>] lov_sublock_lock+0x6b/0x2f0 [lov] [<ffffffffa096a325>] lov_lock_enqueue+0x165/0x820 [lov] [<ffffffffa05477fc>] cl_enqueue_try+0xfc/0x300 [obdclass] [<ffffffffa0548bfd>] cl_enqueue_locked+0x6d/0x200 [obdclass] [<ffffffffa054987e>] cl_lock_request+0x7e/0x1c0 [obdclass] [<ffffffffa0a212cf>] cl_glimpse_lock+0x16f/0x410 [lustre] [<ffffffffa0a21a94>] cl_glimpse_size+0x154/0x160 [lustre] [<ffffffffa09e59ae>] ll_inode_revalidate_it+0x5e/0x170 [lustre] [<ffffffffa09e5b09>] ll_getattr_it+0x49/0x170 [lustre] [<ffffffffa09e5c67>] ll_getattr+0x37/0x40 [lustre] [<ffffffff8116c591>] vfs_getattr+0x51/0x80 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60 [<ffffffff8116c84f>] vfs_fstat+0x3f/0x60 [<ffffffff8116c894>] sys_newfstat+0x24/0x40 [<ffffffff81173e85>] ? putname+0x35/0x50 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b and also several lock dumps like folllowing : LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lock@ffff8804655ec0b8[7 4 0 4 0 00000000] P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0] {
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) vvp@ffff8802fb74e6f0:
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lov@ffff880465f7ca20: 4
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) 0 0: ---
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) 1 0: lock@ffff88084a564a58[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100260000:0x31d165b:0x0] {
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lovsub@ffff88086678baa0: [1 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]]
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) osc@ffff880869f0eee8: ffff88086a61f000 00101001 0x75459d96781d6cfc 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff88084a564a58
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) 2 0: lock@ffff8806b82be7b8[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100160000:0x31d36ce:0x0] {
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lovsub@ffff88087991c960: [2 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]]
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) osc@ffff880869f0e368: ffff8807f14e3d80 00101001 0x75459d96781d6d03 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff8806b82be7b8
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) 3 0: lock@ffff880869fcb078[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100100000:0x31d431c:0x0] {
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lovsub@ffff88068ae36a20: [3 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]]
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) osc@ffff880869f0e700: ffff88068ae3ad80 00101001 0x75459d96781d6d0a 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff880869fcb078
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff8804655ec0b8
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) unuse return -108
Since stacks look similar to the panic stack (and stack of others threads racing against same [sub-]lock structs ...) for this problem , it is a bit suspect and I wonder if is an usual behavior/symptom on-site ?? |
| Comment by Oleg Drokin [ 05/Oct/13 ] |
|
apparently this is being hit in maloo too: https://maloo.whamcloud.com/sub_tests/5eac2b4e-174d-11e3-9d30-52540035b04c (on 2.4.92 tag) |
| Comment by Bruno Faccini (Inactive) [ 05/Oct/13 ] |
|
Oleg, thanks. Sebs, does this problem occur frequently on nodes running "Arpege" app ?? If yes, do you have some details about it (MPI? OpenMP? I/Os profile/workload ? MPI-IO? ...)? Also, can you add some Lustre-debug flags (mainly "warning dlmtrace error emerg") to get more infos on next occurence ?, because actually you seem to run with /proc/sys/lnet/debug=0 ... |
| Comment by Bruno Faccini (Inactive) [ 07/Oct/13 ] |
|
Crash-dump analysis seem to indicate the reason of the LBUG is that Glimpse-lock's per-stripe lov_lock_sub.sub_got/osc_lock sub-lock's structs do not have respectively CEF_ASYNC nor ols_glimpse set, and this allows to have conflicts/cll_conflict to be resolved (and CLO_WAIT returned) in osc_lock_enqueue_wait(), but this is not expected to occur for a Glimpse-lock (ie, with cld_enq_flags&CEF_ASYNC and ols_flags&LDLM_FL_HAS_INTENT) in lov_lock_enqueue_one() where CLO_WAIT return is avoided without regard of cll_conflict. According to this I have now strong assumptions that change #7569 for |
| Comment by Bruno Faccini (Inactive) [ 08/Oct/13 ] |
|
b2_1 backport, of change #7569 from |
| Comment by Bruno Faccini (Inactive) [ 11/Oct/13 ] |
|
Patch successfully passed standard auto-tests and additional parallel-scale* requested. Sebs, any news on your side ? Did you trigger new occurrences already ?? Have you been able to add Lustre-debug flags ? Any details about Arpege app to provide ? |
| Comment by Sebastien Piechurski [ 11/Oct/13 ] |
|
Hi Bruno, We don't have any new occurrences. We are also tracing another issue where this application creates some lock callback timeouts on the OSS with eviction of some clients. I don't know if those 2 problems are tied or not, but seems quite close. We will open a new Jira when we have more data about it. |
| Comment by Bruno Faccini (Inactive) [ 11/Oct/13 ] |
|
Cool, thanks Seb for these details and to add the debug flags. |
| Comment by Bruno Faccini (Inactive) [ 04/Nov/13 ] |
|
Seb, looking into the history/comments of this ticket, I wonder if you already gave a try to my patch on-site, or if problem simply did not re-occur ?? |
| Comment by Sebastien Piechurski [ 04/Nov/13 ] |
|
Hi Bruno, Actually, there was no new occurrence of this issue. There were only 4 occurrences in 5 days mid september, and then nothing. |
| Comment by Peter Jones [ 13/Nov/13 ] |
|
Sebastien This patch would be under consideration to land on b2_1 if/when we are working on a 2.1.7 release but we have no immediate plans to do this at the time of writing. Regards Peter |
| Comment by Peter Jones [ 27/Nov/13 ] |
|
Issue resolved |