Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.1.6
-
3
-
10424
Description
A LBUG was hit at MeteoFrance running Lustre 2.1.6 with the following messages:
<ConMan> Console [beaufix1034] log at 2013-09-08 04:00:00 UTC. 2013-09-08 04:03:18 LustreError: 11-0: an error occurred while communicating with 172.23.80.66@o2ib1. The obd_ping operation failed with -107 2013-09-08 04:03:18 Lustre: scratch-OST0014-osc-ffff880464f5a400: Connection to scratch-OST0014 (at 172.23.80.66@o2ib1) was lost; in progress operations using this service will wait for recovery to complete 2013-09-08 04:03:18 LustreError: 167-0: This client was evicted by scratch-OST0014; in progress operations using this service will fail. 2013-09-08 04:03:18 LustreError: 31914:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST0014-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 2013-09-08 04:03:18 LustreError: 31914:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff88016d5fbe40 (48993688/0/0/0) (rc: 1) 2013-09-08 04:03:18 Lustre: scratch-OST0014-osc-ffff880464f5a400: Connection restored to scratch-OST0014 (at 172.23.80.66@o2ib1) 2013-09-08 04:19:08 LustreError: 11-0: an error occurred while communicating with 172.23.80.69@o2ib1. The obd_ping operation failed with -107 2013-09-08 04:19:08 Lustre: scratch-OST001f-osc-ffff880464f5a400: Connection to scratch-OST001f (at 172.23.80.69@o2ib1) was lost; in progress operations using this service will wait for recovery to complete 2013-09-08 04:19:08 LustreError: 167-0: This client was evicted by scratch-OST001f; in progress operations using this service will fail. 2013-09-08 04:19:08 LustreError: 32128:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST001f-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 2013-09-08 04:19:08 LustreError: 32128:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff88047ba341c0 (48992886/0/0/0) (rc: 1) 2013-09-08 04:19:08 Lustre: scratch-OST001f-osc-ffff880464f5a400: Connection restored to scratch-OST001f (at 172.23.80.69@o2ib1) 2013-09-08 04:29:08 LustreError: 11-0: an error occurred while communicating with 172.23.64.63@o2ib. The obd_ping operation failed with -107 2013-09-08 04:29:08 Lustre: scratch-OST0004-osc-ffff880464f5a400: Connection to scratch-OST0004 (at 172.23.64.63@o2ib) was lost; in progress operations using this service will wait for recovery to complete 2013-09-08 04:29:08 LustreError: 167-0: This client was evicted by scratch-OST0004; in progress operations using this service will fail. 2013-09-08 04:29:08 LustreError: 32232:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST0004-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 2013-09-08 04:29:08 LustreError: 32232:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff880812ed6bc0 (48991503/0/0/0) (rc: 1) 2013-09-08 04:29:08 Lustre: scratch-OST0004-osc-ffff880464f5a400: Connection restored to scratch-OST0004 (at 172.23.64.63@o2ib) <ConMan> Console [beaufix1034] log at 2013-09-08 05:00:00 UTC. 2013-09-08 05:04:58 LustreError: 11-0: an error occurred while communicating with 172.23.80.68@o2ib1. The obd_ping operation failed with -107 2013-09-08 05:04:58 Lustre: scratch-OST002b-osc-ffff880464f5a400: Connection to scratch-OST002b (at 172.23.80.68@o2ib1) was lost; in progress operations using this service will wait for recovery to complete 2013-09-08 05:04:58 LustreError: 167-0: This client was evicted by scratch-OST002b; in progress operations using this service will fail. 2013-09-08 05:04:58 LustreError: 33732:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST002b-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 2013-09-08 05:04:58 LustreError: 33732:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff880320afb480 (48996861/0/0/0) (rc: 1) 2013-09-08 05:04:59 Lustre: scratch-OST002b-osc-ffff880464f5a400: Connection restored to scratch-OST002b (at 172.23.80.68@o2ib1) 2013-09-08 05:05:05 LustreError: 33253:0:(lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed: 2013-09-08 05:05:05 LustreError: 33253:0:(lov_lock.c:671:lov_lock_enqueue()) LBUG 2013-09-08 05:05:05 Pid: 33253, comm: arpege 2013-09-08 05:05:05 2013-09-08 05:05:05 Call Trace: 2013-09-08 05:05:05 [<ffffffffa03977f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2013-09-08 05:05:05 [<ffffffffa0397e07>] lbug_with_loc+0x47/0xb0 [libcfs] 2013-09-08 05:05:05 [<ffffffffa0741690>] ? osc_lock_enqueue+0x0/0x810 [osc] 2013-09-08 05:05:05 [<ffffffffa08858dd>] lov_lock_enqueue+0x71d/0x820 [lov] 2013-09-08 05:05:05 [<ffffffffa04b37fc>] cl_enqueue_try+0xfc/0x300 [obdclass] 2013-09-08 05:05:05 [<ffffffffa04b4bfd>] cl_enqueue_locked+0x6d/0x200 [obdclass] 2013-09-08 05:05:05 [<ffffffffa04b587e>] cl_lock_request+0x7e/0x1c0 [obdclass] 2013-09-08 05:05:05 [<ffffffffa09f02cf>] cl_glimpse_lock+0x16f/0x410 [lustre] 2013-09-08 05:05:05 [<ffffffffa09f0a94>] cl_glimpse_size+0x154/0x160 [lustre] 2013-09-08 05:05:05 [<ffffffffa09b49ae>] ll_inode_revalidate_it+0x5e/0x170 [lustre] 2013-09-08 05:05:05 [<ffffffffa09b4b09>] ll_getattr_it+0x49/0x170 [lustre] 2013-09-08 05:05:05 [<ffffffffa09b4c67>] ll_getattr+0x37/0x40 [lustre] 2013-09-08 05:05:05 [<ffffffff8116c591>] vfs_getattr+0x51/0x80 2013-09-08 05:05:05 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60 2013-09-08 05:05:05 [<ffffffff8116c84f>] vfs_fstat+0x3f/0x60 2013-09-08 05:05:05 [<ffffffff8116c894>] sys_newfstat+0x24/0x40 2013-09-08 05:05:05 [<ffffffff81173e85>] ? putname+0x35/0x50 2013-09-08 05:05:06 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140 2013-09-08 05:05:06 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b 2013-09-08 05:05:06 2013-09-08 05:05:06 Kernel panic - not syncing: LBUG 2013-09-08 05:05:06 Pid: 33253, comm: arpege Tainted: G W --------------- H 2.6.32-279.5.2.bl6.Bull.36.x86_64 #1 2013-09-08 05:05:06 Call Trace: 2013-09-08 05:05:06 [<ffffffff81495fe3>] ? panic+0xa0/0x168 2013-09-08 05:05:06 [<ffffffffa0397e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 2013-09-08 05:05:06 [<ffffffffa0741690>] ? osc_lock_enqueue+0x0/0x810 [osc] 2013-09-08 05:05:06 [<ffffffffa08858dd>] ? lov_lock_enqueue+0x71d/0x820 [lov] 2013-09-08 05:05:06 [<ffffffffa04b37fc>] ? cl_enqueue_try+0xfc/0x300 [obdclass] 2013-09-08 05:05:06 [<ffffffffa04b4bfd>] ? cl_enqueue_locked+0x6d/0x200 [obdclass] 2013-09-08 05:05:06 [<ffffffffa04b587e>] ? cl_lock_request+0x7e/0x1c0 [obdclass] 2013-09-08 05:05:06 [<ffffffffa09f02cf>] ? cl_glimpse_lock+0x16f/0x410 [lustre] 2013-09-08 05:05:06 [<ffffffffa09f0a94>] ? cl_glimpse_size+0x154/0x160 [lustre] 2013-09-08 05:05:06 [<ffffffffa09b49ae>] ? ll_inode_revalidate_it+0x5e/0x170 [lustre] 2013-09-08 05:05:06 [<ffffffffa09b4b09>] ? ll_getattr_it+0x49/0x170 [lustre] 2013-09-08 05:05:06 [<ffffffffa09b4c67>] ? ll_getattr+0x37/0x40 [lustre] 2013-09-08 05:05:06 [<ffffffff8116c591>] ? vfs_getattr+0x51/0x80 2013-09-08 05:05:06 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60 2013-09-08 05:05:06 [<ffffffff8116c84f>] ? vfs_fstat+0x3f/0x60 2013-09-08 05:05:06 [<ffffffff8116c894>] ? sys_newfstat+0x24/0x40 2013-09-08 05:05:06 [<ffffffff81173e85>] ? putname+0x35/0x50 2013-09-08 05:05:06 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140 2013-09-08 05:05:06 [<ffffffff810030f2>] ? system_call_fastpath+0x16/0x1b
This looks very similar to LU-3874, but the patch that was suspected in this jira is not applied to our version (only in master and b2_4).
Unfortunately, we do not have any dump to analyze further.
Thanks,
Sebastien.
Hi Bruno,
We don't have any new occurrences.
We will enable debug logs in case we have a new occurrence.
The application is an hybrid one (MPI + OpenMP) which might access some files concurrently, inducing multiple locks on the same region of the same objects.
We are also tracing another issue where this application creates some lock callback timeouts on the OSS with eviction of some clients. I don't know if those 2 problems are tied or not, but seems quite close. We will open a new Jira when we have more data about it.