Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3941

(lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.1.6
    • 3
    • 10424

    Description

      A LBUG was hit at MeteoFrance running Lustre 2.1.6 with the following messages:

      <ConMan> Console [beaufix1034] log at 2013-09-08 04:00:00 UTC.
      2013-09-08 04:03:18 LustreError: 11-0: an error occurred while communicating with 172.23.80.66@o2ib1. The obd_ping operation failed with -107
      2013-09-08 04:03:18 Lustre: scratch-OST0014-osc-ffff880464f5a400: Connection to scratch-OST0014 (at 172.23.80.66@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      2013-09-08 04:03:18 LustreError: 167-0: This client was evicted by scratch-OST0014; in progress operations using this service will fail.
      2013-09-08 04:03:18 LustreError: 31914:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST0014-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      2013-09-08 04:03:18 LustreError: 31914:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff88016d5fbe40 (48993688/0/0/0) (rc: 1)
      2013-09-08 04:03:18 Lustre: scratch-OST0014-osc-ffff880464f5a400: Connection restored to scratch-OST0014 (at 172.23.80.66@o2ib1)
      2013-09-08 04:19:08 LustreError: 11-0: an error occurred while communicating with 172.23.80.69@o2ib1. The obd_ping operation failed with -107
      2013-09-08 04:19:08 Lustre: scratch-OST001f-osc-ffff880464f5a400: Connection to scratch-OST001f (at 172.23.80.69@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      2013-09-08 04:19:08 LustreError: 167-0: This client was evicted by scratch-OST001f; in progress operations using this service will fail.
      2013-09-08 04:19:08 LustreError: 32128:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST001f-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      2013-09-08 04:19:08 LustreError: 32128:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff88047ba341c0 (48992886/0/0/0) (rc: 1)
      2013-09-08 04:19:08 Lustre: scratch-OST001f-osc-ffff880464f5a400: Connection restored to scratch-OST001f (at 172.23.80.69@o2ib1)
      2013-09-08 04:29:08 LustreError: 11-0: an error occurred while communicating with 172.23.64.63@o2ib. The obd_ping operation failed with -107
      2013-09-08 04:29:08 Lustre: scratch-OST0004-osc-ffff880464f5a400: Connection to scratch-OST0004 (at 172.23.64.63@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      2013-09-08 04:29:08 LustreError: 167-0: This client was evicted by scratch-OST0004; in progress operations using this service will fail.
      2013-09-08 04:29:08 LustreError: 32232:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST0004-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      2013-09-08 04:29:08 LustreError: 32232:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff880812ed6bc0 (48991503/0/0/0) (rc: 1)
      2013-09-08 04:29:08 Lustre: scratch-OST0004-osc-ffff880464f5a400: Connection restored to scratch-OST0004 (at 172.23.64.63@o2ib)
      
      <ConMan> Console [beaufix1034] log at 2013-09-08 05:00:00 UTC.
      2013-09-08 05:04:58 LustreError: 11-0: an error occurred while communicating with 172.23.80.68@o2ib1. The obd_ping operation failed with -107
      2013-09-08 05:04:58 Lustre: scratch-OST002b-osc-ffff880464f5a400: Connection to scratch-OST002b (at 172.23.80.68@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      2013-09-08 05:04:58 LustreError: 167-0: This client was evicted by scratch-OST002b; in progress operations using this service will fail.
      2013-09-08 05:04:58 LustreError: 33732:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST002b-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      2013-09-08 05:04:58 LustreError: 33732:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff880320afb480 (48996861/0/0/0) (rc: 1)
      2013-09-08 05:04:59 Lustre: scratch-OST002b-osc-ffff880464f5a400: Connection restored to scratch-OST002b (at 172.23.80.68@o2ib1)
      2013-09-08 05:05:05 LustreError: 33253:0:(lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed:
      2013-09-08 05:05:05 LustreError: 33253:0:(lov_lock.c:671:lov_lock_enqueue()) LBUG
      2013-09-08 05:05:05 Pid: 33253, comm: arpege
      2013-09-08 05:05:05
      2013-09-08 05:05:05 Call Trace:
      2013-09-08 05:05:05 [<ffffffffa03977f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2013-09-08 05:05:05 [<ffffffffa0397e07>] lbug_with_loc+0x47/0xb0 [libcfs]
      2013-09-08 05:05:05 [<ffffffffa0741690>] ? osc_lock_enqueue+0x0/0x810 [osc]
      2013-09-08 05:05:05 [<ffffffffa08858dd>] lov_lock_enqueue+0x71d/0x820 [lov]
      2013-09-08 05:05:05 [<ffffffffa04b37fc>] cl_enqueue_try+0xfc/0x300 [obdclass]
      2013-09-08 05:05:05 [<ffffffffa04b4bfd>] cl_enqueue_locked+0x6d/0x200 [obdclass]
      2013-09-08 05:05:05 [<ffffffffa04b587e>] cl_lock_request+0x7e/0x1c0 [obdclass]
      2013-09-08 05:05:05 [<ffffffffa09f02cf>] cl_glimpse_lock+0x16f/0x410 [lustre]
      2013-09-08 05:05:05 [<ffffffffa09f0a94>] cl_glimpse_size+0x154/0x160 [lustre]
      2013-09-08 05:05:05 [<ffffffffa09b49ae>] ll_inode_revalidate_it+0x5e/0x170 [lustre]
      2013-09-08 05:05:05 [<ffffffffa09b4b09>] ll_getattr_it+0x49/0x170 [lustre]
      2013-09-08 05:05:05 [<ffffffffa09b4c67>] ll_getattr+0x37/0x40 [lustre]
      2013-09-08 05:05:05 [<ffffffff8116c591>] vfs_getattr+0x51/0x80
      2013-09-08 05:05:05 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60
      2013-09-08 05:05:05 [<ffffffff8116c84f>] vfs_fstat+0x3f/0x60
      2013-09-08 05:05:05 [<ffffffff8116c894>] sys_newfstat+0x24/0x40
      2013-09-08 05:05:05 [<ffffffff81173e85>] ? putname+0x35/0x50
      2013-09-08 05:05:06 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140
      2013-09-08 05:05:06 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b
      2013-09-08 05:05:06
      2013-09-08 05:05:06 Kernel panic - not syncing: LBUG
      2013-09-08 05:05:06 Pid: 33253, comm: arpege Tainted: G W --------------- H 2.6.32-279.5.2.bl6.Bull.36.x86_64 #1
      2013-09-08 05:05:06 Call Trace:
      2013-09-08 05:05:06 [<ffffffff81495fe3>] ? panic+0xa0/0x168
      2013-09-08 05:05:06 [<ffffffffa0397e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      2013-09-08 05:05:06 [<ffffffffa0741690>] ? osc_lock_enqueue+0x0/0x810 [osc]
      2013-09-08 05:05:06 [<ffffffffa08858dd>] ? lov_lock_enqueue+0x71d/0x820 [lov]
      2013-09-08 05:05:06 [<ffffffffa04b37fc>] ? cl_enqueue_try+0xfc/0x300 [obdclass]
      2013-09-08 05:05:06 [<ffffffffa04b4bfd>] ? cl_enqueue_locked+0x6d/0x200 [obdclass]
      2013-09-08 05:05:06 [<ffffffffa04b587e>] ? cl_lock_request+0x7e/0x1c0 [obdclass]
      2013-09-08 05:05:06 [<ffffffffa09f02cf>] ? cl_glimpse_lock+0x16f/0x410 [lustre]
      2013-09-08 05:05:06 [<ffffffffa09f0a94>] ? cl_glimpse_size+0x154/0x160 [lustre]
      2013-09-08 05:05:06 [<ffffffffa09b49ae>] ? ll_inode_revalidate_it+0x5e/0x170 [lustre]
      2013-09-08 05:05:06 [<ffffffffa09b4b09>] ? ll_getattr_it+0x49/0x170 [lustre]
      2013-09-08 05:05:06 [<ffffffffa09b4c67>] ? ll_getattr+0x37/0x40 [lustre]
      2013-09-08 05:05:06 [<ffffffff8116c591>] ? vfs_getattr+0x51/0x80
      2013-09-08 05:05:06 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60
      2013-09-08 05:05:06 [<ffffffff8116c84f>] ? vfs_fstat+0x3f/0x60
      2013-09-08 05:05:06 [<ffffffff8116c894>] ? sys_newfstat+0x24/0x40
      2013-09-08 05:05:06 [<ffffffff81173e85>] ? putname+0x35/0x50
      2013-09-08 05:05:06 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140
      2013-09-08 05:05:06 [<ffffffff810030f2>] ? system_call_fastpath+0x16/0x1b
      

      This looks very similar to LU-3874, but the patch that was suspected in this jira is not applied to our version (only in master and b2_4).

      Unfortunately, we do not have any dump to analyze further.

      Thanks,
      Sebastien.

      Attachments

        Activity

          [LU-3941] (lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed
          pjones Peter Jones added a comment -

          Issue resolved

          pjones Peter Jones added a comment - Issue resolved
          pjones Peter Jones added a comment -

          Sebastien

          This patch would be under consideration to land on b2_1 if/when we are working on a 2.1.7 release but we have no immediate plans to do this at the time of writing.

          Regards

          Peter

          pjones Peter Jones added a comment - Sebastien This patch would be under consideration to land on b2_1 if/when we are working on a 2.1.7 release but we have no immediate plans to do this at the time of writing. Regards Peter

          Hi Bruno,

          Actually, there was no new occurrence of this issue. There were only 4 occurrences in 5 days mid september, and then nothing.
          Is the patch planned for full review and landing on b2_1 ?

          spiechurski Sebastien Piechurski added a comment - Hi Bruno, Actually, there was no new occurrence of this issue. There were only 4 occurrences in 5 days mid september, and then nothing. Is the patch planned for full review and landing on b2_1 ?

          Seb, looking into the history/comments of this ticket, I wonder if you already gave a try to my patch on-site, or if problem simply did not re-occur ??

          bfaccini Bruno Faccini (Inactive) added a comment - Seb, looking into the history/comments of this ticket, I wonder if you already gave a try to my patch on-site, or if problem simply did not re-occur ??

          Cool, thanks Seb for these details and to add the debug flags.

          bfaccini Bruno Faccini (Inactive) added a comment - Cool, thanks Seb for these details and to add the debug flags.

          Hi Bruno,

          We don't have any new occurrences.
          We will enable debug logs in case we have a new occurrence.
          The application is an hybrid one (MPI + OpenMP) which might access some files concurrently, inducing multiple locks on the same region of the same objects.

          We are also tracing another issue where this application creates some lock callback timeouts on the OSS with eviction of some clients. I don't know if those 2 problems are tied or not, but seems quite close. We will open a new Jira when we have more data about it.

          spiechurski Sebastien Piechurski added a comment - Hi Bruno, We don't have any new occurrences. We will enable debug logs in case we have a new occurrence. The application is an hybrid one (MPI + OpenMP) which might access some files concurrently, inducing multiple locks on the same region of the same objects. We are also tracing another issue where this application creates some lock callback timeouts on the OSS with eviction of some clients. I don't know if those 2 problems are tied or not, but seems quite close. We will open a new Jira when we have more data about it.

          Patch successfully passed standard auto-tests and additional parallel-scale* requested.

          Sebs, any news on your side ? Did you trigger new occurrences already ?? Have you been able to add Lustre-debug flags ? Any details about Arpege app to provide ?

          bfaccini Bruno Faccini (Inactive) added a comment - Patch successfully passed standard auto-tests and additional parallel-scale* requested. Sebs, any news on your side ? Did you trigger new occurrences already ?? Have you been able to add Lustre-debug flags ? Any details about Arpege app to provide ?

          People

            bfaccini Bruno Faccini (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: