Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12142

Hang in OSC on eviction - threads stuck in read() and ldlm_bl_NN

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0
    • Lustre 2.10.7, Lustre 2.12.3
    • None
    • 9223372036854775807

    Description

      This appears to be related to LU-6271.

      A private customer ticket reported a hang on a client which was suffering repeated evictions.

      The client threads all seem to be waiting in two connected places.

      First, the eviction:

      [<ffffffffc11f8c05>] osc_object_invalidate+0x115/0x290 [osc]
      [<ffffffffc11e9f4f>] osc_ldlm_resource_invalidate+0xaf/0x190 [osc]
      [<ffffffffc0ce8d10>] cfs_hash_for_each_relax+0x250/0x450 [libcfs]
      [<ffffffffc0cec0a5>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs]
      [<ffffffffc11f1427>] osc_import_event+0x497/0x1370 [osc]
      [<ffffffffc13b3590>] ptlrpc_invalidate_import+0x220/0x8f0 [ptlrpc]
      [<ffffffffc13b50c8>] ptlrpc_invalidate_import_thread+0x48/0x2b0 [ptlrpc]
      [<ffffffffa52c1c71>] kthread+0xd1/0xe0
      [<ffffffffa5974c1d>] ret_from_fork_nospec_begin+0x7/0x21
      [<ffffffffffffffff>] 0xffffffffffffffff 

      And then the other side:

      [<ffffffffc11faf65>] osc_lru_alloc+0x265/0x390 [osc]
      [<ffffffffc11fb1c2>] osc_page_init+0x132/0x1d0 [osc]
      [<ffffffffc0ff146f>] lov_page_init_composite+0x26f/0x4c0 [lov]
      [<ffffffffc0fe8b11>] lov_page_init+0x21/0x60 [lov]
      [<ffffffffc0e849bd>] cl_page_alloc+0x10d/0x280 [obdclass]
      [<ffffffffc0e84ba4>] cl_page_find+0x74/0x280 [obdclass]
      [<ffffffffc1111653>] ll_readpage+0x83/0x6e0 [lustre]
      [<ffffffffa53b81f0>] generic_file_aio_read+0x3f0/0x790
      [<ffffffffc1139037>] vvp_io_read_start+0x4b7/0x600 [lustre]
      [<ffffffffc0e87b78>] cl_io_start+0x68/0x130 [obdclass]
      [<ffffffffc0e89f5e>] cl_io_loop+0x12e/0xc90 [obdclass]
      [<ffffffffc10e43c8>] ll_file_io_generic+0x498/0xc80 [lustre]
      [<ffffffffc10e547a>] ll_file_aio_read+0x34a/0x3e0 [lustre]
      [<ffffffffc10e55de>] ll_file_read+0xce/0x1e0 [lustre]
      [<ffffffffa54414bf>] vfs_read+0x9f/0x170
      [<ffffffffa544237f>] SyS_read+0x7f/0xf0
      [<ffffffffa5974ddb>] system_call_fastpath+0x22/0x27
      [<ffffffffffffffff>] 0xffffffffffffffff

      The eviction side is waiting for:

      l_wait_event(osc->oo_io_waitq, atomic_read(&osc->oo_nr_ios) == 0, &lwi);

      This is the first action in osc_object_invalidate.

       

      And the other side, in osc_lru_alloc, sleeps with no timeout on the osc_lru_waitq:

              struct l_wait_info lwi = LWI_INTR(LWI_ON_SIGNAL_NOOP, NULL);
      [.....]
                      rc = l_wait_event(osc_lru_waitq,
                                      atomic_long_read(cli->cl_lru_left) > 0,
                                      &lwi); 

      osc_lru_alloc is called after osc_io_iter_init, which increases oo_nr_ios, so it's sleeping here with oo_nr_ios elevated.

       

      The OSC eviction path does not tickle osc_lru_waitq directly, it does so by freeing pages from objects, so if the first object to be invalidated has threads waiting for pages, I think it will get stuck here.  (We would also expect that the failure of whatever is holding these LRU pages would free them up - We may have an ordering issue here.)

      Additionally, the osc_lru_alloc code does not appear to have any method to fail if the import is being evicted.  It looks like we have to successfully get a page in here before we'll spool out in to the larger i/o, which will eventually catch the eviction and fail.

      Attachments

        Issue Links

          Activity

            People

              wshilong Wang Shilong (Inactive)
              pfarrell Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: