Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17305

LustreError: 4731:0:(osp_precreate.c:220:osp_statfs_update()) ASSERTION( imp ) failed:

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      __class_new_export() puts all exports (but self ones) to obd's obd_exports_timed.

      That may lead to the following failure:

      An osp device (e.g. lustre-MDT0001-osp-MDT0000) gets into list of ll_evictor via:

      void ptlrpc_update_export_timer(struct obd_export *exp, time64_t extra_delay)
      ..
                      if (ktime_get_real_seconds() >
                          (exp->exp_obd->obd_eviction_timer + extra_delay)) {
                              /*
                               * The evictor won't evict anyone who we've heard from
                               * recently, so we don't have to check before we start
                               * it.
                               */
                              if (!ping_evictor_wake(exp))
                                      exp->exp_obd->obd_eviction_timer = 0;
      

      A single export of the osp device may really get expired while ll_evictor processed previous obd. The below is a real example how long class_fail_export may take.

      00000020:00080000:1.0:1697800331.318457:0:11259:0:(genops.c:1602:class_fail_export()) disconnecting export 00000000aebb178e/1f7b0f9c-d105-4dcb-b264-be1a9fe6c818
      00000020:00080000:1.0:1697800415.211208:0:11259:0:(genops.c:1619:class_fail_export()) disconnected export 00000000aebb178e/1f7b0f9c-d105-4dcb-b264-be1a9fe6c818
      

      Now osp's exports looks like "dead" for ll_evictor:

      00000100:00080000:1.0:1697800415.211212:0:11259:0:(pinger.c:498:ping_evictor_main()) evicting all exports of obd lustre-MDT0002-osp-MDT0001 older than 1697800385
      00000100:02000400:1.0:1697800415.211217:0:11259:0:(pinger.c:525:ping_evictor_main()) lustre-MDT0002-osp-MDT0001: haven't heard from client lustre-MDT0001-mdtlov_UUID (at 0@lo) in 60 seconds. I think it's dead, and I am evicting it. exp 00000000917f6020, cur 1697800415 expire 1697800385 last 1697800355
      

      class_fail_export() for that export does a lot including clearing of obd->u.cli.cl_import via where obd_cleanup_client_import():

      ping_evictor_main
         class_fail_export
            obd_disconnect
               osp_obd_disconnect
                  class_manual_cleanup
                     class_process_config(LCFG_CLEANUP)
                        class_cleanup
                           obd_precleanup
                              osp_device_fini
                                 client_obd_cleanup
                                    obd_cleanup_client_import
                                       obd->u.cli.cl_import = NULL;
      

      As osp-pre threads are not stopped by the evictor, that leads to
      assertion on:

      osp_precreate_thread
         osp_statfs_update
            imp = d->opd_obd->u.cli.cl_import;
            LASSERT(imp);
      

      If such export (created by client_connect_import) did not get linked to obd_exports_timed list - the problem would not exist.

      Attachments

        Activity

          People

            vsaveliev Vladimir Saveliev
            vsaveliev Vladimir Saveliev
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: