[LU-17305] LustreError: 4731:0:(osp_precreate.c:220:osp_statfs_update()) ASSERTION( imp ) failed: Created: 21/Nov/23  Updated: 21/Nov/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Vladimir Saveliev Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

__class_new_export() puts all exports (but self ones) to obd's obd_exports_timed.

That may lead to the following failure:

An osp device (e.g. lustre-MDT0001-osp-MDT0000) gets into list of ll_evictor via:

void ptlrpc_update_export_timer(struct obd_export *exp, time64_t extra_delay)
..
                if (ktime_get_real_seconds() >
                    (exp->exp_obd->obd_eviction_timer + extra_delay)) {
                        /*
                         * The evictor won't evict anyone who we've heard from
                         * recently, so we don't have to check before we start
                         * it.
                         */
                        if (!ping_evictor_wake(exp))
                                exp->exp_obd->obd_eviction_timer = 0;

A single export of the osp device may really get expired while ll_evictor processed previous obd. The below is a real example how long class_fail_export may take.

00000020:00080000:1.0:1697800331.318457:0:11259:0:(genops.c:1602:class_fail_export()) disconnecting export 00000000aebb178e/1f7b0f9c-d105-4dcb-b264-be1a9fe6c818
00000020:00080000:1.0:1697800415.211208:0:11259:0:(genops.c:1619:class_fail_export()) disconnected export 00000000aebb178e/1f7b0f9c-d105-4dcb-b264-be1a9fe6c818

Now osp's exports looks like "dead" for ll_evictor:

00000100:00080000:1.0:1697800415.211212:0:11259:0:(pinger.c:498:ping_evictor_main()) evicting all exports of obd lustre-MDT0002-osp-MDT0001 older than 1697800385
00000100:02000400:1.0:1697800415.211217:0:11259:0:(pinger.c:525:ping_evictor_main()) lustre-MDT0002-osp-MDT0001: haven't heard from client lustre-MDT0001-mdtlov_UUID (at 0@lo) in 60 seconds. I think it's dead, and I am evicting it. exp 00000000917f6020, cur 1697800415 expire 1697800385 last 1697800355

class_fail_export() for that export does a lot including clearing of obd->u.cli.cl_import via where obd_cleanup_client_import():

ping_evictor_main
   class_fail_export
      obd_disconnect
         osp_obd_disconnect
            class_manual_cleanup
               class_process_config(LCFG_CLEANUP)
                  class_cleanup
                     obd_precleanup
                        osp_device_fini
                           client_obd_cleanup
                              obd_cleanup_client_import
                                 obd->u.cli.cl_import = NULL;

As osp-pre threads are not stopped by the evictor, that leads to
assertion on:

osp_precreate_thread
   osp_statfs_update
      imp = d->opd_obd->u.cli.cl_import;
      LASSERT(imp);

If such export (created by client_connect_import) did not get linked to obd_exports_timed list - the problem would not exist.



 Comments   
Comment by Gerrit Updater [ 21/Nov/23 ]

"Vladimir Saveliev <vladimir.saveliev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53192
Subject: LU-17305 obdclass: do not link all exports to obd's timed list
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9383339a726b46e33a758dda77a864f38b28795e

Generated at Sat Feb 10 03:34:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.