Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
__class_new_export() puts all exports (but self ones) to obd's obd_exports_timed.
That may lead to the following failure:
An osp device (e.g. lustre-MDT0001-osp-MDT0000) gets into list of ll_evictor via:
void ptlrpc_update_export_timer(struct obd_export *exp, time64_t extra_delay) .. if (ktime_get_real_seconds() > (exp->exp_obd->obd_eviction_timer + extra_delay)) { /* * The evictor won't evict anyone who we've heard from * recently, so we don't have to check before we start * it. */ if (!ping_evictor_wake(exp)) exp->exp_obd->obd_eviction_timer = 0;
A single export of the osp device may really get expired while ll_evictor processed previous obd. The below is a real example how long class_fail_export may take.
00000020:00080000:1.0:1697800331.318457:0:11259:0:(genops.c:1602:class_fail_export()) disconnecting export 00000000aebb178e/1f7b0f9c-d105-4dcb-b264-be1a9fe6c818 00000020:00080000:1.0:1697800415.211208:0:11259:0:(genops.c:1619:class_fail_export()) disconnected export 00000000aebb178e/1f7b0f9c-d105-4dcb-b264-be1a9fe6c818
Now osp's exports looks like "dead" for ll_evictor:
00000100:00080000:1.0:1697800415.211212:0:11259:0:(pinger.c:498:ping_evictor_main()) evicting all exports of obd lustre-MDT0002-osp-MDT0001 older than 1697800385 00000100:02000400:1.0:1697800415.211217:0:11259:0:(pinger.c:525:ping_evictor_main()) lustre-MDT0002-osp-MDT0001: haven't heard from client lustre-MDT0001-mdtlov_UUID (at 0@lo) in 60 seconds. I think it's dead, and I am evicting it. exp 00000000917f6020, cur 1697800415 expire 1697800385 last 1697800355
class_fail_export() for that export does a lot including clearing of obd->u.cli.cl_import via where obd_cleanup_client_import():
ping_evictor_main class_fail_export obd_disconnect osp_obd_disconnect class_manual_cleanup class_process_config(LCFG_CLEANUP) class_cleanup obd_precleanup osp_device_fini client_obd_cleanup obd_cleanup_client_import obd->u.cli.cl_import = NULL;
As osp-pre threads are not stopped by the evictor, that leads to
assertion on:
osp_precreate_thread osp_statfs_update imp = d->opd_obd->u.cli.cl_import; LASSERT(imp);
If such export (created by client_connect_import) did not get linked to obd_exports_timed list - the problem would not exist.