Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
__class_new_export() puts all exports (but self ones) to obd's obd_exports_timed.
That may lead to the following failure:
An osp device (e.g. lustre-MDT0001-osp-MDT0000) gets into list of ll_evictor via:
void ptlrpc_update_export_timer(struct obd_export *exp, time64_t extra_delay)
..
if (ktime_get_real_seconds() >
(exp->exp_obd->obd_eviction_timer + extra_delay)) {
/*
* The evictor won't evict anyone who we've heard from
* recently, so we don't have to check before we start
* it.
*/
if (!ping_evictor_wake(exp))
exp->exp_obd->obd_eviction_timer = 0;
A single export of the osp device may really get expired while ll_evictor processed previous obd. The below is a real example how long class_fail_export may take.
00000020:00080000:1.0:1697800331.318457:0:11259:0:(genops.c:1602:class_fail_export()) disconnecting export 00000000aebb178e/1f7b0f9c-d105-4dcb-b264-be1a9fe6c818 00000020:00080000:1.0:1697800415.211208:0:11259:0:(genops.c:1619:class_fail_export()) disconnected export 00000000aebb178e/1f7b0f9c-d105-4dcb-b264-be1a9fe6c818
Now osp's exports looks like "dead" for ll_evictor:
00000100:00080000:1.0:1697800415.211212:0:11259:0:(pinger.c:498:ping_evictor_main()) evicting all exports of obd lustre-MDT0002-osp-MDT0001 older than 1697800385 00000100:02000400:1.0:1697800415.211217:0:11259:0:(pinger.c:525:ping_evictor_main()) lustre-MDT0002-osp-MDT0001: haven't heard from client lustre-MDT0001-mdtlov_UUID (at 0@lo) in 60 seconds. I think it's dead, and I am evicting it. exp 00000000917f6020, cur 1697800415 expire 1697800385 last 1697800355
class_fail_export() for that export does a lot including clearing of obd->u.cli.cl_import via where obd_cleanup_client_import():
ping_evictor_main
class_fail_export
obd_disconnect
osp_obd_disconnect
class_manual_cleanup
class_process_config(LCFG_CLEANUP)
class_cleanup
obd_precleanup
osp_device_fini
client_obd_cleanup
obd_cleanup_client_import
obd->u.cli.cl_import = NULL;
As osp-pre threads are not stopped by the evictor, that leads to
assertion on:
osp_precreate_thread
osp_statfs_update
imp = d->opd_obd->u.cli.cl_import;
LASSERT(imp);
If such export (created by client_connect_import) did not get linked to obd_exports_timed list - the problem would not exist.