[LU-17305] LustreError: 4731:0:(osp_precreate.c:220:osp_statfs_update()) ASSERTION( imp ) failed: Created: 21/Nov/23 Updated: 21/Nov/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Vladimir Saveliev | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
__class_new_export() puts all exports (but self ones) to obd's obd_exports_timed. That may lead to the following failure: An osp device (e.g. lustre-MDT0001-osp-MDT0000) gets into list of ll_evictor via:
void ptlrpc_update_export_timer(struct obd_export *exp, time64_t extra_delay)
..
if (ktime_get_real_seconds() >
(exp->exp_obd->obd_eviction_timer + extra_delay)) {
/*
* The evictor won't evict anyone who we've heard from
* recently, so we don't have to check before we start
* it.
*/
if (!ping_evictor_wake(exp))
exp->exp_obd->obd_eviction_timer = 0;
A single export of the osp device may really get expired while ll_evictor processed previous obd. The below is a real example how long class_fail_export may take. 00000020:00080000:1.0:1697800331.318457:0:11259:0:(genops.c:1602:class_fail_export()) disconnecting export 00000000aebb178e/1f7b0f9c-d105-4dcb-b264-be1a9fe6c818 00000020:00080000:1.0:1697800415.211208:0:11259:0:(genops.c:1619:class_fail_export()) disconnected export 00000000aebb178e/1f7b0f9c-d105-4dcb-b264-be1a9fe6c818 Now osp's exports looks like "dead" for ll_evictor: 00000100:00080000:1.0:1697800415.211212:0:11259:0:(pinger.c:498:ping_evictor_main()) evicting all exports of obd lustre-MDT0002-osp-MDT0001 older than 1697800385 00000100:02000400:1.0:1697800415.211217:0:11259:0:(pinger.c:525:ping_evictor_main()) lustre-MDT0002-osp-MDT0001: haven't heard from client lustre-MDT0001-mdtlov_UUID (at 0@lo) in 60 seconds. I think it's dead, and I am evicting it. exp 00000000917f6020, cur 1697800415 expire 1697800385 last 1697800355 class_fail_export() for that export does a lot including clearing of obd->u.cli.cl_import via where obd_cleanup_client_import():
ping_evictor_main
class_fail_export
obd_disconnect
osp_obd_disconnect
class_manual_cleanup
class_process_config(LCFG_CLEANUP)
class_cleanup
obd_precleanup
osp_device_fini
client_obd_cleanup
obd_cleanup_client_import
obd->u.cli.cl_import = NULL;
As osp-pre threads are not stopped by the evictor, that leads to
osp_precreate_thread
osp_statfs_update
imp = d->opd_obd->u.cli.cl_import;
LASSERT(imp);
If such export (created by client_connect_import) did not get linked to obd_exports_timed list - the problem would not exist. |
| Comments |
| Comment by Gerrit Updater [ 21/Nov/23 ] |
|
"Vladimir Saveliev <vladimir.saveliev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53192 |