When a failure occurs during processing the llog Lustre does not properly cleanup OBDs. This has been an issue for a long time which I usually see when attempting to mount with an incorrect configuration or something outside of the ordinary. These usually require a reboot as an OBD is stuck with references to it. In the partiular case I am running into is when an OSS server doesn't have a key loaded for SSK and the MDT attempts to connect to the OSP and fails.
When ptlrpc_connect_import() returns an error the exports created from early in osp_obd_connect() is not released. This seems to be related to
LU-7184 and before would LBUG but now is not properly cleaned up. I will submit a patch to call obd_disconnect() if osp_obd_connect()->ptlrpc_connect_import() fails.
There is another issue which I don't have an easy solution for and wouldn't mind if someone more familiar with llog processing had any ideas. Since this failure causes llog processing to be aborted for the MDT no more devices are added but the first OSP is already attached and setup which caused the failure in llog processing. Is ther ea proper place that something like class_manual_cleanup() could be called in the llog processing?
The failure from the logs looks something like:
[432236.495419] LustreError: 16927:0:(sec_gss.c:2036:gss_svc_handle_init()) target 'SiteA2-MDT0000_UUID' is not available for context init (no target)
[432236.669171] LustreError: 17057:0:(gss_keyring.c:849:gss_sec_lookup_ctx_kr()) failed request key: -126
[432236.679584] LustreError: 17057:0:(sec.c:448:sptlrpc_req_get_ctx()) req ffff880f05f10600: fail to get context
[432236.690680] LustreError: 17057:0:(osp_dev.c:1452:osp_obd_connect()) SiteA2-OST0001-osc-MDT0000: can't connect obd: rc = -111
[432236.703333] LustreError: 17057:0:(lod_lov.c:302:lod_add_device()) SiteA2-OST0001-osc-MDT0000: cannot connect to next dev SiteA2-OST0001_UUID (-111)
[432236.718584] LustreError: 17057:0:(obd_config.c:1716:class_config_llog_handler()) MGC10.10.10.19@o2ib: cfg command failed: rc = -111
[432236.731905] Lustre: cmd=cf00d 0:SiteA2-MDT0000-mdtlov 1:SiteA2-OST0001_UUID 2:1 3:1
[432236.732069] LustreError: 15c-8: MGC10.10.10.19@o2ib: The configuration from log 'SiteA2-MDT0000' failed (-111). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
[432236.758276] LustreError: 16836:0:(obd_mount_server.c:1383:server_start_targets()) failed to start server SiteA2-MDT0000: -111
[432236.771206] LustreError: 16836:0:(obd_mount_server.c:1934:server_fill_super()) Unable to start targets: -111
[432236.782390] Lustre: Failing over SiteA2-MDT0000
[432236.940170] Lustre: server umount SiteA2-MDT0000 complete
[432236.940180] LustreError: 16836:0:(obd_mount.c:1583:lustre_fill_super()) Unable to mount (-111)
[432239.712652] Lustre: MGS: Connection restored to 0b840ce7-6105-bed2-c3ed-d79df6a411a4 (at 10.10.4.11@o2ib)
[432248.702230] Lustre: 17432:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1521563934/real 1521563934] req@ffff880f132b0000 x1595465066874656/t0(0) o251->MGC10.10.10.19@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1521563940 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[432248.704582] Lustre: server umount MGS complete
Trying to clean things manually after the fact:
[root@sitea-oss-1 ~]# lctl dl
3 UP osd-zfs SiteA2-MDT0000-osd SiteA2-MDT0000-osd_UUID 3
9 IN osp SiteA2-OST0001-osc-MDT0000 SiteA2-MDT0000-mdtlov_UUID 3
[root@sitea-oss-1 ~]# lctl --device SiteA2-OST0001-osc-MDT0000 cleanup
[root@sitea-oss-1 ~]# lctl --device SiteA2-OST0001-osc-MDT0000 detach
[root@sitea-oss-1 ~]# lctl dl
The device is cleaned up and detached but without the patch I will submit the osp module can't be unloaded due to references still active.
From looking at the code there are a handful of places that things may not be disconnected during failure cases with which happen after obd_connect() but I don't have time to investigate them.