Description
There is nothing to prevent a coordinator from restarting after it was shut down. This is a problem at cleanup time for example when unmounting an mdt.
The race is a bit difficult to trigger. The simplest way to do it is to add a delay in mdt_hsm_cdt_stop() right after the coordinator's state is set to CDT_STOPPED.
After applying the patch (cf. attachment), concurrently run:
while [ $(lctl get_param -n mdt.lustre-MDT0000.hsm_control) != "stopped" ]; do sleep 1 done lctl set_param mdt.lustre-MDT0000.hsm_control enabled
and
lustre/tests/llmountcleanup.sh
This should trigger the following:
kernel:LustreError: 20570:0:(mdt_coordinator.c:391:hsm_cdt_procfs_fini()) ASSERTION( cdt->cdt_state == CDT_STOPPED ) failed:
kernel:LustreError: 20570:0:(mdt_coordinator.c:391:hsm_cdt_procfs_fini()) LBUG