[LU-9063] hsm: race on the coordinator's state Created: 30/Jan/17  Updated: 16/Dec/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: CEA Assignee: Hongchao Zhang
Resolution: Unresolved Votes: 0
Labels: HSM, patch

Attachments: File 0001-LU-0000-hsm-add-a-delay-to-easily-trigger-a-race-at-.patch    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

There is nothing to prevent a coordinator from restarting after it was shut down. This is a problem at cleanup time for example when unmounting an mdt.

The race is a bit difficult to trigger. The simplest way to do it is to add a delay in mdt_hsm_cdt_stop() right after the coordinator's state is set to CDT_STOPPED.

After applying the patch (cf. attachment), concurrently run:

while [ $(lctl get_param -n mdt.lustre-MDT0000.hsm_control) != "stopped" ]; do
    sleep 1
done
lctl set_param mdt.lustre-MDT0000.hsm_control enabled

and

lustre/tests/llmountcleanup.sh

This should trigger the following:
kernel:LustreError: 20570:0:(mdt_coordinator.c:391:hsm_cdt_procfs_fini()) ASSERTION( cdt->cdt_state == CDT_STOPPED ) failed:
kernel:LustreError: 20570:0:(mdt_coordinator.c:391:hsm_cdt_procfs_fini()) LBUG



 Comments   
Comment by Peter Jones [ 30/Jan/17 ]

Hongchao

Could you please advise on this one?

Thanks

Peter

Comment by Gerrit Updater [ 31/Jan/17 ]

Vinayak (vinayakswami.hariharmath@seagate.com) uploaded a new patch: https://review.whamcloud.com/25170
Subject: LU-9063 hsm: protect cdt_state with mutex lock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 52e08b6ec58240ce6f88f4ddbcdb3dd4216797bf

Comment by Gerrit Updater [ 06/Feb/17 ]

Vinayak (vinayakswami.hariharmath@seagate.com) uploaded a new patch: https://review.whamcloud.com/25269
Subject: LU-9063 tests: patch to create the race
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3f97bdee4ae56a2e89c6a6eff13ba2aff52bc36b

Comment by Quentin Bouget [ 05/Apr/17 ]

I think this patch would be a fix: https://review.whamcloud.com/#/c/22667

Comment by Cory Spitz [ 16/Dec/23 ]

Seems that this is an old issue that can be resolved. Quentin B. was probably right.

Comment by Cory Spitz [ 16/Dec/23 ]

And I'm guessing that https://review.whamcloud.com/c/fs/lustre-release/+/25170 can be abandoned.

Generated at Sat Feb 10 02:22:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.