[LU-6478] MDT unmount may complete before HSM coordinator stops causes list corruption Created: 20/Apr/15  Updated: 22/Jun/15

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: hsm

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The HSM coordinator may still be running after the MDT has been unmounted. To see this let sanity-hsm run for a bit, then kill it, then unmount the MDT.

# export agt1_HOST=$HOSTNAME
# ./lustre/tests/llmount.sh
...
# bash lustre/tests/sanity-hsm.sh
...
== sanity-hsm test 9: Use of explicit archive number, with dedicated copytool == 09:53:53 (1429541633)
Purging archive on t
Starting copytool agt1 on t
Copytool is stopped on t
Copytool has stopped in  2s.
mdt.lustre-MDT0000.hsm_control=shutdown
Waiting 20 secs for update
^C
# umount /mnt/mds1

From the console:

...
[ 1489.933004] Lustre: DEBUG MARKER: == sanity-hsm test 9: Use of explicit archive number, with dedicated copytool == 09:53:53 (1429541633)
[ 1499.709003] Lustre: Failing over lustre-MDT0000
[ 1499.823727] Lustre: server umount lustre-MDT0000 complete
[ 1500.063068] ------------[ cut here ]------------
[ 1500.063872] WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted)
[ 1500.064035] Hardware name: Bochs
[ 1500.064035] list_del corruption. prev->next should be ffff8800c792dde8, but was 6b6b6b6b6b6b6b6b
[ 1500.064035] Modules linked in: ...
[ 1500.064035] Pid: 9019, comm: hsm_cdtr Not tainted 2.6.32-431.29.2.el6.lustre.x86_64 #1
[ 1500.064035] Call Trace:
[ 1500.064035]  [<ffffffff810741b7>] ? warn_slowpath_common+0x87/0xc0
[ 1500.064035]  [<ffffffff810742a6>] ? warn_slowpath_fmt+0x46/0x50
[ 1500.064035]  [<ffffffff812b528e>] ? list_del+0x6e/0xa0
[ 1500.064035]  [<ffffffff8109efd1>] ? remove_wait_queue+0x31/0x50
[ 1500.064035]  [<ffffffffa0f1e9b4>] ? mdt_coordinator+0xd94/0x1620 [mdt]
[ 1500.064035]  [<ffffffff81061d90>] ? default_wake_function+0x0/0x20
[ 1500.064035]  [<ffffffff81553065>] ? thread_return+0x4e/0x7e9
[ 1500.064035]  [<ffffffffa0f1dc20>] ? mdt_coordinator+0x0/0x1620 [mdt]
[ 1500.064035]  [<ffffffff8109e856>] ? kthread+0x96/0xa0
[ 1500.064035]  [<ffffffff8100c30a>] ? child_rip+0xa/0x20
[ 1500.064035]  [<ffffffff815562e0>] ? _spin_unlock_irq+0x30/0x40
[ 1500.064035]  [<ffffffff8100bb10>] ? restore_args+0x0/0x30
[ 1500.064035]  [<ffffffff8109e7c0>] ? kthread+0x0/0xa0
[ 1500.064035]  [<ffffffff8100c300>] ? child_rip+0x0/0x20
[ 1500.064035] ---[ end trace a2fe1cd64beca73d ]---
[ 1500.064035] ------------[ cut here ]------------

Generated at Sat Feb 10 02:00:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.