Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6478

MDT unmount may complete before HSM coordinator stops causes list corruption

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      The HSM coordinator may still be running after the MDT has been unmounted. To see this let sanity-hsm run for a bit, then kill it, then unmount the MDT.

      # export agt1_HOST=$HOSTNAME
      # ./lustre/tests/llmount.sh
      ...
      # bash lustre/tests/sanity-hsm.sh
      ...
      == sanity-hsm test 9: Use of explicit archive number, with dedicated copytool == 09:53:53 (1429541633)
      Purging archive on t
      Starting copytool agt1 on t
      Copytool is stopped on t
      Copytool has stopped in  2s.
      mdt.lustre-MDT0000.hsm_control=shutdown
      Waiting 20 secs for update
      ^C
      # umount /mnt/mds1
      

      From the console:

      ...
      [ 1489.933004] Lustre: DEBUG MARKER: == sanity-hsm test 9: Use of explicit archive number, with dedicated copytool == 09:53:53 (1429541633)
      [ 1499.709003] Lustre: Failing over lustre-MDT0000
      [ 1499.823727] Lustre: server umount lustre-MDT0000 complete
      [ 1500.063068] ------------[ cut here ]------------
      [ 1500.063872] WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted)
      [ 1500.064035] Hardware name: Bochs
      [ 1500.064035] list_del corruption. prev->next should be ffff8800c792dde8, but was 6b6b6b6b6b6b6b6b
      [ 1500.064035] Modules linked in: ...
      [ 1500.064035] Pid: 9019, comm: hsm_cdtr Not tainted 2.6.32-431.29.2.el6.lustre.x86_64 #1
      [ 1500.064035] Call Trace:
      [ 1500.064035]  [<ffffffff810741b7>] ? warn_slowpath_common+0x87/0xc0
      [ 1500.064035]  [<ffffffff810742a6>] ? warn_slowpath_fmt+0x46/0x50
      [ 1500.064035]  [<ffffffff812b528e>] ? list_del+0x6e/0xa0
      [ 1500.064035]  [<ffffffff8109efd1>] ? remove_wait_queue+0x31/0x50
      [ 1500.064035]  [<ffffffffa0f1e9b4>] ? mdt_coordinator+0xd94/0x1620 [mdt]
      [ 1500.064035]  [<ffffffff81061d90>] ? default_wake_function+0x0/0x20
      [ 1500.064035]  [<ffffffff81553065>] ? thread_return+0x4e/0x7e9
      [ 1500.064035]  [<ffffffffa0f1dc20>] ? mdt_coordinator+0x0/0x1620 [mdt]
      [ 1500.064035]  [<ffffffff8109e856>] ? kthread+0x96/0xa0
      [ 1500.064035]  [<ffffffff8100c30a>] ? child_rip+0xa/0x20
      [ 1500.064035]  [<ffffffff815562e0>] ? _spin_unlock_irq+0x30/0x40
      [ 1500.064035]  [<ffffffff8100bb10>] ? restore_args+0x0/0x30
      [ 1500.064035]  [<ffffffff8109e7c0>] ? kthread+0x0/0xa0
      [ 1500.064035]  [<ffffffff8100c300>] ? child_rip+0x0/0x20
      [ 1500.064035] ---[ end trace a2fe1cd64beca73d ]---
      [ 1500.064035] ------------[ cut here ]------------
      

      Attachments

        Activity

          People

            wc-triage WC Triage
            jhammond John Hammond
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: