Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14883

umount blocked due to remaining statahead threads

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.6
    • None
    • CentOS 7.9 clients (3.10.0-1160.24.1.el7.x86_64)
    • 3
    • 9223372036854775807

    Description

      We noticed today on a cluster running 2.12.6 clients that a large number of them were blocked at umount when applying a security patch. No jobs were running anymore on these clients, but umount doesn't finish.

      Dumping the tasks on a client has revealed that 3 ll_sa threads are still running:

      PID: 25857  TASK: ffff9a78cd32b180  CPU: 36  COMMAND: "ll_sa_24325"
       #0 [ffff9a7519997c98] __schedule at ffffffffa2f8899f
       #1 [ffff9a7519997d28] schedule at ffffffffa2f88eb9
       #2 [ffff9a7519997d38] ll_statahead_thread at ffffffffc133787e [lustre]
       #3 [ffff9a7519997ec8] kthread at ffffffffa28c5da1
       #4 [ffff9a7519997f50] ret_from_fork_nospec_begin at ffffffffa2f95de4
      
      PID: 34631  TASK: ffff9b754a283180  CPU: 10  COMMAND: "ll_sa_30018"
       #0 [ffff99fd4ebd7c98] __schedule at ffffffffa2f8899f
       #1 [ffff99fd4ebd7d28] schedule at ffffffffa2f88eb9
       #2 [ffff99fd4ebd7d38] ll_statahead_thread at ffffffffc133787e [lustre]
       #3 [ffff99fd4ebd7ec8] kthread at ffffffffa28c5da1
       #4 [ffff99fd4ebd7f50] ret_from_fork_nospec_begin at ffffffffa2f95de4
      
      PID: 41809  TASK: ffff9afeeeb8b180  CPU: 40  COMMAND: "ll_sa_40406"
       #0 [ffff99fe617c7c98] __schedule at ffffffffa2f8899f
       #1 [ffff99fe617c7d28] schedule at ffffffffa2f88eb9
       #2 [ffff99fe617c7d38] ll_statahead_thread at ffffffffc133787e [lustre]
       #3 [ffff99fe617c7ec8] kthread at ffffffffa28c5da1
       #4 [ffff99fe617c7f50] ret_from_fork_nospec_begin at ffffffffa2f95de4
      

      This is blocking umount/ll_kill_super():

      crash> bt -f 19912
      PID: 19912  TASK: ffff9b7532fae300  CPU: 44  COMMAND: "umount"
       #0 [ffff9a002139bd08] __schedule at ffffffffa2f8899f
          ffff9a002139bd10: 0000000000000082 ffff9a002139bfd8 
          ffff9a002139bd20: ffff9a002139bfd8 ffff9a002139bfd8 
          ffff9a002139bd30: 000000000001acc0 ffff9afb94b4e300 
          ffff9a002139bd40: ffff9b797f2d3940 00000001e05d9d9f 
          ffff9a002139bd50: 0000000000000282 000000006a88e28b 
          ffff9a002139bd60: ffff9a002139bdc0 ffff9b797f313940 
          ffff9a002139bd70: ffff9a002139bdc0 00000001e05d9e1c 
          ffff9a002139bd80: ffff9b797f313940 0000000000015980 
          ffff9a002139bd90: ffff9a002139bda0 ffffffffa2f88eb9 
       #1 [ffff9a002139bd98] schedule at ffffffffa2f88eb9
          ffff9a002139bda0: ffff9a002139be48 ffffffffa2f868d8 
       #2 [ffff9a002139bda8] schedule_timeout at ffffffffa2f868d8
          ffff9a002139bdb0: ffff9a0073ffd888 0000000000000282 
          ffff9a002139bdc0: ffff9b797f314b48 ffff9b797f314b48 
          ffff9a002139bdd0: 00000001e05d9e1c ffff9b797f313940 
          ffff9a002139bde0: ffffffffa28ad7f0 ffff9b7532fae300 
          ffff9a002139bdf0: ffffffffffffffff 0000000000000000 
          ffff9a002139be00: 0000000000000000 0000000000000000 
          ffff9a002139be10: 0000000000000282 000000006a88e28b 
          ffff9a002139be20: ffff9b7532fae300 ffff9a0030bf0000 
          ffff9a002139be30: ffff9b7532fae300 ffffffffa3907570 
          ffff9a002139be40: ffff9b7532faead0 ffff9a002139be70 
          ffff9a002139be50: ffffffffc130a449 
       #3 [ffff9a002139be50] ll_kill_super at ffffffffc130a449 [lustre]
          ffff9a002139be58: 0000000000000000 ffff9b75f7eba000 
          ffff9a002139be68: ffffffffc0fa8ae0 ffff9a002139be88 
          ffff9a002139be78: ffffffffc0f599ad 
       #4 [ffff9a002139be78] lustre_kill_super at ffffffffc0f599ad [obdclass]
          ffff9a002139be80: ffff9b75f7eba000 ffff9a002139bea8 
          ffff9a002139be90: ffffffffa2a50fae 
       #5 [ffff9a002139be90] deactivate_locked_super at ffffffffa2a50fae
          ffff9a002139be98: ffff9b75f7eba000 0000000000000000 
          ffff9a002139bea8: ffff9a002139bec0 ffffffffa2a51736 
       #6 [ffff9a002139beb0] deactivate_super at ffffffffa2a51736
          ffff9a002139beb8: ffff9a0073ffd680 ffff9a002139bed8 
          ffff9a002139bec8: ffffffffa2a70dbf 
       #7 [ffff9a002139bec8] cleanup_mnt at ffffffffa2a70dbf
          ffff9a002139bed0: ffff9a0073ffd6b8 ffff9a002139bee8 
          ffff9a002139bee0: ffffffffa2a70e52 
       #8 [ffff9a002139bee0] __cleanup_mnt at ffffffffa2a70e52
          ffff9a002139bee8: ffff9a002139bf28 ffffffffa28c28db 
       #9 [ffff9a002139bef0] task_work_run at ffffffffa28c28db
          ffff9a002139bef8: ffff9b7532faeb04 0000000000000002 
          ffff9a002139bf08: ffff9a002139bf58 0000000000000000 
          ffff9a002139bf18: 0000000000000000 0000000000000000 
          ffff9a002139bf28: ffff9a002139bf48 ffffffffa282cc65 
      #10 [ffff9a002139bf30] do_notify_resume at ffffffffa282cc65
          ffff9a002139bf38: 0000000000000000 0000000000000000 
          ffff9a002139bf48: 0000000000000000 ffffffffa2f962ef 
      #11 [ffff9a002139bf50] int_signal at ffffffffa2f962ef
          RIP: 00002b88f55a5a97  RSP: 00007fffaf522e48  RFLAGS: 00000246
          RAX: 0000000000000000  RBX: 00005603f7e53040  RCX: ffffffffffffffff
          RDX: 0000000000000001  RSI: 0000000000000000  RDI: 00005603f7e54840
          RBP: 00005603f7e54840   R8: 0000000000000000   R9: 0000000000000005
          R10: 00007fffaf5228a0  R11: 0000000000000246  R12: 00002b88f4e3ad78
          R13: 0000000000000000  R14: 00005603f7e54960  R15: 00005603f7e53040
          ORIG_RAX: 00000000000000a6  CS: 0033  SS: 002b
      
      sb = 0xffff9b75f7eba000
      
      crash> struct super_block.s_fs_info ffff9b75f7eba000
        s_fs_info = 0xffff9b75f7eb9000
      crash> struct lustre_sb_info.lsi_llsbi 0xffff9b75f7eb9000
        lsi_llsbi = 0xffff9a0030bf0000
      crash> struct ll_sb_info.ll_umounting 0xffff9a0030bf0000
        ll_umounting = 1
      crash> struct ll_sb_info.ll_sa_running 0xffff9a0030bf0000
        ll_sa_running = {
          counter = 3
        }
      

      Here on this cluster clients are all 2.12.6, and server is Oak which is running 2.12.6 too.
      We have one client that is available for live troubleshooting (it's still blocked at umount).

      Is there another known bug with such problem in 2.12? Happy to provide more info.

      Attachments

        1. foreach_bt.log
          452 kB
          Stephane Thiell

        Issue Links

          Activity

            People

              qian_wc Qian Yingjin
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: