Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14110

Race during several client mount instances (--> rmmod lustre hang)

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.5
    • Lustre 2.15.0
    • VMs with Lustre 2.12.5/master on ldiskfs
    • 3
    • 9223372036854775807

    Description

      I create this ticket to follow the issue seen by @apercher  (cf. LU-8346's comment).

      Here are the commands/configs to reproduce the issue:

      fstab:

      <serv1@ib1>:<serv2@ib1>:/fs1 /mnt/fs1 lustre defaults,_netdev,noauto,x-systemd.requires=lnet.service,flock,user_xattr,nosuid 0 0
      <serv1@ib1>:<serv2@ib1>:/fs1/home /mnt/home lustre defaults,_netdev,noauto,x-systemd.requires=lnet.service,flock,user_xattr,nosuid 0 0

      commands:

      while true; do
       mount /mnt/home & mount /mnt/fs1
       umount /mnt/home
       umount /mnt/fs1
       lustre_rmmod
      done
      

      After some iterations "rmmod lustre" will hang in "lu_context_key_degister"

      dmesg (master branch):

       [ 1560.484463] INFO: task rmmod:6430 blocked for more than 120 seconds.
       [ 1560.484480] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       [ 1560.484496] rmmod D ffff9ddbdfd9acc0 0 6430 6396 0x00000080
       [ 1560.484499] Call Trace:
       [ 1560.484504] [<ffffffff8b0266d2>] ? kmem_cache_free+0x1e2/0x200
       [ 1560.484508] [<ffffffff8b585da9>] schedule+0x29/0x70
       [ 1560.484531] [<ffffffffc0a0284d>] lu_context_key_degister+0xcd/0x150 [obdclass]
       [ 1560.484534] [<ffffffff8aec7880>] ? wake_bit_function_rh+0x40/0x40
       [ 1560.484548] [<ffffffffc0a02a72>] lu_context_key_degister_many+0x72/0xb0 [obdclass]
       [ 1560.484550] [<ffffffff8b0266d2>] ? kmem_cache_free+0x1e2/0x200
       [ 1560.484564] [<ffffffffc0d67347>] vvp_type_fini+0x27/0x30 [lustre]
       [ 1560.484577] [<ffffffffc09fc01b>] lu_device_type_fini+0x1b/0x20 [obdclass]
       [ 1560.484586] [<ffffffffc0d68d75>] vvp_global_fini+0x15/0x30 [lustre]
       [ 1560.484596] [<ffffffffc0d7beb4>] lustre_exit+0x31/0x17d [lustre]
       [ 1560.484599] [<ffffffff8af1c46e>] SyS_delete_module+0x19e/0x310
       [ 1560.484601] [<ffffffff8b592e09>] ? system_call_after_swapgs+0x96/0x13a
       [ 1560.484603] [<ffffffff8b592e15>] ? system_call_after_swapgs+0xa2/0x13a
       [ 1560.484604] [<ffffffff8b592e09>] ? system_call_after_swapgs+0x96/0x13a
       [ 1560.484606] [<ffffffff8b592e15>] ? system_call_after_swapgs+0xa2/0x13a
       [ 1560.484607] [<ffffffff8b592e09>] ? system_call_after_swapgs+0x96/0x13a
       [ 1560.484609] [<ffffffff8b592ed2>] system_call_fastpath+0x25/0x2a
       [ 1560.484611] [<ffffffff8b592e15>] ? system_call_after_swapgs+0xa2/0x13a

      crash backtrace (master branch):

      crash> bt -F 6430
      PID: 6430   TASK: ffff9ddbd5c0c1c0  CPU: 3   COMMAND: "rmmod"
       #0 [ffff9ddbd5d2bd18] __schedule at ffffffff8b5858fa
          ffff9ddbd5d2bd20: 0000000000000082 ffff9ddbd5d2bfd8
          ffff9ddbd5d2bd30: ffff9ddbd5d2bfd8 ffff9ddbd5d2bfd8
          ffff9ddbd5d2bd40: 000000000001acc0 [task_struct]
          ffff9ddbd5d2bd50: kmem_cache_free+482 [dm_rq_target_io]
          ffff9ddbd5d2bd60: 0000000000000000 00000000a8325962
          ffff9ddbd5d2bd70: 0000000000000246 ll_thread_key
          ffff9ddbd5d2bd80: bit_wait_table+2664 ffff9ddbd5d2bdd8
          ffff9ddbd5d2bd90: 0000000000000000 0000000000000000
          ffff9ddbd5d2bda0: ffff9ddbd5d2bdb0 schedule+41
       #1 [ffff9ddbd5d2bda8] schedule at ffffffff8b585da9
          ffff9ddbd5d2bdb0: ffff9ddbd5d2be20 lu_context_key_degister+205
       #2 [ffff9ddbd5d2bdb8] lu_context_key_degister at ffffffffc0a0284d [obdclass]
          ffff9ddbd5d2bdc0: ll_thread_key+36 00000000ffffffff
          ffff9ddbd5d2bdd0: 0000000000000000 0000000000000000
          ffff9ddbd5d2bde0: [task_struct]    var_wake_function
          ffff9ddbd5d2bdf0: bit_wait_table+2672 bit_wait_table+2672
          ffff9ddbd5d2be00: 00000000a8325962 fffffffffffffff5
          ffff9ddbd5d2be10: __this_module    0000000000000800
          ffff9ddbd5d2be20: ffff9ddbd5d2be80 lu_context_key_degister_many+114
       #3 [ffff9ddbd5d2be28] lu_context_key_degister_many at ffffffffc0a02a72 [obdclass]
          ffff9ddbd5d2be30: ffff9ddb00000008 ffff9ddbd5d2be90
          ffff9ddbd5d2be40: ffff9ddbd5d2be50 00000000a8325962
          ffff9ddbd5d2be50: kmem_cache_free+482 vvp_session_key
          ffff9ddbd5d2be60: vvp_thread_key   0000000000000000
      crash> sym ll_thread_key
      ffffffffc0da4a00 (D) ll_thread_key [lustre]
      crash> struct lu_context_key ll_thread_key
      struct lu_context_key {
        lct_tags = 1073741832,
        lct_init = 0xffffffffc0d67d20 <ll_thread_key_init>,
        lct_fini = 0xffffffffc0d67e30 <ll_thread_key_fini>,
        lct_exit = 0x0,
        lct_index = 14,
        lct_used = {
          counter = 1
        },
        lct_owner = 0xffffffffc0da8b80 <__this_module>,
        lct_reference = {<No data fields>}
      }
      

       

      The issue seems to be more recurrent on b2_12 branch.

       

      Attachments

        Issue Links

          Activity

            People

              eaujames Etienne Aujames
              eaujames Etienne Aujames
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: