Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8509

drop_caches hangs in cl_inode_fini()

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.9.0
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      Running lustre 2.8.0_0.0.llnlpreview.18 on the clients (see the lustre-release-fe-llnl) , we are regularly seeing hangs of the /etc/slurm/prolog script when it triggers drop_caches. This script runs before each job to clear out the cache from any previous jobs.

      In particular it hangs here:

      #  Flush slab cache entries
      echo 2 >/proc/sys/vm/drop_caches
      

      And this is the backtrace for where it is getting stuck:

      crash> bt -xs 1386
      PID: 1386   TASK: ffff88201b0a5080  CPU: 10  COMMAND: "prolog"
       #0 [ffff882011bd3af8] __schedule+0x295 at ffffffff81651975
       #1 [ffff882011bd3b60] schedule+0x29 at ffffffff81652049
       #2 [ffff882011bd3b70] cl_inode_fini+0x1ac at ffffffffa0c6b3ac [lustre]
       #3 [ffff882011bd3c10] ll_clear_inode+0x21c at ffffffffa0c377ec [lustre]
       #4 [ffff882011bd3c38] ll_delete_inode+0x58 at ffffffffa0c39048 [lustre]
       #5 [ffff882011bd3c60] evict+0xa7 at ffffffff81204077
       #6 [ffff882011bd3c88] dispose_list+0x3e at ffffffff8120417e
       #7 [ffff882011bd3cb0] prune_icache_sb+0x163 at ffffffff81205113
       #8 [ffff882011bd3d18] prune_super+0x143 at ffffffff811ea343
       #9 [ffff882011bd3d50] shrink_slab+0x175 at ffffffff81183a25
      #10 [ffff882011bd3e08] drop_caches_sysctl_handler+0x283 at ffffffff8124a743
      #11 [ffff882011bd3e90] proc_sys_call_handler+0xd3 at ffffffff81260f03
      #12 [ffff882011bd3ee8] proc_sys_write+0x14 at ffffffff81260f34
      #13 [ffff882011bd3ef8] vfs_write+0xbd at ffffffff811e7bfd
      #14 [ffff882011bd3f38] sys_write+0x7f at ffffffff811e869f
      #15 [ffff882011bd3f80] system_call_fastpath+0x16 at ffffffff8165d709
          RIP: 00007ffff76d3500  RSP: 00007fffffffe180  RFLAGS: 00010206
          RAX: 0000000000000001  RBX: ffffffff8165d709  RCX: 0000000000000400
          RDX: 0000000000000002  RSI: 00007ffff7ff8000  RDI: 0000000000000001
          RBP: 00007ffff7ff8000   R8: 000000000000000a   R9: 00007ffff7fbd740
          R10: 00007fffffffe670  R11: 0000000000000246  R12: 0000000000000001
          R13: 0000000000000002  R14: 00007ffff79a7400  R15: 0000000000000002
          ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
      

      Attachments

        Issue Links

          Activity

            [LU-8509] drop_caches hangs in cl_inode_fini()
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22745/
            Subject: LU-8509 llite: drop_caches hangs in cl_inode_fini()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c594026329e6a78a6c9f3188514211647b3040d8

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22745/ Subject: LU-8509 llite: drop_caches hangs in cl_inode_fini() Project: fs/lustre-release Branch: master Current Patch Set: Commit: c594026329e6a78a6c9f3188514211647b3040d8

            Andrew Perepechko (andrew.perepechko@seagate.com) uploaded a new patch: http://review.whamcloud.com/22745
            Subject: LU-8509 llite: drop_caches hangs in cl_inode_fini()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 124b249c8ccaa4aba925916752d0a3fa51fda2f1

            gerrit Gerrit Updater added a comment - Andrew Perepechko (andrew.perepechko@seagate.com) uploaded a new patch: http://review.whamcloud.com/22745 Subject: LU-8509 llite: drop_caches hangs in cl_inode_fini() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 124b249c8ccaa4aba925916752d0a3fa51fda2f1

            Andrew Perepechko (andrew.perepechko@seagate.com) uploaded a new patch: http://review.whamcloud.com/22743
            Subject: LU-8509 tests: drop_caches hangs in cl_inode_fini()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9b1a256b2cede7866fa7c86916ddebab88800ad0

            gerrit Gerrit Updater added a comment - Andrew Perepechko (andrew.perepechko@seagate.com) uploaded a new patch: http://review.whamcloud.com/22743 Subject: LU-8509 tests: drop_caches hangs in cl_inode_fini() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9b1a256b2cede7866fa7c86916ddebab88800ad0

            Cray has hit this bug several times. apinit, one of the Cray workload manager daemons, hangs while dropping vm caches: echo 3 > /proc/sys/vm/drop_caches. apinit drops vm caches at the end of each job after dropping ldlm caches. When apinit fails to complete the vm drop_caches, the Node Health Checker (NHC) first marks the node suspect and then marks it admindown. In this state no new jobs are scheduled.

            apinit appears to be hung waiting for the loh_ref count of a cl_object to drop from 2 to 1.

            I uploaded a dump to ftp.intel.com:/uploads/LU-8509 in case it may be of some help.

            amk Ann Koehler (Inactive) added a comment - Cray has hit this bug several times. apinit, one of the Cray workload manager daemons, hangs while dropping vm caches: echo 3 > /proc/sys/vm/drop_caches. apinit drops vm caches at the end of each job after dropping ldlm caches. When apinit fails to complete the vm drop_caches, the Node Health Checker (NHC) first marks the node suspect and then marks it admindown. In this state no new jobs are scheduled. apinit appears to be hung waiting for the loh_ref count of a cl_object to drop from 2 to 1. I uploaded a dump to ftp.intel.com:/uploads/ LU-8509 in case it may be of some help.

            Perhaps the next time it happens.

            morrone Christopher Morrone (Inactive) added a comment - Perhaps the next time it happens.
            bobijam Zhenyu Xu added a comment -

            Can you dump all threads trace of this hit? (echo t > /proc/sysrq_trigger)

            bobijam Zhenyu Xu added a comment - Can you dump all threads trace of this hit? (echo t > /proc/sysrq_trigger)
            Do the image built with --enable-lu_ref defined in configure?

            No, we are not setting that.

            morrone Christopher Morrone (Inactive) added a comment - Do the image built with --enable-lu_ref defined in configure? No, we are not setting that.
            bobijam Zhenyu Xu added a comment -

            Do the image built with --enable-lu_ref defined in configure? cl_inode_fini() is waiting for the lli_clob reference becoming to 1, and it seems that another thread referenced the object does not release the reference thereafter.

            bobijam Zhenyu Xu added a comment - Do the image built with --enable-lu_ref defined in configure? cl_inode_fini() is waiting for the lli_clob reference becoming to 1, and it seems that another thread referenced the object does not release the reference thereafter.
            pjones Peter Jones added a comment -

            Bobijam

            Could you please asssit on this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Could you please asssit on this issue? Thanks Peter

            People

              bobijam Zhenyu Xu
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: