[LU-8509] drop_caches hangs in cl_inode_fini() Created: 17/Aug/16  Updated: 06/Jan/20  Resolved: 05/Oct/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Christopher Morrone Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: llnl

Issue Links:
Duplicate
is duplicated by LU-8936 Client LBUGs with cl_object.c:735:cl_... Resolved
is duplicated by LU-8743 client stuck in cl_inode_fini() Resolved
Related
is related to LU-8743 client stuck in cl_inode_fini() Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running lustre 2.8.0_0.0.llnlpreview.18 on the clients (see the lustre-release-fe-llnl) , we are regularly seeing hangs of the /etc/slurm/prolog script when it triggers drop_caches. This script runs before each job to clear out the cache from any previous jobs.

In particular it hangs here:

#  Flush slab cache entries
echo 2 >/proc/sys/vm/drop_caches

And this is the backtrace for where it is getting stuck:

crash> bt -xs 1386
PID: 1386   TASK: ffff88201b0a5080  CPU: 10  COMMAND: "prolog"
 #0 [ffff882011bd3af8] __schedule+0x295 at ffffffff81651975
 #1 [ffff882011bd3b60] schedule+0x29 at ffffffff81652049
 #2 [ffff882011bd3b70] cl_inode_fini+0x1ac at ffffffffa0c6b3ac [lustre]
 #3 [ffff882011bd3c10] ll_clear_inode+0x21c at ffffffffa0c377ec [lustre]
 #4 [ffff882011bd3c38] ll_delete_inode+0x58 at ffffffffa0c39048 [lustre]
 #5 [ffff882011bd3c60] evict+0xa7 at ffffffff81204077
 #6 [ffff882011bd3c88] dispose_list+0x3e at ffffffff8120417e
 #7 [ffff882011bd3cb0] prune_icache_sb+0x163 at ffffffff81205113
 #8 [ffff882011bd3d18] prune_super+0x143 at ffffffff811ea343
 #9 [ffff882011bd3d50] shrink_slab+0x175 at ffffffff81183a25
#10 [ffff882011bd3e08] drop_caches_sysctl_handler+0x283 at ffffffff8124a743
#11 [ffff882011bd3e90] proc_sys_call_handler+0xd3 at ffffffff81260f03
#12 [ffff882011bd3ee8] proc_sys_write+0x14 at ffffffff81260f34
#13 [ffff882011bd3ef8] vfs_write+0xbd at ffffffff811e7bfd
#14 [ffff882011bd3f38] sys_write+0x7f at ffffffff811e869f
#15 [ffff882011bd3f80] system_call_fastpath+0x16 at ffffffff8165d709
    RIP: 00007ffff76d3500  RSP: 00007fffffffe180  RFLAGS: 00010206
    RAX: 0000000000000001  RBX: ffffffff8165d709  RCX: 0000000000000400
    RDX: 0000000000000002  RSI: 00007ffff7ff8000  RDI: 0000000000000001
    RBP: 00007ffff7ff8000   R8: 000000000000000a   R9: 00007ffff7fbd740
    R10: 00007fffffffe670  R11: 0000000000000246  R12: 0000000000000001
    R13: 0000000000000002  R14: 00007ffff79a7400  R15: 0000000000000002
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b


 Comments   
Comment by Peter Jones [ 17/Aug/16 ]

Bobijam

Could you please asssit on this issue?

Thanks

Peter

Comment by Zhenyu Xu [ 18/Aug/16 ]

Do the image built with --enable-lu_ref defined in configure? cl_inode_fini() is waiting for the lli_clob reference becoming to 1, and it seems that another thread referenced the object does not release the reference thereafter.

Comment by Christopher Morrone [ 18/Aug/16 ]
Do the image built with --enable-lu_ref defined in configure?

No, we are not setting that.

Comment by Zhenyu Xu [ 23/Aug/16 ]

Can you dump all threads trace of this hit? (echo t > /proc/sysrq_trigger)

Comment by Christopher Morrone [ 23/Aug/16 ]

Perhaps the next time it happens.

Comment by Ann Koehler (Inactive) [ 24/Aug/16 ]

Cray has hit this bug several times. apinit, one of the Cray workload manager daemons, hangs while dropping vm caches: echo 3 > /proc/sys/vm/drop_caches. apinit drops vm caches at the end of each job after dropping ldlm caches. When apinit fails to complete the vm drop_caches, the Node Health Checker (NHC) first marks the node suspect and then marks it admindown. In this state no new jobs are scheduled.

apinit appears to be hung waiting for the loh_ref count of a cl_object to drop from 2 to 1.

I uploaded a dump to ftp.intel.com:/uploads/LU-8509 in case it may be of some help.

Comment by Gerrit Updater [ 26/Sep/16 ]

Andrew Perepechko (andrew.perepechko@seagate.com) uploaded a new patch: http://review.whamcloud.com/22743
Subject: LU-8509 tests: drop_caches hangs in cl_inode_fini()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9b1a256b2cede7866fa7c86916ddebab88800ad0

Comment by Gerrit Updater [ 26/Sep/16 ]

Andrew Perepechko (andrew.perepechko@seagate.com) uploaded a new patch: http://review.whamcloud.com/22745
Subject: LU-8509 llite: drop_caches hangs in cl_inode_fini()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 124b249c8ccaa4aba925916752d0a3fa51fda2f1

Comment by Gerrit Updater [ 05/Oct/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22745/
Subject: LU-8509 llite: drop_caches hangs in cl_inode_fini()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c594026329e6a78a6c9f3188514211647b3040d8

Comment by Peter Jones [ 05/Oct/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:18:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.