[LU-8509] drop_caches hangs in cl_inode_fini() Created: 17/Aug/16 Updated: 06/Jan/20 Resolved: 05/Oct/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Christopher Morrone | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
Running lustre 2.8.0_0.0.llnlpreview.18 on the clients (see the lustre-release-fe-llnl) , we are regularly seeing hangs of the /etc/slurm/prolog script when it triggers drop_caches. This script runs before each job to clear out the cache from any previous jobs. In particular it hangs here: # Flush slab cache entries echo 2 >/proc/sys/vm/drop_caches And this is the backtrace for where it is getting stuck: crash> bt -xs 1386
PID: 1386 TASK: ffff88201b0a5080 CPU: 10 COMMAND: "prolog"
#0 [ffff882011bd3af8] __schedule+0x295 at ffffffff81651975
#1 [ffff882011bd3b60] schedule+0x29 at ffffffff81652049
#2 [ffff882011bd3b70] cl_inode_fini+0x1ac at ffffffffa0c6b3ac [lustre]
#3 [ffff882011bd3c10] ll_clear_inode+0x21c at ffffffffa0c377ec [lustre]
#4 [ffff882011bd3c38] ll_delete_inode+0x58 at ffffffffa0c39048 [lustre]
#5 [ffff882011bd3c60] evict+0xa7 at ffffffff81204077
#6 [ffff882011bd3c88] dispose_list+0x3e at ffffffff8120417e
#7 [ffff882011bd3cb0] prune_icache_sb+0x163 at ffffffff81205113
#8 [ffff882011bd3d18] prune_super+0x143 at ffffffff811ea343
#9 [ffff882011bd3d50] shrink_slab+0x175 at ffffffff81183a25
#10 [ffff882011bd3e08] drop_caches_sysctl_handler+0x283 at ffffffff8124a743
#11 [ffff882011bd3e90] proc_sys_call_handler+0xd3 at ffffffff81260f03
#12 [ffff882011bd3ee8] proc_sys_write+0x14 at ffffffff81260f34
#13 [ffff882011bd3ef8] vfs_write+0xbd at ffffffff811e7bfd
#14 [ffff882011bd3f38] sys_write+0x7f at ffffffff811e869f
#15 [ffff882011bd3f80] system_call_fastpath+0x16 at ffffffff8165d709
RIP: 00007ffff76d3500 RSP: 00007fffffffe180 RFLAGS: 00010206
RAX: 0000000000000001 RBX: ffffffff8165d709 RCX: 0000000000000400
RDX: 0000000000000002 RSI: 00007ffff7ff8000 RDI: 0000000000000001
RBP: 00007ffff7ff8000 R8: 000000000000000a R9: 00007ffff7fbd740
R10: 00007fffffffe670 R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000000002 R14: 00007ffff79a7400 R15: 0000000000000002
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
|
| Comments |
| Comment by Peter Jones [ 17/Aug/16 ] |
|
Bobijam Could you please asssit on this issue? Thanks Peter |
| Comment by Zhenyu Xu [ 18/Aug/16 ] |
|
Do the image built with --enable-lu_ref defined in configure? cl_inode_fini() is waiting for the lli_clob reference becoming to 1, and it seems that another thread referenced the object does not release the reference thereafter. |
| Comment by Christopher Morrone [ 18/Aug/16 ] |
Do the image built with --enable-lu_ref defined in configure? No, we are not setting that. |
| Comment by Zhenyu Xu [ 23/Aug/16 ] |
|
Can you dump all threads trace of this hit? (echo t > /proc/sysrq_trigger) |
| Comment by Christopher Morrone [ 23/Aug/16 ] |
|
Perhaps the next time it happens. |
| Comment by Ann Koehler (Inactive) [ 24/Aug/16 ] |
|
Cray has hit this bug several times. apinit, one of the Cray workload manager daemons, hangs while dropping vm caches: echo 3 > /proc/sys/vm/drop_caches. apinit drops vm caches at the end of each job after dropping ldlm caches. When apinit fails to complete the vm drop_caches, the Node Health Checker (NHC) first marks the node suspect and then marks it admindown. In this state no new jobs are scheduled. apinit appears to be hung waiting for the loh_ref count of a cl_object to drop from 2 to 1. I uploaded a dump to ftp.intel.com:/uploads/ |
| Comment by Gerrit Updater [ 26/Sep/16 ] |
|
Andrew Perepechko (andrew.perepechko@seagate.com) uploaded a new patch: http://review.whamcloud.com/22743 |
| Comment by Gerrit Updater [ 26/Sep/16 ] |
|
Andrew Perepechko (andrew.perepechko@seagate.com) uploaded a new patch: http://review.whamcloud.com/22745 |
| Comment by Gerrit Updater [ 05/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22745/ |
| Comment by Peter Jones [ 05/Oct/16 ] |
|
Landed for 2.9 |