[LU-8509] drop_caches hangs in cl_inode_fini() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.8.0
Labels:
- llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Running lustre 2.8.0_0.0.llnlpreview.18 on the clients (see the lustre-release-fe-llnl) , we are regularly seeing hangs of the /etc/slurm/prolog script when it triggers drop_caches. This script runs before each job to clear out the cache from any previous jobs.

In particular it hangs here:

#  Flush slab cache entries
echo 2 >/proc/sys/vm/drop_caches

And this is the backtrace for where it is getting stuck:

crash> bt -xs 1386
PID: 1386   TASK: ffff88201b0a5080  CPU: 10  COMMAND: "prolog"
 #0 [ffff882011bd3af8] __schedule+0x295 at ffffffff81651975
 #1 [ffff882011bd3b60] schedule+0x29 at ffffffff81652049
 #2 [ffff882011bd3b70] cl_inode_fini+0x1ac at ffffffffa0c6b3ac [lustre]
 #3 [ffff882011bd3c10] ll_clear_inode+0x21c at ffffffffa0c377ec [lustre]
 #4 [ffff882011bd3c38] ll_delete_inode+0x58 at ffffffffa0c39048 [lustre]
 #5 [ffff882011bd3c60] evict+0xa7 at ffffffff81204077
 #6 [ffff882011bd3c88] dispose_list+0x3e at ffffffff8120417e
 #7 [ffff882011bd3cb0] prune_icache_sb+0x163 at ffffffff81205113
 #8 [ffff882011bd3d18] prune_super+0x143 at ffffffff811ea343
 #9 [ffff882011bd3d50] shrink_slab+0x175 at ffffffff81183a25
#10 [ffff882011bd3e08] drop_caches_sysctl_handler+0x283 at ffffffff8124a743
#11 [ffff882011bd3e90] proc_sys_call_handler+0xd3 at ffffffff81260f03
#12 [ffff882011bd3ee8] proc_sys_write+0x14 at ffffffff81260f34
#13 [ffff882011bd3ef8] vfs_write+0xbd at ffffffff811e7bfd
#14 [ffff882011bd3f38] sys_write+0x7f at ffffffff811e869f
#15 [ffff882011bd3f80] system_call_fastpath+0x16 at ffffffff8165d709
    RIP: 00007ffff76d3500  RSP: 00007fffffffe180  RFLAGS: 00010206
    RAX: 0000000000000001  RBX: ffffffff8165d709  RCX: 0000000000000400
    RDX: 0000000000000002  RSI: 00007ffff7ff8000  RDI: 0000000000000001
    RBP: 00007ffff7ff8000   R8: 000000000000000a   R9: 00007ffff7fbd740
    R10: 00007fffffffe670  R11: 0000000000000246  R12: 0000000000000001
    R13: 0000000000000002  R14: 00007ffff79a7400  R15: 0000000000000002
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

Attachments

Issue Links

is duplicated by

LU-8936 Client LBUGs with cl_object.c:735:cl_env_attach() ASSERTION( rc == 0 ) in process ldlm_bl_02

Resolved

LU-8743 client stuck in cl_inode_fini()

Resolved

is related to

LU-8743 client stuck in cl_inode_fini()

Resolved

Activity

[LU-8509] drop_caches hangs in cl_inode_fini()

Peter Jones added a comment - 05/Oct/16 11:57 AM

Landed for 2.9

Peter Jones added a comment - 05/Oct/16 11:57 AM Landed for 2.9

Gerrit Updater added a comment - 05/Oct/16 3:51 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22745/
Subject: ~~LU-8509~~ llite: drop_caches hangs in cl_inode_fini()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c594026329e6a78a6c9f3188514211647b3040d8

Gerrit Updater added a comment - 05/Oct/16 3:51 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22745/ Subject: LU-8509 llite: drop_caches hangs in cl_inode_fini() Project: fs/lustre-release Branch: master Current Patch Set: Commit: c594026329e6a78a6c9f3188514211647b3040d8

Gerrit Updater added a comment - 26/Sep/16 9:16 PM

Andrew Perepechko (andrew.perepechko@seagate.com) uploaded a new patch: http://review.whamcloud.com/22745
Subject: ~~LU-8509~~ llite: drop_caches hangs in cl_inode_fini()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 124b249c8ccaa4aba925916752d0a3fa51fda2f1

Gerrit Updater added a comment - 26/Sep/16 9:16 PM Andrew Perepechko (andrew.perepechko@seagate.com) uploaded a new patch: http://review.whamcloud.com/22745 Subject: LU-8509 llite: drop_caches hangs in cl_inode_fini() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 124b249c8ccaa4aba925916752d0a3fa51fda2f1

Gerrit Updater added a comment - 26/Sep/16 8:46 PM

Andrew Perepechko (andrew.perepechko@seagate.com) uploaded a new patch: http://review.whamcloud.com/22743
Subject: ~~LU-8509~~ tests: drop_caches hangs in cl_inode_fini()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9b1a256b2cede7866fa7c86916ddebab88800ad0

Gerrit Updater added a comment - 26/Sep/16 8:46 PM Andrew Perepechko (andrew.perepechko@seagate.com) uploaded a new patch: http://review.whamcloud.com/22743 Subject: LU-8509 tests: drop_caches hangs in cl_inode_fini() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9b1a256b2cede7866fa7c86916ddebab88800ad0

Ann Koehler (Inactive) added a comment - 24/Aug/16 3:39 PM

Cray has hit this bug several times. apinit, one of the Cray workload manager daemons, hangs while dropping vm caches: echo 3 > /proc/sys/vm/drop_caches. apinit drops vm caches at the end of each job after dropping ldlm caches. When apinit fails to complete the vm drop_caches, the Node Health Checker (NHC) first marks the node suspect and then marks it admindown. In this state no new jobs are scheduled.

apinit appears to be hung waiting for the loh_ref count of a cl_object to drop from 2 to 1.

I uploaded a dump to ftp.intel.com:/uploads/~~LU-8509~~ in case it may be of some help.

Ann Koehler (Inactive) added a comment - 24/Aug/16 3:39 PM Cray has hit this bug several times. apinit, one of the Cray workload manager daemons, hangs while dropping vm caches: echo 3 > /proc/sys/vm/drop_caches. apinit drops vm caches at the end of each job after dropping ldlm caches. When apinit fails to complete the vm drop_caches, the Node Health Checker (NHC) first marks the node suspect and then marks it admindown. In this state no new jobs are scheduled. apinit appears to be hung waiting for the loh_ref count of a cl_object to drop from 2 to 1. I uploaded a dump to ftp.intel.com:/uploads/ LU-8509 in case it may be of some help.

Christopher Morrone (Inactive) added a comment - 23/Aug/16 11:51 PM

Perhaps the next time it happens.

Christopher Morrone (Inactive) added a comment - 23/Aug/16 11:51 PM Perhaps the next time it happens.

Zhenyu Xu added a comment - 23/Aug/16 11:13 PM

Can you dump all threads trace of this hit? (echo t > /proc/sysrq_trigger)

Zhenyu Xu added a comment - 23/Aug/16 11:13 PM Can you dump all threads trace of this hit? (echo t > /proc/sysrq_trigger)

Christopher Morrone (Inactive) added a comment - 18/Aug/16 6:04 PM

Do the image built with --enable-lu_ref defined in configure?

No, we are not setting that.

Christopher Morrone (Inactive) added a comment - 18/Aug/16 6:04 PM Do the image built with --enable-lu_ref defined in configure? No, we are not setting that.

Zhenyu Xu added a comment - 18/Aug/16 1:40 PM

Do the image built with --enable-lu_ref defined in configure? cl_inode_fini() is waiting for the lli_clob reference becoming to 1, and it seems that another thread referenced the object does not release the reference thereafter.

Zhenyu Xu added a comment - 18/Aug/16 1:40 PM Do the image built with --enable-lu_ref defined in configure? cl_inode_fini() is waiting for the lli_clob reference becoming to 1, and it seems that another thread referenced the object does not release the reference thereafter.

Peter Jones added a comment - 17/Aug/16 3:03 PM

Bobijam

Could you please asssit on this issue?

Thanks

Peter

Peter Jones added a comment - 17/Aug/16 3:03 PM Bobijam Could you please asssit on this issue? Thanks Peter

People

Assignee:: Zhenyu Xu

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 17/Aug/16 12:08 AM

Updated:: 06/Jan/20 10:19 PM

Resolved:: 05/Oct/16 11:57 AM