[LUDOC-225] Possible ll_stop_statahead deadlock Created: 13/Feb/14  Updated: 14/Feb/14  Resolved: 14/Feb/14

Status: Resolved
Project: Lustre Documentation
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Christopher Morrone Assignee: Jodi Levi (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 12652

 Description   

It looks like we may have a deadlock in the statahead client code. Running lustre 2.4.0-19chaos (see http://github.com/chaos/lustre) we have some clients hanging on close() like so:

PID: 67983  TASK: ffff8804448d0080  CPU: 10  COMMAND: "java"
 #0 [ffff88041b969d08] schedule at ffffffff815109b2
 #1 [ffff88041b969dd0] cfs_waitq_wait+0xe at ffffffffa043577e [libcfs]
 #2 [ffff88041b969de0] ll_stop_statahead+0x1b8 at ffffffffa0aad1b8 [lustre]
 #3 [ffff88041b969e60] ll_file_release+0x2d8 at ffffffffa0a6b698 [lustre]
 #4 [ffff88041b969ea0] ll_dir_release+0xdb at ffffffffa0a52d5b [lustre]
 #5 [ffff88041b969ec0] __fput+0x108 at ffffffff81183898
 #6 [ffff88041b969f10] fput+0x25 at ffffffff811839e5
 #7 [ffff88041b969f20] filp_close+0x5d at ffffffff8117eddd
 #8 [ffff88041b969f50] sys_close+0xa5 at ffffffff8117eeb5
 #9 [ffff88041b969f80] system_call_fastpath+0x16 at ffffffff8100b0b2
    RIP: 00002aaaaacdb5ad  RSP: 00002aaab505f3f8  RFLAGS: 00010206
    RAX: 0000000000000003  RBX: ffffffff8100b0b2  RCX: 000000000000003a
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000131
    RBP: 0000000000000131   R8: 00002aaaab586580   R9: 00002aaaab33a780
    R10: 0000000000000131  R11: 0000000000000293  R12: 0000000000000002
    R13: 00002aaabc013df0  R14: 00002aaabc107700  R15: 00002aaabc013df0
    ORIG_RAX: 0000000000000003  CS: 0033  SS: 002b

PID: 67984  TASK: ffff880639bcb500  CPU: 10  COMMAND: "ll_sa_66672"
 #0 [ffff88065abc9d40] schedule at ffffffff815109b2
 #1 [ffff88065abc9e08] cfs_waitq_wait+0xe at ffffffffa043577e [libcfs]
 #2 [ffff88065abc9e18] ll_statahead_thread+0x59e at ffffffffa0ab288e [lustre]
 #3 [ffff88065abc9f48] child_rip+0xa at ffffffff8100c10a
 
PID: 67985  TASK: ffff880d2faba080  CPU: 11  COMMAND: "ll_agl_66672"
 #0 [ffff8809d06fddc0] schedule at ffffffff815109b2
 #1 [ffff8809d06fde88] cfs_waitq_wait+0xe at ffffffffa043577e [libcfs]
 #2 [ffff8809d06fde98] ll_agl_thread+0x44a at ffffffffa0aadbba [lustre]
 #3 [ffff8809d06fdf48] child_rip+0xa at ffffffff8100c10a


 Comments   
Comment by Christopher Morrone [ 13/Feb/14 ]

Whoops, wrong project. You can close this, I'll open the LU ticket.

Comment by Peter Jones [ 14/Feb/14 ]

ok Chris. In future if this happens then let us know and I can move tickets between projects.

Generated at Sat Feb 10 03:41:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.