[LU-4624] ll_stop_statahead deadlock Created: 13/Feb/14  Updated: 11/Oct/14  Resolved: 27/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: Lustre 2.6.0, Lustre 2.5.4

Type: Bug Priority: Critical
Reporter: Christopher Morrone Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-3270 ptlrpcd strnlen crash trying to log a... Resolved
Severity: 3
Rank (Obsolete): 12653

 Description   

We are seeing processes on Lustre clients hang in close():

PID: 67983  TASK: ffff8804448d0080  CPU: 10  COMMAND: "java"
 #0 [ffff88041b969d08] schedule at ffffffff815109b2
 #1 [ffff88041b969dd0] cfs_waitq_wait+0xe at ffffffffa043577e [libcfs]
 #2 [ffff88041b969de0] ll_stop_statahead+0x1b8 at ffffffffa0aad1b8 [lustre]
 #3 [ffff88041b969e60] ll_file_release+0x2d8 at ffffffffa0a6b698 [lustre]
 #4 [ffff88041b969ea0] ll_dir_release+0xdb at ffffffffa0a52d5b [lustre]
 #5 [ffff88041b969ec0] __fput+0x108 at ffffffff81183898
 #6 [ffff88041b969f10] fput+0x25 at ffffffff811839e5
 #7 [ffff88041b969f20] filp_close+0x5d at ffffffff8117eddd
 #8 [ffff88041b969f50] sys_close+0xa5 at ffffffff8117eeb5
 #9 [ffff88041b969f80] system_call_fastpath+0x16 at ffffffff8100b0b2
    RIP: 00002aaaaacdb5ad  RSP: 00002aaab505f3f8  RFLAGS: 00010206
    RAX: 0000000000000003  RBX: ffffffff8100b0b2  RCX: 000000000000003a
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000131
    RBP: 0000000000000131   R8: 00002aaaab586580   R9: 00002aaaab33a780
    R10: 0000000000000131  R11: 0000000000000293  R12: 0000000000000002
    R13: 00002aaabc013df0  R14: 00002aaabc107700  R15: 00002aaabc013df0
    ORIG_RAX: 0000000000000003  CS: 0033  SS: 002b
 
PID: 67984  TASK: ffff880639bcb500  CPU: 10  COMMAND: "ll_sa_66672"
 #0 [ffff88065abc9d40] schedule at ffffffff815109b2
 #1 [ffff88065abc9e08] cfs_waitq_wait+0xe at ffffffffa043577e [libcfs]
 #2 [ffff88065abc9e18] ll_statahead_thread+0x59e at ffffffffa0ab288e [lustre]
 #3 [ffff88065abc9f48] child_rip+0xa at ffffffff8100c10a
 
PID: 67985  TASK: ffff880d2faba080  CPU: 11  COMMAND: "ll_agl_66672"
 #0 [ffff8809d06fddc0] schedule at ffffffff815109b2
 #1 [ffff8809d06fde88] cfs_waitq_wait+0xe at ffffffffa043577e [libcfs]
 #2 [ffff8809d06fde98] ll_agl_thread+0x44a at ffffffffa0aadbba [lustre]
 #3 [ffff8809d06fdf48] child_rip+0xa at ffffffff8100c10a

We are running 2.4.0-19chaos (see http://github.com/chaos/lustre) on these clients.



 Comments   
Comment by Peter Jones [ 13/Feb/14 ]

Niu

Could you please advise on this one?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 14/Feb/14 ]

I can't see why the agl & statahead thread can't be stopped now, is there any error messages in the log? Is it possible to reproduce it and get some debug log with D_READA enabled?

Comment by Christopher Morrone [ 22/Feb/14 ]

The synchronization between the statahead threads is pretty weak. It definitely needs work.

The specific problem that we are seeing right now has to do with a race between startup and shutdown of the statahead threads. The statahead threads blindly overwrite the thread status flags to be SVC_RUNNING. If the state was set to SVC_STOPPING before that, we have now lost that notice and we wind up deadlocked.

Here's quick-and-dirty patch to address the problem:

http://review.whamcloud.com/9358

I only have the new threads overwrite the thread state flags if the flag is still uninitialized. Otherwise, it remains unchanged. This allows the SVC_STOPPING to remain set.

I call this "quick-and-dirty" because really it would be better to recognize that SVC_STOPPING is set and immediately abort startup. My patch still allows the full startup to continue: launch the agl thread, and all the other initializations. But then in the main loop it will see that it needs to shutdown and unwind all of that.

Better would be to exit early without proceding through all of that startup. But the cleanup code in both the ll_statahead_thread and ll_agl_thread both need more reworking than I have time for at the moment. But I would suggest that really needs to be done before this issue is closed.

Nevertheless, I suspect there is no harm in landing my patch in the interim.

I also include a patch to make the statahead debug messages a bit more useful.

http://review.whamcloud.com/9360

Comment by Christopher Morrone [ 22/Feb/14 ]

Here are some thoughts about things that are wrong and need to be fixed in the statahead code:

  • The code abuses the ptlrpc_thread structures. It reuses them to keep state about the ll_statahead_thread() and the ll_agl_thread(), but uses very few of the actual fields. Even the fields that are used don't appear to be used properly. For instance it uses the t_flags field to represent the state of the threads, but fails to employ all of the states correctly (SVC_STARTING is missed). But the more important problem is that these are not ptlrpc services! They should not be using the ptlrpc_thread structure. Period. That is wrong and completely misleading to anyone new to the code.
  • Cleanup at the end of the functions needs to be in reverse order of setup. This would facilitate early exit from the threads (the race exposed in this ticket is a perfect example).
  • There needs to be more symetry and uniformity in the functions for the sa thread and the agl thread. The sa thread has an ll_stop_statahead() function. For the agl thread, that functionality is just inlined in the ll_statahead_thread(). It would be better to make that its own static scope function.
  • sai_agl_valid would appear to be a completely superfluous and confusing flag. It looks like yet another flag to represent the state of the agl thread, but we already have a different state flag that seems to server the same purpose.
  • LS_NONE_FIRST_DE should be renamed LS_NOT_FIRST_DE
  • is_first_dirent() needs a really good comment explaining what it does. Why do we care about filenames that start with "."?
  • There are places where "parent" is dereferenced after doing a dput(parent). Are we really sure that is safe? (and if so, we need a comment explaining why it is safe)
  • The "sai_entries_stated" is difficult to understand for an english reader. At the very least, the comment by its declaration, " /* entries stated */" is insufficient. Perhaps it should be reworded "/* entries stat()ed */". The problem here is that "stated" is a real english word. On first read, it does not seem like the past tense of stat(), which is a word that we made up. The same goes for the comment for the function ll_sa_entry_stated().
  • "thread" is too generic of a local variable name, and makes the code more difficult to read. When it refers to the statahead thread, it should be renamed to "sa_thread" and when it refers to the agl thread is should be renamed "agl_thread".
Comment by Niu Yawei (Inactive) [ 24/Feb/14 ]

he specific problem that we are seeing right now has to do with a race between startup and shutdown of the statahead threads. The statahead threads blindly overwrite the thread status flags to be SVC_RUNNING. If the state was set to SVC_STOPPING before that, we have now lost that notice and we wind up deadlocked.
Here's quick-and-dirty patch to address the problem:
http://review.whamcloud.com/9358
I only have the new threads overwrite the thread state flags if the flag is still uninitialized. Otherwise, it remains unchanged. This allows the SVC_STOPPING to remain set.

I think the most common use pattern of start/stop ptlrpc_thread is that: caller calls kthread_run() to start thread, then wait until the thread start running, then stop the thread later. For the statahead, if there could be multiple callers which can concurrently start/stop the statahead thread, they must be serialized outside of the statahead thread.

Comment by Niu Yawei (Inactive) [ 24/Feb/14 ]

Lai & Fanyong are statahead/agl experts, I'd like to ask their opinions. Lai/nasf, any comments to Chris' questions?

Comment by Christopher Morrone [ 24/Feb/14 ]

(whoops sorry, accidentally edited your comment instead of replying...think I fix that now)

I think the most common use pattern of start/stop ptlrpc_thread is that: caller calls kthread_run() to start thread, then wait until the thread start running, then stop the thread later. For the statahead, if there could be multiple callers which can concurrently start/stop the statahead thread, they must be serialized outside of the statahead thread.

That is a strange position to take, and I would argue that it is wrong.

The statahead code explicitly exports interfaces for starting and stopping the statahead threads, and that is clearly the correct place to be doing the synchronization. Forcing the calling code to know the internal sychronization requirements of the statahead code would just be bad software design.

Comment by Niu Yawei (Inactive) [ 25/Feb/14 ]

That is a strange position to take, and I would argue that it is wrong.
The statahead code explicitly exports interfaces for starting and stopping the statahead threads, and that is clearly the correct place to be doing the synchronization. Forcing the calling code to know the internal sychronization requirements of the statahead code would just be bad software design.

Yes, the function of starting/stopping statahead thread is a good place to do the synchronization. I meant that we'd not do that inside of statahead thread, your patch looks like trying to resolve the race in the statahead thread.

Comment by Christopher Morrone [ 25/Feb/14 ]

Yes, the function of starting/stopping statahead thread is a good place to do the synchronization. I meant that we'd not do that inside of statahead thread, your patch looks like trying to resolve the race in the statahead thread.

The statahead thread initializes the threads's t_flags without checking the current state of the flags. That is a bug. Any change to address that problem will necessarily require changes to the ll_statahead_thread() code.

If you think there is some way to fix this without touching the ll_statahead_thread(), please show the code. Otherwise, please remove the -1 review on the patch.

Comment by Christopher Morrone [ 25/Feb/14 ]

I think the most common use pattern of start/stop ptlrpc_thread is that

What is "common" is pretty much irrelevant to the conversation. The code needs to be correct for all cases, not just the common case.

Comment by Niu Yawei (Inactive) [ 25/Feb/14 ]

The statahead thread initializes the threads's t_flags without checking the current state of the flags. That is a bug. Any change to address that problem will necessarily require changes to the ll_statahead_thread() code.
If you think there is some way to fix this without touching the ll_statahead_thread(), please show the code. Otherwise, please remove the -1 review on the patch.

I see the race of starting statahead thread in do_statahead_enter() vs. stopping thread in ll_file_release(). and looks it's hard to be fixed outside of the statahead thread, because close() should not be blocked. (current ll_file_release() wait on the thread exit, it'll be changed in the fix of LU-3270 http://review.whamcloud.com/#/c/6392/25)

Thanks for your patch, Chris.

Comment by Lai Siyao [ 25/Feb/14 ]

Chris, there is a patch http://review.whamcloud.com/#/c/6392/ which addresses many statahead issues, I'll try to land that ASAP. So below suggestions should better be fixed in a new patch.

Here are some thoughts about things that are wrong and need to be fixed in the statahead code:
The code abuses the ptlrpc_thread structures. It reuses them to keep state about the ll_statahead_thread() and the ll_agl_thread(), but uses very few of the actual fields. Even the fields that are used don't appear to be used properly. For instance it uses the t_flags field to represent the state of the threads, but fails to employ all of the states correctly (SVC_STARTING is missed). But the more important problem is that these are not ptlrpc services! They should not be using the ptlrpc_thread structure. Period. That is wrong and completely misleading to anyone new to the code.
Cleanup at the end of the functions needs to be in reverse order of setup. This would facilitate early exit from the threads (the race exposed in this ticket is a perfect example).
There needs to be more symetry and uniformity in the functions for the sa thread and the agl thread. The sa thread has an ll_stop_statahead() function. For the agl thread, that functionality is just inlined in the ll_statahead_thread(). It would be better to make that its own static scope function.
sai_agl_valid would appear to be a completely superfluous and confusing flag. It looks like yet another flag to represent the state of the agl thread, but we already have a different state flag that seems to server the same purpose.
LS_NONE_FIRST_DE should be renamed LS_NOT_FIRST_DE
is_first_dirent() needs a really good comment explaining what it does. Why do we care about filenames that start with "."?
There are places where "parent" is dereferenced after doing a dput(parent). Are we really sure that is safe? (and if so, we need a comment explaining why it is safe)
The "sai_entries_stated" is difficult to understand for an english reader. At the very least, the comment by its declaration, " /* entries stated /" is insufficient. Perhaps it should be reworded "/ entries stat()ed */". The problem here is that "stated" is a real english word. On first read, it does not seem like the past tense of stat(), which is a word that we made up. The same goes for the comment for the function ll_sa_entry_stated().
"thread" is too generic of a local variable name, and makes the code more difficult to read. When it refers to the statahead thread, it should be renamed to "sa_thread" and when it refers to the agl thread is should be renamed "agl_thread".

Comment by Christopher Morrone [ 25/Feb/14 ]

Lai, thanks for the pointer. While I think there is some good cleanup work in your patch, I think a lot of my comments are not addressed by that.

Also, I fear that there is too much going on in that patch to get it landed to b2_4/b2_5 very soon.

I think in the short term (and for the b2_5 branch) my patch should be fully reviewed. If it is acceptable, it can land and fix this deadlock while we are waiting for the larger reworking of the code to be ready.

Comment by Christopher Morrone [ 26/Feb/14 ]

I pushed an update to my patch. There was a follow-on issue to the one I addressed already, a failure to refcount the sai buffer correctly.

But the more I look at the statahead code, the more potential problems I see. Hopefully Lai's patch will be more comprehensive.

Comment by Christopher Morrone [ 26/Feb/14 ]

I don't think that Lai's patch addresses the first issue identified here.

Comment by Peter Jones [ 27/Mar/14 ]

All patches tracked under this ticket have landed for 2.6. Further work in this area is taking place under LU-3270

Generated at Sat Feb 10 01:44:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.