[LU-14713] Process hung with waiting for mmap_sem - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.15.0
Affects Version/s: Lustre 2.12.7, Lustre 2.12.8
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Write and Truncate IO will serialized on ll_trunc_sem::ll_trunc_{readers|waiters}, if one process quit abruptly (be killed), the other will keep waiting for the semaphore (task state be set as TASK_INTERRUPTIBLE):

 INFO: task a.out:109684 blocked for more than 120 seconds.
      Tainted: G          IOE    --------- -  - 4.18.0-240.15.1.el8_3.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 Call Trace:
 __schedule+0x2a6/0x700
 schedule+0x38/0xa0
 trunc_sem_down_read+0xa6/0xb0 [lustre]
 vvp_io_write_start+0x107/0xb80 [lustre]
 cl_io_start+0x59/0x110 [obdclass]
 cl_io_loop+0x9a/0x1e0 [obdclass]
 ll_file_io_generic+0x380/0xb10 [lustre]
 ll_file_write_iter+0x136/0x5a0 [lustre]
 new_sync_write+0x124/0x170
 vfs_write+0xa5/0x1a0
 ksys_write+0x4f/0xb0
 do_syscall_64+0x5b/0x1a0

Attachments

Issue Links

is related to

LU-15397 LustreError: 4585:0:(llite_mmap.c:61:our_vma()) ASSERTION( !down_write_trylock(&mm->mmap_sem) ) failed

Open

Activity

[LU-14713] Process hung with waiting for mmap_sem

Neil Brown added a comment - 04/Jun/21 12:02 AM

Could you please report precisely which version of Lustre is in used (ideally the git commit hash). The "Affected versions" above says "None" which isn't helpful.

Neil Brown added a comment - 04/Jun/21 12:02 AM Could you please report precisely which version of Lustre is in used (ideally the git commit hash). The "Affected versions" above says "None" which isn't helpful.

Neil Brown added a comment - 03/Jun/21 11:30 AM

Thanks for the excellent explanation. The bug must be hiding somewhere else then. I'll keep digging.

Neil Brown added a comment - 03/Jun/21 11:30 AM Thanks for the excellent explanation. The bug must be hiding somewhere else then. I'll keep digging.

Zhenyu Xu added a comment - 03/Jun/21 9:15 AM

The clio layer always (at least for now) stacks like this VVP upon LOV upon OSC/MDC, and cl_io_start() calls from top to bottom, which is vvp_io_xxx_start() then lov_io_xxx_start() which is nested with {mdc|osc}_io_xxx_start(), and that makes vvp_io_setattr_start() the first to be called in the CIT_SETATTR cl_io_loop call path.

Like following example shows

00000020:00000001:0.0:1622711194.323939:0:20024:0:(cl_io.c:755:cl_io_loop()) Process entered
00000020:00000001:0.0:1622711194.323940:0:20024:0:(cl_io.c:390:cl_io_iter_init()) Process entered
...
00000020:00000001:0.0:1622711194.323955:0:20024:0:(cl_io.c:402:cl_io_iter_init()) Process leaving (rc=0 : 0 : 0)
00000020:00000001:0.0:1622711194.323958:0:20024:0:(cl_io.c:311:cl_io_lock()) Process entered
...
00000020:00000001:0.0:1622711194.324421:0:20024:0:(cl_io.c:331:cl_io_lock()) Process leaving (rc=0 : 0 : 0)
00000020:00000001:0.0:1622711194.324422:0:20024:0:(cl_io.c:516:cl_io_start()) Process entered
00000080:00000001:0.0:1622711194.324422:0:20024:0:(vvp_io.c:737:vvp_io_setattr_start()) Process entered
...
00000080:00000001:0.0:1622711194.324427:0:20024:0:(vvp_io.c:763:vvp_io_setattr_start()) Process leaving (rc=0 : 0 : 0)
00020000:00000001:0.0:1622711194.324427:0:20024:0:(lov_io.c:1487:lov_io_setattr_start()) Process entered
...
00000008:00000001:0.0:1622711194.324429:0:20024:0:(osc_io.c:582:osc_io_setattr_start()) Process entered
...
00000008:00000001:0.0:1622711194.324459:0:20024:0:(osc_io.c:687:osc_io_setattr_start()) Process leaving (rc=0 : 0 : 0)
...
00020000:00000001:0.0:1622711194.324461:0:20024:0:(lov_io.c:1502:lov_io_setattr_start()) Process leaving (rc=0 : 0 : 0)
00000020:00000001:0.0:1622711194.324462:0:20024:0:(cl_io.c:528:cl_io_start()) Process leaving (rc=0 : 0 : 0)

So basically vvp_io_setattr_start() will be called whatever in CIT_SETATTR cl_io_start().

Zhenyu Xu added a comment - 03/Jun/21 9:15 AM The clio layer always (at least for now) stacks like this VVP upon LOV upon OSC/MDC, and cl_io_start() calls from top to bottom, which is vvp_io_xxx_start() then lov_io_xxx_start() which is nested with {mdc|osc}_io_xxx_start(), and that makes vvp_io_setattr_start() the first to be called in the CIT_SETATTR cl_io_loop call path. Like following example shows 00000020:00000001:0.0:1622711194.323939:0:20024:0:(cl_io.c:755:cl_io_loop()) Process entered 00000020:00000001:0.0:1622711194.323940:0:20024:0:(cl_io.c:390:cl_io_iter_init()) Process entered ... 00000020:00000001:0.0:1622711194.323955:0:20024:0:(cl_io.c:402:cl_io_iter_init()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1622711194.323958:0:20024:0:(cl_io.c:311:cl_io_lock()) Process entered ... 00000020:00000001:0.0:1622711194.324421:0:20024:0:(cl_io.c:331:cl_io_lock()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1622711194.324422:0:20024:0:(cl_io.c:516:cl_io_start()) Process entered 00000080:00000001:0.0:1622711194.324422:0:20024:0:(vvp_io.c:737:vvp_io_setattr_start()) Process entered ... 00000080:00000001:0.0:1622711194.324427:0:20024:0:(vvp_io.c:763:vvp_io_setattr_start()) Process leaving (rc=0 : 0 : 0) 00020000:00000001:0.0:1622711194.324427:0:20024:0:(lov_io.c:1487:lov_io_setattr_start()) Process entered ... 00000008:00000001:0.0:1622711194.324429:0:20024:0:(osc_io.c:582:osc_io_setattr_start()) Process entered ... 00000008:00000001:0.0:1622711194.324459:0:20024:0:(osc_io.c:687:osc_io_setattr_start()) Process leaving (rc=0 : 0 : 0) ... 00020000:00000001:0.0:1622711194.324461:0:20024:0:(lov_io.c:1502:lov_io_setattr_start()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1622711194.324462:0:20024:0:(cl_io.c:528:cl_io_start()) Process leaving (rc=0 : 0 : 0) So basically vvp_io_setattr_start() will be called whatever in CIT_SETATTR cl_io_start().

Andreas Dilger added a comment - 03/Jun/21 7:10 AM

@Andreas does this seem at all possible?

That is really a question for Bobijam to answer, as I'm not nearly as expert in the client IO code as he is.

Andreas Dilger added a comment - 03/Jun/21 7:10 AM @Andreas does this seem at all possible? That is really a question for Bobijam to answer, as I'm not nearly as expert in the client IO code as he is.

Neil Brown added a comment - 03/Jun/21 12:44 AM

Thanks for the crash data. I could not find the other process that I was sure must exist.

The only process that looks at all suspicious is PID 46556 (ll_imp_inval) and I suspect it is waiting on the a.out thread, not the other way around.

It is very likely that the ll_trunc_sem that a.out is blocked on has ll_trunc_waiters == 0 and ll_trunc_reads == -1.

There are two ways this can happen:

1/ a thread calls trunc_sem_down_write, but never calls trunc_sem_up_write

2/ a thread calls trunc_sem_up_write without having first called trunc_sem_down_write. If it does this while some other thread is holding a read lock, then the up_write call will set ll_trunc_reads to 0, and when the other thread calls trunc_sem_up_read, ll_trunc_reads will be decremented to -1.

I cannot see a code path that leads to option 1. I can see a code path that might lead to option 2, but there may be considerations that I'm not aware of which make it impossible. I'll explain it in case it is possible.

trunc_sem_up/down_write are only called from vvp_io_setattr_start/end. These are only called from cl_io_start/end which are always paired.

They are called directly in ll_readahead_handle_work and cl_io_loop where cl_io_end will always follow cl_io_start, and indirectly through love_io_start/end which will be paired properly exactly if vvp_io_setattr_start/end are paired properly.

cl_io_end will always call all of the ->cio_end functions, so missing a call to trunc_sem_up write is (should be) impossible.

However cl_io_start will sometimes not call all cio_start functions. If one of the cio_start functions in the stack returns non-zero, the remainder are not called. However the matching cio_end function will be called. This seems very strange. I might be missing something, but it certainly seems that is the what would happen.

So if some cio_start function before vvp_io_setattr_start returned non-zero, then trunc_sem_down_write would not be called, but trunc_semd_up_write would. This would often not cause a problem. But if some other thread had called trunc_sem_down_read(), then it would provide exactly the symptom you observe.

@Andreas does this seem at all possible?

Neil Brown added a comment - 03/Jun/21 12:44 AM Thanks for the crash data. I could not find the other process that I was sure must exist. The only process that looks at all suspicious is PID 46556 (ll_imp_inval) and I suspect it is waiting on the a.out thread, not the other way around. It is very likely that the ll_trunc_sem that a.out is blocked on has ll_trunc_waiters == 0 and ll_trunc_reads == -1. There are two ways this can happen: 1/ a thread calls trunc_sem_down_write, but never calls trunc_sem_up_write 2/ a thread calls trunc_sem_up_write without having first called trunc_sem_down_write. If it does this while some other thread is holding a read lock, then the up_write call will set ll_trunc_reads to 0, and when the other thread calls trunc_sem_up_read, ll_trunc_reads will be decremented to -1. I cannot see a code path that leads to option 1. I can see a code path that might lead to option 2, but there may be considerations that I'm not aware of which make it impossible. I'll explain it in case it is possible. trunc_sem_up/down_write are only called from vvp_io_setattr_start/end. These are only called from cl_io_start/end which are always paired. They are called directly in ll_readahead_handle_work and cl_io_loop where cl_io_end will always follow cl_io_start, and indirectly through love_io_start/end which will be paired properly exactly if vvp_io_setattr_start/end are paired properly. cl_io_end will always call all of the ->cio_end functions, so missing a call to trunc_sem_up write is (should be) impossible. However cl_io_start will sometimes not call all cio_start functions. If one of the cio_start functions in the stack returns non-zero, the remainder are not called. However the matching cio_end function will be called. This seems very strange . I might be missing something, but it certainly seems that is the what would happen. So if some cio_start function before vvp_io_setattr_start returned non-zero, then trunc_sem_down_write would not be called, but trunc_semd_up_write would. This would often not cause a problem. But if some other thread had called trunc_sem_down_read(), then it would provide exactly the symptom you observe. @Andreas does this seem at all possible?

Mitsuhiro Nishizawa added a comment - 02/Jun/21 6:26 AM

Ah, we usually don't share our customer's data publicly. So I attached it to DDN-2177, but it would be better to digging into the root cause.
Here is the stack trace: [^crash.log]

I may delete the attachment later.

Mitsuhiro Nishizawa added a comment - 02/Jun/21 6:26 AM Ah, we usually don't share our customer's data publicly. So I attached it to DDN-2177, but it would be better to digging into the root cause. Here is the stack trace: [^crash.log] I may delete the attachment later.

Neil Brown added a comment - 02/Jun/21 6:05 AM

I don't know how to access "DDN-2177". I don't suppose you can attach them here?

Neil Brown added a comment - 02/Jun/21 6:05 AM I don't know how to access "DDN-2177". I don't suppose you can attach them here?

Mitsuhiro Nishizawa added a comment - 02/Jun/21 1:43 AM

I attached stack traces in DDN-2177. Please note that this issue occurred twice on this client and the process "a.out:109684" in problem description is when this occurred first time. The customer restarted the client, the issue occurred again and then vmcore was captured (vmcore is available in DDN-2177). The stack traces attached is from this vmcore and so the process is "a.out:4718". Hope this helps,

Mitsuhiro Nishizawa added a comment - 02/Jun/21 1:43 AM I attached stack traces in DDN-2177. Please note that this issue occurred twice on this client and the process "a.out:109684" in problem description is when this occurred first time. The customer restarted the client, the issue occurred again and then vmcore was captured (vmcore is available in DDN-2177). The stack traces attached is from this vmcore and so the process is "a.out:4718". Hope this helps,

Neil Brown added a comment - 28/May/21 5:46 AM

Thanks for the stack trace.

I think there must have been a least one other process blocked on a mutex or semphore. Was anything else in the logs?

trunc_sem_down_read() can only block in two circumstances: 1/ ll_trunc_readers is negative so a write lock it held, 2/ ll_waiters is non-zero so some process is waiting for a write lock.

In case 2, that process will be blocked on trunc_sem_down_write() and should get a soft-lockup warning.

In case 1, the writer must have called vvp_io_setattr_start and either

A/ not yet called vvp_io_setattr_end(), or

B/ changed io->ci_type or cio->u.ci_setattr.sa_subtype so taht when vvp_io_setattr_end() was called it didn't release the lock.

B seems extremely unlikely.

A implies that the process is still waiting as _end is always called after _start before the process can exit or return to user space.

The only scenario that I can think of (other than 1B) where there would be no other process triggering the soft-lockup detector is that the process holding the write lock is in an TASK_INTERRUPTIBLE or TASK_IDLE wait. That seems unlikely, but should be easy to find if it is the case.

Given that the program that generated the stack trace was "a.out" I assume you can reproduce the problem. Could you do that, then "echo t > /proc/sysrq-trigger" to generate a stack trace of all tasks. Then provide me with the full set of stack traces?

I agree that it would be good to be able to kill tasks blocked on mutexes or semphores, but that tends to take a lot more careful effort than just returning an error code.

Neil Brown added a comment - 28/May/21 5:46 AM Thanks for the stack trace. I think there must have been a least one other process blocked on a mutex or semphore. Was anything else in the logs? trunc_sem_down_read() can only block in two circumstances: 1/ ll_trunc_readers is negative so a write lock it held, 2/ ll_waiters is non-zero so some process is waiting for a write lock. In case 2, that process will be blocked on trunc_sem_down_write() and should get a soft-lockup warning. In case 1, the writer must have called vvp_io_setattr_start and either A/ not yet called vvp_io_setattr_end(), or B/ changed io->ci_type or cio->u.ci_setattr.sa_subtype so taht when vvp_io_setattr_end() was called it didn't release the lock. B seems extremely unlikely. A implies that the process is still waiting as _end is always called after _start before the process can exit or return to user space. The only scenario that I can think of (other than 1B) where there would be no other process triggering the soft-lockup detector is that the process holding the write lock is in an TASK_INTERRUPTIBLE or TASK_IDLE wait. That seems unlikely, but should be easy to find if it is the case. Given that the program that generated the stack trace was "a.out" I assume you can reproduce the problem. Could you do that, then "echo t > /proc/sysrq-trigger" to generate a stack trace of all tasks. Then provide me with the full set of stack traces? I agree that it would be good to be able to kill tasks blocked on mutexes or semphores, but that tends to take a lot more careful effort than just returning an error code.

Gerrit Updater added a comment - 27/May/21 9:00 AM

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/43844
Subject: ~~LU-14713~~ llite: make trunc_sem_down_xxx waiting killable
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e335dad76a6edc7ede7a45b4ed60098dfa2c356f

Gerrit Updater added a comment - 27/May/21 9:00 AM Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/43844 Subject: LU-14713 llite: make trunc_sem_down_xxx waiting killable Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e335dad76a6edc7ede7a45b4ed60098dfa2c356f

Andreas Dilger added a comment - 27/May/21 6:19 AM

Making the wait interruptible is good. It would also be good if the interrupted process woke up any waiters for the lock, so that they do not wait forever.

Andreas Dilger added a comment - 27/May/21 6:19 AM Making the wait interruptible is good. It would also be good if the interrupted process woke up any waiters for the lock, so that they do not wait forever.

People

Assignee:: Zhenyu Xu

Reporter:: Zhenyu Xu

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 27/May/21 3:49 AM

Updated:: 08/Jun/22 3:43 PM

Resolved:: 20/Nov/21 2:42 PM