[LU-14713] Process hung with waiting for mmap_sem - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.15.0
Affects Version/s: Lustre 2.12.7, Lustre 2.12.8
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Write and Truncate IO will serialized on ll_trunc_sem::ll_trunc_{readers|waiters}, if one process quit abruptly (be killed), the other will keep waiting for the semaphore (task state be set as TASK_INTERRUPTIBLE):

 INFO: task a.out:109684 blocked for more than 120 seconds.
      Tainted: G          IOE    --------- -  - 4.18.0-240.15.1.el8_3.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 Call Trace:
 __schedule+0x2a6/0x700
 schedule+0x38/0xa0
 trunc_sem_down_read+0xa6/0xb0 [lustre]
 vvp_io_write_start+0x107/0xb80 [lustre]
 cl_io_start+0x59/0x110 [obdclass]
 cl_io_loop+0x9a/0x1e0 [obdclass]
 ll_file_io_generic+0x380/0xb10 [lustre]
 ll_file_write_iter+0x136/0x5a0 [lustre]
 new_sync_write+0x124/0x170
 vfs_write+0xa5/0x1a0
 ksys_write+0x4f/0xb0
 do_syscall_64+0x5b/0x1a0

Attachments

Issue Links

is related to

LU-15397 LustreError: 4585:0:(llite_mmap.c:61:our_vma()) ASSERTION( !down_write_trylock(&mm->mmap_sem) ) failed

Open

Activity

[LU-14713] Process hung with waiting for mmap_sem

Neil Brown added a comment - 04/Jun/21 4:19 AM

Looking at the vmcore "bt -FF 4718" show

#1 [ffffbe8540f3bb88] schedule at ffffffff8f6d38a8
ffffbe8540f3bb90: [ffff9f9b3d8c37a0:lustre_inode_cache] trunc_sem_down_read+166
#2 [ffffbe8540f3bb98] trunc_sem_down_read at ffffffffc12523c6 [lustre]
ffffbe8540f3bba0: [ffff9f9b3d8c37a0:lustre_inode_cache] 00000000ffffffff

So ffffbe8540f3bba0 must be the address of the ll_trunc_sem.

crash> x/2wx 0xffff9f9b3d8c37a0
0xffff9f9b3d8c37a0: 0x00000000 0x00000000

so both ll_trunc_waiters and ll_trunc_readers are zero. So trunc_sem_down_read() shouldn't block.

This suggests a missed wake-up. It could only be the wakeup from trunc_sem_up_write(). Maybe a barrier is needed after the atomic_set, and before the atomic_add_unless_negative.

But I thought barriers like that weren't needed on x86.

I'll read up about memory ordering again.

Neil Brown added a comment - 04/Jun/21 4:19 AM Looking at the vmcore "bt -FF 4718" show #1 [ffffbe8540f3bb88] schedule at ffffffff8f6d38a8 ffffbe8540f3bb90: [ffff9f9b3d8c37a0:lustre_inode_cache] trunc_sem_down_read+166 #2 [ffffbe8540f3bb98] trunc_sem_down_read at ffffffffc12523c6 [lustre] ffffbe8540f3bba0: [ffff9f9b3d8c37a0:lustre_inode_cache] 00000000ffffffff So ffffbe8540f3bba0 must be the address of the ll_trunc_sem. crash> x/2wx 0xffff9f9b3d8c37a0 0xffff9f9b3d8c37a0: 0x00000000 0x00000000 so both ll_trunc_waiters and ll_trunc_readers are zero. So trunc_sem_down_read() shouldn't block. This suggests a missed wake-up. It could only be the wakeup from trunc_sem_up_write(). Maybe a barrier is needed after the atomic_set, and before the atomic_add_unless_negative. But I thought barriers like that weren't needed on x86. I'll read up about memory ordering again.

Zhenyu Xu added a comment - 04/Jun/21 3:12 AM

I can open it, but I'm having difficulty to find 'struct ll_trunc_sem'. vmcore is available from:

same here.

crash> struct ll_trunc_sem
struct: invalid data structure reference: ll_trunc_sem

seems lack of lustre module debug info.

rpm -qpl lustre-client-debuginfo-2.12.6_ddn19-1.el8.x86_64.rpm

shows that it does not contain lustre module info, only some tools.

Zhenyu Xu added a comment - 04/Jun/21 3:12 AM I can open it, but I'm having difficulty to find 'struct ll_trunc_sem'. vmcore is available from: same here. crash> struct ll_trunc_sem struct: invalid data structure reference: ll_trunc_sem seems lack of lustre module debug info. rpm -qpl lustre-client-debuginfo-2.12.6_ddn19-1.el8.x86_64.rpm shows that it does not contain lustre module info, only some tools.

Neil Brown added a comment - 04/Jun/21 1:44 AM

> The reported version is 2.12.6-ddn19

The public "lustre-release" git tree doesn't have any branch or tag with a name like that, and the public "2.12.6" doesn't contain trunc_sem_down_read() at all.

Is there somewhere I can get the source code for that release?

Neil Brown added a comment - 04/Jun/21 1:44 AM > The reported version is 2.12.6-ddn19 The public "lustre-release" git tree doesn't have any branch or tag with a name like that, and the public "2.12.6" doesn't contain trunc_sem_down_read() at all. Is there somewhere I can get the source code for that release?

Zhenyu Xu added a comment - 04/Jun/21 1:01 AM

The reported version is 2.12.6-ddn19 (both client and server). And unfortunately I failed to load teh vmcore with crqsh

# crash ./120035_YITP_client_ioerror/usr/lib/debug/lib/modules/4.18.0-240.15.1.el8_3.x86_64/vmlinux 120035_YITP_client_ioerror/vmcore 


crash 7.2.3-11.el7_9.1
Copyright (C) 2002-2017  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...


WARNING: kernel relocated [222MB]: patching 97897 gdb minimal_symbol values


please wait... (gathering task table data)                             
crash: radix trees do not exist or have changed their format

Zhenyu Xu added a comment - 04/Jun/21 1:01 AM The reported version is 2.12.6-ddn19 (both client and server). And unfortunately I failed to load teh vmcore with crqsh # crash ./120035_YITP_client_ioerror/usr/lib/debug/lib/modules/4.18.0-240.15.1.el8_3.x86_64/vmlinux 120035_YITP_client_ioerror/vmcore crash 7.2.3-11.el7_9.1 Copyright (C) 2002-2017 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... WARNING: kernel relocated [222MB]: patching 97897 gdb minimal_symbol values please wait... (gathering task table data) crash: radix trees do not exist or have changed their format

Neil Brown added a comment - 04/Jun/21 12:05 AM

Another question: If you still have the core file and can find the 'struct ll_trunc_sem" that the process is blocked on, could you report the contents of that. I'm guessing the two atomic_t's are -1 and 0, but it would be highly informative if they had other values. Maybe some sort of memory corruption caused the problem.

Neil Brown added a comment - 04/Jun/21 12:05 AM Another question: If you still have the core file and can find the 'struct ll_trunc_sem" that the process is blocked on, could you report the contents of that. I'm guessing the two atomic_t's are -1 and 0, but it would be highly informative if they had other values. Maybe some sort of memory corruption caused the problem.

Neil Brown added a comment - 04/Jun/21 12:02 AM

Could you please report precisely which version of Lustre is in used (ideally the git commit hash). The "Affected versions" above says "None" which isn't helpful.

Neil Brown added a comment - 04/Jun/21 12:02 AM Could you please report precisely which version of Lustre is in used (ideally the git commit hash). The "Affected versions" above says "None" which isn't helpful.

Neil Brown added a comment - 03/Jun/21 11:30 AM

Thanks for the excellent explanation. The bug must be hiding somewhere else then. I'll keep digging.

Neil Brown added a comment - 03/Jun/21 11:30 AM Thanks for the excellent explanation. The bug must be hiding somewhere else then. I'll keep digging.

Zhenyu Xu added a comment - 03/Jun/21 9:15 AM

The clio layer always (at least for now) stacks like this VVP upon LOV upon OSC/MDC, and cl_io_start() calls from top to bottom, which is vvp_io_xxx_start() then lov_io_xxx_start() which is nested with {mdc|osc}_io_xxx_start(), and that makes vvp_io_setattr_start() the first to be called in the CIT_SETATTR cl_io_loop call path.

Like following example shows

00000020:00000001:0.0:1622711194.323939:0:20024:0:(cl_io.c:755:cl_io_loop()) Process entered
00000020:00000001:0.0:1622711194.323940:0:20024:0:(cl_io.c:390:cl_io_iter_init()) Process entered
...
00000020:00000001:0.0:1622711194.323955:0:20024:0:(cl_io.c:402:cl_io_iter_init()) Process leaving (rc=0 : 0 : 0)
00000020:00000001:0.0:1622711194.323958:0:20024:0:(cl_io.c:311:cl_io_lock()) Process entered
...
00000020:00000001:0.0:1622711194.324421:0:20024:0:(cl_io.c:331:cl_io_lock()) Process leaving (rc=0 : 0 : 0)
00000020:00000001:0.0:1622711194.324422:0:20024:0:(cl_io.c:516:cl_io_start()) Process entered
00000080:00000001:0.0:1622711194.324422:0:20024:0:(vvp_io.c:737:vvp_io_setattr_start()) Process entered
...
00000080:00000001:0.0:1622711194.324427:0:20024:0:(vvp_io.c:763:vvp_io_setattr_start()) Process leaving (rc=0 : 0 : 0)
00020000:00000001:0.0:1622711194.324427:0:20024:0:(lov_io.c:1487:lov_io_setattr_start()) Process entered
...
00000008:00000001:0.0:1622711194.324429:0:20024:0:(osc_io.c:582:osc_io_setattr_start()) Process entered
...
00000008:00000001:0.0:1622711194.324459:0:20024:0:(osc_io.c:687:osc_io_setattr_start()) Process leaving (rc=0 : 0 : 0)
...
00020000:00000001:0.0:1622711194.324461:0:20024:0:(lov_io.c:1502:lov_io_setattr_start()) Process leaving (rc=0 : 0 : 0)
00000020:00000001:0.0:1622711194.324462:0:20024:0:(cl_io.c:528:cl_io_start()) Process leaving (rc=0 : 0 : 0)

So basically vvp_io_setattr_start() will be called whatever in CIT_SETATTR cl_io_start().

Zhenyu Xu added a comment - 03/Jun/21 9:15 AM The clio layer always (at least for now) stacks like this VVP upon LOV upon OSC/MDC, and cl_io_start() calls from top to bottom, which is vvp_io_xxx_start() then lov_io_xxx_start() which is nested with {mdc|osc}_io_xxx_start(), and that makes vvp_io_setattr_start() the first to be called in the CIT_SETATTR cl_io_loop call path. Like following example shows 00000020:00000001:0.0:1622711194.323939:0:20024:0:(cl_io.c:755:cl_io_loop()) Process entered 00000020:00000001:0.0:1622711194.323940:0:20024:0:(cl_io.c:390:cl_io_iter_init()) Process entered ... 00000020:00000001:0.0:1622711194.323955:0:20024:0:(cl_io.c:402:cl_io_iter_init()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1622711194.323958:0:20024:0:(cl_io.c:311:cl_io_lock()) Process entered ... 00000020:00000001:0.0:1622711194.324421:0:20024:0:(cl_io.c:331:cl_io_lock()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1622711194.324422:0:20024:0:(cl_io.c:516:cl_io_start()) Process entered 00000080:00000001:0.0:1622711194.324422:0:20024:0:(vvp_io.c:737:vvp_io_setattr_start()) Process entered ... 00000080:00000001:0.0:1622711194.324427:0:20024:0:(vvp_io.c:763:vvp_io_setattr_start()) Process leaving (rc=0 : 0 : 0) 00020000:00000001:0.0:1622711194.324427:0:20024:0:(lov_io.c:1487:lov_io_setattr_start()) Process entered ... 00000008:00000001:0.0:1622711194.324429:0:20024:0:(osc_io.c:582:osc_io_setattr_start()) Process entered ... 00000008:00000001:0.0:1622711194.324459:0:20024:0:(osc_io.c:687:osc_io_setattr_start()) Process leaving (rc=0 : 0 : 0) ... 00020000:00000001:0.0:1622711194.324461:0:20024:0:(lov_io.c:1502:lov_io_setattr_start()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1622711194.324462:0:20024:0:(cl_io.c:528:cl_io_start()) Process leaving (rc=0 : 0 : 0) So basically vvp_io_setattr_start() will be called whatever in CIT_SETATTR cl_io_start().

Andreas Dilger added a comment - 03/Jun/21 7:10 AM

@Andreas does this seem at all possible?

That is really a question for Bobijam to answer, as I'm not nearly as expert in the client IO code as he is.

Andreas Dilger added a comment - 03/Jun/21 7:10 AM @Andreas does this seem at all possible? That is really a question for Bobijam to answer, as I'm not nearly as expert in the client IO code as he is.

Neil Brown added a comment - 03/Jun/21 12:44 AM

Thanks for the crash data. I could not find the other process that I was sure must exist.

The only process that looks at all suspicious is PID 46556 (ll_imp_inval) and I suspect it is waiting on the a.out thread, not the other way around.

It is very likely that the ll_trunc_sem that a.out is blocked on has ll_trunc_waiters == 0 and ll_trunc_reads == -1.

There are two ways this can happen:

1/ a thread calls trunc_sem_down_write, but never calls trunc_sem_up_write

2/ a thread calls trunc_sem_up_write without having first called trunc_sem_down_write. If it does this while some other thread is holding a read lock, then the up_write call will set ll_trunc_reads to 0, and when the other thread calls trunc_sem_up_read, ll_trunc_reads will be decremented to -1.

I cannot see a code path that leads to option 1. I can see a code path that might lead to option 2, but there may be considerations that I'm not aware of which make it impossible. I'll explain it in case it is possible.

trunc_sem_up/down_write are only called from vvp_io_setattr_start/end. These are only called from cl_io_start/end which are always paired.

They are called directly in ll_readahead_handle_work and cl_io_loop where cl_io_end will always follow cl_io_start, and indirectly through love_io_start/end which will be paired properly exactly if vvp_io_setattr_start/end are paired properly.

cl_io_end will always call all of the ->cio_end functions, so missing a call to trunc_sem_up write is (should be) impossible.

However cl_io_start will sometimes not call all cio_start functions. If one of the cio_start functions in the stack returns non-zero, the remainder are not called. However the matching cio_end function will be called. This seems very strange. I might be missing something, but it certainly seems that is the what would happen.

So if some cio_start function before vvp_io_setattr_start returned non-zero, then trunc_sem_down_write would not be called, but trunc_semd_up_write would. This would often not cause a problem. But if some other thread had called trunc_sem_down_read(), then it would provide exactly the symptom you observe.

@Andreas does this seem at all possible?

Neil Brown added a comment - 03/Jun/21 12:44 AM Thanks for the crash data. I could not find the other process that I was sure must exist. The only process that looks at all suspicious is PID 46556 (ll_imp_inval) and I suspect it is waiting on the a.out thread, not the other way around. It is very likely that the ll_trunc_sem that a.out is blocked on has ll_trunc_waiters == 0 and ll_trunc_reads == -1. There are two ways this can happen: 1/ a thread calls trunc_sem_down_write, but never calls trunc_sem_up_write 2/ a thread calls trunc_sem_up_write without having first called trunc_sem_down_write. If it does this while some other thread is holding a read lock, then the up_write call will set ll_trunc_reads to 0, and when the other thread calls trunc_sem_up_read, ll_trunc_reads will be decremented to -1. I cannot see a code path that leads to option 1. I can see a code path that might lead to option 2, but there may be considerations that I'm not aware of which make it impossible. I'll explain it in case it is possible. trunc_sem_up/down_write are only called from vvp_io_setattr_start/end. These are only called from cl_io_start/end which are always paired. They are called directly in ll_readahead_handle_work and cl_io_loop where cl_io_end will always follow cl_io_start, and indirectly through love_io_start/end which will be paired properly exactly if vvp_io_setattr_start/end are paired properly. cl_io_end will always call all of the ->cio_end functions, so missing a call to trunc_sem_up write is (should be) impossible. However cl_io_start will sometimes not call all cio_start functions. If one of the cio_start functions in the stack returns non-zero, the remainder are not called. However the matching cio_end function will be called. This seems very strange . I might be missing something, but it certainly seems that is the what would happen. So if some cio_start function before vvp_io_setattr_start returned non-zero, then trunc_sem_down_write would not be called, but trunc_semd_up_write would. This would often not cause a problem. But if some other thread had called trunc_sem_down_read(), then it would provide exactly the symptom you observe. @Andreas does this seem at all possible?

Mitsuhiro Nishizawa added a comment - 02/Jun/21 6:26 AM

Ah, we usually don't share our customer's data publicly. So I attached it to DDN-2177, but it would be better to digging into the root cause.
Here is the stack trace: [^crash.log]

I may delete the attachment later.

Mitsuhiro Nishizawa added a comment - 02/Jun/21 6:26 AM Ah, we usually don't share our customer's data publicly. So I attached it to DDN-2177, but it would be better to digging into the root cause. Here is the stack trace: [^crash.log] I may delete the attachment later.

People

Assignee:: Zhenyu Xu

Reporter:: Zhenyu Xu

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 27/May/21 3:49 AM

Updated:: 08/Jun/22 3:43 PM

Resolved:: 20/Nov/21 2:42 PM