[LU-13137] User process segfaults since 2.13 client upgrade Created: 15/Jan/20 Updated: 14/May/20 Resolved: 14/May/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Qian Yingjin |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Environment: |
CentOS 7.6, Lustre client 2.13.0 from WC |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We're not 100% sure this comes from Lustre but this is the main hint that we have so far. We started to upgrade our clients to Lustre 2.13 on Sherlock in mid-December 2019 and it this had taken about 3 weeks to upgrade the whole cluster in a rolling fashion. Now all clients are 2.13. Note that no other changes have been done since then (eg. no system, kernel nor OFED upgrades). Users have been reporting random segmentation faults from all clients. In the few cases that we checked, the binary executed was stored in Lustre, although we're not sure at this point that it's always the case. Example of segfaults are: 2020-01-14T10:47:50-08:00 sh-116-04 kernel: python[44357]: segfault at a4996 ip 00000000000a4996 sp 00007fc7c945dee8 error 14 2020-01-14T10:46:22-08:00 sh-108-38 kernel: angsd[335614]: segfault at 0 ip (null) sp 00007ffeec25ab08 error 14 2020-01-14T10:33:27-08:00 sh-107-31 kernel: minimap2[100552]: segfault at 2206 ip 0000000000002206 sp 00007f942ce0cbf8 error 14 2020-01-14T10:33:23-08:00 sh-107-10 kernel: samtools[305687]: segfault at f936 ip 000000000000f936 sp 00007ffebc4dfd18 error 14 in samtools[5631f2a6d000+7000] 2020-01-14T10:32:49-08:00 sh-ln06 kernel: conda[15953]: segfault at 3206 ip 0000000000003206 sp 00007fe3c0495a18 error 14 in python3.7[5590a194e000+5b000] 2020-01-14T10:25:19-08:00 sh-27-30 kernel: gatk[123897]: segfault at 0 ip (null) sp 00007ffea874efe8 error 14 in python3.8[556487517000+5e000] 2020-01-14T10:25:08-08:00 sh-27-29 kernel: bwa[39718]: segfault at 0 ip (null) sp 00007f1eebffec08 error 14 in bwa[400000+59000] Most are error 14. From https://utcc.utoronto.ca/~cks/space/blog/linux/KernelSegfaultErrorCodes: error 14: attempt to execute code from an unmapped area. This is the sign of trying to call through a mangled function pointer (or a NULL one), or perhaps returning from a call when the stack is in an unexpected or corrupted state so that the return address isn't valid. One source of mangled function pointers is use-after-free issues where the (freed) object contains embedded function pointers. (Error 14 with a faulting address of 0 often means a function call through a NULL pointer, which in turn often means 'making an indirect call to a function without checking that it's defined'. There are various larger scale causes of this in code.) From a core dump of such segfaults, I couldn't get any more insight neither: $ gdb -c core.118324 /scratch/users/xunger08/run_debug_2018/dqmc_stack GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /scratch/users/xunger08/run_debug_2018/dqmc_stack...done. [New LWP 118364] [New LWP 118324] Core was generated by `/scratch/users/xunger08/run_debug_2018/./dqmc_stack -t 35000 /scratch/users/xun'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000000000 in ?? () (gdb) thread apply all bt Thread 2 (LWP 118324): #0 0x00007f393383cf19 in ?? () #1 0xcccccccccccccccc in ?? () #2 0x00007f3931ceae00 in ?? () #3 0x00007f3900000001 in ?? () #4 0xcccccccc00000001 in ?? () #5 0x00007ffd624a05e0 in ?? () #6 0x00007f3900000000 in ?? () #7 0x00007f3931ce9d00 in ?? () #8 0x00007ffd00000001 in ?? () #9 0x00007ffd00000001 in ?? () #10 0x00007f3931cea000 in ?? () #11 0x0009b4232aa60776 in ?? () #12 0x00007ffd00000000 in ?? () #13 0x00007f3931cea000 in ?? () #14 0x00007ffd00000001 in ?? () #15 0x000000005c1b37a8 in ?? () #16 0x00007ffd624a09f0 in ?? () #17 0x00007ffd00000000 in ?? () #18 0x0000000000000001 in ?? () #19 0x0000000000000000 in ?? () Thread 1 (LWP 118364): #0 0x0000000000000000 in ?? () #1 0x0000000000000000 in ?? () We're strongly wondering if this could be related to this recent 2.12 to 2.13 client upgrade. Checking our logs through Splunk shows that this issue started at the same time we began to upgrade our Lustre clients to 2.13, that is why I'm opening this ticket. Could this be a come-back of an exec-in-lustre issue? Any suggestions are welcomed, otherwise, we'll try to start downgrading to 2.12 LTS and see if that fixes the problem. Thanks! |
| Comments |
| Comment by Andreas Dilger [ 15/Jan/20 ] |
|
Stephane, looking at the commit messages and affected files for patches that were landed in 2.13, I see the following that are related to mmap: https://review.whamcloud.com/35500 LU-12400 llite: Use the new vm_fault_t type https://review.whamcloud.com/34247 LU-11403 llite: ll_fault fixes [in 2.12.3] https://review.whamcloud.com/34242 LU-8299 llite: ll_fault should fail for insane file offsets https://review.whamcloud.com/32966 LU-10092 pcc: Non-blocking PCC caching https://review.whamcloud.com/32963 LU-10092 llite: Add persistent cache on client The first one looks harmless (mostly a compile fix for new kernels). The second one looks like it is the most likely to affect the mmap/page fault code, but it was also backported to 2.12.3, so I think you would already have been running it on your clients? If you were previously running 2.12.2 on your clients then this would be a likely candidate, but otherwise it seems unlikely. The PCC patches did modify llite_mmap.c but shouldn't have affected the functionality unless PCC was configured on the client. |
| Comment by Stephane Thiell [ 15/Jan/20 ] |
|
Hi Andreas, We're still investigating this and since yesterday:
Users are reporting that crashes are quite random and happen after hours of run time. Some of them seem to be able to reproduce it. Also we have checked and we were using 2.12.3 on the clients before upgrading to 2.13, and we haven't noticed that amount of segfaults with 2.12.3. |
| Comment by Peter Jones [ 16/Jan/20 ] |
|
Bobijam Can you please advise? Thanks Peter |
| Comment by Stanford Research Computing Center [ 16/Jan/20 ] |
|
Hi all, I just wanted to add that we had several users who confirmed that they moved their binaries from Lustre (Fir or Oak) to NFS filesystems and that they haven't encountered any segfault since then. Thanks! Kilian |
| Comment by Stephane Thiell [ 17/Jan/20 ] |
|
Good news. Kilian and I were able to reproduce the problem this morning! We have figured out that this problem is related to client-side LRU cached locks because most of the binaries on Lustre crashed after 65 minutes, which is the default value of lru_max_age. A segmentation fault can easily be reproduced with a Lustre 2.13 client by executing a binary in Lustre and then clearing the locks (MDT only seems to be enough, even without DoM): 1. execute any binary in Lustre ./binary 2. at the same time, clear the locks: lctl set_param ldlm.namespaces.*MDT*.lru_size=clear This is also reproducible by reducing lru_max_age to low value, something like 30s, and the binary will segfault exactly after 30 seconds. As a partial workaround, we have increased lru_max_age on all Sherlock clients to 1209600s (14 days), which matches our max allowed job time, but I don't think this will be enough to avoid all segfaults. |
| Comment by Andreas Dilger [ 18/Jan/20 ] |
|
Bobijam, probably the easiest way to debug this would be to make a patch with a sanity test for master that triggers the failure as described by Stephane, run only that test with Test-Paramweters: fortestonly testlist=sanity envdefinitions=ONLY=N, then use git bisect to cherry-pick the test to different versions of Lustre between 2.12.50 and 2.13.50. One idea that has been proposed in the past is to write a script similar to git bisect that submits several versions of the patch (maybe 33 by default, with different Change-Id: values) to Gerrit so that they can be run in parallel. This would likely be able to able to isolate the problem within a few hours, since "git log --pretty=oneline 2.12.50..2.13.50" reports 1049 patches. The first set of 33 patches should result in the first patch at 2.12.50 passing, and the last patch at 2.13.50 should fail, and somewhere inbetween, leaving 32 or 33 patches between the passing and failing patches. A second set of patches could be generated to isolate the failure to a single patch if it isn't clear what the source of the problem is. Having a script to do this would be useful for isolating all kinds of problems in the future, so should be checked in to Git as well. It might be necessary to run the subtest multiple times to ensure we get a strong verification on whether the subtest is passing or failing, in case it is an intermittent failure, so it would be useful to allow run_one_logged() to call the same subtest multiple times in a row (e.g. with ONLY_REPEAT=N in envdefinitions). |
| Comment by Gerrit Updater [ 19/Jan/20 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/37278 |
| Comment by Gerrit Updater [ 19/Jan/20 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/37279 |
| Comment by Zhenyu Xu [ 21/Jan/20 ] |
|
$ git bisect good f172b116885753d0f316549a2fb9d451e9b4bd2e |
| Comment by Stephane Thiell [ 23/Jan/20 ] |
|
Hello! Is a patch against master being actively worked on? Even with our lru_max_age workaround, we are getting up to 200 segfaults per hour on Sherlock. We told our users to move all executables out of Lustre (we have an NFS filer for that), but some have large software setups on Lustre. I'm asking to know if we should downgrade all our clients to 2.12 or if we'll have a chance to apply a patch on top of 2.13 (within the next days) that would fix this issue. Thanks much! |
| Comment by Andreas Dilger [ 23/Jan/20 ] |
|
Stephane, I'm assuming you are not using PCC, so if we supply a temporary patch that breaks PCC functionality to get the executables working again, that would be OK? I have an idea what the culprit is, so I could "fix" this, at the likely cost of breaking PCC. |
| Comment by Stephane Thiell [ 23/Jan/20 ] |
|
Hi Andreas, No we're not using PCC (yet Thanks! |
| Comment by Andreas Dilger [ 23/Jan/20 ] |
|
In particular, I think it is the following hunk that is causing the problem:
{@@ -487,7 +499,7 @@ int ll_teardown_mmaps(struct address_space *mapping, __u64 first, __u64 last)
if (mapping_mapped(mapping)) {
rc = 0;
unmap_mapping_range(mapping, first + PAGE_SIZE - 1,
- last - first + 1, 0);
+ last - first + 1, 1);
}
RETURN(rc);
this is causing all mmapped pages to be cleared, which is probably not the right thing. I will submit a patch to test that theory shortly, but you may be able to test/deploy a patch to revert that change faster than me. |
| Comment by Andreas Dilger [ 23/Jan/20 ] |
|
I've updated https://review.whamcloud.com/37278 (the test case) with a potential fix. We'll see in an hour or so whether it solves the problem, since the test case could reproduce the problem easily. |
| Comment by Stephane Thiell [ 23/Jan/20 ] |
|
Andreas, I built a 2.13 + the revert of that change and just tested on a client but I'm still able to reproduce the segfault when executing a binary and clearing the LRU cache. |
| Comment by Andreas Dilger [ 23/Jan/20 ] |
|
It looks like there was a second case of unmap_mapping_range() being changed by the PCC patch. With v12 of the patch that reverts both of those changes. the patch is now passing the regression test which previously failed. Please skip v11 of the patch with only the ll_teardown_mmaps() change, you need the change in vvp_conf_set() also. |
| Comment by Stephane Thiell [ 23/Jan/20 ] |
|
Andreas, I tested with the two changes as in v12 on top of 2.13 and it seems to fix the problem: no binary segfault anymore when clearing the LRU manually or reducing lru_max_age. We might perform additional tests and wait for the test results from Gerrit before deploying it. Thanks |
| Comment by Andreas Dilger [ 24/Jan/20 ] |
|
You can see from the initial test results that the patch is passing the fault-specific regression tests (it runs each newly-added test 10x for <ldiskfs,ZFS>x<single,DNE> in the "special" sessions, and each of those tests runs exec+lock cancel 10x), in addition to the 3x custom tests, so this test has already passed a few hundred times when it was otherwise failing 100% of the time: Also, since this is just reverting the behaviour to that of 2.12.x and earlier the risk is very low. Note that there is currently a totally unrelated regression in ZFS testing (https://review.whamcloud.com/37320) that is causing all ZFS test sessions to fail, so do not be alarmed when you see the patch fail all of the ZFS test sessions. |
| Comment by Qian Yingjin [ 14/Apr/20 ] |
|
This is a mmap problem for PCC when multiple client access on a shared mmapped file. Fixing it seems different if not patch the kernel of a client. As Lixi suggested, it seems reasonable to release the semantics of mmap when using PCC on a shared file among different clients. |
| Comment by Gerrit Updater [ 14/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37278/ |
| Comment by Peter Jones [ 14/May/20 ] |
|
Landed for 2.14 |