[LU-13588] sigbus sent to mmap writer that is a long way below quota Created: 19/May/20 Updated: 27/Jun/20 Resolved: 25/Jun/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.8, Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | SC Admin (Inactive) | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
centos 7.8, zfs 0.8.3, lustre 2.12.4 on servers |
||
| Epic/Theme: | Quota |
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Hi, we've been seeing SIGBUS from a tensorflow build, and possibly other builds and codes, since moving to 2.12.4 on servers. we moved to centos 7.8 on servers and clients at the same time. our previous Lustre version on servers was 2.10.5 (plus many patches) and zfs 0.7.9. the old server lustre versions had no issues with SIGBUS that we know of. we have been running 2.10.8 on clients for about 6 months and that is unchanged. after a week or so narrowing down the issue, we have found a reproducer in a tensorflow build ld step that will reliably SIGBUS, and have also found that this is related to group block quotas. the .so file that ld (ld.gold, collect2) is writing into is initially nulls and sparse, is about 210M in size, is mapp'd, and probably receives a lot (>600k) small memcpy/memset's into the file before it gets a SIGBUS. a strace -f -t snippet is 62275 16:15:47 mmap(NULL, 258627360, PROT_READ|PROT_WRITE, MAP_SHARED, 996, 0) = 0x2b3b15905000
62275 16:15:48 --- SIGBUS {si_signo=SIGBUS, si_code=BUS_ADRERR, si_addr=0x2b3b23a7cd23} ---
62275 16:15:48 +++ killed by SIGBUS +++
if that value of si_addr is correct, then it's well within the size of the file, so it doesn't look like ld is writing in the wrong place. ltrace also shows no addresses out of bounds. it gets interesting if we change the group quota limit on the account. if there is less than ~9TB of group quota free in the account, then we reliably get a SIGBUS. # lfs setquota -g oz997 -B 6000000000 /fred to # lfs setquota -g oz997 -B 14000000000 /fred where only about 5TB is actually used in the account -> # lfs quota -g oz997 /fred
Disk quotas for grp oz997 (gid 10273):
Filesystem kbytes quota limit grace files quota limit grace
/fred 5166964968 0 6000000000 - 936838 0 2000000 -
then we get a sigbus -> $ /apps/skylake/software/core/gcccore/6.4.0/bin/gcc @bazel-out/k8-py2-opt/bin/tensorflow/python/_pywrap_tensorflow_internal.so-2.params collect2: fatal error: ld terminated with signal 7 [Bus error] compilation terminated. but when there is ~9TB free quota, or more -> # lfs setquota -g oz997 -B 14000000000 /fred then we do not see a SIGBUS and the ld step completes ok. other things to mention
on the surface this seems similar to as we have a reproducer and a test methodology, we could probably build a 2.12.4 client image and try that if you would find that useful. we weren't planning to move to 2.12.x client in production just yet, but we could try it as an experiment. cheers, |
| Comments |
| Comment by Oleg Drokin [ 19/May/20 ] |
|
any chance you can give this patch a try still? https://review.whamcloud.com/38292 I have seen it failing even in total absence of quotas just based on grant dynamics which is the other way how you can get that codepath triggered |
| Comment by SC Admin (Inactive) [ 21/May/20 ] |
|
Hi Oleg, I tried 2.12.4 client (no patches except a build patch for rhel7.8) instead of 2.10.8, and the sigbus issue is still there. do I need to apply the patch in https://review.whamcloud.com/38292 to the servers as well as clients? cheers, |
| Comment by Oleg Drokin [ 21/May/20 ] |
|
no, it's a client only patch |
| Comment by SC Admin (Inactive) [ 22/May/20 ] |
|
Hi Oleg, 2.12.4 + the patch in https://review.whamcloud.com/38292 seems to have fixed it. thanks! BTW any idea if 2.12.5 is out soon? cheers, |
| Comment by Peter Jones [ 22/May/20 ] |
|
scadmin yes 2.12.5 should be out soon - we're aiming to have an RC next week |
| Comment by Peter Jones [ 06/Jun/20 ] |
|
Robin We're at an advanced stage of release testing on RC1 and so far so good Peter |
| Comment by Peter Jones [ 13/Jun/20 ] |
|
Robin 2.12.5 is now GA Peter |
| Comment by SC Admin (Inactive) [ 25/Jun/20 ] |
|
Hi Oleg and Peter, we have all clients at 2.12.5 now, and no sign of SIGBUS. please close this ticket. cheers, |
| Comment by Peter Jones [ 25/Jun/20 ] |
|
Good news! Thanks |