[LU-13588] sigbus sent to mmap writer that is a long way below quota Created: 19/May/20  Updated: 27/Jun/20  Resolved: 25/Jun/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.8, Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: SC Admin (Inactive) Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None
Environment:

centos 7.8, zfs 0.8.3, lustre 2.12.4 on servers
zfs compression is enabled on OSTs.
centos 7.8, lustre 2.10.8 on clients
all x86_64
group block and inode quotas set and enforcing


Epic/Theme: Quota
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hi,

we've been seeing SIGBUS from a tensorflow build, and possibly other builds and codes, since moving to 2.12.4 on servers. we moved to centos 7.8 on servers and clients at the same time. our previous Lustre version on servers was 2.10.5 (plus many patches) and zfs 0.7.9. the old server lustre versions had no issues with SIGBUS that we know of. we have been running 2.10.8 on clients for about 6 months and that is unchanged.

after a week or so narrowing down the issue, we have found a reproducer in a tensorflow build ld step that will reliably SIGBUS, and have also found that this is related to group block quotas.

the .so file that ld (ld.gold, collect2) is writing into is initially nulls and sparse, is about 210M in size, is mapp'd, and probably receives a lot (>600k) small memcpy/memset's into the file before it gets a SIGBUS.

a strace -f -t snippet is

62275 16:15:47 mmap(NULL, 258627360, PROT_READ|PROT_WRITE, MAP_SHARED, 996, 0) = 0x2b3b15905000
62275 16:15:48 --- SIGBUS {si_signo=SIGBUS, si_code=BUS_ADRERR, si_addr=0x2b3b23a7cd23} ---
62275 16:15:48 +++ killed by SIGBUS +++

if that value of si_addr is correct, then it's well within the size of the file, so it doesn't look like ld is writing in the wrong place. ltrace also shows no addresses out of bounds.

it gets interesting if we change the group quota limit on the account. if there is less than ~9TB of group quota free in the account, then we reliably get a SIGBUS.
ie. anywhere in the range ->

# lfs setquota  -g oz997 -B 6000000000 /fred

to

# lfs setquota  -g oz997 -B 14000000000 /fred

where only about 5TB is actually used in the account ->

# lfs quota  -g oz997 /fred
Disk quotas for grp oz997 (gid 10273):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
          /fred 5166964968       0 6000000000       -  936838       0 2000000       -

then we get a sigbus ->

 $ /apps/skylake/software/core/gcccore/6.4.0/bin/gcc @bazel-out/k8-py2-opt/bin/tensorflow/python/_pywrap_tensorflow_internal.so-2.params
collect2: fatal error: ld terminated with signal 7 [Bus error]
compilation terminated.

but when there is ~9TB free quota, or more ->

# lfs setquota  -g oz997 -B 14000000000 /fred

then we do not see a SIGBUS and the ld step completes ok.

other things to mention

  • we have tried with various different (much newer) gcc's and they see the same thing.
  • as far as I can tell from strace and ltrace output of the memcpy/memsets, all of the addresses it is writing to are well within the bounds of the file and so should not be getting SIGBUS. ie. it's probably not a bug in ld.gold.
  • ld.gold is the default linker. if we pick ld.bfd instead, then ld.bfd does ordinary (not mmap'd) i/o to the output .so and that succeeds with the smallest quota above, so this just seems to affect mmap'd i/o.
  • we've tried a couple of different user and group accounts and the pattern is similar, so I don't think it's anything odd in an account's limits or settings.
  • another user with a much larger quota is also seeing SIGBUS on a build, but that group is within 30T of a 2P quota, so are "close" to over by some measure. I haven't dug into this bug report yet, but I suspect it's the same issue as this one.
  • builds to XFS work ok. I haven't tried XFS with a quota.

on the surface this seems similar to LU-13228 but we do not set any soft quotas, and the accounts are many TB away from being over quota. also our only recent lustre changes have been on the server side, and AFAICT that ticket is a client side fix.

as we have a reproducer and a test methodology, we could probably build a 2.12.4 client image and try that if you would find that useful. we weren't planning to move to 2.12.x client in production just yet, but we could try it as an experiment.

cheers,
robin



 Comments   
Comment by Oleg Drokin [ 19/May/20 ]

any chance you can give this patch a try still? https://review.whamcloud.com/38292

I have seen it failing even in total absence of quotas just based on grant dynamics which is the other way how you can get that codepath triggered

Comment by SC Admin (Inactive) [ 21/May/20 ]

Hi Oleg,

I tried 2.12.4 client (no patches except a build patch for rhel7.8) instead of 2.10.8, and the sigbus issue is still there.

do I need to apply the patch in https://review.whamcloud.com/38292 to the servers as well as clients?

cheers,
robin

Comment by Oleg Drokin [ 21/May/20 ]

no, it's a client only patch

Comment by SC Admin (Inactive) [ 22/May/20 ]

Hi Oleg,

2.12.4 + the patch in https://review.whamcloud.com/38292 seems to have fixed it. thanks!

BTW any idea if 2.12.5 is out soon?
it would be good to have all those fixes as well as this one before we make the jump to 2.12 clients.

cheers,
robin

Comment by Peter Jones [ 22/May/20 ]

scadmin yes 2.12.5 should be out soon - we're aiming to have an RC next week

Comment by Peter Jones [ 06/Jun/20 ]

Robin

We're at an advanced stage of release testing on RC1 and so far so good

Peter

Comment by Peter Jones [ 13/Jun/20 ]

Robin

2.12.5 is now GA

Peter

Comment by SC Admin (Inactive) [ 25/Jun/20 ]

Hi Oleg and Peter,

we have all clients at 2.12.5 now, and no sign of SIGBUS.

please close this ticket.
thanks!

cheers,
robin

Comment by Peter Jones [ 25/Jun/20 ]

Good news! Thanks

Generated at Sat Feb 10 03:02:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.