Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13588

sigbus sent to mmap writer that is a long way below quota

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.10.8, Lustre 2.12.4
    • None
    • centos 7.8, zfs 0.8.3, lustre 2.12.4 on servers
      zfs compression is enabled on OSTs.
      centos 7.8, lustre 2.10.8 on clients
      all x86_64
      group block and inode quotas set and enforcing
    • 3
    • 9223372036854775807

    Description

      Hi,

      we've been seeing SIGBUS from a tensorflow build, and possibly other builds and codes, since moving to 2.12.4 on servers. we moved to centos 7.8 on servers and clients at the same time. our previous Lustre version on servers was 2.10.5 (plus many patches) and zfs 0.7.9. the old server lustre versions had no issues with SIGBUS that we know of. we have been running 2.10.8 on clients for about 6 months and that is unchanged.

      after a week or so narrowing down the issue, we have found a reproducer in a tensorflow build ld step that will reliably SIGBUS, and have also found that this is related to group block quotas.

      the .so file that ld (ld.gold, collect2) is writing into is initially nulls and sparse, is about 210M in size, is mapp'd, and probably receives a lot (>600k) small memcpy/memset's into the file before it gets a SIGBUS.

      a strace -f -t snippet is

      62275 16:15:47 mmap(NULL, 258627360, PROT_READ|PROT_WRITE, MAP_SHARED, 996, 0) = 0x2b3b15905000
      62275 16:15:48 --- SIGBUS {si_signo=SIGBUS, si_code=BUS_ADRERR, si_addr=0x2b3b23a7cd23} ---
      62275 16:15:48 +++ killed by SIGBUS +++
      

      if that value of si_addr is correct, then it's well within the size of the file, so it doesn't look like ld is writing in the wrong place. ltrace also shows no addresses out of bounds.

      it gets interesting if we change the group quota limit on the account. if there is less than ~9TB of group quota free in the account, then we reliably get a SIGBUS.
      ie. anywhere in the range ->

      # lfs setquota  -g oz997 -B 6000000000 /fred
      

      to

      # lfs setquota  -g oz997 -B 14000000000 /fred
      

      where only about 5TB is actually used in the account ->

      # lfs quota  -g oz997 /fred
      Disk quotas for grp oz997 (gid 10273):
           Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
                /fred 5166964968       0 6000000000       -  936838       0 2000000       -
      

      then we get a sigbus ->

       $ /apps/skylake/software/core/gcccore/6.4.0/bin/gcc @bazel-out/k8-py2-opt/bin/tensorflow/python/_pywrap_tensorflow_internal.so-2.params
      collect2: fatal error: ld terminated with signal 7 [Bus error]
      compilation terminated.
      

      but when there is ~9TB free quota, or more ->

      # lfs setquota  -g oz997 -B 14000000000 /fred
      

      then we do not see a SIGBUS and the ld step completes ok.

      other things to mention

      • we have tried with various different (much newer) gcc's and they see the same thing.
      • as far as I can tell from strace and ltrace output of the memcpy/memsets, all of the addresses it is writing to are well within the bounds of the file and so should not be getting SIGBUS. ie. it's probably not a bug in ld.gold.
      • ld.gold is the default linker. if we pick ld.bfd instead, then ld.bfd does ordinary (not mmap'd) i/o to the output .so and that succeeds with the smallest quota above, so this just seems to affect mmap'd i/o.
      • we've tried a couple of different user and group accounts and the pattern is similar, so I don't think it's anything odd in an account's limits or settings.
      • another user with a much larger quota is also seeing SIGBUS on a build, but that group is within 30T of a 2P quota, so are "close" to over by some measure. I haven't dug into this bug report yet, but I suspect it's the same issue as this one.
      • builds to XFS work ok. I haven't tried XFS with a quota.

      on the surface this seems similar to LU-13228 but we do not set any soft quotas, and the accounts are many TB away from being over quota. also our only recent lustre changes have been on the server side, and AFAICT that ticket is a client side fix.

      as we have a reproducer and a test methodology, we could probably build a 2.12.4 client image and try that if you would find that useful. we weren't planning to move to 2.12.x client in production just yet, but we could try it as an experiment.

      cheers,
      robin

      Attachments

        Activity

          People

            green Oleg Drokin
            scadmin SC Admin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: