Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      The following test

          addr = mmap(.., MAP_SHARED, ..);
      
          for(;;)
              memcpy(addr, ...);
      

      leads to sigbus signal as vvp_io_fault_start() runs out of grants.

      static int vvp_io_fault_start(const struct lu_env *env,
                                    const struct cl_io_slice *ios)
      ...
                              result = cl_io_commit_async(env, io, plist, 0, to,
                                                          mkwrite_commit_callback);
      ...
                                      /* we're in big trouble, what can we do now? */
                                      if (result == -EDQUOT)
                                              result = -ENOSPC;
      

      Attachments

        Activity

          [LU-11703] sigbus on shared mapping memset

          The issue has bene fixed by LU-13228 clio: mmap write when overquota: vvp_io_commit_sync() called by vvp_io_fault_start() forces the required transaction commit.

          vsaveliev Vladimir Saveliev added a comment - The issue has bene fixed by LU-13228 clio: mmap write when overquota: vvp_io_commit_sync() called by vvp_io_fault_start() forces the required transaction commit.

          Jacek, hello

          You should check whether your SIGBUS is related to this issue. Check whether you run out of grants.

          The below are two runs. In first a client has enough grants, so it is able to survive 10 seconds of the memset.

           ./test  /mnt/lustre/file
          Alarm clock
          [root@sl754 LU-11703]# lctl get_param osc.*.cur_grant_bytes
          osc.lustre-OST0000-osc-ffff9a11149bf800.cur_grant_bytes=497221632
          osc.lustre-OST0001-osc-ffff9a11149bf800.cur_grant_bytes=8437760
          

          In second case, the client runs out of grants and gets SIGBUS:

          ./test  /mnt/lustre/file2
          Bus error (core dumped)
          [root@sl754 LU-11703]# lctl get_param osc.*.cur_grant_bytes
          osc.lustre-OST0000-osc-ffff9a10679ac800.cur_grant_bytes=20926464
          osc.lustre-OST0001-osc-ffff9a10679ac800.cur_grant_bytes=20480
          

          Lack of grants usually happens when servers do not have much free disk space.
          The max_pages_per_rpc parameter is less critical for the problem. If you have enough disk space and no lack of grants you are certainly not seeing this issue.

          vsaveliev Vladimir Saveliev added a comment - Jacek, hello You should check whether your SIGBUS is related to this issue. Check whether you run out of grants. The below are two runs. In first a client has enough grants, so it is able to survive 10 seconds of the memset. ./test /mnt/lustre/file Alarm clock [root@sl754 LU-11703]# lctl get_param osc.*.cur_grant_bytes osc.lustre-OST0000-osc-ffff9a11149bf800.cur_grant_bytes=497221632 osc.lustre-OST0001-osc-ffff9a11149bf800.cur_grant_bytes=8437760 In second case, the client runs out of grants and gets SIGBUS: ./test /mnt/lustre/file2 Bus error (core dumped) [root@sl754 LU-11703]# lctl get_param osc.*.cur_grant_bytes osc.lustre-OST0000-osc-ffff9a10679ac800.cur_grant_bytes=20926464 osc.lustre-OST0001-osc-ffff9a10679ac800.cur_grant_bytes=20480 Lack of grants usually happens when servers do not have much free disk space. The max_pages_per_rpc parameter is less critical for the problem. If you have enough disk space and no lack of grants you are certainly not seeing this issue.

          Hey Vladimir!,
          We are running Lustre: Build Version: 2.10.3 and seeing similar issue, but i can't seem to be able to reproduce it using mmap_sanity tst 9. I suspect that this is because some of our parameters are not default. I noticed that it depends on max_pages_per_rpc and page size being 4096. Our max_pages_per_rpc is 4096 on some of the OSTs and 1024 on others. I have updated the file size and still can't reproduce. Should i be able to on my version?
          What other parameters could affect this?

          The issue we are seeing is with sqlite3. We are using it in read only mode but the problems happens when there is a on the fly index created, that gets dumped to temporary file on Lustre.
          I do not have full crash dump, just java hs_err file.
          It goes like that:

          # Problematic frame:
          # C  [libintlc.so.5+0x3c012]  __intel_mic_avx512f_memcpy+0x92 
          
          siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR), si_addr: 0x00002ae0eb909000
          
          2ae0eb909000-2ae0eb90c000 rw-s 00000000 fab:cb7e8 144115656126652496     /kz5/20200710_job_12159204.XPuW/etilqs_boLeelqyY0xqGKU (deleted)
          

          What i find weird is that 0x00002ae0eb909000 points to the beginning of the file, not somewhere in the middle.

          The following is from a different run, file is not on Lustre but it is executing the same piece of code and connects to the same sqlite db.

          120601 access("/dev/shm/etilqs_kVSlJLGaN5xGwo0", F_OK <unfinished ...>
          120601 <... access resumed> )           = -1 ENOENT (No such file or directory)
          120601 open("/dev/shm/etilqs_kVSlJLGaN5xGwo0", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600 <unfinished ...>
          120601 <... open resumed> )             = 162
          120601 fstat(162,  <unfinished ...>
          120601 <... fstat resumed> {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
          120601 unlink("/dev/shm/etilqs_kVSlJLGaN5xGwo0" <unfinished ...>
          120601 <... unlink resumed> )           = 0
          120601 fstat(162,  <unfinished ...>
          120601 <... fstat resumed> {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
          120601 ftruncate(162, 11951 <unfinished ...>
          120601 <... ftruncate resumed> )        = 0
          120601 fstat(162,  <unfinished ...>
          120601 <... fstat resumed> {st_mode=S_IFREG|0600, st_size=11951, ...}) = 0
          120601 mmap(NULL, 11951, PROT_READ|PROT_WRITE, MAP_SHARED, 162, 0 <unfinished ...>
          120601 <... mmap resumed> )             = 0x7f6445d11000
          120601 ftruncate(162, 14774 <unfinished ...>
          120601 <... ftruncate resumed> )        = 0
          120601 munmap(0x7f6445d11000, 11951 <unfinished ...>
          120601 <... munmap resumed> )           = 0
          120601 lseek(162, 11944, SEEK_SET <unfinished ...>
          120601 <... lseek resumed> )            = 11944
          120601 write(162, ""..., 344 <unfinished ...>
          120601 <... write resumed> )            = 344
          120601 lseek(162, 12288, SEEK_SET <unfinished ...>
          120601 <... lseek resumed> )            = 12288
          120601 write(162, ""..., 2479 <unfinished ...>
          120601 <... write resumed> )            = 2479
          120601 fstat(162,  <unfinished ...>
          120601 <... fstat resumed> {st_mode=S_IFREG|0600, st_size=14774, ...}) = 0
          120601 mmap(NULL, 14774, PROT_READ|PROT_WRITE, MAP_SHARED, 162, 0 <unfinished ...>
          120601 <... mmap resumed> )             = 0x7f6445d10000
          120601 munmap(0x7f6445d10000, 14774 <unfinished ...>
          120601 <... munmap resumed> )           = 0
          120601 close(162 <unfinished ...>
          120601 <... close resumed> )            = 0
          

          Is it worth a new issue?
          Regards.
          Jacek Tomaka

          Tomaka Jacek Tomaka (Inactive) added a comment - Hey Vladimir!, We are running Lustre: Build Version: 2.10.3 and seeing similar issue, but i can't seem to be able to reproduce it using mmap_sanity tst 9. I suspect that this is because some of our parameters are not default. I noticed that it depends on max_pages_per_rpc and page size being 4096. Our max_pages_per_rpc is 4096 on some of the OSTs and 1024 on others. I have updated the file size and still can't reproduce. Should i be able to on my version? What other parameters could affect this? The issue we are seeing is with sqlite3. We are using it in read only mode but the problems happens when there is a on the fly index created, that gets dumped to temporary file on Lustre. I do not have full crash dump, just java hs_err file. It goes like that: # Problematic frame: # C [libintlc.so.5+0x3c012] __intel_mic_avx512f_memcpy+0x92 siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR), si_addr: 0x00002ae0eb909000 2ae0eb909000-2ae0eb90c000 rw-s 00000000 fab:cb7e8 144115656126652496 /kz5/20200710_job_12159204.XPuW/etilqs_boLeelqyY0xqGKU (deleted) What i find weird is that 0x00002ae0eb909000 points to the beginning of the file, not somewhere in the middle. The following is from a different run, file is not on Lustre but it is executing the same piece of code and connects to the same sqlite db. 120601 access("/dev/shm/etilqs_kVSlJLGaN5xGwo0", F_OK <unfinished ...> 120601 <... access resumed> ) = -1 ENOENT (No such file or directory) 120601 open("/dev/shm/etilqs_kVSlJLGaN5xGwo0", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600 <unfinished ...> 120601 <... open resumed> ) = 162 120601 fstat(162, <unfinished ...> 120601 <... fstat resumed> {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 120601 unlink("/dev/shm/etilqs_kVSlJLGaN5xGwo0" <unfinished ...> 120601 <... unlink resumed> ) = 0 120601 fstat(162, <unfinished ...> 120601 <... fstat resumed> {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 120601 ftruncate(162, 11951 <unfinished ...> 120601 <... ftruncate resumed> ) = 0 120601 fstat(162, <unfinished ...> 120601 <... fstat resumed> {st_mode=S_IFREG|0600, st_size=11951, ...}) = 0 120601 mmap(NULL, 11951, PROT_READ|PROT_WRITE, MAP_SHARED, 162, 0 <unfinished ...> 120601 <... mmap resumed> ) = 0x7f6445d11000 120601 ftruncate(162, 14774 <unfinished ...> 120601 <... ftruncate resumed> ) = 0 120601 munmap(0x7f6445d11000, 11951 <unfinished ...> 120601 <... munmap resumed> ) = 0 120601 lseek(162, 11944, SEEK_SET <unfinished ...> 120601 <... lseek resumed> ) = 11944 120601 write(162, ""..., 344 <unfinished ...> 120601 <... write resumed> ) = 344 120601 lseek(162, 12288, SEEK_SET <unfinished ...> 120601 <... lseek resumed> ) = 12288 120601 write(162, ""..., 2479 <unfinished ...> 120601 <... write resumed> ) = 2479 120601 fstat(162, <unfinished ...> 120601 <... fstat resumed> {st_mode=S_IFREG|0600, st_size=14774, ...}) = 0 120601 mmap(NULL, 14774, PROT_READ|PROT_WRITE, MAP_SHARED, 162, 0 <unfinished ...> 120601 <... mmap resumed> ) = 0x7f6445d10000 120601 munmap(0x7f6445d10000, 14774 <unfinished ...> 120601 <... munmap resumed> ) = 0 120601 close(162 <unfinished ...> 120601 <... close resumed> ) = 0 Is it worth a new issue? Regards. Jacek Tomaka

          Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/33724
          Subject: LU-11703 target: force transaction commit on grant lack
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: c9ce35eef3945d9b446e8e535a586b06ec68dfd0

          gerrit Gerrit Updater added a comment - Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/33724 Subject: LU-11703 target: force transaction commit on grant lack Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c9ce35eef3945d9b446e8e535a586b06ec68dfd0

          People

            vsaveliev Vladimir Saveliev
            vsaveliev Vladimir Saveliev
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: