[LU-11703] sigbus on shared mapping memset Created: 26/Nov/18 Updated: 12/Jan/22 Resolved: 12/Jan/22 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Vladimir Saveliev | Assignee: | Vladimir Saveliev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
The following test
addr = mmap(.., MAP_SHARED, ..);
for(;;)
memcpy(addr, ...);
leads to sigbus signal as vvp_io_fault_start() runs out of grants. static int vvp_io_fault_start(const struct lu_env *env, const struct cl_io_slice *ios) ... result = cl_io_commit_async(env, io, plist, 0, to, mkwrite_commit_callback); ... /* we're in big trouble, what can we do now? */ if (result == -EDQUOT) result = -ENOSPC; |
| Comments |
| Comment by Gerrit Updater [ 26/Nov/18 ] |
|
Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/33724 |
| Comment by Jacek Tomaka [ 25/Jul/20 ] |
|
Hey Vladimir!, The issue we are seeing is with sqlite3. We are using it in read only mode but the problems happens when there is a on the fly index created, that gets dumped to temporary file on Lustre. # Problematic frame: # C [libintlc.so.5+0x3c012] __intel_mic_avx512f_memcpy+0x92 siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR), si_addr: 0x00002ae0eb909000 2ae0eb909000-2ae0eb90c000 rw-s 00000000 fab:cb7e8 144115656126652496 /kz5/20200710_job_12159204.XPuW/etilqs_boLeelqyY0xqGKU (deleted) What i find weird is that 0x00002ae0eb909000 points to the beginning of the file, not somewhere in the middle. The following is from a different run, file is not on Lustre but it is executing the same piece of code and connects to the same sqlite db. 120601 access("/dev/shm/etilqs_kVSlJLGaN5xGwo0", F_OK <unfinished ...>
120601 <... access resumed> ) = -1 ENOENT (No such file or directory)
120601 open("/dev/shm/etilqs_kVSlJLGaN5xGwo0", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600 <unfinished ...>
120601 <... open resumed> ) = 162
120601 fstat(162, <unfinished ...>
120601 <... fstat resumed> {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
120601 unlink("/dev/shm/etilqs_kVSlJLGaN5xGwo0" <unfinished ...>
120601 <... unlink resumed> ) = 0
120601 fstat(162, <unfinished ...>
120601 <... fstat resumed> {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
120601 ftruncate(162, 11951 <unfinished ...>
120601 <... ftruncate resumed> ) = 0
120601 fstat(162, <unfinished ...>
120601 <... fstat resumed> {st_mode=S_IFREG|0600, st_size=11951, ...}) = 0
120601 mmap(NULL, 11951, PROT_READ|PROT_WRITE, MAP_SHARED, 162, 0 <unfinished ...>
120601 <... mmap resumed> ) = 0x7f6445d11000
120601 ftruncate(162, 14774 <unfinished ...>
120601 <... ftruncate resumed> ) = 0
120601 munmap(0x7f6445d11000, 11951 <unfinished ...>
120601 <... munmap resumed> ) = 0
120601 lseek(162, 11944, SEEK_SET <unfinished ...>
120601 <... lseek resumed> ) = 11944
120601 write(162, ""..., 344 <unfinished ...>
120601 <... write resumed> ) = 344
120601 lseek(162, 12288, SEEK_SET <unfinished ...>
120601 <... lseek resumed> ) = 12288
120601 write(162, ""..., 2479 <unfinished ...>
120601 <... write resumed> ) = 2479
120601 fstat(162, <unfinished ...>
120601 <... fstat resumed> {st_mode=S_IFREG|0600, st_size=14774, ...}) = 0
120601 mmap(NULL, 14774, PROT_READ|PROT_WRITE, MAP_SHARED, 162, 0 <unfinished ...>
120601 <... mmap resumed> ) = 0x7f6445d10000
120601 munmap(0x7f6445d10000, 14774 <unfinished ...>
120601 <... munmap resumed> ) = 0
120601 close(162 <unfinished ...>
120601 <... close resumed> ) = 0
Is it worth a new issue? |
| Comment by Vladimir Saveliev [ 27/Jul/20 ] |
|
Jacek, hello You should check whether your SIGBUS is related to this issue. Check whether you run out of grants. The below are two runs. In first a client has enough grants, so it is able to survive 10 seconds of the memset. ./test /mnt/lustre/file Alarm clock [root@sl754 LU-11703]# lctl get_param osc.*.cur_grant_bytes osc.lustre-OST0000-osc-ffff9a11149bf800.cur_grant_bytes=497221632 osc.lustre-OST0001-osc-ffff9a11149bf800.cur_grant_bytes=8437760 In second case, the client runs out of grants and gets SIGBUS: ./test /mnt/lustre/file2 Bus error (core dumped) [root@sl754 LU-11703]# lctl get_param osc.*.cur_grant_bytes osc.lustre-OST0000-osc-ffff9a10679ac800.cur_grant_bytes=20926464 osc.lustre-OST0001-osc-ffff9a10679ac800.cur_grant_bytes=20480 Lack of grants usually happens when servers do not have much free disk space. |
| Comment by Vladimir Saveliev [ 12/Jan/22 ] |
|
The issue has bene fixed by |