leads to sigbus signal as vvp_io_fault_start() runs out of grants.
staticint vvp_io_fault_start(const struct lu_env *env,
const struct cl_io_slice *ios)
...
result = cl_io_commit_async(env, io, plist, 0, to,
mkwrite_commit_callback);
...
/* we're in big trouble, what can we do now? */if (result == -EDQUOT)
result = -ENOSPC;
The issue has bene fixed by LU-13228 clio: mmap write when overquota: vvp_io_commit_sync() called by vvp_io_fault_start() forces the required transaction commit.
Vladimir Saveliev
added a comment - The issue has bene fixed by LU-13228 clio: mmap write when overquota: vvp_io_commit_sync() called by vvp_io_fault_start() forces the required transaction commit.
Lack of grants usually happens when servers do not have much free disk space.
The max_pages_per_rpc parameter is less critical for the problem. If you have enough disk space and no lack of grants you are certainly not seeing this issue.
Vladimir Saveliev
added a comment - Jacek, hello
You should check whether your SIGBUS is related to this issue. Check whether you run out of grants.
The below are two runs. In first a client has enough grants, so it is able to survive 10 seconds of the memset.
./test /mnt/lustre/file
Alarm clock
[root@sl754 LU-11703]# lctl get_param osc.*.cur_grant_bytes
osc.lustre-OST0000-osc-ffff9a11149bf800.cur_grant_bytes=497221632
osc.lustre-OST0001-osc-ffff9a11149bf800.cur_grant_bytes=8437760
In second case, the client runs out of grants and gets SIGBUS:
./test /mnt/lustre/file2
Bus error (core dumped)
[root@sl754 LU-11703]# lctl get_param osc.*.cur_grant_bytes
osc.lustre-OST0000-osc-ffff9a10679ac800.cur_grant_bytes=20926464
osc.lustre-OST0001-osc-ffff9a10679ac800.cur_grant_bytes=20480
Lack of grants usually happens when servers do not have much free disk space.
The max_pages_per_rpc parameter is less critical for the problem. If you have enough disk space and no lack of grants you are certainly not seeing this issue.
Hey Vladimir!,
We are running Lustre: Build Version: 2.10.3 and seeing similar issue, but i can't seem to be able to reproduce it using mmap_sanity tst 9. I suspect that this is because some of our parameters are not default. I noticed that it depends on max_pages_per_rpc and page size being 4096. Our max_pages_per_rpc is 4096 on some of the OSTs and 1024 on others. I have updated the file size and still can't reproduce. Should i be able to on my version?
What other parameters could affect this?
The issue we are seeing is with sqlite3. We are using it in read only mode but the problems happens when there is a on the fly index created, that gets dumped to temporary file on Lustre.
I do not have full crash dump, just java hs_err file.
It goes like that:
Jacek Tomaka (Inactive)
added a comment - Hey Vladimir!,
We are running Lustre: Build Version: 2.10.3 and seeing similar issue, but i can't seem to be able to reproduce it using mmap_sanity tst 9. I suspect that this is because some of our parameters are not default. I noticed that it depends on max_pages_per_rpc and page size being 4096. Our max_pages_per_rpc is 4096 on some of the OSTs and 1024 on others. I have updated the file size and still can't reproduce. Should i be able to on my version?
What other parameters could affect this?
The issue we are seeing is with sqlite3. We are using it in read only mode but the problems happens when there is a on the fly index created, that gets dumped to temporary file on Lustre.
I do not have full crash dump, just java hs_err file.
It goes like that:
# Problematic frame:
# C [libintlc.so.5+0x3c012] __intel_mic_avx512f_memcpy+0x92
siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR), si_addr: 0x00002ae0eb909000
2ae0eb909000-2ae0eb90c000 rw-s 00000000 fab:cb7e8 144115656126652496 /kz5/20200710_job_12159204.XPuW/etilqs_boLeelqyY0xqGKU (deleted)
What i find weird is that 0x00002ae0eb909000 points to the beginning of the file, not somewhere in the middle.
The following is from a different run, file is not on Lustre but it is executing the same piece of code and connects to the same sqlite db.
120601 access("/dev/shm/etilqs_kVSlJLGaN5xGwo0", F_OK <unfinished ...>
120601 <... access resumed> ) = -1 ENOENT (No such file or directory)
120601 open("/dev/shm/etilqs_kVSlJLGaN5xGwo0", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600 <unfinished ...>
120601 <... open resumed> ) = 162
120601 fstat(162, <unfinished ...>
120601 <... fstat resumed> {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
120601 unlink("/dev/shm/etilqs_kVSlJLGaN5xGwo0" <unfinished ...>
120601 <... unlink resumed> ) = 0
120601 fstat(162, <unfinished ...>
120601 <... fstat resumed> {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
120601 ftruncate(162, 11951 <unfinished ...>
120601 <... ftruncate resumed> ) = 0
120601 fstat(162, <unfinished ...>
120601 <... fstat resumed> {st_mode=S_IFREG|0600, st_size=11951, ...}) = 0
120601 mmap(NULL, 11951, PROT_READ|PROT_WRITE, MAP_SHARED, 162, 0 <unfinished ...>
120601 <... mmap resumed> ) = 0x7f6445d11000
120601 ftruncate(162, 14774 <unfinished ...>
120601 <... ftruncate resumed> ) = 0
120601 munmap(0x7f6445d11000, 11951 <unfinished ...>
120601 <... munmap resumed> ) = 0
120601 lseek(162, 11944, SEEK_SET <unfinished ...>
120601 <... lseek resumed> ) = 11944
120601 write(162, ""..., 344 <unfinished ...>
120601 <... write resumed> ) = 344
120601 lseek(162, 12288, SEEK_SET <unfinished ...>
120601 <... lseek resumed> ) = 12288
120601 write(162, ""..., 2479 <unfinished ...>
120601 <... write resumed> ) = 2479
120601 fstat(162, <unfinished ...>
120601 <... fstat resumed> {st_mode=S_IFREG|0600, st_size=14774, ...}) = 0
120601 mmap(NULL, 14774, PROT_READ|PROT_WRITE, MAP_SHARED, 162, 0 <unfinished ...>
120601 <... mmap resumed> ) = 0x7f6445d10000
120601 munmap(0x7f6445d10000, 14774 <unfinished ...>
120601 <... munmap resumed> ) = 0
120601 close(162 <unfinished ...>
120601 <... close resumed> ) = 0
Is it worth a new issue?
Regards.
Jacek Tomaka
Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/33724
Subject: LU-11703 target: force transaction commit on grant lack
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c9ce35eef3945d9b446e8e535a586b06ec68dfd0
Gerrit Updater
added a comment - Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/33724
Subject: LU-11703 target: force transaction commit on grant lack
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c9ce35eef3945d9b446e8e535a586b06ec68dfd0
The issue has bene fixed by
LU-13228clio: mmap write when overquota: vvp_io_commit_sync() called by vvp_io_fault_start() forces the required transaction commit.