[LU-15815] fast_read/stale data/reclaim workround causes SIGBUS Created: 03/May/22  Updated: 19/Jan/24  Resolved: 09/May/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Critical
Reporter: John Hammond Assignee: Andrew Perepechko
Resolution: Fixed Votes: 0
Labels: None

Attachments: PNG File image_2022-05-03_13-25-27.png     File mmap_fault_vs_drop_caches3.c     File osu_alltoall_trimmed.dk    
Issue Links:
Duplicate
is duplicated by LU-15819 Executables run from Lustre may recei... Closed
Related
is related to LU-8633 SIGBUS under memory pressure with fas... Resolved
is related to LU-14541 Memory reclaim caused a stale data read Resolved
is related to LU-12587 DIO fallback to Buffer IO unexpectedly Reopened
is related to LU-16160 take ldlm lock when queue sync pages Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The fast_read stale data workaround from LU-14541 can cause applications to receive a spurious SIGBUS when reclaim runs concurrently with page fault handler for mmaped files.



 Comments   
Comment by Gerrit Updater [ 03/May/22 ]

"John L. Hammond <jhammond@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47204
Subject: LU-15815 llite: disable fast_read and workaround
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d1720e2b774bb4137324fb9e80c3d6151e3b9c0f

Comment by John Hammond [ 03/May/22 ]

Here is a reliable if inconvenient reproducer:

$LUSTRE/tests/llmount.sh
lctl set_param debug_mb=512 debug='+trace page mmap'
lctl set_param llite.*.max_read_ahead_mb=0 # Not needed to reproduce.

yum install openmpi openmpi-devel
mv /usr/lib64/openmpi /mnt/lustre/openmpi
ln -s /mnt/lustre/openmpi /usr/lib64/openmpi
cd /mnt/lustre
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.9.tar.gz
tar -xzf osu-micro-benchmarks-5.9.tar.gz
cd osu-micro-benchmarks-5.9
./configure CC=/usr/lib64/openmpi/bin/mpicc CXX=/usr/lib64/openmpi/bin/mpicxx && make -j4

while true; do echo 3 > /proc/sys/vm/drop_caches ; done &
lctl clear
while /mnt/lustre/openmpi/bin/mpirun --allow-run-as-root -np 2 --oversubscribe --host k /mnt/lustre/osu-micro-benchmarks-5.9//mpi/\
collective/osu_alltoall -f -m 65536; do
  true
done
lctl dk > /tmp/osu_alltoall.dk
kill %1
...
[k:18404] *** Process received signal ***
[k:18404] Signal: Bus error (7)
[k:18404] Signal code: Non-existant physical address (2)
[k:18404] Failing at address: 0x7fdb20a62e73
[k:18404] [ 0] /lib64/libpthread.so.0(+0xf5f0)[0x7fdb1fcf45f0]
[k:18404] [ 1] /lib64/ld-linux-x86-64.so.2(+0x19d72)[0x7fdb20f88d72]
[k:18404] [ 2] /lib64/ld-linux-x86-64.so.2(+0x8ae2)[0x7fdb20f77ae2]
[k:18404] [ 3] /lib64/ld-linux-x86-64.so.2(+0x14254)[0x7fdb20f83254]
[k:18404] [ 4] /lib64/ld-linux-x86-64.so.2(+0xf784)[0x7fdb20f7e784]
[k:18404] [ 5] /lib64/ld-linux-x86-64.so.2(+0x13b3b)[0x7fdb20f82b3b]
[k:18404] [ 6] /lib64/libdl.so.2(+0xeeb)[0x7fdb2084beeb]
[k:18404] [ 7] /lib64/ld-linux-x86-64.so.2(+0xf784)[0x7fdb20f7e784]
[k:18404] [ 8] /lib64/libdl.so.2(+0x14ed)[0x7fdb2084c4ed]
[k:18404] [ 9] /lib64/libdl.so.2(dlopen+0x31)[0x7fdb2084bf81]
[k:18404] [10] /usr/lib64/openmpi/lib/libopen-pal.so.13(+0x59edd)[0x7fdb20aa8edd]
[k:18404] [11] /usr/lib64/openmpi/lib/libopen-pal.so.13(+0x3c7d1)[0x7fdb20a8b7d1]
[k:18404] [12] /usr/lib64/openmpi/lib/libopen-pal.so.13(mca_base_component_find+0x78a)[0x7fdb20a8cd4a]
[k:18404] [13] /usr/lib64/openmpi/lib/libopen-pal.so.13(mca_base_framework_components_register+0x56)[0x7fdb20a96cb6]
[k:18404] [14] /usr/lib64/openmpi/lib/libopen-pal.so.13(mca_base_framework_register+0x196)[0x7fdb20a97166]
[k:18404] [15] /usr/lib64/openmpi/lib/libopen-pal.so.13(mca_base_framework_open+0x12)[0x7fdb20a971c2]
[k:18404] [16] /usr/lib64/openmpi/lib/openmpi/mca_ess_hnp.so(+0x4a48)[0x7fdb1eadea48]
[k:18404] [17] /usr/lib64/openmpi/lib/libopen-rte.so.12(orte_init+0x168)[0x7fdb20d09398]
[k:18404] [18] /mnt/lustre/openmpi/bin/mpirun[0x40449f]
[k:18404] [19] /mnt/lustre/openmpi/bin/mpirun[0x40361d]
[k:18404] [20] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fdb1f939505]
[k:18404] [21] /mnt/lustre/openmpi/bin/mpirun[0x403539]
[k:18404] *** End of error message ***
Bus error (core dumped)
# grep SIGBUS /tmp/osu_alltoall.dk
00000080:00008000:3.0:1651610586.302378:0:18404:0:(vvp_io.c:1353:vvp_io_kernel_fault()) got addr 00007fdb20a62000 - SIGBUS

Trimmed logs are attached as osu_alltoall_trimmed.dk. PID 18404 is osu_alltoall, 16737 is bash wrring to drop_caches.

Comment by Gerrit Updater [ 05/May/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47204/
Subject: LU-15815 llite: disable fast_read and workaround
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 201ade9442828fbb3bedb3b31154d51ead10af41

Comment by Peter Jones [ 05/May/22 ]

Landed for 2.15

Comment by Cory Spitz [ 06/May/22 ]

jhammond, can we re-open this ticket? It makes no sense to me to revert LU-14541 when you've confirmed will reintroduce a data corruption. Can we find another solution please?

Comment by Peter Jones [ 06/May/22 ]

Reopening until this discussion is settled

Comment by John Hammond [ 06/May/22 ]

> Can we find another solution please?

panda shared some ideas on LU-15819. Is that work in progress?

Comment by Andrew Perepechko [ 06/May/22 ]

Is that work in progress?

Yes, it is.

Comment by Sebastien Piechurski [ 06/May/22 ]

We tested this patch on a 2.12.6 basis, and disabling fast_read is actually catastrophic in terms of performance on our customer code.

The revert of LU-14541 had already been done several months ago (due to the SIGBUS errors), but we recently found that it was causing the corruptions on mmap'ed pages. Disabling fast_read on top of it is causing a x5 slowdown on the application.

So, yes please can we find another solution ?

Comment by John Hammond [ 06/May/22 ]

> The revert of LU-14541 had already been done several months ago (due to the SIGBUS errors),

Where? On what branch?

> but we recently found that it was causing the corruptions on mmap'ed pages.

Do you mean that LU-14541 causes corruption? Or that reverting it causes corruption? Did you open an issue?

> So, yes please can we find another solution ?

There seems to be a lot of hopeful use of "we" in this ticket.

Comment by John Hammond [ 06/May/22 ]

With LU-14541 in place we have the following:

[  179.609120] mmap_fault_vs_d (11006): drop_caches: 3
[  179.613492] mmap_fault_vs_d (11006): drop_caches: 3
[  179.618134] mmap_fault_vs_d (11006): drop_caches: 3
[  179.621452] LustreError: 11007:0:(osc_cache.c:2495:osc_teardown_async_page()) extent 000000009717064a@{[3 -> 3/1023], [2|0|-|cache|wi|00000000b2cc94b9], [28672|1|+|-|00000000d41a65d9|1024|00000000baa546d7]} trunc at 3.
[  179.622752] LustreError: 11007:0:(osc_cache.c:2495:osc_teardown_async_page()) ### extent: 000000009717064a ns: lustre-OST0000-osc-ffff8ba637eb9000 lock: 00000000d41a65d9/0xa4961a626f53297d lrc: 4/0,1 mode: PW/PW res: [0x19:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 12288->16383) gid 0 flags: 0x800020000000000 nid: local remote: 0xa4961a626f532984 expref: -99 pid: 11007 timeout: 0 lvb_type: 1
[  179.622988] mmap_fault_vs_d (11006): drop_caches: 3
[  179.625126] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) page@00000000285d5729[3 00000000214d642c 5 1 00000000baa546d7]
[  179.625126] 
[  179.626325] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) vvp-page@000000002564fdc0(1:0) vm@00000000c9fa97b2 17ffffc0001035 2:0 ffff8ba5f1774140 3 lru
[  179.626325] 
[  179.627359] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) lov-page@00000000b1b9d4b5
[  179.627359] 
[  179.628033] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) osc-page@00000000a3522c1f 3: 1< 0x845fed 2 + - > 2< 12288 0 4096 0x0 0x40420 | 00000000baa546d7 00000000cd5d95e4 00000000b2cc94b9 > 3< 0 0 0 > 4< 0 0 8 667238400 - | - - + - > 5< - - + - | 0 - | 1 - ->
[  179.628033] 
[  179.629656] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) end page@00000000285d5729
[  179.629656] 
[  179.629826] mmap_fault_vs_d (11006): drop_caches: 3
[  179.630301] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) Trying to teardown failed: -16
[  179.630302] LustreError: 11007:0:(osc_page.c:183:osc_page_delete()) ASSERTION( 0 ) failed: 
[  179.630304] LustreError: 11007:0:(osc_page.c:183:osc_page_delete()) LBUG
[  179.632209] Pid: 11007, comm: mmap_fault_vs_d 4.18.0-348.7.1.el8.x86_64 #1 SMP Thu Mar 3 10:39:00 CST 2022
[  179.632823] Call Trace TBD:
[  179.633101] [<0>] libcfs_call_trace+0x8f/0xe0 [libcfs]
[  179.633455] [<0>] lbug_with_loc+0x53/0xb0 [libcfs]
[  179.633774] [<0>] osc_page_delete+0x666/0x670 [osc]
[  179.634124] [<0>] cl_page_delete0+0x9d/0x2d0 [obdclass]
[  179.634491] [<0>] cl_page_delete+0x3e/0x130 [obdclass]
[  179.634516] mmap_fault_vs_d (11006): drop_caches: 3
[  179.634854] [<0>] ll_invalidatepage+0xc0/0x230 [lustre]
[  179.634862] [<0>] truncate_cleanup_page+0x8d/0x170
[  179.635861] [<0>] generic_error_remove_page+0x31/0x80
[  179.636274] [<0>] vvp_page_discard+0x49/0x120 [lustre]
[  179.636640] [<0>] cl_page_discard+0x5e/0xc0 [obdclass]
[  179.636993] [<0>] cl_page_list_discard+0x59/0x160 [obdclass]
[  179.637394] [<0>] ll_io_read_page+0x27c/0xd00 [lustre]
[  179.637749] [<0>] ll_readpage+0x133/0xb70 [lustre]
[  179.638076] [<0>] filemap_fault+0x84c/0xa60
[  179.638369] [<0>] vvp_io_fault_start+0x53f/0x12b0 [lustre]
[  179.638753] [<0>] cl_io_start+0x66/0x190 [obdclass]
[  179.638928] mmap_fault_vs_d (11006): drop_caches: 3
[  179.639107] [<0>] cl_io_loop+0xde/0x2a0 [obdclass]
[  179.639755] [<0>] ll_fault+0x916/0xc60 [lustre]
[  179.640066] [<0>] __do_fault+0x48/0x130
[  179.640323] [<0>] do_fault+0x110/0x660
[  179.640572] [<0>] __handle_mm_fault+0xa0f/0xdb0
[  179.640873] [<0>] handle_mm_fault+0x11e/0x310
[  179.641168] [<0>] __do_page_fault+0x317/0x900
[  179.641461] [<0>] do_page_fault+0x87/0x300
[  179.641737] [<0>] page_fault+0x1e/0x30

See mmap_fault_vs_drop_caches3.c to reproduce.

Comment by Sebastien Piechurski [ 07/May/22 ]

>> The revert of LU-14541 had already been done several months ago (due to the SIGBUS errors),

>Where? On what branch?

The site is ECMWF, and it was done on a 2.12.6 basis. I unfortunately don't have the whole history as I was not involved in these at the time and this was handled between the local team and DDN directly.

>> but we recently found that it was causing the corruptions on mmap'ed pages.

>Do you mean that LU-14541 causes corruption? Or that reverting it causes corruption? Did you open an issue?

Reverting it caused the corruption, or at least we get the corruptions with it reverted. As we only understood recently that the crash/hang issues we observed were related to in-memory data corruption, it is actually only an assumption that the revert caused it.

There is a DDN support case open for this.

>> So, yes please can we find another solution ?

>There seems to be a lot of hopeful use of "we" in this ticket.

Obviously, this last "we" was probably more a "you". As much as I would like to provide solutions, my understanding of the memory management subsystem is too limited for this. 

 

Regards,

 

Sebastien.

Comment by Patrick Farrell [ 09/May/22 ]

So, we landed a patch from John here to resolve the SIGBUS issue by removing the clearpageuptodate() call in vvp_page_delete, ie, reverting LU-14541.  I have more comments on the SIGBUS issue, etc, which I'll put there, but basically, I think this is correct - It's clear from the page fault code in the kernel that we can't unset pageuptodate() without causing problems.  (I was wrong about this previously.)

So, that means the SIGBUS issue is resolved, but we have to find another way to solve LU-14541.

So, I'm going to close this ticket as resolved, and let's move the discussion to LU-14541.  panda, I've added you as a watcher there.  I have a few thoughts on it but I'm still hoping you and/or Shadow have a good idea...

Comment by Patrick Farrell [ 09/May/22 ]

Resolved by revert of LU-14541.  See LU-14541 for further discussion.

Generated at Sat Feb 10 03:21:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.