[LU-2909] Failure on test suite sanity-benchmark test_fsx: fsx bus error, core dumped Created: 05/Mar/13  Updated: 18/Dec/14  Resolved: 30/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: HB

Attachments: Text File log.gz    
Issue Links:
Duplicate
is duplicated by LU-3878 sanity-benchmark test fsx: Bus error Resolved
Related
Severity: 3
Rank (Obsolete): 6999

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/5a852a8a-85b1-11e2-9f8d-52540035b04c.

The sub-test test_fsx failed with the following error:

fsx failed

truncating to largest ever: 0x25701f
018000 1362442021.951758 trunc from 0x169dd8  to  0x2e6079	(0x17c2a2 bytes)
019000 1362442023.494191 mapwrite   0x1dd935 thru 0x1ec6de	(0xedaa bytes)
020000 1362442025.029135 read       0x17d39e thru 0x1848a8	(0x750b bytes)
021000 1362442026.608768 mapwrite   0x21696e thru 0x223441	(0xcad4 bytes)
022000 1362442028.147407 write      0x065ab3 thru 0x071b3c	(0xc08a bytes)
023000 1362442029.708216 write      0x142a88 thru 0x149749	(0x6cc2 bytes)
024000 1362442031.298952 trunc from 0x16cef5  to  0x19c5bf	(0x2f6cb bytes)
025000 1362442032.760608 mapread    0x050a4e thru 0x05ea37	(0xdfea bytes)
026000 1362442034.337654 write      0x100d42 thru 0x10c4b8	(0xb777 bytes)
027000 1362442035.889125 mapread    0x0a0142 thru 0x0a0412	(0x2d1 bytes)
028000 1362442037.474502 read       0x1556d1 thru 0x15c086	(0x69b6 bytes)
029000 1362442039.086174 write      0x009b42 thru 0x00e01c	(0x44db bytes)
030000 1362442040.692829 mapwrite   0x18135d thru 0x184f86	(0x3c2a bytes)
031000 1362442042.344050 write      0x142774 thru 0x14e54f	(0xbddc bytes)
032000 1362442043.844112 mapwrite   0x0a2628 thru 0x0a295f	(0x338 bytes)
033000 1362442045.438181 write      0x18f5c4 thru 0x19c18c	(0xcbc9 bytes)
034000 1362442047.032646 trunc from 0x1e89cc  to  0x3b16df	(0x1c8d14 bytes)
035000 1362442048.552844 write      0x1e22ba thru 0x1eb190	(0x8ed7 bytes)
036000 1362442050.084864 read       0x0abc33 thru 0x0b5f8c	(0xa35a bytes)
037000 1362442051.712007 trunc from 0x0af05a  to  0x0b8740	(0x96e7 bytes)
038000 1362442053.320923 mapread    0x055a8e thru 0x05a120	(0x4693 bytes)
039000 1362442054.897862 write      0x236d76 thru 0x23733f	(0x5ca bytes)
040000 1362442056.462382 mapread    0x116f2e thru 0x12395c	(0xca2f bytes)
041000 1362442058.009744 mapwrite   0x1493c5 thru 0x156198	(0xcdd4 bytes)
042000 1362442059.557940 write      0x1cc6db thru 0x1cf588	(0x2eae bytes)
/usr/lib64/lustre/tests/sanity-benchmark.sh: line 186: 22372 Bus error               (core dumped) $CMD
 sanity-benchmark test_fsx: @@@@@@ FAIL: fsx failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:3973:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:3996:error()
  = /usr/lib64/lustre/tests/sanity-benchmark.sh:199:test_fsx()
  = /usr/lib64/lustre/tests/test-framework.sh:4251:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:4284:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4139:run_test()
  = /usr/lib64/lustre/tests/sanity-benchmark.sh:203:main()
Dumping lctl log to /logdir/test_logs/2013-03-04/lustre-master-el6-x86_64--full--2_1_1__1286__-70011898087180-120645/sanity-benchmark.test_fsx.*.1362442060.log
CMD: client-20-ib,client-21-ib,client-22-ib,client-23-ib.lab.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-03-04/lustre-master-el6-x86_64--full--2_1_1__1286__-70011898087180-120645/sanity-benchmark.test_fsx.debug_log.\$(hostname -s).1362442060.log;
         dmesg > /logdir/test_logs/2013-03-04/lustre-master-el6-x86_64--full--2_1_1__1286__-70011898087180-120645/sanity-benchmark.test_fsx.dmesg.\$(hostname -s).1362442060.log


 Comments   
Comment by Ben Evans (Inactive) [ 05/Mar/13 ]

Another datapoint: I've also hit this on SLES11 SP2, but not on Ubuntu 10.04.

Simple test to reproduce (in lustre/tests)

for i in `seq 1 100`; do
echo $i
mount -t lustre $(your mount stuff here)
./fsx -c 100 -N 100000 -l 5M -o 1M /mnt/fs/testfile
rm -rf /mnt/fs/testfile*
umount /mnt/fs
done

When this fails, it will fail ~50% of the time.

Comment by Keith Mannthey (Inactive) [ 07/Mar/13 ]

I am attempting to reproduce on a standard RHEL 6.3 vm client. So far I have not encountered any errors but I am going to let it run more. I would really like a usable core file to look into this issue more.

If I can't hit this locally in the next day or so I may engage with the autotest world to have the VMs held after the next failure.

Comment by Ben Evans (Inactive) [ 08/Mar/13 ]

Keith. I think this is OS-specific. I've only seen it on SLES11 SP2, not RHEL 6.2 or Ubuntu.

Comment by Keith Mannthey (Inactive) [ 08/Mar/13 ]

Well I was able to catch it on RHEL6.3 with a client running a Lustre patched kernel. I got it in about 25% of my runs last night. Also our autotest run RHEL6.3 unpatched client for the most part and we hit it about 5% of the time there. I don't this could be related to the base kernel. I am going to try over the weekend to run this version of FSX (I understand this version of the test has a bit of Lustre hacking in it) on ext4.

For some reason I didn't get a core dump on my local system so I adjusted some limits in my shell and am trying again.

Comment by Jinshan Xiong (Inactive) [ 08/Mar/13 ]

Ben, can you please try to reproduce this issue by reverting the following patch:

commit d1eb3357bab215018bcb29877ed8b2990e65cc73
Author: Niu Yawei <niu@whamcloud.com>
Date: Sun Jan 6 03:21:03 2013 -0500

Thanks,
Jinshan

Comment by Keith Mannthey (Inactive) [ 08/Mar/13 ]

Jinshan,
I will try as well. At this point my system has lots of dirty pages.

Also first debug datapoint:

Test dumped on:

047000 1362775116.526873 trunc from 0x04dd1a  to  0x270fc3      (0x2232aa bytes)
048000 1362775122.237376 trunc from 0x24c2cc  to  0x44a81b      (0x1fe550 bytes)
049000 1362775127.917614 write      0x0b91b7 thru 0x0c3304      (0xa14e bytes)
050000 1362775132.986038 mapread    0x03446b thru 0x03bb6a      (0x7700 bytes)
051000 1362775138.608533 mapwrite   0x252e46 thru 0x2570c6      (0x4281 bytes)

Similar to other data points we grow a file and then call mapwrite on the newly expanded area.

(gdb) bt
#0  memcpy () at ../sysdeps/x86_64/memcpy.S:406
#1  0x0000000000403a86 in domapwrite (offset=<value optimized out>, size=60322)
    at fsx.c:875
#2  0x000000000040434d in test () at fsx.c:1055
#3  0x0000000000404c3a in main (argc=<value optimized out>, 
    argv=<value optimized out>) at fsx.c:1381
(gdb) frame 1
#1  0x0000000000403a86 in domapwrite (offset=<value optimized out>, size=60322)
    at fsx.c:875
875             memcpy(p + pg_offset, good_buf + offset, size);
(gdb) print p
$2 = 0x7f8de405f000 <Address 0x7f8de405f000 out of bounds>

It seems gdb thinks the p pointer is out of bounds.
Also size seem a tad odd but it could be a units things not sure yet.

void domapwrite(unsigned offset, unsigned size)
...
        if ((p = mmap(0, map_size, PROT_READ | PROT_WRITE, MAP_FILE|MAP_SHARED,
                      fd, (off_t)(offset - pg_offset))) == MAP_FAILED) {
                prterr("domapwrite: mmap");
                report_failure(202);
        }
...
        memcpy(p + pg_offset, good_buf + offset, size); <=== fsx.c:875
...

Somehow the mmap returns a pointer that is invalid a short time later?

I am still digging into this.

Comment by Jinshan Xiong (Inactive) [ 08/Mar/13 ]

Yes, this is a mmap write problem for sure.

Comment by Keith Mannthey (Inactive) [ 08/Mar/13 ]

I added some real debugging to domapwrite and I am retesting. I hope to know a bit more tomorrow.

Comment by Ben Evans (Inactive) [ 11/Mar/13 ]

Jinshan, rolling back the patch seems to have made the issue go away.

Comment by Keith Mannthey (Inactive) [ 11/Mar/13 ]

I kicked of a depatched build test last night. So far no sign of the issue. With the stock build once I started seeing the issue it was quite common. I am going to continue to let it run for a while and I will update the LU if I hit it ASAP.

All user space debugging of fsx confirmed it is a mmap write problem. It does not seem to be a test problem.

I emailed this issue to Niu Yawei as well. As I see it there are really two options. 1. Fix the patch. 2. Drop the patch.

Comment by Niu Yawei (Inactive) [ 12/Mar/13 ]

Hi, Keith, seems I missed your mail, next time you can just add me as watcher for this ticket and put comment here. thanks.

Comment by Niu Yawei (Inactive) [ 12/Mar/13 ]

The problem looks like mmap is trying to access invalid memory address, I can't imagine how it can be related to the fix in osc_enter_cache() (d1eb3357bab215018bcb29877ed8b2990e65cc73
).

Keith, how did you run your test? With master reverted the patch d1eb3357bab215018bcb29877ed8b2990e65cc73 only, or with old master which reverted all the patches since d1eb3357bab215018bcb29877ed8b2990e65cc73 ? Thanks.

Comment by Keith Mannthey (Inactive) [ 12/Mar/13 ]

I was able to revert just that patch (it was clean to revert). I ran for 18 to 20 hours. It did take at least 10 hours to repro on Master so I felt safe calling it at 18-20 hours and Bens data point. I can set it up again and let it crank for days if needed.

Ben, Did you roll back the whole git tree or just that patch as well?

I ran a modified version of Bens script with our call to fsx.

#!/bin/bash

set -e

for i in `seq 1 100`; do
echo kmannth $i
mount -t lustre mds@tcp:/lustre /mnt/lustre/
./a.out -c 50 -p 1000 -S 5662 -P /tmp -l 2453703 -N 100000  /mnt/lustre/f0.fsxfile 2>&1
rm -rf /mnt/lustre/testfile*
umount /mnt/lustre
done

On my system it does not happen right away (1gb of Memory in the Client). On Master lets say it took 10-12 hours to repro. After it was reproducing on the system it would fail quite easily (50% or more), the system has to be exhausted in some way to get to the reproduction state.

Comment by Jinshan Xiong (Inactive) [ 12/Mar/13 ]

The problem looks like mmap is trying to access invalid memory address, I can't imagine how it can be related to the fix in osc_enter_cache() (d1eb3357bab215018bcb29877ed8b2990e65cc73
).

I think the problem is that mmap write failed to reserve grant at the osc layer so -EDQUOT was returned to kernel which caused sigbus.

Comment by Oleg Drokin [ 13/Mar/13 ]

so, this needs to be fixed, right? Unless there is no free space on the FS, the grant reservation should not really fail.

Comment by Niu Yawei (Inactive) [ 13/Mar/13 ]

I think the problem is that mmap write failed to reserve grant at the osc layer so -EDQUOT was returned to kernel which caused sigbus.

Yes, probably. If this is the case, then I think returning SIGBUS to app could be the best we can do. What do you think about?

Comment by Niu Yawei (Inactive) [ 13/Mar/13 ]

so, this needs to be fixed, right? Unless there is no free space on the FS, the grant reservation should not really fail.

The problem is that client doesn't know if server is out of space, it can only rely on grant so far. I think enlarging the obd_max_dirty_pages & cl_dirty_max may alleviate the problem?

Comment by Alex Zhuravlev [ 13/Mar/13 ]

how client may have space granted while the server is out of space?

Comment by Niu Yawei (Inactive) [ 13/Mar/13 ]

Alex, osc_enter_cache() may return -EDQUOT in two cases:

  • out of grant;
  • too many dirty pages (in osc_enter_cache(), we can only trigger dirty flush on current osc but not the all oscs, so the global dirty max could be exceeded);

I guess the test failed for the second case.

Comment by Alex Zhuravlev [ 13/Mar/13 ]

in that case there should be a blocking point somewhere, so that we do not return to the kernel with an error?

Comment by Johann Lombardi (Inactive) [ 13/Mar/13 ]

I think we wait if there are BRWs in flight (which might return grant space back). However, there is still the case where there is no bulk write in flight. That's exactly for this problem that, when Jinshan implemented support for mkwrite, i suggested to add a new RPC type (or an "empty" write) to request grant from OSTs.

Comment by Niu Yawei (Inactive) [ 13/Mar/13 ]

Hi, Keith, could you rerun the test with larger obd_max_dirty_pages to see if the situation is alleviated? If the test confirmed the root cause, we may take the next step to consider how to fix it. Thanks.

Comment by Jinshan Xiong (Inactive) [ 13/Mar/13 ]

This should be a reproduction of LU-1028. If I remember it correctly, the grant was lost due to truncate.

Comment by Keith Mannthey (Inactive) [ 13/Mar/13 ]

I will kick off a test with master and see how a larger obd_max_dirty_pages affect things.

Comment by Keith Mannthey (Inactive) [ 14/Mar/13 ]

I managed to accidentally reboot my debug client that was in the failing state. I restarted testing a while ago. Sorry for the delay.

I also wanted to confirm my current settings as

lctl get_param osc.*.max_dirty_mb
osc.lustre-OST0000-osc-ffff88001d064000.max_dirty_mb=32

My plan is get into the failing state again then double the number to 64. I hope to observe a clean running fsx loop after that.

Also I tried setting osc.*.max_dirty_mb to 4mb in an attempt to shorted down the time to reproduction.

Comment by Jinshan Xiong (Inactive) [ 14/Mar/13 ]

This issue can be easily reproduced on my node with command: ./fsx -l $((4*1024*1024)) /mnt/lustre/fsxfile

Please check the attachment for debug log.

Comment by Niu Yawei (Inactive) [ 14/Mar/13 ]

Xiong, your log shows osc_enter_cache returned -EDQUOT because of per osc dirty is exceeding 32M. I think it's caused by the chagne of (66be2e6014a7bf20b2f749c877cb141b209f15d9)

diff --git a/lustre/osc/osc_cache.c b/lustre/osc/osc_cache.c
index 37b249f..581beb6 100644
--- a/lustre/osc/osc_cache.c
+++ b/lustre/osc/osc_cache.c
@@ -1533,12 +1533,7 @@ static int osc_enter_cache(const struct lu_env *env, struct client_obd *cli,
                ocw.ocw_rc = 0;
                client_obd_list_unlock(&cli->cl_loi_list_lock);

-               /* First osc_io_unplug() tries to put current object
-                * on ready list, second osc_io_unplug() makes sure that
-                * dirty flush can still be triggered even if current
-                * object hasn't any dirty pages */
-               osc_io_unplug(env, cli, osc, PDL_POLICY_ROUND);
-               osc_io_unplug(env, cli, NULL, PDL_POLICY_ROUND);
+               osc_io_unplug_async(env, cli, NULL);

                CDEBUG(D_CACHE, "%s: sleeping for cache space @ %p for %p\n",
                       cli->cl_import->imp_obd->obd_name, &ocw, oap);

There are two problems:

1. the current object isn't being added onto ready list; (so the dirty of current object may not be flushed)
2. flush is changed to asynchronous, so we can't make sure the dirty is flushed in time;

I'm not sure if Keith's test included this fix.

Comment by Niu Yawei (Inactive) [ 14/Mar/13 ]

Keith, if your original test didn't include the (66be2e6014a7bf20b2f749c877cb141b209f15d9), I guess your case could be little different, please make sure enlarge the global dirty max: /proc/sys/lustre/max_dirty_mb (not only the per osc dirty max) as well when you testing.

Comment by Jinshan Xiong (Inactive) [ 14/Mar/13 ]

1. the current object isn't being added onto ready list; (so the dirty of current object may not be flushed)

Actually if an object has dirty pages, it should be in the clint_obd::cl_loi_write_list, unless extents are being used by someone - apparently this is not the case for fsx. Please let me know if I missed something.

Comment by Niu Yawei (Inactive) [ 15/Mar/13 ]

Actually if an object has dirty pages, it should be in the clint_obd::cl_loi_write_list, unless extents are being used by someone - apparently this is not the case for fsx. Please let me know if I missed something.

Looks the object is added onto the cl_loi_write_list only when osc_list_maint() (or __osc_list_maint()) is called upon the object, and we call osc_io_unplug(obj) -> __osc_list_maint(obj) in osc_enter_cache() for this purpose. Do we add the object onto write list somewhere else? I didn't see it in the code.

Comment by Niu Yawei (Inactive) [ 15/Mar/13 ]

Per discussion with Xiong, the object should be on the write list after osc_extent_release() called, but the async flush problem has to be addressed.

This should address the problem caused by async flush in osc_enter_cache(): http://review.whamcloud.com/5732

Comment by Keith Mannthey (Inactive) [ 19/Mar/13 ]

Sorry for the slow progress on this. I am collecting a few times to repro before I change values. I will have enough data to move forward tomorrow with the test.

I will double /proc/sys/lustre/max_dirty_mb.

Comment by Keith Mannthey (Inactive) [ 19/Mar/13 ]

In the orginal case the issue repos in somewhere between 8 hours and 30 min (yes a big window).

cat /proc/sys/lustre/max_dirty_mb
127.9

now

cat /proc/sys/lustre/max_dirty_mb
460.7

The test is running I will update when I know more. To confirm in the code this client/mds/ost trio is running is RHEL6 based 2.3.62 stock.

Comment by Keith Mannthey (Inactive) [ 20/Mar/13 ]

Well with the change I seem to be seeing 1.5 - 3 hours between bus errors but I don't have many runs in. It is hard to say if it makes it better at this point.

Comment by Niu Yawei (Inactive) [ 20/Mar/13 ]

Thank you, Keith. Then I suppose it was because there is high contenion on grant, and the grant is consumed by other process immediately after flush. Given that original implementaion is fine, I think we'd change the 'if' back to 'while' in osc_enter_cache(). I've updated the patch.

Comment by Keith Mannthey (Inactive) [ 29/Mar/13 ]

I just wanted to confirm that http://review.whamcloud.com/5732 has tested successfully. My test has been running without error for over 36 hours. The patch status has been updated as well.

Comment by Peter Jones [ 30/Mar/13 ]

Landed for 2.4

Comment by Keith Mannthey (Inactive) [ 01/Apr/13 ]

I just wanted to note that I redeployed the VMs that the fsx loop test was running on today. All in all The test ran successfully for over 5 days.

Generated at Sat Feb 10 01:29:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.