Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2909

Failure on test suite sanity-benchmark test_fsx: fsx bus error, core dumped

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 6999

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/5a852a8a-85b1-11e2-9f8d-52540035b04c.

      The sub-test test_fsx failed with the following error:

      fsx failed

      truncating to largest ever: 0x25701f
      018000 1362442021.951758 trunc from 0x169dd8  to  0x2e6079	(0x17c2a2 bytes)
      019000 1362442023.494191 mapwrite   0x1dd935 thru 0x1ec6de	(0xedaa bytes)
      020000 1362442025.029135 read       0x17d39e thru 0x1848a8	(0x750b bytes)
      021000 1362442026.608768 mapwrite   0x21696e thru 0x223441	(0xcad4 bytes)
      022000 1362442028.147407 write      0x065ab3 thru 0x071b3c	(0xc08a bytes)
      023000 1362442029.708216 write      0x142a88 thru 0x149749	(0x6cc2 bytes)
      024000 1362442031.298952 trunc from 0x16cef5  to  0x19c5bf	(0x2f6cb bytes)
      025000 1362442032.760608 mapread    0x050a4e thru 0x05ea37	(0xdfea bytes)
      026000 1362442034.337654 write      0x100d42 thru 0x10c4b8	(0xb777 bytes)
      027000 1362442035.889125 mapread    0x0a0142 thru 0x0a0412	(0x2d1 bytes)
      028000 1362442037.474502 read       0x1556d1 thru 0x15c086	(0x69b6 bytes)
      029000 1362442039.086174 write      0x009b42 thru 0x00e01c	(0x44db bytes)
      030000 1362442040.692829 mapwrite   0x18135d thru 0x184f86	(0x3c2a bytes)
      031000 1362442042.344050 write      0x142774 thru 0x14e54f	(0xbddc bytes)
      032000 1362442043.844112 mapwrite   0x0a2628 thru 0x0a295f	(0x338 bytes)
      033000 1362442045.438181 write      0x18f5c4 thru 0x19c18c	(0xcbc9 bytes)
      034000 1362442047.032646 trunc from 0x1e89cc  to  0x3b16df	(0x1c8d14 bytes)
      035000 1362442048.552844 write      0x1e22ba thru 0x1eb190	(0x8ed7 bytes)
      036000 1362442050.084864 read       0x0abc33 thru 0x0b5f8c	(0xa35a bytes)
      037000 1362442051.712007 trunc from 0x0af05a  to  0x0b8740	(0x96e7 bytes)
      038000 1362442053.320923 mapread    0x055a8e thru 0x05a120	(0x4693 bytes)
      039000 1362442054.897862 write      0x236d76 thru 0x23733f	(0x5ca bytes)
      040000 1362442056.462382 mapread    0x116f2e thru 0x12395c	(0xca2f bytes)
      041000 1362442058.009744 mapwrite   0x1493c5 thru 0x156198	(0xcdd4 bytes)
      042000 1362442059.557940 write      0x1cc6db thru 0x1cf588	(0x2eae bytes)
      /usr/lib64/lustre/tests/sanity-benchmark.sh: line 186: 22372 Bus error               (core dumped) $CMD
       sanity-benchmark test_fsx: @@@@@@ FAIL: fsx failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:3973:error_noexit()
        = /usr/lib64/lustre/tests/test-framework.sh:3996:error()
        = /usr/lib64/lustre/tests/sanity-benchmark.sh:199:test_fsx()
        = /usr/lib64/lustre/tests/test-framework.sh:4251:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:4284:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:4139:run_test()
        = /usr/lib64/lustre/tests/sanity-benchmark.sh:203:main()
      Dumping lctl log to /logdir/test_logs/2013-03-04/lustre-master-el6-x86_64--full--2_1_1__1286__-70011898087180-120645/sanity-benchmark.test_fsx.*.1362442060.log
      CMD: client-20-ib,client-21-ib,client-22-ib,client-23-ib.lab.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-03-04/lustre-master-el6-x86_64--full--2_1_1__1286__-70011898087180-120645/sanity-benchmark.test_fsx.debug_log.\$(hostname -s).1362442060.log;
               dmesg > /logdir/test_logs/2013-03-04/lustre-master-el6-x86_64--full--2_1_1__1286__-70011898087180-120645/sanity-benchmark.test_fsx.dmesg.\$(hostname -s).1362442060.log
      

      Attachments

        Issue Links

          Activity

            [LU-2909] Failure on test suite sanity-benchmark test_fsx: fsx bus error, core dumped

            The problem looks like mmap is trying to access invalid memory address, I can't imagine how it can be related to the fix in osc_enter_cache() (d1eb3357bab215018bcb29877ed8b2990e65cc73
            ).

            Keith, how did you run your test? With master reverted the patch d1eb3357bab215018bcb29877ed8b2990e65cc73 only, or with old master which reverted all the patches since d1eb3357bab215018bcb29877ed8b2990e65cc73 ? Thanks.

            niu Niu Yawei (Inactive) added a comment - The problem looks like mmap is trying to access invalid memory address, I can't imagine how it can be related to the fix in osc_enter_cache() (d1eb3357bab215018bcb29877ed8b2990e65cc73 ). Keith, how did you run your test? With master reverted the patch d1eb3357bab215018bcb29877ed8b2990e65cc73 only, or with old master which reverted all the patches since d1eb3357bab215018bcb29877ed8b2990e65cc73 ? Thanks.

            Hi, Keith, seems I missed your mail, next time you can just add me as watcher for this ticket and put comment here. thanks.

            niu Niu Yawei (Inactive) added a comment - Hi, Keith, seems I missed your mail, next time you can just add me as watcher for this ticket and put comment here. thanks.

            I kicked of a depatched build test last night. So far no sign of the issue. With the stock build once I started seeing the issue it was quite common. I am going to continue to let it run for a while and I will update the LU if I hit it ASAP.

            All user space debugging of fsx confirmed it is a mmap write problem. It does not seem to be a test problem.

            I emailed this issue to Niu Yawei as well. As I see it there are really two options. 1. Fix the patch. 2. Drop the patch.

            keith Keith Mannthey (Inactive) added a comment - I kicked of a depatched build test last night. So far no sign of the issue. With the stock build once I started seeing the issue it was quite common. I am going to continue to let it run for a while and I will update the LU if I hit it ASAP. All user space debugging of fsx confirmed it is a mmap write problem. It does not seem to be a test problem. I emailed this issue to Niu Yawei as well. As I see it there are really two options. 1. Fix the patch. 2. Drop the patch.

            Jinshan, rolling back the patch seems to have made the issue go away.

            ben@terascala.com Ben Evans (Inactive) added a comment - Jinshan, rolling back the patch seems to have made the issue go away.

            I added some real debugging to domapwrite and I am retesting. I hope to know a bit more tomorrow.

            keith Keith Mannthey (Inactive) added a comment - I added some real debugging to domapwrite and I am retesting. I hope to know a bit more tomorrow.

            Yes, this is a mmap write problem for sure.

            jay Jinshan Xiong (Inactive) added a comment - Yes, this is a mmap write problem for sure.
            keith Keith Mannthey (Inactive) added a comment - - edited

            Jinshan,
            I will try as well. At this point my system has lots of dirty pages.

            Also first debug datapoint:

            Test dumped on:

            047000 1362775116.526873 trunc from 0x04dd1a  to  0x270fc3      (0x2232aa bytes)
            048000 1362775122.237376 trunc from 0x24c2cc  to  0x44a81b      (0x1fe550 bytes)
            049000 1362775127.917614 write      0x0b91b7 thru 0x0c3304      (0xa14e bytes)
            050000 1362775132.986038 mapread    0x03446b thru 0x03bb6a      (0x7700 bytes)
            051000 1362775138.608533 mapwrite   0x252e46 thru 0x2570c6      (0x4281 bytes)
            

            Similar to other data points we grow a file and then call mapwrite on the newly expanded area.

            (gdb) bt
            #0  memcpy () at ../sysdeps/x86_64/memcpy.S:406
            #1  0x0000000000403a86 in domapwrite (offset=<value optimized out>, size=60322)
                at fsx.c:875
            #2  0x000000000040434d in test () at fsx.c:1055
            #3  0x0000000000404c3a in main (argc=<value optimized out>, 
                argv=<value optimized out>) at fsx.c:1381
            (gdb) frame 1
            #1  0x0000000000403a86 in domapwrite (offset=<value optimized out>, size=60322)
                at fsx.c:875
            875             memcpy(p + pg_offset, good_buf + offset, size);
            (gdb) print p
            $2 = 0x7f8de405f000 <Address 0x7f8de405f000 out of bounds>
            

            It seems gdb thinks the p pointer is out of bounds.
            Also size seem a tad odd but it could be a units things not sure yet.

            void domapwrite(unsigned offset, unsigned size)
            ...
                    if ((p = mmap(0, map_size, PROT_READ | PROT_WRITE, MAP_FILE|MAP_SHARED,
                                  fd, (off_t)(offset - pg_offset))) == MAP_FAILED) {
                            prterr("domapwrite: mmap");
                            report_failure(202);
                    }
            ...
                    memcpy(p + pg_offset, good_buf + offset, size); <=== fsx.c:875
            ...
            

            Somehow the mmap returns a pointer that is invalid a short time later?

            I am still digging into this.

            keith Keith Mannthey (Inactive) added a comment - - edited Jinshan, I will try as well. At this point my system has lots of dirty pages. Also first debug datapoint: Test dumped on: 047000 1362775116.526873 trunc from 0x04dd1a to 0x270fc3 (0x2232aa bytes) 048000 1362775122.237376 trunc from 0x24c2cc to 0x44a81b (0x1fe550 bytes) 049000 1362775127.917614 write 0x0b91b7 thru 0x0c3304 (0xa14e bytes) 050000 1362775132.986038 mapread 0x03446b thru 0x03bb6a (0x7700 bytes) 051000 1362775138.608533 mapwrite 0x252e46 thru 0x2570c6 (0x4281 bytes) Similar to other data points we grow a file and then call mapwrite on the newly expanded area. (gdb) bt #0 memcpy () at ../sysdeps/x86_64/memcpy.S:406 #1 0x0000000000403a86 in domapwrite (offset=<value optimized out>, size=60322) at fsx.c:875 #2 0x000000000040434d in test () at fsx.c:1055 #3 0x0000000000404c3a in main (argc=<value optimized out>, argv=<value optimized out>) at fsx.c:1381 (gdb) frame 1 #1 0x0000000000403a86 in domapwrite (offset=<value optimized out>, size=60322) at fsx.c:875 875 memcpy(p + pg_offset, good_buf + offset, size); (gdb) print p $2 = 0x7f8de405f000 <Address 0x7f8de405f000 out of bounds> It seems gdb thinks the p pointer is out of bounds. Also size seem a tad odd but it could be a units things not sure yet. void domapwrite(unsigned offset, unsigned size) ... if ((p = mmap(0, map_size, PROT_READ | PROT_WRITE, MAP_FILE|MAP_SHARED, fd, (off_t)(offset - pg_offset))) == MAP_FAILED) { prterr("domapwrite: mmap"); report_failure(202); } ... memcpy(p + pg_offset, good_buf + offset, size); <=== fsx.c:875 ... Somehow the mmap returns a pointer that is invalid a short time later? I am still digging into this.

            Ben, can you please try to reproduce this issue by reverting the following patch:

            commit d1eb3357bab215018bcb29877ed8b2990e65cc73
            Author: Niu Yawei <niu@whamcloud.com>
            Date: Sun Jan 6 03:21:03 2013 -0500

            Thanks,
            Jinshan

            jay Jinshan Xiong (Inactive) added a comment - Ben, can you please try to reproduce this issue by reverting the following patch: commit d1eb3357bab215018bcb29877ed8b2990e65cc73 Author: Niu Yawei <niu@whamcloud.com> Date: Sun Jan 6 03:21:03 2013 -0500 Thanks, Jinshan

            Well I was able to catch it on RHEL6.3 with a client running a Lustre patched kernel. I got it in about 25% of my runs last night. Also our autotest run RHEL6.3 unpatched client for the most part and we hit it about 5% of the time there. I don't this could be related to the base kernel. I am going to try over the weekend to run this version of FSX (I understand this version of the test has a bit of Lustre hacking in it) on ext4.

            For some reason I didn't get a core dump on my local system so I adjusted some limits in my shell and am trying again.

            keith Keith Mannthey (Inactive) added a comment - Well I was able to catch it on RHEL6.3 with a client running a Lustre patched kernel. I got it in about 25% of my runs last night. Also our autotest run RHEL6.3 unpatched client for the most part and we hit it about 5% of the time there. I don't this could be related to the base kernel. I am going to try over the weekend to run this version of FSX (I understand this version of the test has a bit of Lustre hacking in it) on ext4. For some reason I didn't get a core dump on my local system so I adjusted some limits in my shell and am trying again.

            Keith. I think this is OS-specific. I've only seen it on SLES11 SP2, not RHEL 6.2 or Ubuntu.

            ben@terascala.com Ben Evans (Inactive) added a comment - Keith. I think this is OS-specific. I've only seen it on SLES11 SP2, not RHEL 6.2 or Ubuntu.

            I am attempting to reproduce on a standard RHEL 6.3 vm client. So far I have not encountered any errors but I am going to let it run more. I would really like a usable core file to look into this issue more.

            If I can't hit this locally in the next day or so I may engage with the autotest world to have the VMs held after the next failure.

            keith Keith Mannthey (Inactive) added a comment - I am attempting to reproduce on a standard RHEL 6.3 vm client. So far I have not encountered any errors but I am going to let it run more. I would really like a usable core file to look into this issue more. If I can't hit this locally in the next day or so I may engage with the autotest world to have the VMs held after the next failure.

            People

              niu Niu Yawei (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: