Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1028

Bus error (core dumped) during fsx test

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.2.0, Lustre 2.4.0
    • Lustre 2.2.0
    • None
    • 3
    • 4251

    Description

      This bug appeared after commit e8ffe16619baf1ef7c5c6b117d338956372aa752, "LU-884 clio: client in memory checksum",
      unfortunately our tests don't show failure of fsx in sanity-benchmark and sanity-benchmark is not part of every autotest.

      The issue looks like the following:
      /usr/lib64/lustre/tests/sanity-benchmark.sh: line 186: 16826 Bus error (core dumped) fsx -c 50 -p 1000 -S $FSX_SEED -P $TMP -l $FSX_SIZE -N $(($FSX_COUNT * 100)) $testfile

      Example of report (master):
      https://maloo.whamcloud.com/test_sets/02485a3a-45d0-11e1-8d6e-5254004bbbd3

      sanity-benchmark is green but fsx failed as showed above. Recently the same code was landed to the orion and we start experiencing the same issue.

      Attachments

        Issue Links

          Activity

            [LU-1028] Bus error (core dumped) during fsx test

            We also trigger the same situation (osc.cur_

            {dirty|grant|lost_grant}

            _bytes = 0) on CEA test system running with our/Bull build of Lustre v2.1.2. This build/version integrates LU-1299 (patch set 11) and ORNL-22 patches but not the one for this LU-1028.

            The very bad news/consequence, which does not clearly appear in this JIRA comments, is that this situation causes files corruptions on affected Clients because applications/cmds are still allowed to write in cache when later asynchronous flushes never succeed (-EDQUOT) but this occurs silently and out of context.

            As a work-around, we are also able to recover grants and associated+working mechanism by writing sunchronous/O_DIRECT I/Os on affected OSTs/OSCs. But this problem is a showstopper for customer to migrate to v2.1.2 since there is always a timing-window where corruption can occur.

            bfaccini Bruno Faccini (Inactive) added a comment - We also trigger the same situation (osc.cur_ {dirty|grant|lost_grant} _bytes = 0) on CEA test system running with our/Bull build of Lustre v2.1.2. This build/version integrates LU-1299 (patch set 11) and ORNL-22 patches but not the one for this LU-1028 . The very bad news/consequence, which does not clearly appear in this JIRA comments, is that this situation causes files corruptions on affected Clients because applications/cmds are still allowed to write in cache when later asynchronous flushes never succeed (-EDQUOT) but this occurs silently and out of context. As a work-around, we are also able to recover grants and associated+working mechanism by writing sunchronous/O_DIRECT I/Os on affected OSTs/OSCs. But this problem is a showstopper for customer to migrate to v2.1.2 since there is always a timing-window where corruption can occur.

            This is also the issue of grant, when this issue is happening, all osc.cur_

            {dirty|grant|lost_grant}

            _bytes are zero.

            139896832 bytes (140 MB) copied, 182.781 s, 765 kB/s
            Success!
            osc.lustre-OST0000-osc-ffff880105ae6000.cur_dirty_bytes=0
            osc.lustre-OST0000-osc-ffff88020f12b400.cur_dirty_bytes=0
            osc.lustre-OST0001-osc-ffff880105ae6000.cur_dirty_bytes=0
            osc.lustre-OST0001-osc-ffff88020f12b400.cur_dirty_bytes=0
            osc.lustre-OST0000-osc-ffff880105ae6000.cur_grant_bytes=0
            osc.lustre-OST0000-osc-ffff88020f12b400.cur_grant_bytes=0
            osc.lustre-OST0001-osc-ffff880105ae6000.cur_grant_bytes=0
            osc.lustre-OST0001-osc-ffff88020f12b400.cur_grant_bytes=0
            osc.lustre-OST0000-osc-ffff880105ae6000.cur_lost_grant_bytes=0
            osc.lustre-OST0000-osc-ffff88020f12b400.cur_lost_grant_bytes=0
            osc.lustre-OST0001-osc-ffff880105ae6000.cur_lost_grant_bytes=0
            osc.lustre-OST0001-osc-ffff88020f12b400.cur_lost_grant_bytes=0
            Resetting fail_loc on all nodes...done.
            PASS 15 (198s)
            
            == sanityn test 16: 2500 iterations of dual-mount fsx == 14:39:45 (1345066785)
            

            So we met the same issue and a temp fix is to delete all test file of test_15 and write some bytes in sync mode so that more grants can be allocated.

            jay Jinshan Xiong (Inactive) added a comment - This is also the issue of grant, when this issue is happening, all osc.cur_ {dirty|grant|lost_grant} _bytes are zero. 139896832 bytes (140 MB) copied, 182.781 s, 765 kB/s Success! osc.lustre-OST0000-osc-ffff880105ae6000.cur_dirty_bytes=0 osc.lustre-OST0000-osc-ffff88020f12b400.cur_dirty_bytes=0 osc.lustre-OST0001-osc-ffff880105ae6000.cur_dirty_bytes=0 osc.lustre-OST0001-osc-ffff88020f12b400.cur_dirty_bytes=0 osc.lustre-OST0000-osc-ffff880105ae6000.cur_grant_bytes=0 osc.lustre-OST0000-osc-ffff88020f12b400.cur_grant_bytes=0 osc.lustre-OST0001-osc-ffff880105ae6000.cur_grant_bytes=0 osc.lustre-OST0001-osc-ffff88020f12b400.cur_grant_bytes=0 osc.lustre-OST0000-osc-ffff880105ae6000.cur_lost_grant_bytes=0 osc.lustre-OST0000-osc-ffff88020f12b400.cur_lost_grant_bytes=0 osc.lustre-OST0001-osc-ffff880105ae6000.cur_lost_grant_bytes=0 osc.lustre-OST0001-osc-ffff88020f12b400.cur_lost_grant_bytes=0 Resetting fail_loc on all nodes...done. PASS 15 (198s) == sanityn test 16: 2500 iterations of dual-mount fsx == 14:39:45 (1345066785) So we met the same issue and a temp fix is to delete all test file of test_15 and write some bytes in sync mode so that more grants can be allocated.

            I'm still able to reproduce this problem in local testing if sanityn.sh test_15() and test_16() both run with smaller OSTs that do not cause test_15 to be skipped.

            adilger Andreas Dilger added a comment - I'm still able to reproduce this problem in local testing if sanityn.sh test_15() and test_16() both run with smaller OSTs that do not cause test_15 to be skipped.

            Integrated in lustre-master » i686,server,el6,inkernel #487
            LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

            Result = SUCCESS
            Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
            Files :

            • lustre/tests/sanity-benchmark.sh
            • lustre/tests/sanityn.sh
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,server,el6,inkernel #487 LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567) Result = SUCCESS Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567 Files : lustre/tests/sanity-benchmark.sh lustre/tests/sanityn.sh

            Integrated in lustre-master » x86_64,server,el6,ofa #487
            LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

            Result = SUCCESS
            Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
            Files :

            • lustre/tests/sanity-benchmark.sh
            • lustre/tests/sanityn.sh
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,server,el6,ofa #487 LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567) Result = SUCCESS Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567 Files : lustre/tests/sanity-benchmark.sh lustre/tests/sanityn.sh

            Integrated in lustre-master » i686,client,el6,ofa #487
            LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

            Result = SUCCESS
            Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
            Files :

            • lustre/tests/sanity-benchmark.sh
            • lustre/tests/sanityn.sh
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,client,el6,ofa #487 LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567) Result = SUCCESS Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567 Files : lustre/tests/sanity-benchmark.sh lustre/tests/sanityn.sh

            Integrated in lustre-master » x86_64,client,el6,inkernel #487
            LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

            Result = SUCCESS
            Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
            Files :

            • lustre/tests/sanityn.sh
            • lustre/tests/sanity-benchmark.sh
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,client,el6,inkernel #487 LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567) Result = SUCCESS Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567 Files : lustre/tests/sanityn.sh lustre/tests/sanity-benchmark.sh

            Integrated in lustre-master » i686,server,el6,ofa #487
            LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

            Result = SUCCESS
            Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
            Files :

            • lustre/tests/sanity-benchmark.sh
            • lustre/tests/sanityn.sh
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,server,el6,ofa #487 LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567) Result = SUCCESS Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567 Files : lustre/tests/sanity-benchmark.sh lustre/tests/sanityn.sh

            Integrated in lustre-master » x86_64,server,el6,inkernel #487
            LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

            Result = SUCCESS
            Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
            Files :

            • lustre/tests/sanity-benchmark.sh
            • lustre/tests/sanityn.sh
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,server,el6,inkernel #487 LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567) Result = SUCCESS Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567 Files : lustre/tests/sanity-benchmark.sh lustre/tests/sanityn.sh

            Integrated in lustre-master » i686,server,el5,inkernel #487
            LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

            Result = SUCCESS
            Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
            Files :

            • lustre/tests/sanityn.sh
            • lustre/tests/sanity-benchmark.sh
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,server,el5,inkernel #487 LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567) Result = SUCCESS Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567 Files : lustre/tests/sanityn.sh lustre/tests/sanity-benchmark.sh

            Integrated in lustre-master » i686,client,el5,inkernel #487
            LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

            Result = SUCCESS
            Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
            Files :

            • lustre/tests/sanity-benchmark.sh
            • lustre/tests/sanityn.sh
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,client,el5,inkernel #487 LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567) Result = SUCCESS Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567 Files : lustre/tests/sanity-benchmark.sh lustre/tests/sanityn.sh

            People

              jay Jinshan Xiong (Inactive)
              tappro Mikhail Pershin
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: