Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1028

Bus error (core dumped) during fsx test

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.2.0, Lustre 2.4.0
    • Lustre 2.2.0
    • None
    • 3
    • 4251

    Description

      This bug appeared after commit e8ffe16619baf1ef7c5c6b117d338956372aa752, "LU-884 clio: client in memory checksum",
      unfortunately our tests don't show failure of fsx in sanity-benchmark and sanity-benchmark is not part of every autotest.

      The issue looks like the following:
      /usr/lib64/lustre/tests/sanity-benchmark.sh: line 186: 16826 Bus error (core dumped) fsx -c 50 -p 1000 -S $FSX_SEED -P $TMP -l $FSX_SIZE -N $(($FSX_COUNT * 100)) $testfile

      Example of report (master):
      https://maloo.whamcloud.com/test_sets/02485a3a-45d0-11e1-8d6e-5254004bbbd3

      sanity-benchmark is green but fsx failed as showed above. Recently the same code was landed to the orion and we start experiencing the same issue.

      Attachments

        Issue Links

          Activity

            [LU-1028] Bus error (core dumped) during fsx test

            Actually, this is happening a LOT for what appears to be a wide striping test at IU:
            https://maloo.whamcloud.com/test_sets/7114751c-d0c6-11e1-8d8f-52540035b04c

            I'm going to file a new bug.

            adilger Andreas Dilger added a comment - Actually, this is happening a LOT for what appears to be a wide striping test at IU: https://maloo.whamcloud.com/test_sets/7114751c-d0c6-11e1-8d8f-52540035b04c I'm going to file a new bug.
            adilger Andreas Dilger added a comment - I saw this recently in https://maloo.whamcloud.com/test_sets/235473d8-d8b7-11e1-9e3b-52540035b04c

            No problem I understand, I will fill a new Jira starting next week with all related inf

            bfaccini Bruno Faccini (Inactive) added a comment - No problem I understand, I will fill a new Jira starting next week with all related inf

            If enabling full traces, we can see the -EDQUOT errno returns from osc_enter_cache() which, I think, seem to happen during lock revoke and further pages flushes mechanism ...

            osc_enter_cache() is called to add the page into OSC write cache. When it runs out of grants on this OSC, -EQUOTA will be returned to caller(ll_commit_write() of llite), and then a SYNC io will be issued so more grants should be returned by the OST. Please take a look at the log again to make sure this process happened.

            In addition to Jinshan, Johann should be CC'd on the bug, since he was working on the grant code recently.

            Yes, I agree with Andreas that a new ticket should be created and I will add Johann in the cc list.

            jay Jinshan Xiong (Inactive) added a comment - If enabling full traces, we can see the -EDQUOT errno returns from osc_enter_cache() which, I think, seem to happen during lock revoke and further pages flushes mechanism ... osc_enter_cache() is called to add the page into OSC write cache. When it runs out of grants on this OSC, -EQUOTA will be returned to caller(ll_commit_write() of llite), and then a SYNC io will be issued so more grants should be returned by the OST. Please take a look at the log again to make sure this process happened. In addition to Jinshan, Johann should be CC'd on the bug, since he was working on the grant code recently. Yes, I agree with Andreas that a new ticket should be created and I will add Johann in the cc list.

            I Bruno, could you please file a separate bug for your "cur_grant is zero" issue, and copy the last comments over. This bug is closed, and was flats stpecifidally to the fsx core dump problem, but your issue may be different. It is easier to track and prioritize if you file a new bug.

            In addition to Jinshan, Johann should be CC'd on the bug, since he was working on the grant code recently.

            adilger Andreas Dilger added a comment - I Bruno, could you please file a separate bug for your "cur_grant is zero" issue, and copy the last comments over. This bug is closed, and was flats stpecifidally to the fsx core dump problem, but your issue may be different. It is easier to track and prioritize if you file a new bug. In addition to Jinshan, Johann should be CC'd on the bug, since he was working on the grant code recently.

            Hello Jinshan, thank's for your quick answer already !!

            Even if we don't know how we fall in such situation, I don't think we are running with the mmap case you described. This comes from the fact that now we are aware of this problem, we frequently monitor "/proc/fs/lustre/osc/<OST-import>/cur_

            {dirty|grant|lost_grant}_bytes" on each Client nodes and at some point of time we find some of them with all 3 counters be NULL when OST is far from beeing full and others Clients still have grants for it!!

            On these Clients we are able to successfully create a file that we limited to only one of these OST/Stripe (lfs setstripe) and then we can easily re-read it from the same Client. But when we try to access it from an other Client, file appears empty with a 0 size, and thus/now also from its creator/original Client node !!!

            If enabling full traces, we can see the -EDQUOT errno returns from osc_enter_cache() which, I think, seem to happen during lock revoke and further pages flushes mechanism ...
            So, according to me, it better looks like that some scenario or I/O sequence is able to break the grant flow/mechanism.

            And again, when the problem is there on a Client, all 3 "cur_{dirty|grant|lost_grant}

            _bytes" stay NULL and we can do the same proof of failure multiple times. The only way to return to normal we know is to run a small program doing O_DIRECT writes to each affected OST.

            Will try to get you the relating infos/traces out from the site soon.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Jinshan, thank's for your quick answer already !! Even if we don't know how we fall in such situation, I don't think we are running with the mmap case you described. This comes from the fact that now we are aware of this problem, we frequently monitor "/proc/fs/lustre/osc/<OST-import>/cur_ {dirty|grant|lost_grant}_bytes" on each Client nodes and at some point of time we find some of them with all 3 counters be NULL when OST is far from beeing full and others Clients still have grants for it!! On these Clients we are able to successfully create a file that we limited to only one of these OST/Stripe (lfs setstripe) and then we can easily re-read it from the same Client. But when we try to access it from an other Client, file appears empty with a 0 size, and thus/now also from its creator/original Client node !!! If enabling full traces, we can see the -EDQUOT errno returns from osc_enter_cache() which, I think, seem to happen during lock revoke and further pages flushes mechanism ... So, according to me, it better looks like that some scenario or I/O sequence is able to break the grant flow/mechanism. And again, when the problem is there on a Client, all 3 "cur_{dirty|grant|lost_grant} _bytes" stay NULL and we can do the same proof of failure multiple times. The only way to return to normal we know is to run a small program doing O_DIRECT writes to each affected OST. Will try to get you the relating infos/traces out from the site soon.

            Hi Bruno, there is no ->page_mkwrite() method yet in v2.1.2 so yes there exists an issue that writing via mmap won't reserve grants so the later flush will fail if it doesn't have enough space on the OST. As a result, one flag in the address_space will be set error so the subsequent IO will see this error(this is a common problem for async write because the app may have already exited so it won't see this error at all). Have you ever seen that the OST is running out of space and the application writes via mmap?

            BTW, if the OST is really running out of space, the grants can't be recovered by issuing an O_DIRECT io.

            There is a report about data corruption which involves all series of clients - 1.8, 2.1 and 2.3. There are huge changes on client code between 1.8 and 2.x but the grant mechanism has never changed. I smell there would be some problem with it.

            Do you have a steady way to reproduce this problem?

            jay Jinshan Xiong (Inactive) added a comment - Hi Bruno, there is no ->page_mkwrite() method yet in v2.1.2 so yes there exists an issue that writing via mmap won't reserve grants so the later flush will fail if it doesn't have enough space on the OST. As a result, one flag in the address_space will be set error so the subsequent IO will see this error(this is a common problem for async write because the app may have already exited so it won't see this error at all). Have you ever seen that the OST is running out of space and the application writes via mmap? BTW, if the OST is really running out of space, the grants can't be recovered by issuing an O_DIRECT io. There is a report about data corruption which involves all series of clients - 1.8, 2.1 and 2.3. There are huge changes on client code between 1.8 and 2.x but the grant mechanism has never changed. I smell there would be some problem with it. Do you have a steady way to reproduce this problem?

            We also trigger the same situation (osc.cur_

            {dirty|grant|lost_grant}

            _bytes = 0) on CEA test system running with our/Bull build of Lustre v2.1.2. This build/version integrates LU-1299 (patch set 11) and ORNL-22 patches but not the one for this LU-1028.

            The very bad news/consequence, which does not clearly appear in this JIRA comments, is that this situation causes files corruptions on affected Clients because applications/cmds are still allowed to write in cache when later asynchronous flushes never succeed (-EDQUOT) but this occurs silently and out of context.

            As a work-around, we are also able to recover grants and associated+working mechanism by writing sunchronous/O_DIRECT I/Os on affected OSTs/OSCs. But this problem is a showstopper for customer to migrate to v2.1.2 since there is always a timing-window where corruption can occur.

            bfaccini Bruno Faccini (Inactive) added a comment - We also trigger the same situation (osc.cur_ {dirty|grant|lost_grant} _bytes = 0) on CEA test system running with our/Bull build of Lustre v2.1.2. This build/version integrates LU-1299 (patch set 11) and ORNL-22 patches but not the one for this LU-1028 . The very bad news/consequence, which does not clearly appear in this JIRA comments, is that this situation causes files corruptions on affected Clients because applications/cmds are still allowed to write in cache when later asynchronous flushes never succeed (-EDQUOT) but this occurs silently and out of context. As a work-around, we are also able to recover grants and associated+working mechanism by writing sunchronous/O_DIRECT I/Os on affected OSTs/OSCs. But this problem is a showstopper for customer to migrate to v2.1.2 since there is always a timing-window where corruption can occur.

            This is also the issue of grant, when this issue is happening, all osc.cur_

            {dirty|grant|lost_grant}

            _bytes are zero.

            139896832 bytes (140 MB) copied, 182.781 s, 765 kB/s
            Success!
            osc.lustre-OST0000-osc-ffff880105ae6000.cur_dirty_bytes=0
            osc.lustre-OST0000-osc-ffff88020f12b400.cur_dirty_bytes=0
            osc.lustre-OST0001-osc-ffff880105ae6000.cur_dirty_bytes=0
            osc.lustre-OST0001-osc-ffff88020f12b400.cur_dirty_bytes=0
            osc.lustre-OST0000-osc-ffff880105ae6000.cur_grant_bytes=0
            osc.lustre-OST0000-osc-ffff88020f12b400.cur_grant_bytes=0
            osc.lustre-OST0001-osc-ffff880105ae6000.cur_grant_bytes=0
            osc.lustre-OST0001-osc-ffff88020f12b400.cur_grant_bytes=0
            osc.lustre-OST0000-osc-ffff880105ae6000.cur_lost_grant_bytes=0
            osc.lustre-OST0000-osc-ffff88020f12b400.cur_lost_grant_bytes=0
            osc.lustre-OST0001-osc-ffff880105ae6000.cur_lost_grant_bytes=0
            osc.lustre-OST0001-osc-ffff88020f12b400.cur_lost_grant_bytes=0
            Resetting fail_loc on all nodes...done.
            PASS 15 (198s)
            
            == sanityn test 16: 2500 iterations of dual-mount fsx == 14:39:45 (1345066785)
            

            So we met the same issue and a temp fix is to delete all test file of test_15 and write some bytes in sync mode so that more grants can be allocated.

            jay Jinshan Xiong (Inactive) added a comment - This is also the issue of grant, when this issue is happening, all osc.cur_ {dirty|grant|lost_grant} _bytes are zero. 139896832 bytes (140 MB) copied, 182.781 s, 765 kB/s Success! osc.lustre-OST0000-osc-ffff880105ae6000.cur_dirty_bytes=0 osc.lustre-OST0000-osc-ffff88020f12b400.cur_dirty_bytes=0 osc.lustre-OST0001-osc-ffff880105ae6000.cur_dirty_bytes=0 osc.lustre-OST0001-osc-ffff88020f12b400.cur_dirty_bytes=0 osc.lustre-OST0000-osc-ffff880105ae6000.cur_grant_bytes=0 osc.lustre-OST0000-osc-ffff88020f12b400.cur_grant_bytes=0 osc.lustre-OST0001-osc-ffff880105ae6000.cur_grant_bytes=0 osc.lustre-OST0001-osc-ffff88020f12b400.cur_grant_bytes=0 osc.lustre-OST0000-osc-ffff880105ae6000.cur_lost_grant_bytes=0 osc.lustre-OST0000-osc-ffff88020f12b400.cur_lost_grant_bytes=0 osc.lustre-OST0001-osc-ffff880105ae6000.cur_lost_grant_bytes=0 osc.lustre-OST0001-osc-ffff88020f12b400.cur_lost_grant_bytes=0 Resetting fail_loc on all nodes...done. PASS 15 (198s) == sanityn test 16: 2500 iterations of dual-mount fsx == 14:39:45 (1345066785) So we met the same issue and a temp fix is to delete all test file of test_15 and write some bytes in sync mode so that more grants can be allocated.

            I'm still able to reproduce this problem in local testing if sanityn.sh test_15() and test_16() both run with smaller OSTs that do not cause test_15 to be skipped.

            adilger Andreas Dilger added a comment - I'm still able to reproduce this problem in local testing if sanityn.sh test_15() and test_16() both run with smaller OSTs that do not cause test_15 to be skipped.

            People

              jay Jinshan Xiong (Inactive)
              tappro Mikhail Pershin
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: