Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.3.0, Lustre 2.4.0
    • Lustre 2.3.0, Lustre 2.4.0
    • None
    • CONFIG_DEBUG_SLAB=y
    • 3
    • 4237

    Description

      Lustre: DEBUG MARKER: == sanity test 103: acl test ========================================================================= 19:57:07 (1346774227)
      /work/lustre/head/clean/lustre/utils/l_getidentity
      Slab corruption (Tainted: P --------------- ): size-2048 start=dac6c470, len=2048
      Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
      Last user: [<dff39e58>](cfs_free+0x8/0x10 [libcfs])
      310: 02 00 00 00 01 00 07 00 ff ff ff ff 02 00 05 00
      320: 01 00 00 00 02 00 07 00 02 00 00 00 04 00 07 00
      330: ff ff ff ff 10 00 07 00 ff ff ff ff 20 00 05 00
      340: ff ff ff ff 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
      Next obj: start=dac6cc88, len=2048
      Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
      Last user: [<dff39e58>](cfs_free+0x8/0x10 [libcfs])
      000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
      010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

      02000000:00000010:1.0:1346774231.327841:1804:3373:0:(sec_null.c:217:null_alloc_repbuf()) kmalloced 'req->rq_repbuf': 2048 at dac6c470.
      ...

      02000000:00000010:1.0:1346774231.328361:836:3373:0:(sec_null.c:231:null_free_repbuf()) kfreed 'req->rq_repbuf': 2048 at dac6c470.

      Attachments

        Issue Links

          Activity

            [LU-1823] sanity/103: slab corruption
            yujian Jian Yu added a comment -

            Per the above test report, the slab corruption issue occurred only on the MDS (fat-intel-2):

            fat-intel-2: Slab corruption (Not tainted): size-2048 start=ffff8802e1b534f8, len=2048
            fat-intel-2: Slab corruption (Not tainted): size-2048 start=ffff8802e1d776f8, len=2048
            fat-intel-2: Slab corruption (Not tainted): size-2048 start=ffff8802e13ca4c8, len=2048
             sanity test_103: @@@@@@ FAIL: slab corruption detected 
            
            yujian Jian Yu added a comment - Per the above test report, the slab corruption issue occurred only on the MDS (fat-intel-2): fat-intel-2: Slab corruption (Not tainted): size-2048 start=ffff8802e1b534f8, len=2048 fat-intel-2: Slab corruption (Not tainted): size-2048 start=ffff8802e1d776f8, len=2048 fat-intel-2: Slab corruption (Not tainted): size-2048 start=ffff8802e13ca4c8, len=2048 sanity test_103: @@@@@@ FAIL: slab corruption detected

            I have started a git bisect to narrow down the code change but I fear it is not realiable data. I am not sure what has happened on my local vms (I shuffled some vms around yesterday) but I am no longer able to reproduce the core issue. I am running Lustre: 2.3.50 (from Master) with kernel-2.6.32-279.5.2 an not triggering the issue. I am moving back to kernel-2.6.32-279.1.1 (confirmed failed with Yu's test run) to see if the issue reappears.

            I will update when I know more.

            keith Keith Mannthey (Inactive) added a comment - I have started a git bisect to narrow down the code change but I fear it is not realiable data. I am not sure what has happened on my local vms (I shuffled some vms around yesterday) but I am no longer able to reproduce the core issue. I am running Lustre: 2.3.50 (from Master) with kernel-2.6.32-279.5.2 an not triggering the issue. I am moving back to kernel-2.6.32-279.1.1 (confirmed failed with Yu's test run) to see if the issue reappears. I will update when I know more.

            If there are no obvious sources of this corruption, it probably makes sense to submit this test patch as several separate changes, each based on one of the recent 2.2.* tags, to see if we can isolate when this corruption started. After that, it is hopefully possible to do a (manual?) git-bisect to find which patch is the culprit, or at least narrow down the range of patches that need to be examined manually. It is also important to check in each of the failure cases what node type the corruption is seen on (MDS, OSS, client), since that will also reduce the number of changes which might have introduced the problem.

            It would make sense to include a check for the LU-1844 list_add/list_del corruption messages as well, since I suspect that is also a sign of random memory corruption.

            adilger Andreas Dilger added a comment - If there are no obvious sources of this corruption, it probably makes sense to submit this test patch as several separate changes, each based on one of the recent 2.2.* tags, to see if we can isolate when this corruption started. After that, it is hopefully possible to do a (manual?) git-bisect to find which patch is the culprit, or at least narrow down the range of patches that need to be examined manually. It is also important to check in each of the failure cases what node type the corruption is seen on (MDS, OSS, client), since that will also reduce the number of changes which might have introduced the problem. It would make sense to include a check for the LU-1844 list_add/list_del corruption messages as well, since I suspect that is also a sign of random memory corruption.
            yujian Jian Yu added a comment -

            Hi Keith,

            FYI, with the build for patch set 5 of http://review.whamcloud.com/#change,3876, I reproduced the issue with PTLDEBUG=-1 manually:
            https://maloo.whamcloud.com/test_sets/59a5ca46-f832-11e1-b114-52540035b04c

            yujian Jian Yu added a comment - Hi Keith, FYI, with the build for patch set 5 of http://review.whamcloud.com/#change,3876 , I reproduced the issue with PTLDEBUG=-1 manually: https://maloo.whamcloud.com/test_sets/59a5ca46-f832-11e1-b114-52540035b04c
            yujian Jian Yu added a comment -

            Hi Keith,

            By using the build http://build.whamcloud.com/job/lustre-reviews/8904/ in http://review.whamcloud.com/#change,3876, I can manually reproduce the slab corruption issue on RHEL6 distro by only running sanity test 103:
            https://maloo.whamcloud.com/test_sets/2c479ade-f7d3-11e1-8b95-52540035b04c

            The autotest run for the above build skipped sanity test 103 because it's in the EXCEPT_SLOW list. I'm updating the commit message to add SLOW=yes into the test parameters.

            yujian Jian Yu added a comment - Hi Keith, By using the build http://build.whamcloud.com/job/lustre-reviews/8904/ in http://review.whamcloud.com/#change,3876 , I can manually reproduce the slab corruption issue on RHEL6 distro by only running sanity test 103: https://maloo.whamcloud.com/test_sets/2c479ade-f7d3-11e1-8b95-52540035b04c The autotest run for the above build skipped sanity test 103 because it's in the EXCEPT_SLOW list. I'm updating the commit message to add SLOW=yes into the test parameters.

            My config test didn't make it though build on the first pass but Yu has a very nice patch/test here I am watching http://review.whamcloud.com/#change,3876

            keith Keith Mannthey (Inactive) added a comment - My config test didn't make it though build on the first pass but Yu has a very nice patch/test here I am watching http://review.whamcloud.com/#change,3876

            I submitted a few config changes for b2_3 as suggested. http://review.whamcloud.com/3875 .

            I have been able to get some local testing done today. I tried an older 2.2.59 code base (I had it setup on one of my build servers) and I don't seem to see the problem there but I do see it with Master. I will work to narrow down the window of possible changes. Sorting out if b2_3 is effected is my next step.

            keith Keith Mannthey (Inactive) added a comment - I submitted a few config changes for b2_3 as suggested. http://review.whamcloud.com/3875 . I have been able to get some local testing done today. I tried an older 2.2.59 code base (I had it setup on one of my build servers) and I don't seem to see the problem there but I do see it with Master. I will work to narrow down the window of possible changes. Sorting out if b2_3 is effected is my next step.

            probably it'd make sense to add a trivial (dmesg|grep 'Slab corruption' && error) to t-f so that we don't miss it.

            bzzz Alex Zhuravlev added a comment - probably it'd make sense to add a trivial (dmesg|grep 'Slab corruption' && error) to t-f so that we don't miss it.

            I know in the past, Oleg, Johann, and I have wanted to run early development kernels with various debug options enabled for all kernel builds, so that this kind of problem can be flushed out when patches land instead of only at users who run these debug kernels, or hitting silent corruption problems. This previously was suggested by Johann in TT-359, but I think it could be done with a patch to the kernel config options during the development cycle instead of via the test environment (which would need more effort/complexity).

            I think since we are early in the 2.4 release cycle that it makes sense to enable these config options for all our server kernels (so they will be seen for servers and clients running the server kernel). We can leave this as a blocker bug for the 2.4 release to remember to revert the debug kernel config changes.

            Given the relatively small number of patches that have landed on master compared to 2.3, it probably also makes sense to submit a patch to b2_3 to enable CONFIG_DEBUG_SLAB, CONFIG_DEBUG_SPINLOCK and possibly some others, with:

            Test-Parameters: fortestonly testgroup=full
            

            to see if there is a similar failure for b2_3.

            adilger Andreas Dilger added a comment - I know in the past, Oleg, Johann, and I have wanted to run early development kernels with various debug options enabled for all kernel builds, so that this kind of problem can be flushed out when patches land instead of only at users who run these debug kernels, or hitting silent corruption problems. This previously was suggested by Johann in TT-359, but I think it could be done with a patch to the kernel config options during the development cycle instead of via the test environment (which would need more effort/complexity). I think since we are early in the 2.4 release cycle that it makes sense to enable these config options for all our server kernels (so they will be seen for servers and clients running the server kernel). We can leave this as a blocker bug for the 2.4 release to remember to revert the debug kernel config changes. Given the relatively small number of patches that have landed on master compared to 2.3, it probably also makes sense to submit a patch to b2_3 to enable CONFIG_DEBUG_SLAB, CONFIG_DEBUG_SPINLOCK and possibly some others, with: Test-Parameters: fortestonly testgroup=full to see if there is a similar failure for b2_3.

            I'm able to reproduce this almost 100% with REFORMAT=y ONLY=103 sh sanity.sh, within single vbox instance.

            bzzz Alex Zhuravlev added a comment - I'm able to reproduce this almost 100% with REFORMAT=y ONLY=103 sh sanity.sh, within single vbox instance.

            Are there more logs from the rest of the systems? Is there anything special needed to reproduce this?

            keith Keith Mannthey (Inactive) added a comment - Are there more logs from the rest of the systems? Is there anything special needed to reproduce this?

            People

              green Oleg Drokin
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: