Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.3.0, Lustre 2.4.0
    • Lustre 2.3.0, Lustre 2.4.0
    • None
    • CONFIG_DEBUG_SLAB=y
    • 3
    • 4237

    Description

      Lustre: DEBUG MARKER: == sanity test 103: acl test ========================================================================= 19:57:07 (1346774227)
      /work/lustre/head/clean/lustre/utils/l_getidentity
      Slab corruption (Tainted: P --------------- ): size-2048 start=dac6c470, len=2048
      Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
      Last user: [<dff39e58>](cfs_free+0x8/0x10 [libcfs])
      310: 02 00 00 00 01 00 07 00 ff ff ff ff 02 00 05 00
      320: 01 00 00 00 02 00 07 00 02 00 00 00 04 00 07 00
      330: ff ff ff ff 10 00 07 00 ff ff ff ff 20 00 05 00
      340: ff ff ff ff 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
      Next obj: start=dac6cc88, len=2048
      Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
      Last user: [<dff39e58>](cfs_free+0x8/0x10 [libcfs])
      000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
      010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

      02000000:00000010:1.0:1346774231.327841:1804:3373:0:(sec_null.c:217:null_alloc_repbuf()) kmalloced 'req->rq_repbuf': 2048 at dac6c470.
      ...

      02000000:00000010:1.0:1346774231.328361:836:3373:0:(sec_null.c:231:null_free_repbuf()) kfreed 'req->rq_repbuf': 2048 at dac6c470.

      Attachments

        Issue Links

          Activity

            [LU-1823] sanity/103: slab corruption
            yujian Jian Yu added a comment -

            Hi Keith,

            FYI, with the build for patch set 5 of http://review.whamcloud.com/#change,3876, I reproduced the issue with PTLDEBUG=-1 manually:
            https://maloo.whamcloud.com/test_sets/59a5ca46-f832-11e1-b114-52540035b04c

            yujian Jian Yu added a comment - Hi Keith, FYI, with the build for patch set 5 of http://review.whamcloud.com/#change,3876 , I reproduced the issue with PTLDEBUG=-1 manually: https://maloo.whamcloud.com/test_sets/59a5ca46-f832-11e1-b114-52540035b04c
            yujian Jian Yu added a comment -

            Hi Keith,

            By using the build http://build.whamcloud.com/job/lustre-reviews/8904/ in http://review.whamcloud.com/#change,3876, I can manually reproduce the slab corruption issue on RHEL6 distro by only running sanity test 103:
            https://maloo.whamcloud.com/test_sets/2c479ade-f7d3-11e1-8b95-52540035b04c

            The autotest run for the above build skipped sanity test 103 because it's in the EXCEPT_SLOW list. I'm updating the commit message to add SLOW=yes into the test parameters.

            yujian Jian Yu added a comment - Hi Keith, By using the build http://build.whamcloud.com/job/lustre-reviews/8904/ in http://review.whamcloud.com/#change,3876 , I can manually reproduce the slab corruption issue on RHEL6 distro by only running sanity test 103: https://maloo.whamcloud.com/test_sets/2c479ade-f7d3-11e1-8b95-52540035b04c The autotest run for the above build skipped sanity test 103 because it's in the EXCEPT_SLOW list. I'm updating the commit message to add SLOW=yes into the test parameters.

            My config test didn't make it though build on the first pass but Yu has a very nice patch/test here I am watching http://review.whamcloud.com/#change,3876

            keith Keith Mannthey (Inactive) added a comment - My config test didn't make it though build on the first pass but Yu has a very nice patch/test here I am watching http://review.whamcloud.com/#change,3876

            I submitted a few config changes for b2_3 as suggested. http://review.whamcloud.com/3875 .

            I have been able to get some local testing done today. I tried an older 2.2.59 code base (I had it setup on one of my build servers) and I don't seem to see the problem there but I do see it with Master. I will work to narrow down the window of possible changes. Sorting out if b2_3 is effected is my next step.

            keith Keith Mannthey (Inactive) added a comment - I submitted a few config changes for b2_3 as suggested. http://review.whamcloud.com/3875 . I have been able to get some local testing done today. I tried an older 2.2.59 code base (I had it setup on one of my build servers) and I don't seem to see the problem there but I do see it with Master. I will work to narrow down the window of possible changes. Sorting out if b2_3 is effected is my next step.

            probably it'd make sense to add a trivial (dmesg|grep 'Slab corruption' && error) to t-f so that we don't miss it.

            bzzz Alex Zhuravlev added a comment - probably it'd make sense to add a trivial (dmesg|grep 'Slab corruption' && error) to t-f so that we don't miss it.

            I know in the past, Oleg, Johann, and I have wanted to run early development kernels with various debug options enabled for all kernel builds, so that this kind of problem can be flushed out when patches land instead of only at users who run these debug kernels, or hitting silent corruption problems. This previously was suggested by Johann in TT-359, but I think it could be done with a patch to the kernel config options during the development cycle instead of via the test environment (which would need more effort/complexity).

            I think since we are early in the 2.4 release cycle that it makes sense to enable these config options for all our server kernels (so they will be seen for servers and clients running the server kernel). We can leave this as a blocker bug for the 2.4 release to remember to revert the debug kernel config changes.

            Given the relatively small number of patches that have landed on master compared to 2.3, it probably also makes sense to submit a patch to b2_3 to enable CONFIG_DEBUG_SLAB, CONFIG_DEBUG_SPINLOCK and possibly some others, with:

            Test-Parameters: fortestonly testgroup=full
            

            to see if there is a similar failure for b2_3.

            adilger Andreas Dilger added a comment - I know in the past, Oleg, Johann, and I have wanted to run early development kernels with various debug options enabled for all kernel builds, so that this kind of problem can be flushed out when patches land instead of only at users who run these debug kernels, or hitting silent corruption problems. This previously was suggested by Johann in TT-359, but I think it could be done with a patch to the kernel config options during the development cycle instead of via the test environment (which would need more effort/complexity). I think since we are early in the 2.4 release cycle that it makes sense to enable these config options for all our server kernels (so they will be seen for servers and clients running the server kernel). We can leave this as a blocker bug for the 2.4 release to remember to revert the debug kernel config changes. Given the relatively small number of patches that have landed on master compared to 2.3, it probably also makes sense to submit a patch to b2_3 to enable CONFIG_DEBUG_SLAB, CONFIG_DEBUG_SPINLOCK and possibly some others, with: Test-Parameters: fortestonly testgroup=full to see if there is a similar failure for b2_3.

            I'm able to reproduce this almost 100% with REFORMAT=y ONLY=103 sh sanity.sh, within single vbox instance.

            bzzz Alex Zhuravlev added a comment - I'm able to reproduce this almost 100% with REFORMAT=y ONLY=103 sh sanity.sh, within single vbox instance.

            Are there more logs from the rest of the systems? Is there anything special needed to reproduce this?

            keith Keith Mannthey (Inactive) added a comment - Are there more logs from the rest of the systems? Is there anything special needed to reproduce this?
            pjones Peter Jones added a comment -

            Keith is going to try and reproduce this with a debug kernel

            pjones Peter Jones added a comment - Keith is going to try and reproduce this with a debug kernel

            People

              green Oleg Drokin
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: