Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2902

sanity test_156: NOT IN CACHE: before: , after:

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.4.1, Lustre 2.5.0
    • Lustre 2.4.0
    • 3
    • 6990

    Description

      This issue was created by maloo for Oleg Drokin <green@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/406900a6-84d3-11e2-9ab1-52540035b04c.

      The sub-test test_156 failed with the following error:

      NOT IN CACHE: before: 16741, after: 16741

      This seems to have an astounding 21% failure rate and nobody filed a ticket for it yet.
      Might be related to older LU-2009 that never was really investigated it seems.

      Info required for matching: sanity 156

      Attachments

        Issue Links

          Activity

            [LU-2902] sanity test_156: NOT IN CACHE: before: , after:

            OK one week into the LU-2979 patch landing and there are no failures of Sanity 151 or 156. This is good news.

            keith Keith Mannthey (Inactive) added a comment - OK one week into the LU-2979 patch landing and there are no failures of Sanity 151 or 156. This is good news.

            I did a quick search in Maloo today. I didn't see any failures in the last 2 days. I will check again next week.

            keith Keith Mannthey (Inactive) added a comment - I did a quick search in Maloo today. I didn't see any failures in the last 2 days. I will check again next week.
            sarah Sarah Liu added a comment -

            hit this issue when running interop between 2.3.0 client and 2.4-tag-2.3.65:
            https://maloo.whamcloud.com/test_sets/e89c3aac-bbee-11e2-b013-52540035b04c

            sarah Sarah Liu added a comment - hit this issue when running interop between 2.3.0 client and 2.4-tag-2.3.65: https://maloo.whamcloud.com/test_sets/e89c3aac-bbee-11e2-b013-52540035b04c
            jhammond John Hammond added a comment -

            The cases with missing values are probably occurrences of LU-2979.

            jhammond John Hammond added a comment - The cases with missing values are probably occurrences of LU-2979 .

            I have stated a debug run of sorts here: http://review.whamcloud.com/6006

            keith Keith Mannthey (Inactive) added a comment - I have stated a debug run of sorts here: http://review.whamcloud.com/6006

            LU-3094 does not seem to be the root issue for this problem.

            As a summary there is not sing of the /proc we are looking for. The kernel is not reporting values this area.

            I have not been able to reproduce this outside of autotest. Other Lustre things in /proc seem to be working so I am guessing for some reason this chunk to initialize but not a whole lproc collapse?

            I am starting to work on a debug patch for Lustre to help identify what might happening.

            keith Keith Mannthey (Inactive) added a comment - LU-3094 does not seem to be the root issue for this problem. As a summary there is not sing of the /proc we are looking for. The kernel is not reporting values this area. I have not been able to reproduce this outside of autotest. Other Lustre things in /proc seem to be working so I am guessing for some reason this chunk to initialize but not a whole lproc collapse? I am starting to work on a debug patch for Lustre to help identify what might happening.
            sarah Sarah Liu added a comment -

            Also seen this issue after upgrade from 1.8.9 to 2.4 and then add one new MDT:

            https://maloo.whamcloud.com/test_sets/a02cc9b2-9ec5-11e2-975f-52540035b04c

            sarah Sarah Liu added a comment - Also seen this issue after upgrade from 1.8.9 to 2.4 and then add one new MDT: https://maloo.whamcloud.com/test_sets/a02cc9b2-9ec5-11e2-975f-52540035b04c

            I opened LU-3094 to track the 132 issues.

            keith Keith Mannthey (Inactive) added a comment - I opened LU-3094 to track the 132 issues.

            Thanks for the links:

            https://maloo.whamcloud.com/test_sets/016cbfcc-9816-11e2-879d-52540035b04c
            sanity test_151 AND test_156: @@@@@@ FAIL: NOT IN CACHE: before: , after:

            This is the "after" call with some spaces added for readability.

            snapshot_time 1364501362.939531 secs.usecs read_bytes 11 samples [bytes] 4096 1048576 1429504 write_bytes 6 samples [bytes] 1910 1048576 1910466 get_info 165 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 75 samples [reqs] setattr 1 samples [reqs] punch 2 samples [reqs] sync 7 samples [reqs] preprw 17 samples [reqs] commitrw 17 samples [reqs] ping 95 samples [reqs]
            snapshot_time 1364501362.939706 secs.usecs read_bytes 770 samples [bytes] 4096 1048576 805322752 write_bytes 770 samples [bytes] 3013 1048576 805321669 get_info 164 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 77 samples [reqs] punch 5 samples [reqs] sync 6 samples [reqs] preprw 1540 samples [reqs] commitrw 1540 samples [reqs] ping 83 samples [reqs]
            snapshot_time 1364501362.939814 secs.usecs read_bytes 3 samples [bytes] 8192 8192 24576 write_bytes 2 samples [bytes] 6096 6096 12192 get_info 157 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 74 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 5 samples [reqs] commitrw 5 samples [reqs] ping 96 samples [reqs]
            snapshot_time 1364501362.939862 secs.usecs read_bytes 2 samples [bytes] 8192 8192 16384 write_bytes 3 samples [bytes] 1916 6096 12108 get_info 153 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 75 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 5 samples [reqs] commitrw 5 samples [reqs] ping 95 samples [reqs]
            snapshot_time 1364501362.939991 secs.usecs read_bytes 2 samples [bytes] 8192 8192 16384 write_bytes 2 samples [bytes] 777 6096 6873 get_info 156 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] setattr 2 samples [reqs] punch 3 samples [reqs] sync 3 samples [reqs] preprw 4 samples [reqs] commitrw 4 samples [reqs] ping 94 samples [reqs]
            snapshot_time 1364501362.940079 secs.usecs read_bytes 2 samples [bytes] 8192 8192 16384 write_bytes 2 samples [bytes] 1916 6096 8012 get_info 154 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] setattr 1 samples [reqs] punch 1 samples [reqs] sync 2 samples [reqs] preprw 4 samples [reqs] commitrw 4 samples [reqs] ping 95 samples [reqs]
            snapshot_time 1364501362.940243 secs.usecs read_bytes 2 samples [bytes] 4096 12288 16384 write_bytes 2 samples [bytes] 12288 50400 62688 get_info 150 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] setattr 1 samples [reqs] sync 2 samples [reqs] preprw 4 samples [reqs] commitrw 4 samples [reqs] ping 97 samples [reqs]
            
            

            There are 7 snapshot_time (7 OSTs) entries and they are all obdfilter.lustre-OST*.stats results. There has been some activity on the filesystem.

            There is no sign of the stats form "osd-.lustre-OST.stats".

            There are zero cache access to report on all 7 OSTs, either the cache is broken or it is just not reporting correctly.

            I noticed in the dmesg of the OST there is really bad things happening on test 132. https://maloo.whamcloud.com/test_logs/2613697e-9817-11e2-879d-52540035b04c/download

            It looks like there is some commuincation breakdown and the OSTs remount a few times?

            There are bad messages that don't appear in other runs of the test. Perhaps the system is in some bad state after test 132.

            Part of the OST dmesg:

            Lustre: DEBUG MARKER: test -b /dev/lvm-OSS/P5
            Lustre: DEBUG MARKER: mkdir -p /mnt/ost5; mount -t lustre   		                   /dev/lvm-OSS/P5 /mnt/ost5
            LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. quota=on. Opts: 
            Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
            LustreError: 137-5: UUID 'lustre-OST0005_UUID' is not available for connect (no target)
            LustreError: Skipped 1 previous similar message
            LustreError: 4490:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 663: rc -2
            LustreError: 4490:0:(ldlm_resource.c:1161:ldlm_resource_get()) Skipped 184 previous similar messages
            
            

            There is alot more in the dmesg but all in all it looks not so good.

            keith Keith Mannthey (Inactive) added a comment - - edited Thanks for the links: https://maloo.whamcloud.com/test_sets/016cbfcc-9816-11e2-879d-52540035b04c sanity test_151 AND test_156: @@@@@@ FAIL: NOT IN CACHE: before: , after: This is the "after" call with some spaces added for readability. snapshot_time 1364501362.939531 secs.usecs read_bytes 11 samples [bytes] 4096 1048576 1429504 write_bytes 6 samples [bytes] 1910 1048576 1910466 get_info 165 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 75 samples [reqs] setattr 1 samples [reqs] punch 2 samples [reqs] sync 7 samples [reqs] preprw 17 samples [reqs] commitrw 17 samples [reqs] ping 95 samples [reqs] snapshot_time 1364501362.939706 secs.usecs read_bytes 770 samples [bytes] 4096 1048576 805322752 write_bytes 770 samples [bytes] 3013 1048576 805321669 get_info 164 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 77 samples [reqs] punch 5 samples [reqs] sync 6 samples [reqs] preprw 1540 samples [reqs] commitrw 1540 samples [reqs] ping 83 samples [reqs] snapshot_time 1364501362.939814 secs.usecs read_bytes 3 samples [bytes] 8192 8192 24576 write_bytes 2 samples [bytes] 6096 6096 12192 get_info 157 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 74 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 5 samples [reqs] commitrw 5 samples [reqs] ping 96 samples [reqs] snapshot_time 1364501362.939862 secs.usecs read_bytes 2 samples [bytes] 8192 8192 16384 write_bytes 3 samples [bytes] 1916 6096 12108 get_info 153 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 75 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 5 samples [reqs] commitrw 5 samples [reqs] ping 95 samples [reqs] snapshot_time 1364501362.939991 secs.usecs read_bytes 2 samples [bytes] 8192 8192 16384 write_bytes 2 samples [bytes] 777 6096 6873 get_info 156 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] setattr 2 samples [reqs] punch 3 samples [reqs] sync 3 samples [reqs] preprw 4 samples [reqs] commitrw 4 samples [reqs] ping 94 samples [reqs] snapshot_time 1364501362.940079 secs.usecs read_bytes 2 samples [bytes] 8192 8192 16384 write_bytes 2 samples [bytes] 1916 6096 8012 get_info 154 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] setattr 1 samples [reqs] punch 1 samples [reqs] sync 2 samples [reqs] preprw 4 samples [reqs] commitrw 4 samples [reqs] ping 95 samples [reqs] snapshot_time 1364501362.940243 secs.usecs read_bytes 2 samples [bytes] 4096 12288 16384 write_bytes 2 samples [bytes] 12288 50400 62688 get_info 150 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] setattr 1 samples [reqs] sync 2 samples [reqs] preprw 4 samples [reqs] commitrw 4 samples [reqs] ping 97 samples [reqs] There are 7 snapshot_time (7 OSTs) entries and they are all obdfilter.lustre-OST*.stats results. There has been some activity on the filesystem. There is no sign of the stats form "osd- .lustre-OST .stats". There are zero cache access to report on all 7 OSTs, either the cache is broken or it is just not reporting correctly. I noticed in the dmesg of the OST there is really bad things happening on test 132. https://maloo.whamcloud.com/test_logs/2613697e-9817-11e2-879d-52540035b04c/download It looks like there is some commuincation breakdown and the OSTs remount a few times? There are bad messages that don't appear in other runs of the test. Perhaps the system is in some bad state after test 132. Part of the OST dmesg: Lustre: DEBUG MARKER: test -b /dev/lvm-OSS/P5 Lustre: DEBUG MARKER: mkdir -p /mnt/ost5; mount -t lustre /dev/lvm-OSS/P5 /mnt/ost5 LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. quota=on. Opts: Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u LustreError: 137-5: UUID 'lustre-OST0005_UUID' is not available for connect (no target) LustreError: Skipped 1 previous similar message LustreError: 4490:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 663: rc -2 LustreError: 4490:0:(ldlm_resource.c:1161:ldlm_resource_get()) Skipped 184 previous similar messages There is alot more in the dmesg but all in all it looks not so good.
            sebastien.buisson Sebastien Buisson (Inactive) added a comment - It seems we hit two new occurrences of this problem on master yesterday: https://maloo.whamcloud.com/test_sets/016cbfcc-9816-11e2-879d-52540035b04c https://maloo.whamcloud.com/test_sets/ab3b7452-97f4-11e2-a652-52540035b04c

            I thought I would update. The issue has not failed since the debug patch was landed (test 151 or 156) (I spent some time looking in maloo). There are failures on openSFS with old code but nothing in the current master space.

            I am all for leaving the debug patch in (Andreas considered ok for a longer term leave) and we will see what happens.

            keith Keith Mannthey (Inactive) added a comment - I thought I would update. The issue has not failed since the debug patch was landed (test 151 or 156) (I spent some time looking in maloo). There are failures on openSFS with old code but nothing in the current master space. I am all for leaving the debug patch in (Andreas considered ok for a longer term leave) and we will see what happens.

            People

              keith Keith Mannthey (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: