Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9305

Running File System Aging create write checksum errors

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.0, Lustre 2.11.0
    • None
    • 1
    • 9223372036854775807

    Description

      My most recent re-production of this was:
      ZFS based on 0.7.0 RC4 fs/zfs:coral-rc1-combined
      Lustre tagged release 2.9.57(but 2.9.58 fails as well)
      Centos 7.3 3.10.0-514.16.1.el7.x86_64

      I have personally verified this fails on Lustre 2.8, 2.9 and latest tagged release, zfs 0.6.5-current ZOL Master and the most recent Centos 7.1, 7.2, and 7.3 kernels.

      This may well be a Lustre issue I need to try to reproduce on raidz, with out large RPCs, etc.

      On both the clients and OSS nodes we see checksum errors while the file aging test is running such as:
      [ 9354.968454] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.6@o2ib inode [0x200000401:0x254:0x0] object 0x0:292 extent [117440512-125698047]: client csum de357896, server csum 5cd77893

      [ 9394.315856] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.6@o2ib inode [0x200000401:0x28c:0x0] object 0x0:320 extent [67108864-82968575]: client csum df6bd34a, server csum 8480d352
      [ 9404.371609] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.6@o2ib inode [0x200000401:0x298:0x0] object 0x0:326 extent [67108864-74448895]: client csum 2ced4ec0, server csum 1f814ec4

      Attachments

        1. BasicLibs.py
          6 kB
        2. debug_info.20170406_143409_48420_wolf-3.wolf.hpdd.intel.com.tgz
          3.45 MB
        3. debug_vmalloc_lustre.patch
          6 kB
        4. debug_vmalloc_spl.patch
          14 kB
        5. debug_vmalloc.patch
          22 kB
        6. FileAger-wolf6.py
          6 kB
        7. FileAger-wolf7.py
          6 kB
        8. FileAger-wolf8.py
          6 kB
        9. FileAger-wolf9.py
          6 kB
        10. Linux_x64_Memory_Address_Mapping.pdf
          224 kB
        11. wolf-6_client.tgz
          5.67 MB

        Issue Links

          Activity

            [LU-9305] Running File System Aging create write checksum errors

            Thanks, Jinshan! After another 20x crashes yesterday, we installed the patch about 12 hours ago. So far so good.

            phils@dugeo.com Phil Schwan (Inactive) added a comment - Thanks, Jinshan! After another 20x crashes yesterday, we installed the patch about 12 hours ago. So far so good.

            Hi Phil,

            The problem discovered in this ticket is about memory corruption, so I wouldn't be surprised if the symptom is related to this bug. Please go ahead and try the patch and see how it goes.

            jay Jinshan Xiong (Inactive) added a comment - Hi Phil, The problem discovered in this ticket is about memory corruption, so I wouldn't be surprised if the symptom is related to this bug. Please go ahead and try the patch and see how it goes.
            phils@dugeo.com Phil Schwan (Inactive) added a comment - - edited

            Do we think that this bug is inherently to blame for the "Error -14" / put_page crash that I see referenced in a couple of the comments?

            19:00:33:[ 2346.715685] LNetError: 6844:0:(socklnd_cb.c:1149:ksocknal_process_receive()) [ffff880061190000] Error -14 on read from 12345-10.9.4.124@tcp ip 10.9.4.124:1021
            19:00:33:[ 2346.766822] CPU: 1 PID: 6842 Comm: socknal_reaper Tainted: P OE ------------ 3.10.0-514.21.1.el7_lustre.x86_64 #1
            19:00:33:[ 2346.766822] RIP: 0010:[<ffffffff8118edaa>] [<ffffffff8118edaa>] put_page+0xa/0x60

            (I give up trying to get it to format correctly)

            ?

            We've hit this crash about 20 times in the past month in normal production use (ZFS 0.7.1-1 / Lustre 2.9.51_45_g3b3eeeb), but generally without a "BAD WRITE CHECKSUM" error anywhere in sight.

            phils@dugeo.com Phil Schwan (Inactive) added a comment - - edited Do we think that this bug is inherently to blame for the "Error -14" / put_page crash that I see referenced in a couple of the comments? 19:00:33:[ 2346.715685] LNetError: 6844:0:(socklnd_cb.c:1149:ksocknal_process_receive()) [ffff880061190000] Error -14 on read from 12345-10.9.4.124@tcp ip 10.9.4.124:1021 19:00:33:[ 2346.766822] CPU: 1 PID: 6842 Comm: socknal_reaper Tainted: P OE ------------ 3.10.0-514.21.1.el7_lustre.x86_64 #1 19:00:33:[ 2346.766822] RIP: 0010: [<ffffffff8118edaa>] [<ffffffff8118edaa>] put_page+0xa/0x60 (I give up trying to get it to format correctly) ? We've hit this crash about 20 times in the past month in normal production use (ZFS 0.7.1-1 / Lustre 2.9.51_45_g3b3eeeb), but generally without a "BAD WRITE CHECKSUM" error anywhere in sight.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27950/
            Subject: LU-9305 osd: do not release pages twice
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 6d039389735c52f965505643de9d8e4772e3f87f

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27950/ Subject: LU-9305 osd: do not release pages twice Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6d039389735c52f965505643de9d8e4772e3f87f

            Jinshan, thanks for testing.

            bzzz Alex Zhuravlev added a comment - Jinshan, thanks for testing.

            The patch should be able to fix the problem - I have run the patched lustre for several hours w/o hitting the problem. Without the patch it died quickly.

            jay Jinshan Xiong (Inactive) added a comment - The patch should be able to fix the problem - I have run the patched lustre for several hours w/o hitting the problem. Without the patch it died quickly.
            jay Jinshan Xiong (Inactive) added a comment - - edited

            This is indeed a race condition. I wonder why I couldn't catch the race by enabling VM_BUG_ON_PAGE() in put_page_testzero().

            jay Jinshan Xiong (Inactive) added a comment - - edited This is indeed a race condition. I wonder why I couldn't catch the race by enabling VM_BUG_ON_PAGE() in put_page_testzero() .

            with https://review.whamcloud.com/27950 I can't reproduce the issue.

            bzzz Alex Zhuravlev added a comment - with https://review.whamcloud.com/27950 I can't reproduce the issue.

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/27950
            Subject: LU-9305 osd: do not release pages twice
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 12d3b3dcfbc17fc201dc9de463720e3a3a994f49

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/27950 Subject: LU-9305 osd: do not release pages twice Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 12d3b3dcfbc17fc201dc9de463720e3a3a994f49

            No, i'm not familiar with wolf cluster.

            bzzz Alex Zhuravlev added a comment - No, i'm not familiar with wolf cluster.

            People

              bzzz Alex Zhuravlev
              jsalians_intel John Salinas (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: