[LU-9305] Running File System Aging create write checksum errors - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.10.0, Lustre 2.11.0
Affects Version/s: None
Labels:
- LS_RZ
- prod
- rel-note
Environment:

Hide
My most recent re-production of this was:
ZFS based on 0.7.0 RC4 fs/zfs:coral-rc1-combined
Lustre tagged release 2.9.57(but 2.9.58 fails as well)
Centos 7.3 3.10.0-514.16.1.el7.x86_64

I have personally verified this fails on Lustre 2.8, 2.9 and latest tagged release, zfs 0.6.5-current ZOL Master and the most recent Centos 7.1, 7.2, and 7.3 kernels.

Show
My most recent re-production of this was: ZFS based on 0.7.0 RC4 fs/zfs:coral-rc1-combined Lustre tagged release 2.9.57(but 2.9.58 fails as well) Centos 7.3 3.10.0-514.16.1.el7.x86_64 I have personally verified this fails on Lustre 2.8, 2.9 and latest tagged release, zfs 0.6.5-current ZOL Master and the most recent Centos 7.1, 7.2, and 7.3 kernels.

Severity:
1
Rank (Obsolete):
9223372036854775807

Description

My most recent re-production of this was:
ZFS based on 0.7.0 RC4 fs/zfs:coral-rc1-combined
Lustre tagged release 2.9.57(but 2.9.58 fails as well)
Centos 7.3 3.10.0-514.16.1.el7.x86_64

I have personally verified this fails on Lustre 2.8, 2.9 and latest tagged release, zfs 0.6.5-current ZOL Master and the most recent Centos 7.1, 7.2, and 7.3 kernels.

This may well be a Lustre issue I need to try to reproduce on raidz, with out large RPCs, etc.

On both the clients and OSS nodes we see checksum errors while the file aging test is running such as:
[ 9354.968454] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.6@o2ib inode [0x200000401:0x254:0x0] object 0x0:292 extent [117440512-125698047]: client csum de357896, server csum 5cd77893

[ 9394.315856] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.6@o2ib inode [0x200000401:0x28c:0x0] object 0x0:320 extent [67108864-82968575]: client csum df6bd34a, server csum 8480d352
[ 9404.371609] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.6@o2ib inode [0x200000401:0x298:0x0] object 0x0:326 extent [67108864-74448895]: client csum 2ced4ec0, server csum 1f814ec4

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

BasicLibs.py
6 kB
14/Apr/17 1:50 AM
debug_info.20170406_143409_48420_wolf-3.wolf.hpdd.intel.com.tgz
3.45 MB
07/Apr/17 3:14 PM
debug_vmalloc_lustre.patch
6 kB
27/Apr/17 9:04 PM
debug_vmalloc_spl.patch
14 kB
27/Apr/17 9:04 PM
debug_vmalloc.patch
22 kB
27/Apr/17 9:04 PM
FileAger-wolf6.py
6 kB
14/Apr/17 1:50 AM
FileAger-wolf7.py
6 kB
14/Apr/17 1:50 AM
FileAger-wolf8.py
6 kB
14/Apr/17 1:50 AM
FileAger-wolf9.py
6 kB
14/Apr/17 1:50 AM
Linux_x64_Memory_Address_Mapping.pdf
224 kB
10/May/19 4:19 AM
wolf-6_client.tgz
5.67 MB
07/Apr/17 3:14 PM

Issue Links

is duplicated by

LU-9304 BUG: Bad page state in process ll_ost_io01_013 pfn:1a01bcd kernel BUG at include/linux/scatterlist.h:65!

Resolved

is related to

LU-9279 coral-beta-combined build 124 kernel BUG at include/linux/scatterlist.h:65! invalid opcode: 0000 [#1] SMP

Resolved

LU-9854 Lustre 2.10.0 mmap() issues

Resolved

mentioned in: Page Loading...

Activity

[LU-9305] Running File System Aging create write checksum errors

Phil Schwan (Inactive) added a comment - 12/Sep/17 12:17 AM

Thanks, Jinshan! After another 20x crashes yesterday, we installed the patch about 12 hours ago. So far so good.

Phil Schwan (Inactive) added a comment - 12/Sep/17 12:17 AM Thanks, Jinshan! After another 20x crashes yesterday, we installed the patch about 12 hours ago. So far so good.

Jinshan Xiong (Inactive) added a comment - 10/Sep/17 4:58 AM

Hi Phil,

The problem discovered in this ticket is about memory corruption, so I wouldn't be surprised if the symptom is related to this bug. Please go ahead and try the patch and see how it goes.

Jinshan Xiong (Inactive) added a comment - 10/Sep/17 4:58 AM Hi Phil, The problem discovered in this ticket is about memory corruption, so I wouldn't be surprised if the symptom is related to this bug. Please go ahead and try the patch and see how it goes.

Phil Schwan (Inactive) added a comment - 08/Sep/17 7:08 AM - edited

Do we think that this bug is inherently to blame for the "Error -14" / put_page crash that I see referenced in a couple of the comments?

19:00:33:[ 2346.715685] LNetError: 6844:0:(socklnd_cb.c:1149:ksocknal_process_receive()) [ffff880061190000] Error -14 on read from 12345-10.9.4.124@tcp ip 10.9.4.124:1021
19:00:33:[ 2346.766822] CPU: 1 PID: 6842 Comm: socknal_reaper Tainted: P OE ------------ 3.10.0-514.21.1.el7_lustre.x86_64 #1
19:00:33:[ 2346.766822] RIP: 0010:[<ffffffff8118edaa>] [<ffffffff8118edaa>] put_page+0xa/0x60

(I give up trying to get it to format correctly)

We've hit this crash about 20 times in the past month in normal production use (ZFS 0.7.1-1 / Lustre 2.9.51_45_g3b3eeeb), but generally without a "BAD WRITE CHECKSUM" error anywhere in sight.

Phil Schwan (Inactive) added a comment - 08/Sep/17 7:08 AM - edited Do we think that this bug is inherently to blame for the "Error -14" / put_page crash that I see referenced in a couple of the comments? 19:00:33:[ 2346.715685] LNetError: 6844:0:(socklnd_cb.c:1149:ksocknal_process_receive()) [ffff880061190000] Error -14 on read from 12345-10.9.4.124@tcp ip 10.9.4.124:1021 19:00:33:[ 2346.766822] CPU: 1 PID: 6842 Comm: socknal_reaper Tainted: P OE ------------ 3.10.0-514.21.1.el7_lustre.x86_64 #1 19:00:33:[ 2346.766822] RIP: 0010: [<ffffffff8118edaa>] [<ffffffff8118edaa>] put_page+0xa/0x60 (I give up trying to get it to format correctly) ? We've hit this crash about 20 times in the past month in normal production use (ZFS 0.7.1-1 / Lustre 2.9.51_45_g3b3eeeb), but generally without a "BAD WRITE CHECKSUM" error anywhere in sight.

Gerrit Updater added a comment - 10/Jul/17 9:11 PM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27950/
Subject: ~~LU-9305~~ osd: do not release pages twice
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6d039389735c52f965505643de9d8e4772e3f87f

Gerrit Updater added a comment - 10/Jul/17 9:11 PM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27950/ Subject: LU-9305 osd: do not release pages twice Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6d039389735c52f965505643de9d8e4772e3f87f

Alex Zhuravlev added a comment - 07/Jul/17 3:14 AM

Jinshan, thanks for testing.

Alex Zhuravlev added a comment - 07/Jul/17 3:14 AM Jinshan, thanks for testing.

Jinshan Xiong (Inactive) added a comment - 07/Jul/17 12:58 AM

The patch should be able to fix the problem - I have run the patched lustre for several hours w/o hitting the problem. Without the patch it died quickly.

Jinshan Xiong (Inactive) added a comment - 07/Jul/17 12:58 AM The patch should be able to fix the problem - I have run the patched lustre for several hours w/o hitting the problem. Without the patch it died quickly.

Jinshan Xiong (Inactive) added a comment - 06/Jul/17 3:01 PM - edited

This is indeed a race condition. I wonder why I couldn't catch the race by enabling VM_BUG_ON_PAGE() in put_page_testzero().

Jinshan Xiong (Inactive) added a comment - 06/Jul/17 3:01 PM - edited This is indeed a race condition. I wonder why I couldn't catch the race by enabling VM_BUG_ON_PAGE() in put_page_testzero() .

Alex Zhuravlev added a comment - 06/Jul/17 12:50 PM

with https://review.whamcloud.com/27950 I can't reproduce the issue.

Alex Zhuravlev added a comment - 06/Jul/17 12:50 PM with https://review.whamcloud.com/27950 I can't reproduce the issue.

Gerrit Updater added a comment - 06/Jul/17 10:49 AM

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/27950
Subject: ~~LU-9305~~ osd: do not release pages twice
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 12d3b3dcfbc17fc201dc9de463720e3a3a994f49

Gerrit Updater added a comment - 06/Jul/17 10:49 AM Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/27950 Subject: LU-9305 osd: do not release pages twice Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 12d3b3dcfbc17fc201dc9de463720e3a3a994f49

Alex Zhuravlev added a comment - 03/Jul/17 4:40 PM

No, i'm not familiar with wolf cluster.

Alex Zhuravlev added a comment - 03/Jul/17 4:40 PM No, i'm not familiar with wolf cluster.

People

Assignee:: Alex Zhuravlev

Reporter:: John Salinas (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 23 Start watching this issue

Dates

Due:: 28/Feb/17

Created:: 29/Sep/16 4:04 PM

Updated:: 10/May/19 4:19 AM

Resolved:: 29/Sep/17 5:17 PM