[LU-4714] Corruption detected when copying large number of files occasionally Created: 05/Mar/14  Updated: 21/Sep/18  Resolved: 21/Sep/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.3, Lustre 2.1.5, Lustre 2.1.6, Lustre 2.4.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jay Lee Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Server : RHEL 6.3 (Kernel 2.6.32-279.19.1), In-kernel OFED, Lustre 2.1.5/2.1.3/2.4.2/2.1.6
Client : RHEL 6.3 (Kernel 2.6.32-279.19.1)
OFED-1.5.4, Lustre 2.1.5/2.1.6
In-kernel OFED, Lustre 2.1.3

90 compute node
2 mds with active/backup
4 oss with active/active

Mainly focused to weather prediction area.


Severity: 3
Rank (Obsolete): 12956

 Description   

Hi,

Customer site are using Lustre 2.1.5 with RHEL 6.3 client for next cluster, 2.1.3 with SLES 11 client for old cluster.

This only happened in 2.1.5 with RHEL 6.3 client.

At the first time, I had reported there were file corruption randomly generated from customer.
Later drill down the situation, found corrpted files rarely when converting a 'GRIB' file on some nodes.
And, it was identified there were no corruption on original files that was written. But 1 or more other nodes could not read 1 or more files.

File corruption were not observed with dealing with small number - less than 100 - of files (usually the size of single file is around 55MiB), but larger the file numbers - upto 660 files, more corruption observed (checking with diff or md5sum after copying files).

I found LU-4380, LU-3219, and suspect this would also related FIEMAP and recent coreutils behavior.

I had tried to upgrade to 2.1.6 for both server and client. It was not help.
Also tried to 2.4.2 for server and 2.1.5/2.1.6/2.1.3 for client. All were useless.

I found disable server-side read cache helps to decrease the ratio of corruption.
However that could not eliminate the corruption issue.

What shall I do for next ?


Generated at Sat Feb 10 01:45:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.