Details
-
Bug
-
Resolution: Won't Fix
-
Major
-
Lustre 2.3.0, Lustre 2.1.1
-
3
-
4520
Description
A data integrity test run periodically run by our storage group found two occurrences of corrupt files written to Lustre. The original files contain 300 MB of random data. The corrupt copies contain several 4096B regions of zeros aligned on 1MiB boundaries. The two corrupt files were written to the same filesystem from two different login nodes on the same cluster within five minutes of each other. The stripe count is 100.
The client application is a parallel ftp client reading data out of our storage archive into Lustre. The test checks for differences between the restored files and the original copies. For a 300MB file it uses 4 threads which issue 4 64MB pwrite()'s and 1 44MB pwrite(). It is possible that the pwrite() gets restarted due to SIGUSR2 from a master process, though we don't know if this occurred in the corrupting case. This test has seen years of widespread use on all of our clusters, and this is the first reported incidence of this type of corruption, so we can characterize the frequency as rare.
When I examine an OST object containing a corrupt region, I see there is no block allocated for the corrupt region (in this case, logical block 256 is missing).
# pigs58 /root > debugfs -c -R "dump_extents /O/0/d$((30205348 % 32))/30205348" /dev/sdb debugfs 1.41.12 (17-May-2010) /dev/sdb: catastrophic mode - not reading inode or group bitmaps Level Entries Logical Physical Length Flags 0/ 0 1/ 3 0 - 255 813140480 - 813140735 256 0/ 0 2/ 3 257 - 511 813142528 - 813142782 255 0/ 0 3/ 3 512 - 767 813143040 - 813143295 256
Finally, the following server-side console messages appeared at the same time one of the corrupted files was written, and mention the NID of the implicated client. The consoles of the OSTs containing the corrupt objects were quiet at the time.
May 17 01:06:08 pigs-mds1 kernel: LustreError: 20418:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1402165306385077 has mismatched opc: new 101 old 0 May 17 01:06:08 pigs-mds1 kernel: Lustre: 20418:0:(mdt_recovery.c:1022:mdt_steal_ack_locks()) Stealing 1 locks from rs ffff880410f62000 x1402165306385077.t125822723745 o0 NID 192.168.114.155@o2ib5 May 17 01:06:08 pigs-mds1 kernel: Lustre: All locks stolen from rs ffff880410f62000 x1402165306385077.t125822723745 o0 NID 192.168.114.155@o2ib5
Attachments
Issue Links
- is duplicated by
-
LU-1680 LBUG cl_lock.c:1949:discard_cb()) (ORI-726)
- Resolved
- is related to
-
LU-1458 lustre-rsync-test test_2b: old lustre_rsync does not work with new llog_changelog_ext_rec remove changelog
- Resolved
- is related to
-
LU-1703 b2_1 can't pass acc-sm test
- Resolved
- Trackbacks
-
Changelog 2.1 Changes from version 2.1.2 to version 2.1.3 Server support for kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1....