[LU-6925] oss buffer cache corruption Created: 29/Jul/15 Updated: 15/Oct/15 Resolved: 15/Oct/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Oleg Drokin |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 1 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
User reported file corruption as shown bellow. The file is striped across 4 osts at 1MB. The corruption is 4KB in size and the end aligns with a ost stripe boundary. The corrupted data is from a process that run on the oss writing and reading data to the local oss filesystem. We have a cron job that dumps ost metadata once a day like so: /sbin/dumpe2fs /dev/ostdevice > /root/ostdevice.meta 2>/dev/null
The output file is read 15min and inodes are caches on the oss. 0.1926E-04 0.8636E-05 -0.5430E-05 -0.1747E-04 -0.2318E-04 -0.2108E-04 -0.1270E-04 -0.1492E-05 0.8965E-05 0.1638E-04 0.2025E-04 0.2143E-04 0.2111E-04 0.2007E-04 0.1847E-04 0.1629E-04 0.1384E-04 0.1204E-04 0.1206E- ^@^@T^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^B^@^@^@^@^@^@^@^@^@^@^@^D^@^@^@^@^@^@^@^@^@^@^@^H^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^A^@^@^@E~L~G^@ ^@^@^@^A^@^@^@E~L~G^@^L^@^@^@^@^@^@^@^@^@^@^@^L^@^@^@^@^@^@^@^@^@^@^@^L^@^@^@^@^@^@^@^@^@^@^^ @^L^@^@^@^@^@^@^@^@^@^@^@^L^@^@^@^@^@^@^@^@^@^@k bitmap at 2181038173 (bg #66560 + 93), Inode bitmap at 2181038429 (bg #66560 + 349) Inode table at 2181039336-2181039343 (bg #66560 + 1256) 1780 free blocks, 128 free inodes, 0 directories, 128 unused inodes Free blocks: 2184085760-2184086271, 2184094208-2184094451, 2184097792-2184098815 Free inodes: 8531585-8531712 Group 66654: (Blocks 2184118272-2184151039) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0xdc38, unused inodes 128 Block bitmap at 2181038174 (bg #66560 + 94), Inode bitmap at 2181038430 (bg #66560 + 350) Inode table at 2181039344-2181039351 (bg #66560 + 1264) 239 free blocks, 128 free inodes, 0 directories, 128 unused inodes Free blocks: 2184121088-2184121321, 2184121339-2184121343 Free inodes: 8531713-8531840 Group 66655: (Blocks 2184151040-2184183807) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0xc8b0, unused inodes 128 Block bitmap at 2181038175 (bg #66560 + 95), Inode bitmap at 2181038431 (bg #66560 + 351) Inode table at 2181039352-2181039359 (bg #66560 + 1272) 5119 free blocks, 128 free inodes, 0 directories, 128 unused inodes Free blocks: 2184151297-2184151551, 2184154624-2184155135, 2184165376-2184166399, 2184167168-2184167423, 2184171520-2184172543, 2184179712-2184181759 Free inodes: 8531841-8531968 Group 66656: (Blocks 2184183808-2184216575) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0x7ce0, unused inodes 128 Block bitmap at 2181038176 (bg #66560 + 96), Inode bitmap at 2181038432 (bg #66560 + 352) Inode table at 2181039360-2181039367 (bg #66560 + 1280) 2816 free blocks, 128 free inodes, 0 directories, 128 unused inodes Free blocks: 2184184832-2184185855, 2184198144-2184198911, 2184205312-2184206335 Free inodes: 8531969-8532096 Group 66657: (Blocks 2184216576-2184249343) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0xe3a2, unused inodes 128 Block bitmap at 2181038177 (bg #66560 + 97), Inode bitmap at 2181038433 (bg #66560 + 353) Inode table at 2181039368-2181039375 (bg #66560 + 1288) 2574 free blocks, 128 free inodes, 0 directories, 128 unused inodes Free blocks: 2184217600-2184218623, 2184221416-2184221437, 2184236544-2184237045, 2184237054-2184237055, 2184240128-2184241151 Free inodes: 8532097-8532224 Group 66658: (Blocks 2184249344-2184282111) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0xe5c6, unused inodes 128 Block bitmap at 2181038178 (bg #66560 + 98), Inode bitmap at 2181038434 (bg #66560 + 354) Inode table at 2181039376-2181039383 (bg #66560 + 1296) 5426 free blocks, 128 free inodes, 0 directories, 128 unused inodes Free blocks: 2184251392-2184251647, 2184252160-2184252407, 2184252413-2184252415, 2184253440-2184254463, 2184255488-2184255743, 2184255924-2184256511, 2184259584-2184260095, 2184260352-2184260602, 2184260608-2184261631, 2184272896-2184273919, 2184276992-2184277229, 2184277246-2184277247 Free inodes: 8532225-8532352 Group 66659: (Blocks 2184282112-2184314879) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0x16f0, unused inodes 128 Block bitmap at 2181038179 (bg #66560 + 99), Inode bitmap at 2181038435 (bg #66560 + 355) Inode table at 2181039384-2181039391 (bg #66560 + 1304) 3751 free blocks, 128 free inodes, 0 directories, 128 unused inodes Free blocks: 2184288256-2184289279, 2184292355-2184292607, 2184293376-2184294000, 2184294867-2184294911, 2184297216-2184298495, 2184299491-2184299519, 2184302848-2184303094, 2184303870-2184304116, 2184304127 Free inodes: 8532353-8532480 Group 66660: (Blocks 2184314880-2184347647) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0x7c1a, unused inodes 128 Block bitmap at 2181038180 (bg #66560 + 100), Inode bitmap at 2181038436 (bg #66560 + 356) Inode table at 2181039392-2181039399 (bg #66560 + 1312) 9197 free blocks, 128 free inodes, 0 directories, 128 unused inodes Free blocks: 2184320256-2184321023, 2184322048-2184323071, 2184323585-2184324055, 2184324057, 2184324074-2184324095, 2184324354-2184325119, 2184325632-2184326143, 2184326655-2184355 .1385E-04 0.2720E-04 0.3428E-04 0.3470E-04 0.3125E-04 0.2717E-04 0.2375E-04 0.1968E-04 0.1258E-04 0.1537E-05 -0.1135E-04 -0.2146E-04 0.2531E-04 0.2365E-04 0.2503E-04 0.2827E-04 0.2984E-04 0.2598E-04 0.1521E-04 -0.2827E-06 -0.1534E-04 -0.2416E-04 No errors are logged on the OSS. |
| Comments |
| Comment by Mahmoud Hanafi [ 29/Jul/15 ] |
|
Please fix type in the title. |
| Comment by Peter Jones [ 29/Jul/15 ] |
|
Oleg Please can you advise? Thanks Peter |
| Comment by Oleg Drokin [ 29/Jul/15 ] |
|
What do you mean by "process that run on the oss writing and reading data to the local oss filesystem." - directly writing to the ldiskfs? Or is this dumpe2fs is what you run on the ost device, but /root/ostdevice.meta is a filesystem that has nothing to do with lustre whatsoever? |
| Comment by Mahmoud Hanafi [ 29/Jul/15 ] |
|
'local oss filesyste' is the root drive for the oss. We only read from the ldiskfs and any data written is to the local filesystem. So the corruption must be occurring in the page cache of the OSSes. I think the dumpe2fs output may be just a coincidence because that data is written and read a lot. |
| Comment by Oleg Drokin [ 29/Jul/15 ] |
|
I guess I am just confused - if the write target is local filesystem - then there could not be any "ost stripe boundary" in there? |
| Comment by Mahmoud Hanafi [ 30/Jul/15 ] |
|
Sorry may I am not explaining well. This is a very strange issue.... The user was running a job on a lustre client writing the file to lustre. The corruption is in the user file on lustre. But the data that is inserted into the users file is data that is read and written on the local filesystem of the OSS. So some how data that is being read and written on the OSS root filesystem corrupted part of the user file on lustre. The corruption was exactly 4KB and it was at the end of a OST stripe. |
| Comment by Oleg Drokin [ 30/Jul/15 ] |
|
Hm. This is quite a mystery indeed. the OST where this occurred on (the corrupted stripe), did it happen to be low on space? There's |
| Comment by Mahmoud Hanafi [ 03/Aug/15 ] |
|
'low space?" do you mean ost disk space? I don't think we where low on disk space but there was a large spike in load and most of the memory was consumed in page/buffer cache. |
| Comment by Oleg Drokin [ 04/Aug/15 ] |
|
Yes, I did mean disk space since this is what was reported as one of preconditions in |
| Comment by Jay Lan (Inactive) [ 04/Aug/15 ] |
|
I posted a request of b2_5 port of |
| Comment by Mahmoud Hanafi [ 02/Sep/15 ] |
|
Could enabling quota enforcement increase the likely hood of hitting this bug? |
| Comment by Oleg Drokin [ 05/Sep/15 ] |
|
Alex, what do you think on this? I imagine quota might cause writes to fail at times too even if otherwise there's plenty of space? |
| Comment by Alex Zhuravlev [ 05/Sep/15 ] |
|
|
| Comment by Peter Jones [ 15/Oct/15 ] |
|
As per NASA fix worked |