[LU-13535] Files truncated/corruption due to lfsck Created: 07/May/20  Updated: 09/Jul/21  Resolved: 27/May/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4
Fix Version/s: Lustre 2.14.0, Lustre 2.12.5

Type: Bug Priority: Blocker
Reporter: Stephane Thiell Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 7.6


Attachments: File dk.fir-md1-s1.log.gz     File dk.fir-md1-s2.log.gz     File fir_lfsck_trunc_getstripe_all.log.gz    
Issue Links:
Related
is related to LU-14837 Layout corruption (lmm_oi) inside mdd... Open
is related to LU-13619 migrate does not update lmm_fid Open
is related to LU-12013 Crashes in sanity-lfsck test 13 Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Following several server crashes (eg. LU-13511) when running lfs migrate, we decided to run lfsck on Fir (Lustre 2.12.4). Today, users are reporting that some of their files have been truncated to 128MB (strangely the size of the first component matches the one from our new default PFL layout).

What led to this situation is likely the following scenario:

  • files were created originally using DoM + PFL (default setup)
  • we changed our default layout to PFL with the first OST component set to 128MB (stripe count 1) to avoid new DoM files
  • because of issues with DoM, we have restriped most of the existing DoM files using lfs migrate -c 1 (DoM/PFL to plain layout) this was done several months ago without problems
  • two days ago, we started to run lfsck namespace + layout
  • today, users are reporting truncated files, only the ones with plain layout > 128MB

I'm wondering if this could be related to LU-13426. We consider this issue Sev 2 at least as lfsck is likely corrupting files that have been migrated to plain layout.

More information below.

Example with file:

/fir/groups/astraigh/Magda/fachinettie19/fachinetti_CC_DLD/extracted/fachinetti_CCr1oCAr1.k25.ci10.madx5.r1.singleline.fa
[root@fir-rbh01 ~]# stat /fir/groups/astraigh/Magda/fachinettie19/fachinetti_CC_DLD/extracted/fachinetti_CCr1oCAr1.k25.ci10.madx5.r1.singleline.fa
  File: ‘/fir/groups/astraigh/Magda/fachinettie19/fachinetti_CC_DLD/extracted/fachinetti_CCr1oCAr1.k25.ci10.madx5.r1.singleline.fa’
  Size: 134217728 	Blocks: 262152     IO Block: 4194304 regular file
Device: e64e03a8h/3863872424d	Inode: 144119811155193635  Links: 1
Access: (0644/-rw-r--r--)  Uid: (65488/ mgebala)   Gid: (52067/astraigh)
Access: 2020-05-07 11:18:32.000000000 -0700
Modify: 2020-04-08 23:24:19.000000000 -0700
Change: 2020-04-29 11:26:53.000000000 -0700
 Birth: -
[root@fir-rbh01 ~]# lfs getstripe /fir/groups/astraigh/Magda/fachinettie19/fachinetti_CC_DLD/extracted/fachinetti_CCr1oCAr1.k25.ci10.madx5.r1.singleline.fa
/fir/groups/astraigh/Magda/fachinettie19/fachinetti_CC_DLD/extracted/fachinetti_CCr1oCAr1.k25.ci10.madx5.r1.singleline.fa
lmm_stripe_count:  1
lmm_stripe_size:   4194304
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 80
	obdidx		 objid		 objid		 group
	    80	      17475505	    0x10aa7b1	  0x1700000402

FID is: [0x200043465:0x6f23:0x0]

[root@fir-rbh01 ~]# lfs path2fid /fir/groups/astraigh/Magda/fachinettie19/fachinetti_CC_DLD/extracted/fachinetti_CCr1oCAr1.k25.ci10.madx5.r1.singleline.fa
[0x200043465:0x6f23:0x0]

Thanks to Robinhood, we know that the file size was ~132MB and not 128MB.

MariaDB [robinhood_fir]> select * from ENTRIES where id='0x200043465:0x6f23:0x0';
+------------------------+---------+----------+-----------+--------+---------------+-------------+------------+---------------+------+------+-------+------------+---------+-----------+--------------+--------------+----------------+--------------+---------------+----------------+----------------+------------------------+
| id                     | uid     | gid      | size      | blocks | creation_time | last_access | last_mod   | last_mdchange | type | mode | nlink | md_update  | invalid | fileclass | class_update | alert_status | checkdv_status | alert_lstchk | alert_lstalrt | checkdv_lstchk | checkdv_lstsuc | checkdv_out            |
+------------------------+---------+----------+-----------+--------+---------------+-------------+------------+---------------+------+------+-------+------------+---------+-----------+--------------+--------------+----------------+--------------+---------------+----------------+----------------+------------------------+
| 0x200043465:0x6f23:0x0 | mgebala | astraigh | 138323718 | 270176 |    1586607743 |  1586413465 | 1586413459 |    1588184813 | file |  420 |     1 | 1588185083 |       0 | +groups+  |   1588185083 |              | ok             |            0 |             0 |     1588184813 |     1588184813 | 60239190574:1586607743 |
+------------------------+---------+----------+-----------+--------+---------------+-------------+------------+---------------+------+------+-------+------------+---------+-----------+--------------+--------------+----------------+--------------+---------------+----------------+----------------+------------------------+
1 row in set (0.00 sec)

Also the original data_version was 60239190574 but now it's:

[root@fir-rbh01 ~]# lfs data_version /fir/groups/astraigh/Magda/fachinettie19/fachinetti_CC_DLD/extracted/fachinetti_CCr1oCAr1.k25.ci10.madx5.r1.singleline.fa
30120416758

This file is on MDT0 and lfsck logs show that something was fixed for this FID 0x200043465:0x6f23:0x0:

[root@fir-rbh01 ~]# grep 0x200043465:0x6f23:0x0 lfsck.fir-md1-s1.log 
00100000:10000000:24.0:1588797550.743684:0:126810:0:(lfsck_layout.c:4033:lfsck_layout_repair_owner()) fir-MDT0000-osd: layout LFSCK assistant repaired inconsistent file owner for: parent [0x200043465:0x6f23:0x0], child [0x1340000401:0x10bc4c3:0x0], OST-index 65, stripe-index 1, old owner 0/0, new owner 65488/52067: rc = 1

Robinhood also shows that the file was previously stripped on two OSTs, but Robinhood doesn't support DoM or migration, so that is from the original striping info:

MariaDB [robinhood_fir]> select * from STRIPE_ITEMS where id='0x200043465:0x6f23:0x0';
+------------------------+--------------+--------+----------------------+
| id                     | stripe_index | ostidx | details              |
+------------------------+--------------+--------+----------------------+
       |43465:0x6f23:0x0 |            0 |     64 |          ??
| 0x200043465:0x6f23:0x0 |            1 |     65 |      @   ??
                                                                     |
+------------------------+--------------+--------+----------------------+
2 rows in set (0.00 sec)

LFSCK layout has fixed many files like that:

[root@fir-hn01 sthiell.root]# clush -w@mds -R exec -bL 'tgt=$(printf fir-MDT%%04x %n); ssh %h lctl get_param -n mdd.$tgt.lfsck_layout' | grep status
fir-md1-s[1-4]: status: completed
[root@fir-hn01 sthiell.root]# clush -w@mds -R exec -bL 'tgt=$(printf fir-MDT%%04x %n); ssh %h lctl get_param -n mdd.$tgt.lfsck_layout' | grep repaired
fir-md1-s[1,4]: repaired_dangling: 0
fir-md1-s[2-3]: repaired_dangling: 1
fir-md1-s[1-4]: repaired_unmatched_pair: 0
fir-md1-s[1-4]: repaired_multiple_referenced: 0
fir-md1-s[1-4]: repaired_orphan: 0
fir-md1-s1: repaired_inconsistent_owner: 10494922
fir-md1-s2: repaired_inconsistent_owner: 26336224
fir-md1-s3: repaired_inconsistent_owner: 36300505
fir-md1-s4: repaired_inconsistent_owner: 15102845
fir-md1-s1: repaired_others: 429814
fir-md1-s2: repaired_others: 46955127
fir-md1-s3: repaired_others: 0
fir-md1-s4: repaired_others: 1716650

Do you confirm this could be due to LFSCK? I'm not sure why "inconsistent file owner" would corrupt files, but this is the only pointer that we have now. If that's the case, do you think there is a way to repair what LFSCK has "fixed"?



 Comments   
Comment by Stephane Thiell [ 07/May/20 ]

Attached debug logs with lfsck for fir-MDT0000 as dk.fir-md1-s1.log.gz and fir-MDT0001 as dk.fir-md1-s2.log.gz The example file in the case is located on fir-MDT0000.

Comment by Stephane Thiell [ 08/May/20 ]

We're not sure anymore if all of these files were originally created with DoM actually. More and more users are reporting truncated files. It seems like users are even reporting truncated files that have been created recently with the non-DoM, PFL config, but only the first stripe (128MiB) remains after LFSCK was run. What seems to be a common cause could be that the parent directories of these files have recently been migrated to another MDT (MDT1). Could a lfs migrate -m followed by a later lfsck_layout be able to truncate PFL files like that to plain layout (with the first component only)?

Our default PFL config:

[root@fir-rbh01 ~]# lfs getstripe -d /fir
  lcm_layout_gen:    0
  lcm_mirror_count:  1
  lcm_entry_count:   3
    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 0
    lcme_extent.e_end:   134217728
      stripe_count:  1       stripe_size:   4194304       pattern:       raid0       stripe_offset: -1

    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 134217728
    lcme_extent.e_end:   137438953472
      stripe_count:  2       stripe_size:   4194304       pattern:       raid0       stripe_offset: -1

    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 137438953472
    lcme_extent.e_end:   EOF
      stripe_count:  4       stripe_size:   4194304       pattern:       raid0       stripe_offset: -1
Comment by Peter Jones [ 08/May/20 ]

Mike

Could you please advise

Thanks

Peter

Comment by Mikhail Pershin [ 11/May/20 ]

yes, I am working at that

Comment by Stephane Thiell [ 11/May/20 ]

Thanks, Mike. This seems pretty bad.

It looks like all files in some MDT-migrated directories have lost their PFL layout after running LFSCK. They just seem to have a plain layout now. Files < 128MiB (size of our first PFL component) are not truncated and are still usable, but with a plain layout, but the larger files are truncated.

Example of previously PFL'ed small file, that now has a plain layout:

[root@fir-rbh01 job034]# lfs getstripe /fir/users/alpays/hongli-backup/GCGR/relion_gcgr_vpp_20180212_tem4/Class2D/job034/run_it025_optimiser.star
/fir/users/alpays/hongli-backup/GCGR/relion_gcgr_vpp_20180212_tem4/Class2D/job034/run_it025_optimiser.star
lmm_stripe_count:  1
lmm_stripe_size:   4194304
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 57
	obdidx		 objid		 objid		 group
	    57	      13579585	     0xcf3541	  0x1140000400

Its directory still has the PFL config:

[root@fir-rbh01 job034]# lfs getstripe -d /fir/users/alpays/hongli-backup/GCGR/relion_gcgr_vpp_20180212_tem4/Class2D/job034
  lcm_layout_gen:    0
  lcm_mirror_count:  1
  lcm_entry_count:   3
    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 0
    lcme_extent.e_end:   134217728
      stripe_count:  1       stripe_size:   4194304       pattern:       raid0       stripe_offset: -1

    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 134217728
    lcme_extent.e_end:   137438953472
      stripe_count:  2       stripe_size:   4194304       pattern:       raid0       stripe_offset: -1

    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 137438953472
    lcme_extent.e_end:   EOF
      stripe_count:  4       stripe_size:   4194304       pattern:       raid0       stripe_offset: -1

How is that possible?

Comment by Peter Jones [ 12/May/20 ]

Details from email

"I’m contacting you regarding https://jira.whamcloud.com/browse/LU-13535 "Files truncated/corruption due to lfsck”

We wanted to send you this email to provide updated information about our specific situation.

A single LFSCK run has changed the layout of millions of files on Fir to plain layout of 1 OST, with 128 MiB maximum. Larger files have been truncated, so today we know that lfsck has corrupted the content of about 215k files. At this state, it looks like it has only happened on directories which have previously been migrated to another MDT (lfs migrate -m 1), directories which worked fine until we ran LFSCK namespace + layout.

We have indeed scanned the filesystem (Fir on Sherlock) for files that are 128MiB in size with a plain layout of 1 OST, which is our way to detect when a file has lost its default PFL layout and very, very likely been truncated by LFSCK last week. During the weekend, we have been running lfs find -size 128M -c 1 on the whole filesystem (665M inodes) and it has completed: we have found 214,695 files total that have been truncated to 128MiB after this LFSCK run. All files I checked manually have indeed been truncated/corrupted. Also, users are reporting that their quota is still showing the previous volume used, so we think there could be a chance that that the objects are still somehow on the OSTs. Some users have lost tens of TB of scratch research data due to very large files being truncated.

Thanks for assigning Mike to this ticket. Any insights would be appreciated as soon as possible so we can adjust the communication to our users. My guess is that the layouts are lost, but perhaps you will find a way to reattach the component to these files?"

Comment by Mikhail Pershin [ 12/May/20 ]

Stephane, could you please get extended output of striping info from affected files via lfs getstripe -R -v, it will show all fields in layout and can give some clues. I am trying to reproduce that behavior locally and also inspect lfsck code in 2.12.4 right now

Also please provide exact lfsck command used

Comment by Mikhail Pershin [ 12/May/20 ]

I was managed to reproduce that bug and have found why it happens, fix for lfsck is on the way. I am trying to figure out now what can be done for lost stripes.

Comment by Stephane Thiell [ 12/May/20 ]

Thanks Mike, this is great news that you were able to reproduce yourself! Let us know if you find a way to reattach the lost stripes, we have moved/quarantined the files into directories using the same project IDs so the FIDs should be the same.

I'm attaching the output of lfs getstripe -R -v on all affected files (the truncated ones only) as fir_lfsck_trunc_getstripe_all.log.gz

As for lfsck, we started it with lctl lfsck_start -M fir-MDTxxx -t namespace first on all 4 MDTs, and then once done, I did lctl lfsck_start -M fir-MDTxxx -t layout on all 4 MDTs.

Comment by Gerrit Updater [ 12/May/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38584
Subject: LU-13535 lfsck: fix possible PFL layout corruption
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: db1aa8f1880162e186467a9a52da21fb319cb1b2

Comment by Gerrit Updater [ 12/May/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38585
Subject: LU-13535 lfsck: fix possible PFL layout corruption
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: a29eecf94a2fb0642256e600074173428ccf5304

Comment by Stephane Thiell [ 13/May/20 ]

Hi Mike – This is really awesome that you seem to have found the source of the problem! Congrats and many thanks for that! For our truncated files, is there any chance that the old composite layout is still around somewhere? We have been working today with our users and hopefully most of the truncated files are scratch files that can be regenerated, but still, unfortunately, a few of them were not transferred from this filesystem to longer-term storage and we would like to know if they could somehow still be "fixed". Thx.

Comment by Mikhail Pershin [ 13/May/20 ]

Stephane, was layout of these files FS-default, so all have the same one or there are many cases?

Comment by Stephane Thiell [ 13/May/20 ]

Prior to the lfsck layout incident, these files were likely all using the PFL layout defined by their parent directories. We have set up the following PFL layout on all directories, and then it's inherited for new directories (as I don't think we can set a PFL layout as FS-default):

lfs setstripe -E 128M -c 1 -S 4M -E 128G -c 2 -S 4M -E -1 -c 4 -S 4M /fir

The FS-default is still plain layout of 1 stripe, as we haven't modified it. With tunefs.lustre on MDT0 I can see:

lov.stripecount=1 lov.stripesize=1048576 
Comment by Stephane Thiell [ 19/May/20 ]

We have been able to recover the composite layout of all our truncated files thanks to Mike! Feel free to close this ticket once the lfsck patch has landed (in 2.12 also please!!).

Comment by Gerrit Updater [ 27/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38584/
Subject: LU-13535 lfsck: fix possible PFL layout corruption
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: be009cb4a73b3bef7302083bec7d1d6289d515b7

Comment by Peter Jones [ 27/May/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 27/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38585/
Subject: LU-13535 lfsck: fix possible PFL layout corruption
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 775ce1c26c843d9ef9e6919f85e5284828762095

Generated at Sat Feb 10 03:02:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.