[LU-8071] lvcreate --snapshot of MDT hangs in ldiskfs_journal_start_sb Created: 26/Apr/16  Updated: 14/Jun/18  Resolved: 02/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Nathan Dauchy (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS-6.7
lustre-2.5.3
lvm2-2.02.118-3.el6_7.4
Also note that the MDT uses an external journal device.


Attachments: File bt.all     File dmesg.out     File lostfound_nonzero.lst     File nagtest.toobig.stripes    
Issue Links:
Related
is related to LU-7616 creation of LVM snapshot on ldiskfs b... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Similar to LU-7616 "creation of LVM snapshot on ldiskfs based MDT hangs until MDT activity/use is halted", but opening a new case for tracking.

The goal is to use LVM snapshots and tar to make file level MDT backups. Procedure worked fine 2 or 3 times, then we triggered the following problem on a recent attempt.

The MDS became extremely sluggish, and all MDT threads went into D state, when running the following command:

lvcreate -l95%FREE -s -p r -n mdt_snap /dev/nbp9-vg/mdt9

(the command never returned, and any further lv* commands hung as well)

In the logs...

Apr 25 17:09:35 nbp9-mds kernel: WARNING: at /usr/src/redhat/BUILD/lustre-2.5.3/ldiskfs/super.c:280 ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]() (Not tainted)
Apr 25 17:14:45 nbp9-mds ]
Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa0e5c4e5>] ? mds_readpage_handle+0x15/0x20 [mdt]
Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08a90c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa05d18d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08a1a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08ab89d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc]
Apr 25 17:14:45 nbp9-mds kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08aada0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
Apr 25 17:14:45 nbp9-mds kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Apr 25 17:14:45 nbp9-mds kernel: ---[ end trace c9f3339c0e103edf ]---

Apr 25 17:14:57 nbp9-mds kernel: WARNING: at /usr/src/redhat/BUILD/lustre-2.5.3/ldiskfs/super.c:280 ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]() (Tainted: G        W  ---------------   )
Apr 25 17:14:57 nbp9-mds kernel: Hardware name: AltixXE270
Apr 25 17:14:57 nbp9-mds kernel: Modules linked in: dm_snapshot dm_bufio osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) jbd2 lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic crc32c_intel libcfs(U) sunrpc bonding ib_ucm(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) configfs ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) dm_round_robin scsi_dh_rdac dm_multipath microcode iTCO_wdt iTCO_vendor_support i2c_i801 lpc_ich mfd_core shpchp sg igb dca ptp pps_core tcp_bic ext3 jbd sd_mod crc_t10dif sr_mod cdrom ahci pata_acpi ata_generic pata_jmicron mptfc scsi_transport_fc scsi_tgt mptsas mptscsih mptbase scsi_transport_sas mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) mlx_compat(U) memtrack(U) usb_storage radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod gru [last unloaded: scsi_wait_scan]
Apr 25 17:14:57 nbp9-mds kernel: Pid: 85906, comm: mdt_rdpg00_042 Tainted: G        W  ---------------    2.6.32-504.30.3.el6.20151008.x86_64.lustre253 #1
Apr 25 17:14:57 nbp9-mds kernel: Call Trace:
Apr 25 17:14:57 nbp9-mds kernel: [<ffffffff81074127>] ? warn_slowpath_common+0x87/0xc0
Apr 25 17:14:57 nbp9-mds kernel: [<ffffffff8107417a>] ? warn_slowpath_null+0x1a/0x20
Apr 25 17:14:57 nbp9-mds kernel: [<ffffffffa0a1c33e>] ? ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]
Apr 25 17:14:57 nbp9-mds kernel: [<ffffffffa0d6069f>] ? osd_trans_start+0x1df/0x660 [osd_ldiskfs]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0ef3619>] ? lod_trans_start+0x1b9/0x250 [lod]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0f7af07>] ? mdd_trans_start+0x17/0x20 [mdd]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0f61ece>] ? mdd_close+0x6be/0xb80 [mdd]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e48be9>] ? mdt_mfd_close+0x4a9/0x1bc0 [mdt]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0899525>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa08c07f6>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa089a53e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa06ebf05>] ? class_handle2object+0x95/0x190 [obdclass]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e4b6a2>] ? mdt_close+0x642/0xa80 [mdt]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e1fada>] ? mdt_handle_common+0x52a/0x1470 [mdt]
Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: sdc - rdac checker reports path is down
Apr 25 17:15:10 nbp9-mds multipathd: checker failed path 8:32 in map nbp9_MGS_MDS
Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: remaining active paths: 1
Apr 25 17:15:10 nbp9-mds multipathd: sdd: remove path (uevent)
Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: failed in domap for removal of path sdd
Apr 25 17:15:10 nbp9-mds multipathd: uevent trigger error
Apr 25 17:15:10 nbp9-mds multipathd: sdc: remove path (uevent)
Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa0e5c4e5>] ? mds_readpage_handle+0x15/0x20 [mdt]
Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08a90c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa05d18d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08a1a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08ab89d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc]
Apr 25 17:15:10 nbp9-mds kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08aada0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
Apr 25 17:15:10 nbp9-mds kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20

Apr 25 17:15:16 nbp9-mds multipathd: nbp9_MGS_MDS: map in use
Apr 25 17:15:16 nbp9-mds multipathd: nbp9_MGS_MDS: can't flush

The server had to be rebooted and e2fsck run to get it back into production.



 Comments   
Comment by Nathan Dauchy (Inactive) [ 26/Apr/16 ]

Not sure how closely tied this is, but... after the reboot we ran e2fsck on the MDT. The fsck was able replay the journal but during the scan found lots of unattached inodes and moved them to /lost+found. Interestingly, most of those inodes were almost sequential:

Unattached inode 584092565
Connect to /lost+found<y>? yes
Unattached inode 584092567
Connect to /lost+found<y>? yes
Unattached inode 584092637
Connect to /lost+found<y>? yes
Unattached inode 584092640
Connect to /lost+found<y>? yes
Unattached inode 584092643
Connect to /lost+found<y>? yes
Unattached inode 584092645
Connect to /lost+found<y>? yes
Unattached inode 584092648
Connect to /lost+found<y>? yes
Unattached inode 584092715
Connect to /lost+found<y>? yes
Unattached inode 584092717
Connect to /lost+found<y>? yes
Unattached inode 584092720
Connect to /lost+found<y>? yes
Unattached inode 584092723
Connect to /lost+found<y>? yes
Unattached inode 584092784

Mounted as ldiskfs and saw that all but about 20 files were owned by root and had surprisingly old date stamps:

nbp9-mds ~ # ls -l /mnt/lustre/nbp9-mdt/lost+found/
total 6816
-rw------- 1 root     root   5792 Jul 26  1970 #17831719
-rw------- 1 root     root   5792 Jul 26  1970 #17831723
-rw------- 1 root     root   5792 Feb  4  1988 #571005637
-rw------- 1 root     root   5792 Feb  4  1988 #571016437
-rw------- 1 root     root   5792 Feb  4  1988 #571029133
...
-rw------- 1 root     root   4832 Jul  5  1988 #584109225
-rw------- 1 root     root   4832 Jul  5  1988 #584109227
-rw------- 1 root     root   4832 Jul  5  1988 #584109231
-rw------- 1 root     root   4832 Jul  5  1988 #584109233
-rw------- 1 root     root   4832 Jul  5  1988 #584109235

nbp9-mds ~ # ls -l /mnt/lustre/nbp9-mdt/lost+found | grep -v " root "
total 47328
-rwxr----- 1 dmenemen g26209    0 Feb 10  2014 #570436464
-rw-r--r-- 1 dmenemen g26209    0 Feb 10  2014 #570436465
-rwxr----- 1 dmenemen g26209    0 Feb 10  2014 #570436466
-rw-r--r-- 1 dmenemen g26209    0 Feb 10  2014 #570436467
-rw-r--r-- 1 dmenemen g26209    0 Feb 10  2014 #570436468
-rw-r--r-- 1 dmenemen g26209    0 Feb 10  2014 #570436469
...
(another dozen or so more similar files)

Ran "ll_recover_lost_found_objs" on the MDT mounted as ldiskfs, and it reported results like:

Object /mnt/lustre/nbp9-mdt/O/22d0c13d/d23/cb410817 restored.
Object /mnt/lustre/nbp9-mdt/O/22d0c13f/d27/cb41083b restored.
Object /mnt/lustre/nbp9-mdt/O/22d0c145/d1/cb504661 restored.
Object /mnt/lustre/nbp9-mdt/O/22d0c147/d27/cb50467b restored.
Object /mnt/lustre/nbp9-mdt/O/22d0c153/d16/cb504810 restored.
Object /mnt/lustre/nbp9-mdt/O/22d0c155/d20/cb504814 restored.

(It got all the non-root owned files)

I believe that the rest of the files in lost+found were probably either useless or were never pointing to user data... they have no xattrs:

# getfattr -d -m ".*" -e hex /mnt/lustre/nbp9-mdt/lost+found/*

(returned nothing!)

So, we ran another e2fsck (which came back clean!) and put the filesystem back into production.

Comment by Nathan Dauchy (Inactive) [ 26/Apr/16 ]

Uploading backtrace and dmesg from the MDS.

Comment by Mahmoud Hanafi [ 28/Apr/16 ]

The recovery of the mdt as resulted in 25776 file losing their stripe info.

<code>
pfe24 /nobackupp9/gmao_SIteam/ObservingSystem/reanalysis/aqua_disc/hdf # cd /nobackupp9//nfeather/marti/Builds/Scaling/Dynamo/Np256_san_v/
pfe24 /nobackupp9/nfeather/marti/Builds/Scaling/Dynamo/Np256_san_v # lfs getstripe temperature_energy.dat
temperature_energy.dat has no stripe info
pfe24 /nobackupp9/nfeather/marti/Builds/Scaling/Dynamo/Np256_san_v # lfs path2fid temperature_energy.dat
[0x20010fbee:0x66:0x0]
pfe24 /nobackupp9/nfeather/marti/Builds/Scaling/Dynamo/Np256_san_v # stat temperature_energy.dat
File: `temperature_energy.dat'
Size: 0 Blocks: 0 IO Block: 4194304 regular empty file
Device: 162deh/90846d Inode: 144133862291669094 Links: 1
Access: (0644/rw-rr-) Uid: (10670/nfeather) Gid: (41430/ s1430)
Access: 2016-04-19 11:04:25.000000000 -0700
Modify: 2016-04-19 11:04:25.000000000 -0700
Change: 2016-04-19 11:04:25.000000000 -0700
Birth: -
<code>

We need help to recover the stripe info!

Comment by Peter Jones [ 28/Apr/16 ]

Fan Yong Is looking into this

Comment by nasf (Inactive) [ 28/Apr/16 ]

Mahmoud,

Usually, for the file with small LOV EA, such as four stripes or so, the LOV EA will be stored inside the inode space (block). So if the inode is there, the LOV EA should be there also, unless the inode corrupted. But since e2fsck run completedlily, it seems not such case. So the lost LOV EA should be stored in separated block, do you still have some idea about the stripe count for the files lost the stripe info? Have you ever enable large EA to support thousands of stripes?

On the other hand, the files in /lost+found seems abnormal. The time "Jul 26 1970" is the start time, that is impossible for normal Lustre objects. And the time "Feb 4 1988" is also not reasonable Lustre time stamp. Then I think that the e2fsck scanned some non-initialized area, the inode bitmap maybe corrupted.

nbp9-mds ~ # ls -l /mnt/lustre/nbp9-mdt/lost+found/
total 6816
rw------ 1 root root 5792 Jul 26 1970 #17831719
rw------ 1 root root 5792 Jul 26 1970 #17831723
rw------ 1 root root 5792 Feb 4 1988 #571005637
rw------ 1 root root 5792 Feb 4 1988 #571016437
rw------ 1 root root 5792 Feb 4 1988 #571029133
...
rw------ 1 root root 4832 Jul 5 1988 #584109225
rw------ 1 root root 4832 Jul 5 1988 #584109227
rw------ 1 root root 4832 Jul 5 1988 #584109231
rw------ 1 root root 4832 Jul 5 1988 #584109233
rw------ 1 root root 4832 Jul 5 1988 #584109235

As for recover object from /lost+found via "ll_recover_lost_found_objs", it is used for OST-object. I am surprised that it recovered something on your MDT! So it makes me to suspect your system corruption in further.

Back to your direct question: how to recover the lost stripe info. Our new layout LFSCK can recover the lost (MDT-object's) LOV EA from orphan OST-object(s), such functionality is available since Lustre-2.6. Seems your system is Lustre-2.5 based, so unless you upgrade your system, otherwise, you have to recover that manually as following for example:

1) Get the "temperature_energy.dat" FID via "lfs path2fid temperature_energy.dat"
2) Get PFID EA from the OST-objects. Since we do not know which OST-object(s) belong to "temperature_energy.dat", then we have to scan the OST-object's one by one.
3) If some OST-object PFID EA matches the "temperature_energy.dat" FID, then we find one stripe of temperature_energy.dat.

Above process is very inefficient and easy to be wrong. So unless you are confident for such manually operations have time, I would suggest to upgrade your system and run layout LFSCK for recovery.

Comment by Mahmoud Hanafi [ 28/Apr/16 ]

I have a modified e2scan that will dump ost object and look up the PFID for each OST. We use it to quickly identify users that are filling up an OST. So if I can run that and get a list of all OST object what would be the next step to put it back together?

We could also upgrade to 2.7.1. How fast does the lfsck run?

Comment by nasf (Inactive) [ 28/Apr/16 ]

If you can get the OST-objects list, then I can make a patch to re-generate the MDT-objects LOV EA from such list.

As for "how fast the LFSCK run", depends on your system environment. I have tested the layout LFSCK on OpenSFS cluster, for single MDT case, the routine scan (without repairing) speed can be up to 100K objects/sec, the repair speed is about 30K objects/sec.

Comment by Mahmoud Hanafi [ 28/Apr/16 ]

Since we need to install a patch we may want to just upgrade. I'll update the case tomorrow as to what we decided to do.

Comment by nasf (Inactive) [ 28/Apr/16 ]

Sorry, I did not say clearly. I mean that if you can get the OST-objects list, then I will make a tool to re-geneate the stripe info from your OST objects list. You do not need to patch your current system.

Comment by Nathan Dauchy (Inactive) [ 28/Apr/16 ]

Prior to the hang, when I had (what I thought was) a successful LVM snapshot, I happened to be looking for files with dates out of the expected range:

nohup find /mnt/snap/ROOT/ -mtime +10000 -o -mtime -0 > /tmp/nbp9_big_time_t.out 2>&1 < /dev/null &

looking at those files from a client, the results are like:

pfe20 /u/ndauchy/tmp # xargs -a nbp9_big_time_t.txt -d '\n' ls -ldtr
-rwxr-xr--  1 sviken   a1426             0 Dec 31  1969 /nobackupp9/sviken/rev3mod3/grid3b/sally_grid_info/?c}x;?<+?"?:8??.??K?????????f?$Z??????????O?????c^+?c?????Dol??:H?N5?#??;G??Z???Vq?s??VX?i??????O?L?
-rwxr-xr--  1 sviken   a1426             0 Dec 31  1969 /nobackupp9/sviken/rev3mod3/grid3b/sally_grid_info/??#???a??O????\?a?????QU?4???W?~??I?k???????w[?~???????"??K?}???????e??a??{???g?|?|?/??k????1??.?lXD
-rw-r--r--  1 yfan2    s0969       7663656 Feb 21  2018 /nobackupp9/yfan2/mfe_f90_mpi3d/polysph_3dfulltherm/run.15.3/eng_1
drwx------ 18 yfan2    s0969         94208 Feb 21  2018 /nobackupp9/yfan2/mfe_f90_mpi3d/polysph_3dfulltherm
-rwxr-xr--  1 sviken   a1540       1703069 Jun 20  2018 /nobackupp9/sviken/rev1wing/40deg/old_wrong_BL/m096a4b0p3TA/flow.o3964429
-rwxr-xr--  1 sviken   a1540   14925662688 Jun 20  2018 /nobackupp9/sviken/rev1wing/40deg/old_wrong_BL/m096a4b0p3TA/gaw-215-flap40.flow
-rwxr-xr--  1 sviken   a1540      14055821 Jun 20  2018 /nobackupp9/sviken/rev1wing/40deg/old_wrong_BL/m096a4b0p3TA/gaw-215-flap40.grid_info
-rwxr-xr--  1 sviken   a1540      10454535 Jun 20  2018 /nobackupp9/sviken/rev1wing/40deg/old_wrong_BL/m096a4b0p3TA/source_grid_00000.p3d

...

-rw-r--r--  1 tsandstr vistech         234 Oct  5  2280 /nobackupp9/chenze/data/rogers/dta_cfd/met120_dta_ET128/m3.913a5.129b0.080/SRBcut_1.map
-rw-r--r--  1 tsandstr vistech        1424 Oct  5  2280 /nobackupp9/chenze/data/rogers/dta_cfd/met120_dta_ET128/m3.913a5.129b0.080/SRBcut.fvx
-rw-r--r--  1 tsandstr vistech         394 Oct 10  2280 /nobackupp9/chenze/data/rogers/dta_cfd/met120_dta_ET128/m3.913a5.129b0.080/sslv_top.vct
-rw-r--r--  1 tsandstr vistech   208629828 Feb 28  2281 /nobackupp9/chenze/data/rogers/dta_cfd/met100_dta_ET128/m3.370a2.558b-0.041/q.surf

Of course, I don't know whether these are A) from the same event, B) from taking the previous LVM snapshot, or C) have been that way for a long time and are completely unrelated. I can upload the full file with all (590) such files if needed.

Comment by Mahmoud Hanafi [ 29/Apr/16 ]

We have upgraded to 2.7.1 and started a dryrun lfsck. This is the command we uses.

lctl lfsck_start --dryrun -M nbp9-MDT0000 -A

Is that correct?

Comment by nasf (Inactive) [ 29/Apr/16 ]

The files list in nbp9_big_time_t.txt looks abnormal, although you said they were from previous successful LVM backup. For further verification, you can dump their LMA EA ("trusted.lma") and linkEA ("trusted.link").

As for the layout LFSCK, if you want to run as "dryrun" mode, please try "lctl lfsck_start -M nbp9-MDT0000 -t layout -o --dryrun"

Comment by Mahmoud Hanafi [ 29/Apr/16 ]

Doesn't -A do both layout and orphan repair?

Comment by nasf (Inactive) [ 29/Apr/16 ]

"-A" is enabled automatically if "-o" specified.

Comment by Nathan Dauchy (Inactive) [ 29/Apr/16 ]

lfsck was completed last night after the upgrade to 2.7.1.

Initial indications are that there was a lot of improvement, out of the initial list of 24150 problem files files, only 125 remain with no stripe info. They appear to mostly be grouped into a handful of directories (or their subdirs) for a small set of users:

# sed 's/ has no .*//' /nobackupnfs2/mhanafi/nbp9_ost_scan/nbp9.no_stripe_files.out.after_lfsck | xargs -d '\n' ls -l | awk '{print $3}' | sort | uniq -c
ls: cannot access /nobackupp9/jchang/newfile: No such file or directory
     65 bduda
     45 bmauer
      1 cbrehm1
      1 dhixon
     11 mchoudha
      1 nfeather
# sed 's/ has no .*//' /nobackupnfs2/mhanafi/nbp9_ost_scan/nbp9.no_stripe_files.out.after_lfsck | xargs -d '\n' -L 1 dirname | xargs -L 1 dirname | sort | uniq -c
      1 /nobackupp9
     65 /nobackupp9/bduda
      1 /nobackupp9/cbrehm1/OpenRotor/6848rpm_ibio_with_Plate_newRun
      1 /nobackupp9/dhixon
     45 /nobackupp9/gmao_SIteam/ObservingSystem/reanalysis/aqua_disc/hdf
      6 /nobackupp9/mchoudha/powerflow
      1 /nobackupp9/mchoudha/powerflow/fine
      4 /nobackupp9/mchoudha/powerflow/medium_aoa0_copy
      1 /nobackupp9/nfeather/marti/Builds/Scaling/Dynamo

It also looks like the recovered files sill lost ACL information. Is it expected that lfsck is unable to recover those attributes?

Comment by nasf (Inactive) [ 29/Apr/16 ]

There are several possible cases for the remained 125 no-stripes files.

1) The file has never been written after create, so the OST-object has no parent FID information.
2) The OST-object's parent FID EA is corrupted, so the OST-object does not know which file it belongs to. Then the layout LFSCK will mark it as Lustre orphan, and put under the directory "$mount_point/.lustre/lost+found/MDT0000/".
3) The OST-object was lost also.
4) There was something wrong during the layout LFSCK as to the layout LFSCK failed to re-generate the LOV EA.

According to my experience, the first case more possible.

The layout LFSCK can recover file owner information, but cannot recover ACL. Sorry for such inconvenience.

Comment by Nathan Dauchy (Inactive) [ 29/Apr/16 ]

There are indeed about 464350 files in the .lustre/lost+found/MDT0000/ directory. The vast majority of them are owned by root, size zero, and date stamped around when the lfsck was run... so I'm assuming those are useless.

In going through the remaining ~969 "interesting" ones, several have IO errors:

pfe27 /nobackupp9/.lustre/lost+found/MDT0000 # file [0x2000a1e14:0x10e4b:0x0]-R-0
[0x2000a1e14:0x10e4b:0x0]-R-0: ERROR: cannot read `' (Input/output error)

pfe27 /nobackupp9/.lustre/lost+found/MDT0000 # ls -l [0x2000a1e14:0x10e4b:0x0]-R-0
-r-------- 1 dlrodrig g28100 20971520 Apr 28 20:31 [0x2000a1e14:0x10e4b:0x0]-R-0

pfe27 /nobackupp9/.lustre/lost+found/MDT0000 # lfs getstripe !$
lfs getstripe [0x2000a1e14:0x10e4b:0x0]-R-0
[0x2000a1e14:0x10e4b:0x0]-R-0
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_pattern:        40000001
lmm_layout_gen:     0
lmm_stripe_offset:  0
	obdidx		 objid		 objid		 group
	     0	             0	            0	             0
	     0	             0	            0	             0
	     0	             0	            0	             0
	    23	       7103687	     0x6c64c7	             0

Would I be correct in assuming that those are non-recoverable?

Comment by nasf (Inactive) [ 29/Apr/16 ]

For the files under .lustre/lost+found/MDT0000/, handle them as following:

1) If the file is accessible (means no IO error), and its size is NOT zero, then it contains some valid data. But since we do not know its original position in the namespace, you needs to move them back to the namespace according to your memory or some new rule.

2) If the file is accessible, and if the size is zero, then it is safe to remove them directly, at least it does not contains valid data, although it may affect related inode quota a bit.

3) If the file is inaccessible with IO error, means only part of the stripes are found, and there is 'hole' in the re-generated LOV EA. Then you can dump part of the data from such file via "dd conv=sync,noerror ...". If such part data is still valuable, then save them in some new file. Otherwise, you have to discard them.

Comment by Nathan Dauchy (Inactive) [ 02/May/16 ]

Note: we do eventually still want to get to the bottom of the snapshot journal call trace and hang issue.

But first, another possible corruption scenario... A file used for hourly I/O tests grew from 960M before the lfsck to 57G afterward.  It appears the extra size is mostly zeros, embedded in the middle of the file.

# stat /nobackupp9/dtalcott/nagtest.toobig
  File: `/nobackupp9/dtalcott/nagtest.toobig'
  Size: 60150513664	Blocks: 3899408    IO Block: 4194304 regular file
Device: 162deh/90846d	Inode: 144117204932100247  Links: 1
Access: (0444/-r--r--r--)  Uid: (10491/dtalcott)   Gid: ( 1179/  cstaff)
Access: 2016-04-29 12:36:38.000000000 -0700
Modify: 2016-04-28 14:06:34.000000000 -0700
Change: 2016-04-29 11:55:56.000000000 -0700
 Birth: -

# stat /nobackupp9/dtalcott/nagtest
  File: `/nobackupp9/dtalcott/nagtest'
  Size: 1006632960	Blocks: 1966080    IO Block: 4194304 regular file
Device: 162deh/90846d	Inode: 144134705246109726  Links: 1
Access: (0644/-rw-r--r--)  Uid: (10491/dtalcott)   Gid: ( 1179/  cstaff)
Access: 2016-05-02 10:06:50.000000000 -0700
Modify: 2016-05-02 10:06:56.000000000 -0700
Change: 2016-05-02 10:06:56.000000000 -0700
 Birth: -

(file was simply renamed, and a new one created for other testing)

# lfs getstripe /nobackupp9/dtalcott/nagtest.toobig
/nobackupp9/dtalcott/nagtest.toobig
lmm_stripe_count:   240
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     236
lmm_stripe_offset:  65
	obdidx		 objid		 objid		 group
	    65	       9682882	     0x93bfc2	             0
	   104	      10385635	     0x9e78e3	             0
	   121	       9870950	     0x969e66	             0
	    87	       9941827	     0x97b343	             0
	   170	        332727	      0x513b7	             0
	    13	        332418	      0x51282	             0
	     6	        332591	      0x5132f	             0
	   171	        332503	      0x512d7	             0
	    33	        332294	      0x51206	             0
	    21	        332406	      0x51276	             0
	   195	        332547	      0x51303	             0
	   126	        332574	      0x5131e	             0
	   162	        332560	      0x51310	             0
	   103	        332341	      0x51235	             0
	   144	        332517	      0x512e5	             0
	   196	        332547	      0x51303	             0
	   184	        332443	      0x5129b	             0
	   239	        332564	      0x51314	             0
	    89	        332596	      0x51334	             0
	    60	        332234	      0x511ca	             0
	    26	        332362	      0x5124a	             0
	    61	        332282	      0x511fa	             0
	   134	        332304	      0x51210	             0
	   187	        332391	      0x51267	             0
	   209	        332400	      0x51270	             0
	   213	        333037	      0x514ed	             0
	   211	        332515	      0x512e3	             0
	    30	        332396	      0x5126c	             0
	    50	        332564	      0x51314	             0
	   151	        332187	      0x5119b	             0
	    32	        332532	      0x512f4	             0
	    20	        332721	      0x513b1	             0
	    72	        332643	      0x51363	             0
	   143	        332666	      0x5137a	             0
	    73	        332940	      0x5148c	             0
	    92	        332507	      0x512db	             0
	    10	        332659	      0x51373	             0
	   109	        332645	      0x51365	             0
	   166	        332316	      0x5121c	             0
	   203	        332306	      0x51212	             0
	   193	        332351	      0x5123f	             0
	   149	        332595	      0x51333	             0
	     3	        332815	      0x5140f	             0
	    94	        332773	      0x513e5	             0
	    18	        332809	      0x51409	             0
	   183	        332545	      0x51301	             0
	    80	        332312	      0x51218	             0
	   212	        332258	      0x511e2	             0
	   120	        332442	      0x5129a	             0
	   191	        332265	      0x511e9	             0
	     9	        332443	      0x5129b	             0
	   108	        332580	      0x51324	             0
	   122	        332412	      0x5127c	             0
	   173	        332369	      0x51251	             0
	    38	        332405	      0x51275	             0
	   123	        332523	      0x512eb	             0
	     1	        332561	      0x51311	             0
	     5	        332717	      0x513ad	             0
	   131	        332752	      0x513d0	             0
	   158	        332602	      0x5133a	             0
	   210	        332550	      0x51306	             0
	   167	        332499	      0x512d3	             0
	    64	        332551	      0x51307	             0
	   116	        332689	      0x51391	             0
	    24	        332897	      0x51461	             0
	   111	        332624	      0x51350	             0
	   169	        332665	      0x51379	             0
	   220	        332578	      0x51322	             0
	   186	        332801	      0x51401	             0
	   205	        332960	      0x514a0	             0
	    70	        332644	      0x51364	             0
	    75	        332327	      0x51227	             0
	   225	        332395	      0x5126b	             0
	   133	        332705	      0x513a1	             0
	   147	        332449	      0x512a1	             0
	    46	        332438	      0x51296	             0
	   114	        332486	      0x512c6	             0
	   231	        332310	      0x51216	             0
	   112	        332828	      0x5141c	             0
	    52	        332709	      0x513a5	             0
	   136	        332738	      0x513c2	             0
	    95	        332467	      0x512b3	             0
	   105	        332314	      0x5121a	             0
	   172	        332398	      0x5126e	             0
	   218	        332224	      0x511c0	             0
	   189	        332112	      0x51150	             0
	    86	        332331	      0x5122b	             0
	    27	        332303	      0x5120f	             0
	   161	        332587	      0x5132b	             0
	   117	        332643	      0x51363	             0
	   179	        332506	      0x512da	             0
	   238	        332437	      0x51295	             0
	   226	        332206	      0x511ae	             0
	    23	        332669	      0x5137d	             0
	   208	        332492	      0x512cc	             0
	   228	        332685	      0x5138d	             0
	   216	        332582	      0x51326	             0
	   223	        332566	      0x51316	             0
	   185	        332800	      0x51400	             0
	    28	        332903	      0x51467	             0
	   138	        332639	      0x5135f	             0
	   157	        332528	      0x512f0	             0
	    22	        332328	      0x51228	             0
	   235	        332309	      0x51215	             0
	   113	        332504	      0x512d8	             0
	   181	        332486	      0x512c6	             0
	   227	        332562	      0x51312	             0
	   174	        332442	      0x5129a	             0
	   178	        332552	      0x51308	             0
	    55	        332324	      0x51224	             0
	    48	        332698	      0x5139a	             0
	   148	        332589	      0x5132d	             0
	   152	        332613	      0x51345	             0
	   159	        332651	      0x5136b	             0
	    41	        332617	      0x51349	             0
	   156	        332609	      0x51341	             0
	   202	        332762	      0x513da	             0
	   125	        332840	      0x51428	             0
	   230	        332751	      0x513cf	             0
	    43	        332886	      0x51456	             0
	   145	        332769	      0x513e1	             0
	   229	        332760	      0x513d8	             0
	    83	        332440	      0x51298	             0
	    14	        332357	      0x51245	             0
	    82	        332559	      0x5130f	             0
	   119	        332679	      0x51387	             0
	    16	        332860	      0x5143c	             0
	   180	        332702	      0x5139e	             0
	   200	        332568	      0x51318	             0
	   175	        332710	      0x513a6	             0
	   153	        332796	      0x513fc	             0
	   124	        332438	      0x51296	             0
	    58	        332344	      0x51238	             0
	   141	        332188	      0x5119c	             0
	   182	        332413	      0x5127d	             0
	    91	        332378	      0x5125a	             0
	    81	        332441	      0x51299	             0
	    69	        332581	      0x51325	             0
	    51	        332391	      0x51267	             0
	   190	        332423	      0x51287	             0
	    98	        332557	      0x5130d	             0
	   199	        332471	      0x512b7	             0
	   160	        332528	      0x512f0	             0
	    84	        332693	      0x51395	             0
	    88	        332739	      0x513c3	             0
	    79	        332455	      0x512a7	             0
	    57	        332518	      0x512e6	             0
	    44	        332375	      0x51257	             0
	    90	        332369	      0x51251	             0
	    45	        332629	      0x51355	             0
	   214	        332659	      0x51373	             0
	   107	        332818	      0x51412	             0
	    49	        332966	      0x514a6	             0
	    37	        332933	      0x51485	             0
	    99	        332822	      0x51416	             0
	   110	        332478	      0x512be	             0
	   130	        332654	      0x5136e	             0
	    71	        332394	      0x5126a	             0
	    96	        332281	      0x511f9	             0
	     4	        332616	      0x51348	             0
	   104	        332652	      0x5136c	             0
	    15	        332590	      0x5132e	             0
	   121	        332463	      0x512af	             0
	    76	        332578	      0x51322	             0
	    42	        332582	      0x51326	             0
	    29	        332589	      0x5132d	             0
	   102	        332444	      0x5129c	             0
	   219	        332330	      0x5122a	             0
	   129	        332467	      0x512b3	             0
	    85	        332737	      0x513c1	             0
	    19	        332804	      0x51404	             0
	   206	        332591	      0x5132f	             0
	    34	        332583	      0x51327	             0
	     7	        332498	      0x512d2	             0
	   176	        332351	      0x5123f	             0
	   164	        332292	      0x51204	             0
	     8	        332694	      0x51396	             0
	    47	        332683	      0x5138b	             0
	   217	        332318	      0x5121e	             0
	   188	        333078	      0x51516	             0
	    74	        332354	      0x51242	             0
	    77	        332476	      0x512bc	             0
	   198	        332637	      0x5135d	             0
	    11	        332451	      0x512a3	             0
	   177	        333010	      0x514d2	             0
	   101	        332725	      0x513b5	             0
	   163	        332432	      0x51290	             0
	    62	        332414	      0x5127e	             0
	   146	        332412	      0x5127c	             0
	   135	        332565	      0x51315	             0
	   128	        332862	      0x5143e	             0
	   100	        332828	      0x5141c	             0
	   168	        332484	      0x512c4	             0
	   127	        332413	      0x5127d	             0
	   201	        332432	      0x51290	             0
	   140	        332907	      0x5146b	             0
	   154	        332681	      0x51389	             0
	   221	        332738	      0x513c2	             0
	    54	        332759	      0x513d7	             0
	   155	        332498	      0x512d2	             0
	    17	        332498	      0x512d2	             0
	   165	        332624	      0x51350	             0
	    67	        332566	      0x51316	             0
	   142	        332525	      0x512ed	             0
	   194	        332587	      0x5132b	             0
	    87	        332708	      0x513a4	             0
	   224	        332711	      0x513a7	             0
	   132	        332822	      0x51416	             0
	   232	        332903	      0x51467	             0
	    63	        332711	      0x513a7	             0
	   233	        332720	      0x513b0	             0
	    12	        332866	      0x51442	             0
	   234	        332943	      0x5148f	             0
	    93	        332571	      0x5131b	             0
	   118	        332352	      0x51240	             0
	    59	        332427	      0x5128b	             0
	    65	        332296	      0x51208	             0
	   197	        332254	      0x511de	             0
	    35	        332315	      0x5121b	             0
	   222	        332360	      0x51248	             0
	    66	        332418	      0x51282	             0
	   215	        332613	      0x51345	             0
	   192	        332397	      0x5126d	             0
	    36	        332430	      0x5128e	             0
	    40	        332322	      0x51222	             0
	    31	        332412	      0x5127c	             0
	    25	        332424	      0x51288	             0
	   204	        332576	      0x51320	             0
	   106	        332455	      0x512a7	             0
	   237	        332340	      0x51234	             0
	   150	        332376	      0x51258	             0
	   139	        332451	      0x512a3	             0
	    97	        332461	      0x512ad	             0
	    53	        332565	      0x51315	             0
	   115	        332749	      0x513cd	             0
	    78	        332803	      0x51403	             0
	     2	        332885	      0x51455	             0
	    39	        332810	      0x5140a	             0
	     0	        333030	      0x514e6	             0
	    68	        332710	      0x513a6	             0

So far I have not been able to figure out any reasonable way to scan for other files that grew a lot and got filled with zeros. The best I can think of is a find for all files modified since 4/28 that are more than ~1GB. Then compare the alleged size (ls -l) and the blocks used (du -s), and if they are more than ~10x off it could be a problem, and look at those files individually. Ugh! Any other ideas?

Comment by nasf (Inactive) [ 03/May/16 ]

How many OSTs in your system? Have you ever created the file nagtest.toobig with 240 stripes? If no, then it may because that some OST-object(s) contained bad PFID EA, and claimed that its parent is the file nagtest.toobig, but LFSCK could NOT know whether it is right or not. Since nobody conflict with it/them, then the nagtest.toobig's LOV EA is enlarged to contain those OST-object(s) as to the size increased.

According to the "lfs getstripe" output, we know every OST-object's ID, then we can use such ID to check on related OST. For example:

obdidx objid objid group
65 9682882 0x93bfc2 0

Here "65" corresponding to the OST0041 (HEX), the "9682882" corresponding to the sub-dir "9682882 % 32 = 2" under /O, you can check such OST-object via: debugfs -c -R "stat /O/0/d2/9682882" $OST0041_dev

On the other hand, the OST-objects with ID "332xx" looks abnormal. Usually, the OST-object's ID on different OSTs should be quite scattered. But your case looks some adjacent numbers. That makes me to suspect that the LOV EA is over-written by some pattern data.

Comment by Mahmoud Hanafi [ 03/May/16 ]

Is there a way log what files that lfsck is making changes to.

As for the /nobackupp9/dtalcott/nagtest.toobig. I am attaching object scan of all osts for this file. You can see that there was 2 stripes 0. All the stripe for this file are 4194304 in size but the stripe on OST-65 is 251658240.

#filename:type,blockgroup,uid,gid,inode,size,type,fid,stripe,ctime,atime,mtime
56.out.service169:F,32528,10491,1179,4163662,4194304,lma,[0x100000000:0x512a2:0x0],0,fid,[0x20001d596:0x97:0x0],0,1461629209,1461629204,1461629209
65.out.service162:F,2833,10491,1179,362692,251658240,lma,[0x100000000:0x93bfc2:0x0],0,fid,[0x20001d596:0x97:0x0],0,1461812807,1461812805,1461812807

and ost 65 has stripe 62
#filename:type,blockgroup,uid,gid,inode,size,type,fid,stripe,ctime,atime,mtime
65.out.service162:F,39432,10491,1179,5047326,4194304,lma,[0x100000000:0x51208:0x0],0,fid,[0x20001d596:0x97:0xd8],216,1461629208,1461629204,1461629208

Comment by nasf (Inactive) [ 04/May/16 ]

By default, the repairing behaviour will be recorded in Lustre debug log via label "D_LFSCK". But because Lustre kernel log is in RAM only, and if you did not dump them periodically, then it will be overwritten. I am not sure whether you have such log or not.

On the other hand, the log "nagtest.toobig.stripes" shows that there are 244 OST-objects claiming as the stripe of the [0x20001d596:0x97:0x0], but the "lfs getstripe" output shows that the file "nagtest.toobig" only has 240 stripes. So there are (at least) 4 OST-objects are fake. I found that there are 4 OST-objects with the size 251658240 bytes are conflict with another 4 OST-objects with the size 4194304 bytes, as shown following:

A) 65.out.service162:F,2833,10491,1179,362692,251658240,lma,[0x100000000:0x93bfc2:0x0],0,fid,[0x20001d596:0x97:0x0],0,1461812807,1461812805,1461812807
B) 104.out.service169:F,240,10491,1179,30814,251658240,lma,[0x100000000:0x9e78e3:0x0],0,fid,[0x20001d596:0x97:0x1],1,1461812807,1461812805,1461812807
C) 121.out.service170:F,2375,10491,1179,304121,251658240,lma,[0x100000000:0x969e66:0x0],0,fid,[0x20001d596:0x97:0x2],2,1461812807,1461812805,1461812807
D) 87.out.service168:F,4361,10491,1179,558333,251658240,lma,[0x100000000:0x97b343:0x0],0,fid,[0x20001d596:0x97:0x3],3,1461812807,1461812805,1461812807

a) 56.out.service169:F,32528,10491,1179,4163662,4194304,lma,[0x100000000:0x512a2:0x0],0,fid,[0x20001d596:0x97:0x0],0,1461629209,1461629204,1461629209
b) 207.out.service176:F,37845,10491,1179,4844182,4194304,lma,[0x100000000:0x5139a:0x0],0,fid,[0x20001d596:0x97:0x1],1,1461629209,1461629204,1461629209
c) 137.out.service170:F,40964,10491,1179,5243393,4194304,lma,[0x100000000:0x512d2:0x0],0,fid,[0x20001d596:0x97:0x2],2,1461629209,1461629204,1461629209
d) 236.out.service173:F,35030,10491,1179,4483887,4194304,lma,[0x100000000:0x51320:0x0],0,fid,[0x20001d596:0x97:0x3],3,1461629209,1461629204,1461629209

The OST-objects A-D with the size 251658240 bytes claim as the first 4 stripes of the file "nagtest.toobig". The reason may be as following: the file "nagtest.toobig" lost its LOV EA because some unknown reason. Either the OST-objects A-D or a-d have corrupted PFID EA. According to the size attribute, the OST-objects A-D seems have corrupted PFID EA. During the LFSCK processing, the layout LFSCK tried to re-geneate the "nagtest.toobig" LOV EA with these 244 orphan OST-objects. Unfortunately, the OST-objects A-D were found earlier than the OST-objects a-d, then the regenerated LOV EA contains the OST-objects A-D information. And then when the orphan OST-objects a-d were handled, the layout LFSCK found that they were conflict with some others, then the layout LFSCK should have created some new files with the name "[0x20001d596:0x97:0x0]-[0x100000000:0x512a2:0x0]-C-0" or similar under the directory ".lustre/lost+found/MDT0000/". Please check for verification.

Comment by Nathan Dauchy (Inactive) [ 04/May/16 ]

To follow up on question from yesterday, which you probably already guessed...

There are 240 OSTs on the corrupted file system (on 16 OSS).

Comment by Nathan Dauchy (Inactive) [ 04/May/16 ]

The closest match I could find in lost+found was:

[0x20011c02e:0x1e13:0x0]-[0x20001d596:0x97:0x0]-3-C-0
[0x20011c02e:0x69c7:0x0]-[0x20001d596:0x97:0x0]-0-C-0
[0x20011c030:0x1219c:0x0]-[0x20001d596:0x97:0x0]-2-C-0
[0x20011c030:0x13ffc:0x0]-[0x20001d596:0x97:0x0]-1-C-0

And the file ownership is the same as /nobackupp9/dtalcott/nagtest.toobig

Comment by Nathan Dauchy (Inactive) [ 04/May/16 ]

Full listing of nonzero files in lost+found, gathered as:

cd /nobackupp9/.lustre/lost+found/MDT0000
find . -type f -a ! -size 0 | xargs ls -ld > /u/ndauchy/tmp/nbp9_diag/lostfound_nonzero.lst
Comment by nasf (Inactive) [ 04/May/16 ]

There are 3 kinds of files under your lustre lost+found directory. They are:

1) The file name contains infix "-C-". That means multiple OST-objects claim the same MDT-object and the same slot in the LOV EA. Then the layout LFSCK will create new MDT-object(s) to hold the conflict OST-object(s). Searching your "lostfound_nonzero.lst" file, there are 4 files with "-C-" name. As described in former comment, they are corresponding to the file "nagtest.toobig.stripes" and conflict with the OST-objects A-D. If you think that the layout LFSCK made wrong decision when re-generated the "nagtest.toobig.stripes" LOV EA, we need to make new patch to recover it. But since there is only one file with conflict OST-objects, you also can recover the file "nagtest.toobig.stripes" manually.

2) The file name contains infix "-R-". That means the orphan OST-object knows its parent MDT-object FID, but does not know the position (the file name) in the namespace. So have to create a file with the known PFID. Under such case, if you can remember something from file data, you can rename the file back to the normal namespace.

3) The file name contains infix "-O-". They are orphan MDT-objects. The backend e2fsck put them into the local /lost+found directory, that is invisible to the client. The LFSCK scans the MDT-objects under local /lost+found directory, if the MDT-object has valid linkEA, then the LFSCK will move it back to normal namespace, otherwise, it will be put under the global Lustre lost+found directory that is visible to the client. Similar as above case 2), if you can remember something from file data, you can rename the file back to the normal namespace. On the other hand, I noticed that the files with "-O-" name all have the size 5792, that is mysterious. Unless the e2fsck found some corrupted inodes, otherwise, it is almost impossible. So you have to check these file with more human knowledge and if you think they are invalid, then remove them directly.

Comment by Nathan Dauchy (Inactive) [ 04/May/16 ]

OK, I think we have a decent handle on the lost+found contents now, and the e2fsck and lfsck behaviour. It sounds like perhaps there is room for an improvement in lfsck to handle the conflicts more elegantly?

What is still a significant unknown is how this corruption happened to begin with and what we can do to avoid it in the future. For now, we are no longer performing LVM snapshots of the MDT. Do you have a more complete explanation for the source of the lost striping information on the MDT? It is probably not just the LVM snapshot or e2fsck, as each have been used widely before. The server reboot while "lvcreate --snapshot" was running could have something to do with it I suppose. Do the stack traces and ldiskfs_journal_start_sb messages provide any clues?

Comment by nasf (Inactive) [ 05/May/16 ]

OK, I think we have a decent handle on the lost+found contents now, and the e2fsck and lfsck behaviour. It sounds like perhaps there is room for an improvement in lfsck to handle the conflicts more elegantly?

If two orphan OST-objects conflict with each other when rebuilding the lost LOV EA, from the LFSCK view, it is NOT easy to know which one is right. Because the LFSCK cannot analysis the OST-object data. We found the trouble just because of the abnormal file size based on human judgement. But from the LFSCK view, since the file/object size is in the valid range, it cannot say current OST-object A-D are invalid. So on the this degree, the LFSCK needs some human assistant to make the right choice.

As for the reason of lost stripe information, I do not know, I only have some guess or suspicion. Rebooting the system during "lvcreate --snapshot" is mostly suspected. For large LOV EA (with ten of or hundred stripes), the LOV EA will be stored in separated block. If someone breaks the process during the interval of inode creation and LOV EA storing, then the inode may lost its LOV EA. On the other hand, the implementation of LMV snapshot also may affect that. For example, there maybe some internal data re-location during the "lvcreate --snapshot", if someone breaks the process, the system may be in some internal inconsistent status.

According to the stack traces in this ticket, there were some users accessing the system during "lvcreate --snapshot", right? If yes, I think that it would cause some Copy-On-Write, that may be blocked by the in-processing "lvcreate --snapshot" or handled with it currently, that depends on the LVM implementation. Unfortunately, it is out of Lustre scope, and I am not familiar with that.

Honestly, I am not the fun of LVM snapshot because of many criticisms about the performance.

Andreas, do you have any idea about the LVM/e2fsck trouble?

Comment by Andreas Dilger [ 05/May/16 ]

It appears that we may be missing calls in osd-ldiskfs to check if the block device is currently (being) frozen before submitting IO to the underlying storage (i.e. all modifications blocked at a barrier while in the middle of creating a snapshot.

In newer kernels (3.5+, e.g. RHEL7) this is handled with sb_start_write() and sb_end_write() at the upper layers, and sb_start_intwrite() and sb_end_intwrite() for filesystem internal writes, and should be taken before i_mutex (see e.g. kernel commit 8e8ad8a57c75f3b for examples).

In older kernels (e.g. RHEL6) it uses vfs_check_frozen() to see if the device is currently frozen, with level=SB_FREEZE_WRITE to block new modifications from starting, and level=SB_FREEZE_TRANS to only block internal threads if the freeze barrier is complete.

Comment by Nathan Dauchy (Inactive) [ 06/May/16 ]

At some point in the future we plan to migrate to RHEL/CentOS-7, but right now CentOS-6.7 is the priority. It looks like there is additional code that uses vfs_check_frozen in newer ext4/super.c source, but that it is not in our 2.6.32-504.30 kernel, nor is it mentioned in the changelog for a very recent CentOS-6.7 2.6.32-573.22 kernel. Are you saying that we just need to incorporate all the existing fixes, or that a new lustre patch is needed to perform a check which is unique to ldiskfs?

If it would be more expedient to use some sort of external pause/resume commands before and after the lvm snapshot is created, then we could work that into the process as well.

Also, just curious if this should work for a running MDT or ldiskfs:

service322 ~ # fsfreeze -f /mnt/lustre/nbptest-mdt
fsfreeze: /mnt/lustre/nbptest-mdt: freeze failed: Operation not supported
Comment by Andreas Dilger [ 06/May/16 ]

The fsfreeze is the same as what LVM does internally, so it wouldn't change the problem being seen. We don't plumb that ioctl through the MDT mountpoint to the underlying filesystem, which is why there is an error reported.

As for fixing this problem, I suspect that some of the upstream ext4 patches need to be backported (in particular kernel commits 437f88cc031ffe7f and 6b0310fbf087ad6) but there is also likely a need to add checks into the osd-ldiskfs code to block new IO submissions when the device is freezing/frozen, since these do not have the VFS-level checks.

Comment by Gerrit Updater [ 09/May/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/20062
Subject: LU-8071 ldiskfs: handle system freeze protection
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b61bdfcaef1b03f8d0f082d57120681d71ab5e40

Comment by Jay Lan (Inactive) [ 10/May/16 ]

The last six patches in ldiskfs-2.6-rhel6.6.series of the master branch are:

rhel6.3/ext4-drop-inode-from-orphan-list-if-ext4_delete_inode-fails.patch
rhel6.6/ext4-remove-truncate-warning.patch
rhel6.6/ext4-corrupted-inode-block-bitmaps-handling-patches.patch
rhel6.3/ext4-notalloc_under_idatasem.patch
rhel6.5/ext4-give-warning-with-dir-htree-growing.patch
rhel6.6/ext4_s_max_ext_tree_depth.patch

Only the first 2 patches have already been picked into b2_7_fe. All six have not been picked to b2_5_fe.

We are running Centos 6.6 and it seems to me these patches are important to have also. Some of our servers run 2.5.3 and the rest run 2.7.1. Is it safe for us to pick up those missing ldiskfs kernel patches? Please advise.

Comment by Andreas Dilger [ 18/May/16 ]

I would recommend against using the ext4-give-warning-with-dir-htree-growing.patch as this also requires other changes to the Lustre code. The other changes are OK to use on other kernels.

Also, are the following patches from the upstream kernel already applied on your systems?

commit 437f88cc031ffe7f37f3e705367f4fe1f4be8b0f
Author:     Eric Sandeen <sandeen@sandeen.net>
AuthorDate: Sun Aug 1 17:33:29 2010 -0400
Commit:     Theodore Ts'o <tytso@mit.edu>
CommitDate: Sun Aug 1 17:33:29 2010 -0400

    ext4: fix freeze deadlock under IO
    
    Commit 6b0310fbf087ad6 caused a regression resulting in deadlocks
    when freezing a filesystem which had active IO; the vfs_check_frozen
    level (SB_FREEZE_WRITE) did not let the freeze-related IO syncing
    through.  Duh.
    
    Changing the test to FREEZE_TRANS should let the normal freeze
    syncing get through the fs, but still block any transactions from
    starting once the fs is completely frozen.
    
    I tested this by running fsstress in the background while periodically
    snapshotting the fs and running fsck on the result.  I ran into
    occasional deadlocks, but different ones.  I think this is a
    fine fix for the problem at hand, and the other deadlocky things
    will need more investigation.
    
    Reported-by: Phillip Susi <psusi@cfl.rr.com>
    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
commit 6b0310fbf087ad6e9e3b8392adca97cd77184084
Author:     Eric Sandeen <sandeen@redhat.com>
AuthorDate: Sun May 16 02:00:00 2010 -0400
Commit:     Theodore Ts'o <tytso@mit.edu>
CommitDate: Sun May 16 02:00:00 2010 -0400

    ext4: don't return to userspace after freezing the fs with a mutex held
    
    ext4_freeze() used jbd2_journal_lock_updates() which takes
    the j_barrier mutex, and then returns to userspace.  The
    kernel does not like this:
    
    ================================================
    [ BUG: lock held when returning to user space! ]
    ------------------------------------------------
    lvcreate/1075 is leaving the kernel with locks still held!
    1 lock held by lvcreate/1075:
     #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
    jbd2_journal_lock_updates+0xe1/0xf0
    
    Use vfs_check_frozen() added to ext4_journal_start_sb() and
    ext4_force_commit() instead.
    
    Addresses-Red-Hat-Bugzilla: #568503
    
    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

I would guess yes, since they originated from Red Hat, but just wanted to confirm.

Comment by Gerrit Updater [ 02/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20062/
Subject: LU-8071 ldiskfs: handle system freeze protection
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bd40ca206881eefeeb6ad7586f93afd685bb8120

Comment by Peter Jones [ 02/Jun/16 ]

Now landed for 2.9 and queued up for maintenance releases. Is there anything further you need on this ticket or can it be closed?

Comment by Jay Lan (Inactive) [ 02/Jun/16 ]

Yes, this ticket can be closed. Thanks!

Comment by Peter Jones [ 02/Jun/16 ]

Thanks Jay!

Generated at Sat Feb 10 02:14:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.