[LU-10677] can't delete directory Created: 16/Feb/18  Updated: 06/Dec/18

Status: In Progress
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: SC Admin (Inactive) Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: None
Environment:

zfs 0.7.5, OPA, skylake, centos7


Attachments: File 10677-rm58-warble1.log     File 10677-rm58-warble2.log     File 10677-warble1.log     File 10677-warble2.log    
Issue Links:
Related
is related to LU-8990 Failback LBUG lod_device_free()) ASSE... Resolved
is related to LU-10028 Export the information about Lustre s... Resolved
Epic/Theme: dne
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hiya,

we have 2 MDS's with 1 MDT on one of them and 2 MDTs on the other. so 3 MDT's in total. each MDT consists of 2 hardware raid1 mirrors with zmirror putting those together into one zfs MDT in one zpool. so 4-way replication.

latest centos7.4 kernels 3.10.0-693.17.1.el7.x86_64 everywhere. nopti set on lustre servers. 8 OSS's if that matters. multipath on all lustre servers. purely software raidz3 on OSS's.

we are testing DNE2 with 3-way dir striping, and also with default inheritance to all sub-dirs.
the below test fails and seems repeatable.

# lfs setdirstripe -c 3 mdt0-2
# lfs setdirstripe -D -c 3 mdt0-2
# chown rhumble mdt0-2
[rhumble@farnarkle2 ~]$ for f in /dagg/old_stuff/rjh/mdtest/mdt*; do echo === $f === ; time ( cd $f ; for g in {0000..9999}; do mkdir $g; for h in {00..99}; do mkdir $g/$h; done; done ) ; time rm -rf $f/*; done
...
=== /dagg/old_stuff/rjh/mdtest/mdt0-2 ===

real    57m21.053s
user    8m36.378s
sys     18m25.963s
rm: cannot remove ‘/dagg/old_stuff/rjh/mdtest/mdt0-2/2556’: Directory not empty

real    72m52.257s
user    0m4.197s
sys     7m59.024s
...

[rhumble@farnarkle2 ~]$ ls -al /dagg/old_stuff/rjh/mdtest/mdt0-2/2556
total 894
drwxrwxr-x 3 rhumble hpcadmin  76800 Feb 16 03:33 .
drwxr-xr-x 3 rhumble hpcadmin 838656 Feb 16 15:46 ..
[rhumble@farnarkle2 ~]$ rmdir /dagg/old_stuff/rjh/mdtest/mdt0-2/2556
rmdir: failed to remove ‘/dagg/old_stuff/rjh/mdtest/mdt0-2/2556’: Directory not empty
[rhumble@farnarkle2 ~]$ rm -rf /dagg/old_stuff/rjh/mdtest/mdt0-2/2556
rm: cannot remove ‘/dagg/old_stuff/rjh/mdtest/mdt0-2/2556’: Directory not empty

there aren't any problems seen with the other 4 dirs tested.
the other 4 dirs are mdt0, mdt1, mdt2 whcih have dir striping set to only that mdt and no default (-D) set, and to a directory with 3-way dir striping and no default (-D) set. ie.

[root@farnarkle1 ~]# lfs getdirstripe /dagg/old_stuff/rjh/mdtest/mdt0
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none

[root@farnarkle1 ~]# lfs getdirstripe /dagg/old_stuff/rjh/mdtest/mdt1
lmv_stripe_count: 0 lmv_stripe_offset: 1 lmv_hash_type: none

[root@farnarkle1 ~]# lfs getdirstripe /dagg/old_stuff/rjh/mdtest/mdt2
lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none

[root@farnarkle1 ~]# lfs getdirstripe /dagg/old_stuff/rjh/mdtest/mdt0-2
lmv_stripe_count: 3 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     0           [0x20000b7bd:0x4a1b:0x0]
     1           [0x28001639c:0x4a58:0x0]
     2           [0x680016b6b:0x4a58:0x0]

[root@farnarkle1 ~]# lfs getdirstripe /dagg/old_stuff/rjh/mdtest/mdt0-2-no-inherit
lmv_stripe_count: 3 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     0           [0x20000bfa7:0xa63a:0x0]
     1           [0x2800182f7:0xa69f:0x0]
     2           [0x680018abd:0xa697:0x0]

[root@farnarkle1 ~]# lfs getdirstripe -D /dagg/old_stuff/rjh/mdtest/mdt0
lmv_stripe_count: 0 lmv_stripe_offset: -1 lmv_hash_type: none

[root@farnarkle1 ~]# lfs getdirstripe -D /dagg/old_stuff/rjh/mdtest/mdt1
lmv_stripe_count: 0 lmv_stripe_offset: -1 lmv_hash_type: none

[root@farnarkle1 ~]# lfs getdirstripe -D /dagg/old_stuff/rjh/mdtest/mdt2
lmv_stripe_count: 0 lmv_stripe_offset: -1 lmv_hash_type: none

[root@farnarkle1 ~]# lfs getdirstripe -D /dagg/old_stuff/rjh/mdtest/mdt0-2
lmv_stripe_count: 3 lmv_stripe_offset: -1 lmv_hash_type: fnv_1a_64

[root@farnarkle1 ~]# lfs getdirstripe -D /dagg/old_stuff/rjh/mdtest/mdt0-2-no-inherit/
lmv_stripe_count: 0 lmv_stripe_offset: -1 lmv_hash_type: none

the un-removable directories have only appeared on the 3-way -D directory, so I suspect the bug is to do with DNE2 and the -D inheritance stuff in particular.

I also re-ran the test with all 3 MDT's on one MDS, and the same thing happened - one directory was un-removable by any means.

there's nothing in dmesg or syslog.

cheers,
robin



 Comments   
Comment by Robin Humble [ 16/Feb/18 ]

oh, and our lustre (client and server) were patched with https://jira.hpdd.intel.com/browse/LU-10212 https://review.whamcloud.com/#/c/29992/ 'cos we were hitting ESTALE on clients.

Comment by Andreas Dilger [ 23/Feb/18 ]

Robin, after the "rm -r" fails, if you do "ls -lR" of the directory, does it show any entries left behind, or is the directory empty? It is possible there is some problem iterating the directory entries during the rm that leaves something behind.

Lai, are you able to reproduce this?

Comment by SC Admin (Inactive) [ 23/Feb/18 ]

Hi Andreas,

the directory 2556/ is empty, but can't be rmdir'd.

cheers,
robin

Comment by Lai Siyao [ 23/Feb/18 ]

I'm trying to reproduce this, but haven't seen in smaller scale on master and 2.10 yet. In the mean time could you do the following to dump debug logs?
1. 'lctl set_param debug=-1' on 2 MDS's.
2. 'lctl clear' on 2 MDS's to clear previous debug logs.
3. 'rm /dagg/old_stuff/rjh/mdtest/mdt0-2/2556'
4. 'lctl dk /tmp/10677-`hostname`.log' to collect debug logs on 2 MDS, and attach them here.

Also could you run 'lfs getdirstripe /dagg/old_stuff/rjh/mdtest/mdt0-2/2556' and 'stat /dagg/old_stuff/rjh/mdtest/mdt0-2/2556', and post the output here?

Comment by SC Admin (Inactive) [ 24/Feb/18 ]

Hi Lai,

yeah, sorry, I should have mentioned - scale is important.
the above reproducer makes 10k dirs, each with 100 subdirs.
I also ran a bunch with 1k dirs, each with 100 subdirs, and they all worked fine.
so something about 10k (and presumably above) triggers it for us.

cheers,
robin

Comment by SC Admin (Inactive) [ 24/Feb/18 ]

Hi Lai,

I couldn't delete these dirs (there are 2 now), but I did mv them so I could do more mdtests, so in the below and the attached the paths differ slightly from the above, but it's the same dir.

[root@farnarkle1 ~]# lfs getdirstripe /dagg/old_stuff/rjh/mdtest/corrupted/2556
lmv_stripe_count: 3 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     0           [0x20000c067:0x12e91:0x0]
     1           [0x2800183d4:0x12f1b:0x0]
     2           [0x680018bb6:0x12f24:0x0]

[root@farnarkle1 ~]# stat /dagg/old_stuff/rjh/mdtest/corrupted/2556
  File: ‘/dagg/old_stuff/rjh/mdtest/corrupted/2556’
  Size: 76800           Blocks: 150        IO Block: 131072 directory
Device: ef57e2ach/4015514284d   Inode: 144116014454481792  Links: 3
Access: (0775/drwxrwxr-x)  Uid: ( 1040/ rhumble)   Gid: (10190/hpcadmin)
Access: 2018-02-24 12:46:01.000000000 +1100
Modify: 2018-02-16 03:33:20.000000000 +1100
Change: 2018-02-16 20:30:03.000000000 +1100
 Birth: -
Comment by SC Admin (Inactive) [ 24/Feb/18 ]

and just in case it isn't obvious from the logs, the current arrangement of MDTs is

# cexec mds: df
************************* mds *************************
--------- warble1---------
Filesystem                    1K-blocks     Used Available Use% Mounted on
...
warble1-MGT-pool/MGT             245248     1920    241280   1% /lustre/MGT
warble1-dagg-MDT2-pool/MDT2   632920832 10578560 622340224   2% /lustre/dagg/MDT2
warble1-dagg-MDT1-pool/MDT1   632918016 11151360 621764608   2% /lustre/dagg/MDT1
warble1-images-MDT0-pool/MDT0 109846912  1504384 108340480   2% /lustre/images/MDT0
--------- warble2---------
Filesystem                  1K-blocks    Used Available Use% Mounted on
...
warble2-dagg-MDT0-pool/MDT0 632916480 8628480 624285952   2% /lustre/dagg/MDT0
warble2-home-MDT0-pool/MDT0 109847680  546944 109298688   1% /lustre/home/MDT0
warble2-apps-MDT0-pool/MDT0 109834880 3831680 106001152   4% /lustre/apps/MDT0

so dagg mdt0 on warble2, and mdt1,2 on warble1. each has their own zpool.

Comment by Lai Siyao [ 24/Feb/18 ]

the 'stat' result shows 'Links' count is 3, while the correct link count of an empty dir should be 2, I'm afraid some error in 'rm' caused one link not released. As you said there is nothing in dmesg or syslog, did you check them on MDS's?

Comment by SC Admin (Inactive) [ 27/Feb/18 ]

Hi Lai,

there is nothing in dmesg or syslog. either when this happens the first time, or on subsequent attempts to remove the directory.
to check this again, I re-ran my reproducer above. 1 attempt with 10k dirs and 1 attempt with 20k dirs didn't produce any corrupted dirs, but the next run got 2 corrupted (unremoveable) dirs, and again only on the -c3 -D directory ->

[rhumble@farnarkle2 ~]$ logger mdt0-2 start 20k ; for f in /dagg/old_stuff/rjh/mdtest/mdt*; do echo === $f === ; time ( cd $f ; for g in {00000..19999}; do mkdir $g; mkdir $g/{00..99}; done ) ; time rm -rf $f/*; done ; logger mdt0-2 end 20k
=== /dagg/old_stuff/rjh/mdtest/mdt0 ===

real    7m24.288s
user    0m26.297s
sys     1m35.283s

real    18m17.211s
user    0m5.137s
sys     4m41.977s
=== /dagg/old_stuff/rjh/mdtest/mdt0-2 ===

real    58m7.293s
user    0m26.627s
sys     2m8.602s
rm: cannot remove ‘/dagg/old_stuff/rjh/mdtest/mdt0-2/15892’: Directory not empty
rm: cannot remove ‘/dagg/old_stuff/rjh/mdtest/mdt0-2/18199’: Directory not empty

real    144m0.824s
user    0m9.201s
sys     16m53.631s
=== /dagg/old_stuff/rjh/mdtest/mdt0-2-no-inherit ===

real    6m36.442s
user    0m24.291s
sys     1m22.405s

real    18m10.672s
user    0m4.968s
sys     4m49.614s
=== /dagg/old_stuff/rjh/mdtest/mdt1 ===

real    6m38.597s
user    0m24.619s
sys     1m19.193s

real    17m1.638s
user    0m5.000s
sys     4m12.672s
=== /dagg/old_stuff/rjh/mdtest/mdt2 ===

real    6m34.660s
user    0m23.893s
sys     1m15.038s

real    16m46.268s
user    0m4.694s
sys     4m6.953s

looking at the broken dirs ->

[rhumble@farnarkle2 ~]$ ls -alR /dagg/old_stuff/rjh/mdtest/mdt0-2
/dagg/old_stuff/rjh/mdtest/mdt0-2:
total 1770
drwxr-xr-x 4 rhumble hpcadmin 1625088 Feb 27 04:15 ./
drwxr-xr-x 8 rhumble hpcadmin   33280 Feb 16 20:30 ../
drwxrwxr-x 3 rhumble hpcadmin   76800 Feb 27 03:46 15892/
drwxrwxr-x 3 rhumble hpcadmin   76800 Feb 27 04:02 18199/

/dagg/old_stuff/rjh/mdtest/mdt0-2/15892:
total 1662
drwxrwxr-x 3 rhumble hpcadmin   76800 Feb 27 03:46 ./
drwxr-xr-x 4 rhumble hpcadmin 1625088 Feb 27 04:15 ../

/dagg/old_stuff/rjh/mdtest/mdt0-2/18199:
total 1662
drwxrwxr-x 3 rhumble hpcadmin   76800 Feb 27 04:02 ./
drwxr-xr-x 4 rhumble hpcadmin 1625088 Feb 27 04:15 ../

[rhumble@farnarkle2 ~]$ lfs getdirstripe /dagg/old_stuff/rjh/mdtest/mdt0-2/15892
lmv_stripe_count: 3 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     0           [0x20000f650:0x29df:0x0]
     1           [0x28001b433:0x2a42:0x0]
     2           [0x68001bc55:0x2a42:0x0]

[rhumble@farnarkle2 ~]$ stat /dagg/old_stuff/rjh/mdtest/mdt0-2/15892
  File: ‘/dagg/old_stuff/rjh/mdtest/mdt0-2/15892’
  Size: 76800           Blocks: 150        IO Block: 131072 directory
Device: ef57e2ach/4015514284d   Inode: 144116245963231287  Links: 3
Access: (0775/drwxrwxr-x)  Uid: ( 1040/ rhumble)   Gid: (10190/hpcadmin)
Access: 2018-02-27 03:46:11.000000000 +1100
Modify: 2018-02-27 03:46:11.000000000 +1100
Change: 2018-02-27 03:46:11.000000000 +1100
 Birth: -

[rhumble@farnarkle2 ~]$ stat /dagg/old_stuff/rjh/mdtest/mdt0-2/18199
  File: ‘/dagg/old_stuff/rjh/mdtest/mdt0-2/18199’
  Size: 76800           Blocks: 150        IO Block: 131072 directory
Device: ef57e2ach/4015514284d   Inode: 468376269654884927  Links: 3
Access: (0775/drwxrwxr-x)  Uid: ( 1040/ rhumble)   Gid: (10190/hpcadmin)
Access: 2018-02-27 04:02:50.000000000 +1100
Modify: 2018-02-27 04:02:50.000000000 +1100
Change: 2018-02-27 04:02:50.000000000 +1100
 Birth: -

[rhumble@farnarkle2 ~]$ rmdir /dagg/old_stuff/rjh/mdtest/mdt0-2/18199 /dagg/old_stuff/rjh/mdtest/mdt0-2/15892
rmdir: failed to remove ‘/dagg/old_stuff/rjh/mdtest/mdt0-2/18199’: Directory not empty
rmdir: failed to remove ‘/dagg/old_stuff/rjh/mdtest/mdt0-2/15892’: Directory not empty

one difference between the 2 recent ok runs, and this incorrect run, is that I moved the 3 MDTs apart onto their (normal) 2 separate MDS's for this run. for the 2 previous runs all the MDTs had been on one MDS. however as the first post in this ticket said, I've also been able to hit the bug when all the MDTs were on all MDS. it does seem slightly easier to trigger when they're spread out though.

please note we have also moved to zfs 0.7.6. the first corruption in this ticket was with 0.7.5. so ZFS version doesn't seem to matter.

cheers,
robin

Comment by Lai Siyao [ 27/Feb/18 ]

I see, it looks like there is race in refcounting somewhere, I'll check related code.

Comment by SC Admin (Inactive) [ 07/Mar/18 ]

Hi,

I'd also greatly appreciate advice on how to delete the residual directories created from the above testing in order to tidy up the filesystem.
thanks.

I presume I have to mount the MDTs as zfs and rm something by hand?

cheers,
robin

Comment by Lai Siyao [ 08/Mar/18 ]

yes, I think so.

Comment by Robin Humble [ 19/Mar/18 ]

can you be more specific please.
thanks.

cheers,
robin

Comment by nasf (Inactive) [ 25/Mar/18 ]

The 10677-warble2.log shows that:

00000004:00000001:13.0:1519436864.845186:0:332748:0:(mdd_dir.c:371:mdd_dir_is_empty()) Process entered
00000004:00000001:13.0:1519436864.845186:0:332748:0:(lod_object.c:382:lod_striped_it_init()) Process entered
00080000:00000001:13.0:1519436864.845187:0:332748:0:(osd_index.c:147:osd_index_it_init()) Process entered
00080000:00000010:13.0:1519436864.845187:0:332748:0:(osd_index.c:156:osd_index_it_init()) slab-alloced 'it': 280 at ffff88bdc4f16f08.
00080000:00000010:13.0:1519436864.845188:0:332748:0:(osd_index.c:106:osd_zap_cursor_init()) kmalloced 't': 56 at ffff88bb0640a940.
00080000:00000001:13.0:1519436864.845188:0:332748:0:(osd_index.c:170:osd_index_it_init()) Process leaving (rc=18446612947367194376 : -131126342357240 : ffff88bdc4f16f08)
00080000:00000001:13.0:1519436864.845188:0:332748:0:(osd_index.c:726:osd_dir_it_init()) Process leaving (rc=18446612947367194376 : -131126342357240 : ffff88bdc4f16f08)
00000004:00000001:13.0:1519436864.845189:0:332748:0:(lod_object.c:461:lod_striped_it_get()) Process entered
00080000:00000001:13.0:1519436864.845190:0:332748:0:(osd_index.c:746:osd_dir_it_get()) Process entered
00080000:00000001:13.0:1519436864.845190:0:332748:0:(osd_index.c:760:osd_dir_it_get()) Process leaving (rc=1 : 1 : 1)
00000004:00000001:13.0:1519436864.845190:0:332748:0:(lod_object.c:509:lod_striped_it_next()) Process entered
00080000:00000001:13.0:1519436864.845191:0:332748:0:(osd_index.c:832:osd_dir_it_next()) Process entered
00080000:00000001:13.0:1519436864.845191:0:332748:0:(osd_index.c:845:osd_dir_it_next()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:13.0:1519436864.845191:0:332748:0:(lod_object.c:522:lod_striped_it_next()) Process leaving (rc=0 : 0 : 0)

==> for "dot".

00000004:00000001:13.0:1519436864.845192:0:332748:0:(lod_object.c:509:lod_striped_it_next()) Process entered
00080000:00000001:13.0:1519436864.845192:0:332748:0:(osd_index.c:832:osd_dir_it_next()) Process entered
00080000:00000001:13.0:1519436864.845192:0:332748:0:(osd_index.c:845:osd_dir_it_next()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:13.0:1519436864.845192:0:332748:0:(lod_object.c:522:lod_striped_it_next()) Process leaving (rc=0 : 0 : 0)

==> for "dotdot".

00000004:00000001:13.0:1519436864.845193:0:332748:0:(lod_object.c:509:lod_striped_it_next()) Process entered
00080000:00000001:13.0:1519436864.845193:0:332748:0:(osd_index.c:832:osd_dir_it_next()) Process entered
00080000:00000001:13.0:1519436864.845198:0:332748:0:(osd_index.c:862:osd_dir_it_next()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:13.0:1519436864.845198:0:332748:0:(lod_object.c:522:lod_striped_it_next()) Process leaving (rc=0 : 0 : 0)

==> anther unknown entry.

00080000:00000001:13.0:1519436864.845199:0:332748:0:(osd_index.c:177:osd_index_it_fini()) Process entered
00080000:00000010:13.0:1519436864.845199:0:332748:0:(osd_index.c:119:osd_zap_cursor_fini()) kfreed 'zc': 56 at ffff88bb0640a940.
00080000:00000010:13.0:1519436864.845200:0:332748:0:(osd_index.c:186:osd_index_it_fini()) slab-freed '(it)': 280 at ffff88bdc4f16f08.
00080000:00000001:13.0:1519436864.845200:0:332748:0:(osd_index.c:188:osd_index_it_fini()) Process leaving
00000004:00000001:13.0:1519436864.845200:0:332748:0:(mdd_dir.c:399:mdd_dir_is_empty()) Process leaving (rc=18446744073709551577 : -39 : ffffffffffffffd9)

If the log is for the trying remove the trouble directory, then means that the first stripe of the striped directory contains at least one unknown entry (neither "dot" nor "dotdot"). But according to your description, such entry is invisible to client via "ls -al". Since it is test environment, I would suggest that if you can reproduce the trouble, please mount related MDT as "zfs" and "ls -alR" on such trouble directory (its first stripe) via ZPL. If still show nothing, then I suspected that there are some dummy entries left in such stripe.

NOTE: Please NOT remove such trouble directory via ZPL directly.

Comment by Robin Humble [ 26/Mar/18 ]

Hi. thanks for looking into this.

actually it's not a test filesystem now, it's the main production filesystem.
it doesn't seem possible to mount a zfs snapshot whilst lustre is running (please let me know how to do this if it's possible), so I did a short cluster downtime.

we have (at least) 4 dirs so far ->

[root@john5 corrupted]# ls -alR
.:
total 365
drwxrwxr-x 6 root    root     33280 Feb 27 13:06 .
drwxr-xr-x 8 rhumble hpcadmin 33280 Feb 16 20:30 ..
drwxrwxr-x 3 rhumble hpcadmin 76800 Feb 16 20:07 0291
drwxrwxr-x 3 rhumble hpcadmin 76800 Feb 27 03:46 15892
drwxrwxr-x 3 rhumble hpcadmin 76800 Feb 27 04:02 18199
drwxrwxr-x 3 rhumble hpcadmin 76800 Feb 16 03:33 2556

./0291:
total 108
drwxrwxr-x 3 rhumble hpcadmin 76800 Feb 16 20:07 .
drwxrwxr-x 6 root    root     33280 Feb 27 13:06 ..

./15892:
total 108
drwxrwxr-x 3 rhumble hpcadmin 76800 Feb 27 03:46 .
drwxrwxr-x 6 root    root     33280 Feb 27 13:06 ..

./18199:
total 108
drwxrwxr-x 3 rhumble hpcadmin 76800 Feb 27 04:02 .
drwxrwxr-x 6 root    root     33280 Feb 27 13:06 ..

./2556:
total 108
drwxrwxr-x 3 rhumble hpcadmin 76800 Feb 16 03:33 .
drwxrwxr-x 6 root    root     33280 Feb 27 13:06 ..

which can't be deleted

[root@john5 corrupted]# rmdir *
rmdir: failed to remove ‘0291’: Directory not empty
rmdir: failed to remove ‘15892’: Directory not empty
rmdir: failed to remove ‘18199’: Directory not empty
rmdir: failed to remove ‘2556’: Directory not empty

with striping

[root@john5 corrupted]# lfs getdirstripe *
lmv_stripe_count: 3 lmv_stripe_offset: 1 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     1           [0x280019a41:0xa9dc:0x0]
     2           [0x68001a211:0xa9dd:0x0]
     0           [0x20000cf21:0xa9dd:0x0]
lmv_stripe_count: 3 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     0           [0x20000f650:0x29df:0x0]
     1           [0x28001b433:0x2a42:0x0]
     2           [0x68001bc55:0x2a42:0x0]
lmv_stripe_count: 3 lmv_stripe_offset: 2 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     2           [0x68001bc57:0x21e1:0x0]
     0           [0x20000f651:0x224b:0x0]
     1           [0x28001b435:0x2272:0x0]
lmv_stripe_count: 3 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     0           [0x20000c067:0x12e91:0x0]
     1           [0x2800183d4:0x12f1b:0x0]
     2           [0x680018bb6:0x12f24:0x0]

I picked the first one and mounted MDT1 as zfs

zfs set canmount=on warble1-dagg-MDT1-pool/MDT1
umount /lustre/dagg/MDT1
zfs mount warble1-dagg-MDT1-pool/MDT1
find /warble1-dagg-MDT1-pool/MDT1 -name 0x280019a41:0xa9dc:0x0

and there is indeed something in there. 2 levels of dirs in fact ->

[warble1]root: ls -alR /warble1-dagg-MDT1-pool/MDT1/oi.65/0x280019a41:0xa9dc:0x0
/warble1-dagg-MDT1-pool/MDT1/oi.65/0x280019a41:0xa9dc:0x0:
total 16571
drwxrwxr-x 3 rhumble hpcadmin 2 Feb 16 20:07 .
drwxr-xr-x 0 root    root     0 Jan  1  1970 ..
drwxrwxr-x 5 rhumble hpcadmin 2 Feb 16 19:28 58

/warble1-dagg-MDT1-pool/MDT1/oi.65/0x280019a41:0xa9dc:0x0/58:
total 75
drwxrwxr-x 5 rhumble hpcadmin 2 Feb 16 19:28 .
drwxrwxr-x 3 rhumble hpcadmin 2 Feb 16 20:07 ..
drwxrwxr-x 2 rhumble hpcadmin 2 Feb 16 19:28 [0x280019a41:0xa9f1:0x0]:0

/warble1-dagg-MDT1-pool/MDT1/oi.65/0x280019a41:0xa9dc:0x0/58/[0x280019a41:0xa9f1:0x0]:0:
total 50
drwxrwxr-x 2 rhumble hpcadmin 2 Feb 16 19:28 .
drwxrwxr-x 5 rhumble hpcadmin 2 Feb 16 19:28 ..

cheers,
robin

Comment by nasf (Inactive) [ 26/Mar/18 ]

/warble1-dagg-MDT1-pool/MDT1/oi.65/0x280019a41:0xa9dc:0x0/58:
total 75
drwxrwxr-x 5 rhumble hpcadmin 2 Feb 16 19:28 .
drwxrwxr-x 3 rhumble hpcadmin 2 Feb 16 20:07 ..
drwxrwxr-x 2 rhumble hpcadmin 2 Feb 16 19:28 [0x280019a41:0xa9f1:0x0]:0

This is not a common directory, instead, it is the shard of another striped directory - "58". Normally, if you did not set stripe for the directory "58", it should have 3 shards by default that was inherited from its parent. But as shown, there is only one shard. So I suspected that something wrong when removed the striped directory "58", as to the other two shards has been destroyed, but failed to removed the local shard. With the corrupted striped directory "58", we cannot remove the parent directory - I assume it is the "0291".

Please find a new client, run the following:

ls -ailR 0291
ls -ailR 0291/58
Comment by Robin Humble [ 26/Mar/18 ]
[root@john6 corrupted]# ls -ailR 0291
0291:
total 108
180145747155549248 drwxrwxr-x 3 rhumble hpcadmin 76800 Feb 16 20:07 .
144115977343280607 drwxrwxr-x 6 root    root     33280 Feb 27 13:06 ..
[root@john6 corrupted]# ls -ailR 0291/58
0291/58:
total 150
180145747155549269 drwxrwxr-x 2 rhumble hpcadmin 76800 Feb 16 19:28 .
180145747155549248 drwxrwxr-x 3 rhumble hpcadmin 76800 Feb 16 20:07 ..
Comment by nasf (Inactive) [ 26/Mar/18 ]

[root@john6 corrupted]# ls -ailR 0291/58
0291/58:
total 150
180145747155549269 drwxrwxr-x 2 rhumble hpcadmin 76800 Feb 16 19:28 .
180145747155549248 drwxrwxr-x 3 rhumble hpcadmin 76800 Feb 16 20:07 ..

So the corrupted directory "58" is still visible. Please try luck to do "rm -rf 0291/58" although I am afraid you cannot do that successfully. Please collect -1 level debug logs on the MDT when rm. Thanks!

Comment by Robin Humble [ 26/Mar/18 ]

wow! that worked. well done!

[root@john6 corrupted]# rm -rf 0291/58
[root@john6 corrupted]# ls -ailR 0291/58
ls: cannot access 0291/58: No such file or directory

and I could remove the parent now too

[root@john6 corrupted]# rmdir 0291
[root@john6 corrupted]# ls -l
total 225
drwxrwxr-x 3 rhumble hpcadmin 76800 Feb 27 03:46 15892
drwxrwxr-x 3 rhumble hpcadmin 76800 Feb 27 04:02 18199
drwxrwxr-x 3 rhumble hpcadmin 76800 Feb 16 03:33 2556

so now 3 problem dirs instead of 4.

I suspect I can probably just

rmdir {00..99}

in each of those and they'll be ok again too... sweet. I won't do it just yet though in case you want to experiment with other things.

cheers,
robin

Comment by Robin Humble [ 26/Mar/18 ]

ok, this is crazy - each of the 3 remaining dirs has a '58' subdir ->

[root@john6 corrupted]# for f in *; do cd $f; echo $f; for g in {00..99}; do ls -ld $g; done; cd .. ; done 2>/dev/null
15892
drwxrwxr-x 2 rhumble hpcadmin 76800 Feb 27 01:39 58
18199
drwxrwxr-x 2 rhumble hpcadmin 76800 Feb 27 01:46 58
2556
drwxrwxr-x 2 rhumble hpcadmin 76800 Feb 16 02:32 58

and each of the '58' shards is on the same MDT as the parent dir ->

[root@john6 corrupted]# lfs getdirstripe 15892
lmv_stripe_count: 3 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     0           [0x20000f650:0x29df:0x0]
     1           [0x28001b433:0x2a42:0x0]
     2           [0x68001bc55:0x2a42:0x0]
[root@john6 corrupted]# lfs getdirstripe 15892/58
lmv_stripe_count: 3 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     0           [0x20000f650:0x29f4:0x0]
     1           [0x28001b433:0x2a57:0x0]
     2           [0x68001bc55:0x2a57:0x0]
[root@john6 corrupted]# lfs getdirstripe 18199
lmv_stripe_count: 3 lmv_stripe_offset: 2 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     2           [0x68001bc57:0x21e1:0x0]
     0           [0x20000f651:0x224b:0x0]
     1           [0x28001b435:0x2272:0x0]
[root@john6 corrupted]# lfs getdirstripe 18199/58
lmv_stripe_count: 3 lmv_stripe_offset: 2 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     2           [0x68001bc57:0x21f6:0x0]
     0           [0x20000f651:0x2260:0x0]
     1           [0x28001b435:0x2287:0x0]
[root@john6 corrupted]# lfs getdirstripe 2556
lmv_stripe_count: 3 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     0           [0x20000c067:0x12e91:0x0]
     1           [0x2800183d4:0x12f1b:0x0]
     2           [0x680018bb6:0x12f24:0x0]
[root@john6 corrupted]# lfs getdirstripe 2556/58
lmv_stripe_count: 3 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     0           [0x20000c067:0x12ea6:0x0]
     1           [0x2800183d4:0x12f30:0x0]
     2           [0x680018bb6:0x12f39:0x0]

I thought 42 was the magic number, not 58

cheers,
robin

Comment by nasf (Inactive) [ 26/Mar/18 ]

That is unbelievable..., in spit of "58" or "42", just the name, nothing special. May be related with your test parameters? What are the parameters you used to reproduce the trouble? Have you adjusted them during different instances? Change it, you may got different results.

Comment by Robin Humble [ 26/Mar/18 ]

the reproducer is the 1-liner in the first post. I've run it with 1k, 10k and 20k for the first level dirs, but always (as far as I can remember)

{00..99}

for the second level dirs. 10k and 20k don't always create problems, but eventually do. 1k seems too small to generate problems.

could there be something in Lustre directory randomisation for which 58 is special?
that's the only thing that really makes sense...
'58' is the 59th dir created in the parent, so that's 111011. is 4 special somehow because we have 3 MDT's to hash across? more buckets in a hash table? hash seed? dunno... I'm just guessing.

I'll run a few more tests the same and see if they also come up with '58', and then try a few variations.

cheers,
robin

Comment by nasf (Inactive) [ 26/Mar/18 ]

Please adjust the 2nd level sub-dirs counts for try.

Comment by Robin Humble [ 29/Mar/18 ]

I ran some more. they take a while, so apologies for the delay.
after a couple of runs through via the usual command line I got an invisible directory:

7149/55

so, darn, '58' isn't so special, that would have been awesome

I also tried new variants with more subdirs but didn't hit any problems. then I tried a variant with less subdirs

time for g in {00000..99999}; do mkdir $g; for h in {0..9}; do mkdir $g/$h; done; done ; time for g in {000..999}; do rm -rf ${g}* ; done

and got 4 busted dirs on the first run. as per usual these 4 dirs appear empty to ls -l

ls /dagg/old_stuff/rjh/mdtest/mdt0-2/*/
/dagg/old_stuff/rjh/mdtest/mdt0-2/24538/:
/dagg/old_stuff/rjh/mdtest/mdt0-2/46886/:
/dagg/old_stuff/rjh/mdtest/mdt0-2/59234/:
/dagg/old_stuff/rjh/mdtest/mdt0-2/93410/:

but actually have these dirs in them which I can see if I explicitly ls -l them

24538/9
46886/5
59234/5
93410/5

so oddly 5* subdirs still seems preferred, but there's one 9* now, so it's possibly just a preference for mid-located dirs for some reason.

cheers,
robin

Comment by nasf (Inactive) [ 29/Mar/18 ]

rjh, thanks for the further verification. The results are in the expectation. The remaining shard of striped directory was not related with any special name, instead, I suspect that it may be related with some in-complete rmdir (striped directory) operation. Although we try to guarantee the atomicity for cross-MDTs modification, but it is difficult to say (or prove) that all the corner cases are properly handled, especially for error cases handing. Now, we need the logs, please help to collect the MDS side Lustre kernel debug logs (the details the better) when reproduce the trouble. Thanks!

Comment by Robin Humble [ 04/Apr/18 ]

Hi,

the MDS logs for rm'ing the '58' dir has been attached to this ticket for a while now.

are you asking for logs whilst I create 1M dirs? surely that will overflow all the log buffers on the MDS's and be useless?

cheers,
robin

Comment by nasf (Inactive) [ 08/Apr/18 ]

What I need is the Lustre kernel debug logs when failed to rmdir the striped directory (such as the "58"). It is difficult to know in advance which striped directory will be the one.

Comment by SC Admin (Inactive) [ 09/Apr/18 ]

ok. I understand now. thanks.
I'll try to find a shorter reproducer and run the 'lctl debug_daemon' with '-1' whilst I running the 'rm -rf' parts.

cheers,
robin

Comment by SC Admin (Inactive) [ 12/Apr/18 ]

Hi Fan Yong,

just to let you know I've put this on the backburner and won't be doing any more on this bug until we have resolved most of LU-10887 and have repaired the current fs errors. basically I'm reluctant to make things worse by adding more errors to the fs at the moment.

it'll be interesting to see if a successful lfsck pass can repair the current 93410/5 59234/5 46886/5 24538/9 'hidden' directories.

cheers,
robin

Comment by nasf (Inactive) [ 12/Apr/18 ]

Hi Robin,

As you can see, we already have three patches for LU-10887, one is for LFSCK trouble (https://review.whamcloud.com/31915), another is for object leak (https://review.whamcloud.com/31929), the 3rd one (https://review.whamcloud.com/29228) for showing mount options. Please keep update on such ticket when you need more helps.

Comment by SC Admin (Inactive) [ 03/May/18 ]

just so you know, these directories that can't be deleted 'cos they have 'hidden' dirs in them are still there and behave the same after the namespace lfsck in LU-10988. so lfsck missed them somehow.

there are at least 9 directories like this at the moment.

cheers,
robin

Comment by nasf (Inactive) [ 03/May/18 ]

just so you know, these directories that can't be deleted 'cos they have 'hidden' dirs in them are still there and behave the same after the namespace lfsck in LU-10988. so lfsck missed them somehow.
there are at least 9 directories like this at the moment.

I am afraid that these 9 directories are the ones skipped by the namespace LFSCK. I would suggest to enable LFSCK debug (lctl set_param debug+=lfsck) on the MDTs, then re-run namespace LFSCK. Please collect the debug logs on the MDT that will tell us which directories are skipped and why. Please use large debug buffer and mask unnecessary debug components to avoid debug buffer overflow.

Comment by Peter Jones [ 30/Jun/18 ]

scadmin when do you expect to be able to run this test?

Comment by SC Admin (Inactive) [ 02/Jul/18 ]

Hi,

sorry. yes, I should have got back to this, but it's been a pretty low priority for us. and high risk as it turns out 'cos the MDS crashed whilst running the lfsck (see LU-11111)

we did get as far as this though ->

# lctl get_param -n mdd.dagg-MDT000*.lfsck_namespace | grep striped_striped_shard
striped_shards_skipped: 3
striped_shards_skipped: 4
striped_shards_skipped: 4

which perhaps is enough to help.

would you like me to upload the debug_file.txt.gz to somewhere, or grep for something? it's about 300M gzip'd.

cheers,
robin

Comment by Peter Jones [ 14/Jul/18 ]

Robin

There is a Whamcloud ftp site that you could upload large files to. I can give you details directly (i.e via email) if you wish to do this

Peter

Comment by Lai Siyao [ 06/Dec/18 ]

I'm working on patch to improve lfsck on this, will provide a patch later.

Generated at Sat Feb 10 02:37:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.