[LU-13492] lfs migrate -m returns Operation not permitted Created: 29/Apr/20  Updated: 04/Oct/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Hongchao Zhang
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.6 Kernel 3.10.0-957.27.2.el7_lustre.pl2.x86_64


Attachments: Text File client-ALL.log     File fir-md1-s2-MDT0001_dlmtrace+rpctrace.log.gz     File fir-md1-s4-MDT0003_dlmtrace+rpctrace.log.gz    
Issue Links:
Related
is related to LU-13298 lfs migrate -m "migrate failed: Opera... Resolved
is related to LU-13425 "run 'lfs migrate -m 1 -c 1 -H 3 dir1... Resolved
is related to LU-15001 improve recovery of interrupted direc... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hello!

When using lfs migrate -m to migrate directories across MDTs, we sometimes face LU-13298 (lfs migrate does not work yet with DoM files) for which we do have a workaround (ie. we restripe the files first without DoM). However, we are now having a different problem this time, I think.

We're trying to migrate files from MDT0003 to MDT0001. While running a migration of a full user directory as follow:

lfs migrate -m 1 /fir/users/apatel6

we hit "operation not permitted" errors on multiple directories, and even retrying the migration is leading to the same error:

[root@fir-rbh01 storage]# lfs migrate -m 1 /fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N
/fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N migrate failed: Operation not permitted (-1)

[root@fir-rbh01 storage]# lfs getdirstripe /fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N
lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
mdtidx           FID[seq:oid:ver]
     3           [0x2800394ad:0x3c7c:0x0]
     3           [0x280038894:0x124ee:0x0]

I also noticed when writing this ticket that something seems wrong here as there are two mdtidx = "3". Usually, when a directory is migrating from 3 to 1, we can see mdtidx 1 and 3.

Quick check of the FIDs above:

[root@fir-rbh01 storage]# lfs fid2path /fir 0x2800394ad:0x3c7c:0x0
/fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N
[root@fir-rbh01 storage]# lfs fid2path /fir 0x280038894:0x124ee:0x0
/fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N

MDT0001 (not MDT0003!) shows this log message when attemping the failed command:

Apr 29 08:35:06 fir-md1-s2 kernel: LustreError: 22437:0:(mdd_dir.c:4496:mdd_migrate()) fir-MDD0001: '02-N' migration was interrupted, run 'lfs migrate -m 3 -c 1 -H 2 02-N' to finish migration.

I don't see anything else, but there might be debug flags that could be interesting?
In any case, let me know how we could help troubleshoot this issue. We're using Lustre 2.12.4 here even on the client that performs the lfs migrate. Thanks!



 Comments   
Comment by Peter Jones [ 29/Apr/20 ]

Hongchao

Could you please advise?

Thanks

Peter

Comment by Andreas Dilger [ 29/Apr/20 ]

Stephane, are you able to collect debug logs from the client and MDS during the failed migration? Ideally, full debug in the client and MDS, but if the MDS is busy this would overflow the debug log, so if needed we could start with "debug=+dlmtrace+rpctrace".

Comment by Stephane Thiell [ 30/Apr/20 ]

Thanks! Attached full debug (+ALL) from the client as client-ALL.log (client NID is 10.0.10.3@o2ib7) while running the following command (same as in the description above):

lfs migrate -m 1 /fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N

and 2 seconds of debug logs from the two MDS in question:

In the logs of MDT0001, I can see:

00000004:00020000:16.0:1588205266.457978:0:22469:0:(mdd_dir.c:4496:mdd_migrate()) fir-MDD0001: '02-N' migration was interrupted, run 'lfs migrate -m 3 -c 1 -H 2 02-N' to finish migration.

so I think I got this part at least.

Let me know if I should try full debug of the MDS. Perhaps I could increase the debug buffer size.

Comment by Stephane Thiell [ 30/Apr/20 ]

We also noticed another thing on another directory tree, that may be related to this ticket.

We were not able to migrate some "leaf" directories, and we noticed that all of them are actually empty.

But even an explicit lfs migrate on them doesn't work (tested from both 2.12.4 and 2.13 clients):

[root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03
lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none
[root@fir-rbh01 storage]# lfs migrate -m 1 /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03
[root@fir-rbh01 storage]# echo $?
0
[root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03
lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none

This directory is empty:

[root@fir-rbh01 storage]# stat /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03
  File: ‘/fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03’
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: e64e03a8h/3863872424d   Inode: 180148089774940567  Links: 2
Access: (2755/drwxr-sr-x)  Uid: (55081/  jbboin)   Gid: (24300/  bgirod)
Access: 2020-04-30 13:28:31.000000000 -0700
Modify: 2019-11-29 22:10:47.000000000 -0800
Change: 2019-11-29 22:10:47.000000000 -0800
 Birth: -
[root@fir-rbh01 storage]# ls -lisa /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03
total 1840
180148089774940567    4 drwxr-sr-x     2 jbboin bgirod    4096 Nov 29 22:10 .
180148089774940559 1836 drwxr-sr-x 13322 jbboin bgirod 1871872 Apr 30 11:29 ..
[root@fir-rbh01 storage]# 

Originally, we ran lfs migrate -m 1 /fir/groups/bgirod, which is mostly done by now, apart from a few empty directories in /fir/groups/bgirod/action_recognition/frames/.

Now, if I try again, I get:

[root@fir-rbh01 storage]# lfs migrate -m 1 /fir/groups/bgirod
/fir/groups/bgirod/ migrate failed: Operation not permitted (-1)

And same error, on MDT0001:

fir-md1-s2: Apr 30 13:46:47 fir-md1-s2 kernel: LustreError: 22427:0:(mdd_dir.c:4496:mdd_migrate()) fir-MDD0001: 'bgirod' migration was interrupted, run 'lfs migrate -m 3 -c 1 -H 2 bgirod' to finish migration.

current getdirstripe info of each component:

[root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod
lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
mdtidx           FID[seq:oid:ver]
     3           [0x28003bb05:0x135:0x0]
     1           [0x2400576a9:0x1abb:0x0]
[root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod/action_recognition
lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
mdtidx           FID[seq:oid:ver]
     3           [0x28003bb05:0x136:0x0]
     1           [0x2400576a9:0x1af9:0x0]
[root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod/action_recognition/frames
lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
mdtidx           FID[seq:oid:ver]
     3           [0x28003bb05:0x138:0x0]
     1           [0x2400576a9:0x1ce2:0x0]
Comment by Hongchao Zhang [ 07/May/20 ]

As per the stripe information of "/fir/groups/bgirod", "/fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N", etc
It should be the migration to MDT0003 (if the original directory was on MDT0003, there will be two mdtidx 3 in the stripes)

[root@zhanghc tests]# ../utils/lfs getdirstripe /mnt/lustre/pdir/cdir/
lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
mdtidx		 FID[seq:oid:ver]
     3		 [0x2c0000400:0xb:0x0]		
     3		 [0x2c0000400:0x9:0x0]

Was there ever migration to MDT0003 prior to this migration?

the "-EPERM" is triggered in "mdd_migrate" because of the pending migration

static int mdd_migrate(const struct lu_env *env, struct md_object *md_pobj,
                       struct md_object *md_sobj, const struct lu_name *lname,
                       struct md_object *md_tobj, struct md_op_spec *spec,
                       struct md_attr *ma)
{
        if (S_ISDIR(attr->la_mode)) {
                                ...
                                if (lmv->lmv_migrate_offset !=
                                    lum_stripe_count ||
                                    lmv->lmv_master_mdt_index !=
                                    lmu->lum_stripe_offset ||
                                    (lmv_hash_type != 0 &&
                                     lmv_hash_type != lmu->lum_hash_type)) {
                                        CERROR("%s: \'"DNAME"\' migration was "
                                                "interrupted, run \'lfs migrate "
                                                "-m %d -c %d -H %d "DNAME"\' to "
                                                "finish migration.\n",
                                                mdd2obd_dev(mdd)->obd_name,
                                                PNAME(lname),
                                                le32_to_cpu(
                                                    lmv->lmv_master_mdt_index),
                                                le32_to_cpu(
                                                    lmv->lmv_migrate_offset),
                                                le32_to_cpu(lmv_hash_type),
                                                PNAME(lname));
                                        GOTO(out, rc = -EPERM);
                                }
                                ...
        }
        ...
}

The migration request will be sent to the migration target MDT, then the above log was printed at MDT0001

For the empty directory issue of "/fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03"
Could you please collect the similar debug logs during migration?
Thanks!

Comment by Stephane Thiell [ 16/Jul/20 ]

Hi Hongchao,

Since my last message, we have upgraded to 2.12.5 and I cannot reproduce the problem with the empty directory. It has now been successfully migrated to MDT1.

However, we still have issues with EPERM errors even in 2.12.5.

For example, I tried again today, and it still doesn't work for this directory:

[root@fir-rbh02 ~]# lfs getdirstripe /fir/groups/astraigh/kousik
lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64,migrating,lost_lmv
mdtidx		 FID[seq:oid:ver]
     0		 [0x200042f8e:0x29:0x0]		
     3		 [0x2800393f0:0x417d:0x0]		
[root@fir-rbh02 ~]# lfs migrate -m 3 /fir/groups/astraigh/kousik
/fir/groups/astraigh/kousik migrate failed: Operation not permitted (-1)

It looks like you spotted the problem (a previous migration was running). Is there a way to fix the problem so that we can migrate this directory to MDT3 for example?

Thanks!
Stephane

Generated at Sat Feb 10 03:01:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.