[LU-13492] lfs migrate -m returns Operation not permitted Created: 29/Apr/20 Updated: 04/Oct/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Hongchao Zhang |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6 Kernel 3.10.0-957.27.2.el7_lustre.pl2.x86_64 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Hello! When using lfs migrate -m to migrate directories across MDTs, we sometimes face We're trying to migrate files from MDT0003 to MDT0001. While running a migration of a full user directory as follow: lfs migrate -m 1 /fir/users/apatel6 we hit "operation not permitted" errors on multiple directories, and even retrying the migration is leading to the same error: [root@fir-rbh01 storage]# lfs migrate -m 1 /fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N
/fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N migrate failed: Operation not permitted (-1)
[root@fir-rbh01 storage]# lfs getdirstripe /fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N
lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
mdtidx FID[seq:oid:ver]
3 [0x2800394ad:0x3c7c:0x0]
3 [0x280038894:0x124ee:0x0]
I also noticed when writing this ticket that something seems wrong here as there are two mdtidx = "3". Usually, when a directory is migrating from 3 to 1, we can see mdtidx 1 and 3. Quick check of the FIDs above: [root@fir-rbh01 storage]# lfs fid2path /fir 0x2800394ad:0x3c7c:0x0 /fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N [root@fir-rbh01 storage]# lfs fid2path /fir 0x280038894:0x124ee:0x0 /fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N MDT0001 (not MDT0003!) shows this log message when attemping the failed command: Apr 29 08:35:06 fir-md1-s2 kernel: LustreError: 22437:0:(mdd_dir.c:4496:mdd_migrate()) fir-MDD0001: '02-N' migration was interrupted, run 'lfs migrate -m 3 -c 1 -H 2 02-N' to finish migration. I don't see anything else, but there might be debug flags that could be interesting? |
| Comments |
| Comment by Peter Jones [ 29/Apr/20 ] |
|
Hongchao Could you please advise? Thanks Peter |
| Comment by Andreas Dilger [ 29/Apr/20 ] |
|
Stephane, are you able to collect debug logs from the client and MDS during the failed migration? Ideally, full debug in the client and MDS, but if the MDS is busy this would overflow the debug log, so if needed we could start with "debug=+dlmtrace+rpctrace". |
| Comment by Stephane Thiell [ 30/Apr/20 ] |
|
Thanks! Attached full debug (+ALL) from the client as client-ALL.log lfs migrate -m 1 /fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N and 2 seconds of debug logs from the two MDS in question:
In the logs of MDT0001, I can see: 00000004:00020000:16.0:1588205266.457978:0:22469:0:(mdd_dir.c:4496:mdd_migrate()) fir-MDD0001: '02-N' migration was interrupted, run 'lfs migrate -m 3 -c 1 -H 2 02-N' to finish migration. so I think I got this part at least. Let me know if I should try full debug of the MDS. Perhaps I could increase the debug buffer size. |
| Comment by Stephane Thiell [ 30/Apr/20 ] |
|
We also noticed another thing on another directory tree, that may be related to this ticket. We were not able to migrate some "leaf" directories, and we noticed that all of them are actually empty. But even an explicit lfs migrate on them doesn't work (tested from both 2.12.4 and 2.13 clients): [root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03 lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none [root@fir-rbh01 storage]# lfs migrate -m 1 /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03 [root@fir-rbh01 storage]# echo $? 0 [root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03 lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none This directory is empty: [root@fir-rbh01 storage]# stat /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03 File: ‘/fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03’ Size: 4096 Blocks: 8 IO Block: 4096 directory Device: e64e03a8h/3863872424d Inode: 180148089774940567 Links: 2 Access: (2755/drwxr-sr-x) Uid: (55081/ jbboin) Gid: (24300/ bgirod) Access: 2020-04-30 13:28:31.000000000 -0700 Modify: 2019-11-29 22:10:47.000000000 -0800 Change: 2019-11-29 22:10:47.000000000 -0800 Birth: - [root@fir-rbh01 storage]# ls -lisa /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03 total 1840 180148089774940567 4 drwxr-sr-x 2 jbboin bgirod 4096 Nov 29 22:10 . 180148089774940559 1836 drwxr-sr-x 13322 jbboin bgirod 1871872 Apr 30 11:29 .. [root@fir-rbh01 storage]# Originally, we ran lfs migrate -m 1 /fir/groups/bgirod, which is mostly done by now, apart from a few empty directories in /fir/groups/bgirod/action_recognition/frames/. Now, if I try again, I get: [root@fir-rbh01 storage]# lfs migrate -m 1 /fir/groups/bgirod /fir/groups/bgirod/ migrate failed: Operation not permitted (-1) And same error, on MDT0001: fir-md1-s2: Apr 30 13:46:47 fir-md1-s2 kernel: LustreError: 22427:0:(mdd_dir.c:4496:mdd_migrate()) fir-MDD0001: 'bgirod' migration was interrupted, run 'lfs migrate -m 3 -c 1 -H 2 bgirod' to finish migration. current getdirstripe info of each component: [root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod
lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
mdtidx FID[seq:oid:ver]
3 [0x28003bb05:0x135:0x0]
1 [0x2400576a9:0x1abb:0x0]
[root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod/action_recognition
lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
mdtidx FID[seq:oid:ver]
3 [0x28003bb05:0x136:0x0]
1 [0x2400576a9:0x1af9:0x0]
[root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod/action_recognition/frames
lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
mdtidx FID[seq:oid:ver]
3 [0x28003bb05:0x138:0x0]
1 [0x2400576a9:0x1ce2:0x0]
|
| Comment by Hongchao Zhang [ 07/May/20 ] |
|
As per the stripe information of "/fir/groups/bgirod", "/fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N", etc [root@zhanghc tests]# ../utils/lfs getdirstripe /mnt/lustre/pdir/cdir/
lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
mdtidx FID[seq:oid:ver]
3 [0x2c0000400:0xb:0x0]
3 [0x2c0000400:0x9:0x0]
Was there ever migration to MDT0003 prior to this migration? the "-EPERM" is triggered in "mdd_migrate" because of the pending migration static int mdd_migrate(const struct lu_env *env, struct md_object *md_pobj,
struct md_object *md_sobj, const struct lu_name *lname,
struct md_object *md_tobj, struct md_op_spec *spec,
struct md_attr *ma)
{
if (S_ISDIR(attr->la_mode)) {
...
if (lmv->lmv_migrate_offset !=
lum_stripe_count ||
lmv->lmv_master_mdt_index !=
lmu->lum_stripe_offset ||
(lmv_hash_type != 0 &&
lmv_hash_type != lmu->lum_hash_type)) {
CERROR("%s: \'"DNAME"\' migration was "
"interrupted, run \'lfs migrate "
"-m %d -c %d -H %d "DNAME"\' to "
"finish migration.\n",
mdd2obd_dev(mdd)->obd_name,
PNAME(lname),
le32_to_cpu(
lmv->lmv_master_mdt_index),
le32_to_cpu(
lmv->lmv_migrate_offset),
le32_to_cpu(lmv_hash_type),
PNAME(lname));
GOTO(out, rc = -EPERM);
}
...
}
...
}
The migration request will be sent to the migration target MDT, then the above log was printed at MDT0001 For the empty directory issue of "/fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03" |
| Comment by Stephane Thiell [ 16/Jul/20 ] |
|
Hi Hongchao, Since my last message, we have upgraded to 2.12.5 and I cannot reproduce the problem with the empty directory. It has now been successfully migrated to MDT1. However, we still have issues with EPERM errors even in 2.12.5. For example, I tried again today, and it still doesn't work for this directory: [root@fir-rbh02 ~]# lfs getdirstripe /fir/groups/astraigh/kousik
lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64,migrating,lost_lmv
mdtidx FID[seq:oid:ver]
0 [0x200042f8e:0x29:0x0]
3 [0x2800393f0:0x417d:0x0]
[root@fir-rbh02 ~]# lfs migrate -m 3 /fir/groups/astraigh/kousik
/fir/groups/astraigh/kousik migrate failed: Operation not permitted (-1)
It looks like you spotted the problem (a previous migration was running). Is there a way to fix the problem so that we can migrate this directory to MDT3 for example? Thanks! |