Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13492

lfs migrate -m returns Operation not permitted

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.4
    • None
    • CentOS 7.6 Kernel 3.10.0-957.27.2.el7_lustre.pl2.x86_64
    • 3
    • 9223372036854775807

    Description

      Hello!

      When using lfs migrate -m to migrate directories across MDTs, we sometimes face LU-13298 (lfs migrate does not work yet with DoM files) for which we do have a workaround (ie. we restripe the files first without DoM). However, we are now having a different problem this time, I think.

      We're trying to migrate files from MDT0003 to MDT0001. While running a migration of a full user directory as follow:

      lfs migrate -m 1 /fir/users/apatel6
      

      we hit "operation not permitted" errors on multiple directories, and even retrying the migration is leading to the same error:

      [root@fir-rbh01 storage]# lfs migrate -m 1 /fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N
      /fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N migrate failed: Operation not permitted (-1)
      
      [root@fir-rbh01 storage]# lfs getdirstripe /fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N
      lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
      mdtidx           FID[seq:oid:ver]
           3           [0x2800394ad:0x3c7c:0x0]
           3           [0x280038894:0x124ee:0x0]
      

      I also noticed when writing this ticket that something seems wrong here as there are two mdtidx = "3". Usually, when a directory is migrating from 3 to 1, we can see mdtidx 1 and 3.

      Quick check of the FIDs above:

      [root@fir-rbh01 storage]# lfs fid2path /fir 0x2800394ad:0x3c7c:0x0
      /fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N
      [root@fir-rbh01 storage]# lfs fid2path /fir 0x280038894:0x124ee:0x0
      /fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N
      

      MDT0001 (not MDT0003!) shows this log message when attemping the failed command:

      Apr 29 08:35:06 fir-md1-s2 kernel: LustreError: 22437:0:(mdd_dir.c:4496:mdd_migrate()) fir-MDD0001: '02-N' migration was interrupted, run 'lfs migrate -m 3 -c 1 -H 2 02-N' to finish migration.
      

      I don't see anything else, but there might be debug flags that could be interesting?
      In any case, let me know how we could help troubleshoot this issue. We're using Lustre 2.12.4 here even on the client that performs the lfs migrate. Thanks!

      Attachments

        1. client-ALL.log
          5.09 MB
          Stephane Thiell
        2. fir-md1-s2-MDT0001_dlmtrace+rpctrace.log.gz
          2.34 MB
          Stephane Thiell
        3. fir-md1-s4-MDT0003_dlmtrace+rpctrace.log.gz
          4.55 MB
          Stephane Thiell

        Issue Links

          Activity

            [LU-13492] lfs migrate -m returns Operation not permitted

            Hi Hongchao,

            Since my last message, we have upgraded to 2.12.5 and I cannot reproduce the problem with the empty directory. It has now been successfully migrated to MDT1.

            However, we still have issues with EPERM errors even in 2.12.5.

            For example, I tried again today, and it still doesn't work for this directory:

            [root@fir-rbh02 ~]# lfs getdirstripe /fir/groups/astraigh/kousik
            lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64,migrating,lost_lmv
            mdtidx		 FID[seq:oid:ver]
                 0		 [0x200042f8e:0x29:0x0]		
                 3		 [0x2800393f0:0x417d:0x0]		
            [root@fir-rbh02 ~]# lfs migrate -m 3 /fir/groups/astraigh/kousik
            /fir/groups/astraigh/kousik migrate failed: Operation not permitted (-1)
            

            It looks like you spotted the problem (a previous migration was running). Is there a way to fix the problem so that we can migrate this directory to MDT3 for example?

            Thanks!
            Stephane

            sthiell Stephane Thiell added a comment - Hi Hongchao, Since my last message, we have upgraded to 2.12.5 and I cannot reproduce the problem with the empty directory. It has now been successfully migrated to MDT1. However, we still have issues with EPERM errors even in 2.12.5. For example, I tried again today, and it still doesn't work for this directory: [root@fir-rbh02 ~]# lfs getdirstripe /fir/groups/astraigh/kousik lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64,migrating,lost_lmv mdtidx FID[seq:oid:ver] 0 [0x200042f8e:0x29:0x0] 3 [0x2800393f0:0x417d:0x0] [root@fir-rbh02 ~]# lfs migrate -m 3 /fir/groups/astraigh/kousik /fir/groups/astraigh/kousik migrate failed: Operation not permitted (-1) It looks like you spotted the problem (a previous migration was running). Is there a way to fix the problem so that we can migrate this directory to MDT3 for example? Thanks! Stephane

            As per the stripe information of "/fir/groups/bgirod", "/fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N", etc
            It should be the migration to MDT0003 (if the original directory was on MDT0003, there will be two mdtidx 3 in the stripes)

            [root@zhanghc tests]# ../utils/lfs getdirstripe /mnt/lustre/pdir/cdir/
            lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
            mdtidx		 FID[seq:oid:ver]
                 3		 [0x2c0000400:0xb:0x0]		
                 3		 [0x2c0000400:0x9:0x0]
            

            Was there ever migration to MDT0003 prior to this migration?

            the "-EPERM" is triggered in "mdd_migrate" because of the pending migration

            static int mdd_migrate(const struct lu_env *env, struct md_object *md_pobj,
                                   struct md_object *md_sobj, const struct lu_name *lname,
                                   struct md_object *md_tobj, struct md_op_spec *spec,
                                   struct md_attr *ma)
            {
                    if (S_ISDIR(attr->la_mode)) {
                                            ...
                                            if (lmv->lmv_migrate_offset !=
                                                lum_stripe_count ||
                                                lmv->lmv_master_mdt_index !=
                                                lmu->lum_stripe_offset ||
                                                (lmv_hash_type != 0 &&
                                                 lmv_hash_type != lmu->lum_hash_type)) {
                                                    CERROR("%s: \'"DNAME"\' migration was "
                                                            "interrupted, run \'lfs migrate "
                                                            "-m %d -c %d -H %d "DNAME"\' to "
                                                            "finish migration.\n",
                                                            mdd2obd_dev(mdd)->obd_name,
                                                            PNAME(lname),
                                                            le32_to_cpu(
                                                                lmv->lmv_master_mdt_index),
                                                            le32_to_cpu(
                                                                lmv->lmv_migrate_offset),
                                                            le32_to_cpu(lmv_hash_type),
                                                            PNAME(lname));
                                                    GOTO(out, rc = -EPERM);
                                            }
                                            ...
                    }
                    ...
            }
            

            The migration request will be sent to the migration target MDT, then the above log was printed at MDT0001

            For the empty directory issue of "/fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03"
            Could you please collect the similar debug logs during migration?
            Thanks!

            hongchao.zhang Hongchao Zhang added a comment - As per the stripe information of "/fir/groups/bgirod", "/fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N", etc It should be the migration to MDT0003 (if the original directory was on MDT0003, there will be two mdtidx 3 in the stripes) [root@zhanghc tests]# ../utils/lfs getdirstripe /mnt/lustre/pdir/cdir/ lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating mdtidx FID[seq:oid:ver] 3 [0x2c0000400:0xb:0x0] 3 [0x2c0000400:0x9:0x0] Was there ever migration to MDT0003 prior to this migration? the "-EPERM" is triggered in "mdd_migrate" because of the pending migration static int mdd_migrate(const struct lu_env *env, struct md_object *md_pobj, struct md_object *md_sobj, const struct lu_name *lname, struct md_object *md_tobj, struct md_op_spec *spec, struct md_attr *ma) { if (S_ISDIR(attr->la_mode)) { ... if (lmv->lmv_migrate_offset != lum_stripe_count || lmv->lmv_master_mdt_index != lmu->lum_stripe_offset || (lmv_hash_type != 0 && lmv_hash_type != lmu->lum_hash_type)) { CERROR("%s: \'"DNAME"\' migration was " "interrupted, run \'lfs migrate " "-m %d -c %d -H %d "DNAME"\' to " "finish migration.\n", mdd2obd_dev(mdd)->obd_name, PNAME(lname), le32_to_cpu( lmv->lmv_master_mdt_index), le32_to_cpu( lmv->lmv_migrate_offset), le32_to_cpu(lmv_hash_type), PNAME(lname)); GOTO(out, rc = -EPERM); } ... } ... } The migration request will be sent to the migration target MDT, then the above log was printed at MDT0001 For the empty directory issue of "/fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03" Could you please collect the similar debug logs during migration? Thanks!

            We also noticed another thing on another directory tree, that may be related to this ticket.

            We were not able to migrate some "leaf" directories, and we noticed that all of them are actually empty.

            But even an explicit lfs migrate on them doesn't work (tested from both 2.12.4 and 2.13 clients):

            [root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03
            lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none
            [root@fir-rbh01 storage]# lfs migrate -m 1 /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03
            [root@fir-rbh01 storage]# echo $?
            0
            [root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03
            lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none
            

            This directory is empty:

            [root@fir-rbh01 storage]# stat /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03
              File: ‘/fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03’
              Size: 4096            Blocks: 8          IO Block: 4096   directory
            Device: e64e03a8h/3863872424d   Inode: 180148089774940567  Links: 2
            Access: (2755/drwxr-sr-x)  Uid: (55081/  jbboin)   Gid: (24300/  bgirod)
            Access: 2020-04-30 13:28:31.000000000 -0700
            Modify: 2019-11-29 22:10:47.000000000 -0800
            Change: 2019-11-29 22:10:47.000000000 -0800
             Birth: -
            [root@fir-rbh01 storage]# ls -lisa /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03
            total 1840
            180148089774940567    4 drwxr-sr-x     2 jbboin bgirod    4096 Nov 29 22:10 .
            180148089774940559 1836 drwxr-sr-x 13322 jbboin bgirod 1871872 Apr 30 11:29 ..
            [root@fir-rbh01 storage]# 
            

            Originally, we ran lfs migrate -m 1 /fir/groups/bgirod, which is mostly done by now, apart from a few empty directories in /fir/groups/bgirod/action_recognition/frames/.

            Now, if I try again, I get:

            [root@fir-rbh01 storage]# lfs migrate -m 1 /fir/groups/bgirod
            /fir/groups/bgirod/ migrate failed: Operation not permitted (-1)
            

            And same error, on MDT0001:

            fir-md1-s2: Apr 30 13:46:47 fir-md1-s2 kernel: LustreError: 22427:0:(mdd_dir.c:4496:mdd_migrate()) fir-MDD0001: 'bgirod' migration was interrupted, run 'lfs migrate -m 3 -c 1 -H 2 bgirod' to finish migration.
            

            current getdirstripe info of each component:

            [root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod
            lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
            mdtidx           FID[seq:oid:ver]
                 3           [0x28003bb05:0x135:0x0]
                 1           [0x2400576a9:0x1abb:0x0]
            [root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod/action_recognition
            lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
            mdtidx           FID[seq:oid:ver]
                 3           [0x28003bb05:0x136:0x0]
                 1           [0x2400576a9:0x1af9:0x0]
            [root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod/action_recognition/frames
            lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
            mdtidx           FID[seq:oid:ver]
                 3           [0x28003bb05:0x138:0x0]
                 1           [0x2400576a9:0x1ce2:0x0]
            
            sthiell Stephane Thiell added a comment - We also noticed another thing on another directory tree, that may be related to this ticket. We were not able to migrate some "leaf" directories, and we noticed that all of them are actually empty. But even an explicit lfs migrate on them doesn't work (tested from both 2.12.4 and 2.13 clients): [root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03 lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none [root@fir-rbh01 storage]# lfs migrate -m 1 /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03 [root@fir-rbh01 storage]# echo $? 0 [root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03 lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none This directory is empty: [root@fir-rbh01 storage]# stat /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03 File: ‘/fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03’ Size: 4096 Blocks: 8 IO Block: 4096 directory Device: e64e03a8h/3863872424d Inode: 180148089774940567 Links: 2 Access: (2755/drwxr-sr-x) Uid: (55081/ jbboin) Gid: (24300/ bgirod) Access: 2020-04-30 13:28:31.000000000 -0700 Modify: 2019-11-29 22:10:47.000000000 -0800 Change: 2019-11-29 22:10:47.000000000 -0800 Birth: - [root@fir-rbh01 storage]# ls -lisa /fir/groups/bgirod/action_recognition/frames/v_ApplyEyeMakeup_g17_c03 total 1840 180148089774940567 4 drwxr-sr-x 2 jbboin bgirod 4096 Nov 29 22:10 . 180148089774940559 1836 drwxr-sr-x 13322 jbboin bgirod 1871872 Apr 30 11:29 .. [root@fir-rbh01 storage]# Originally, we ran lfs migrate -m 1 /fir/groups/bgirod , which is mostly done by now, apart from a few empty directories in /fir/groups/bgirod/action_recognition/frames/ . Now, if I try again, I get: [root@fir-rbh01 storage]# lfs migrate -m 1 /fir/groups/bgirod /fir/groups/bgirod/ migrate failed: Operation not permitted (-1) And same error, on MDT0001: fir-md1-s2: Apr 30 13:46:47 fir-md1-s2 kernel: LustreError: 22427:0:(mdd_dir.c:4496:mdd_migrate()) fir-MDD0001: 'bgirod' migration was interrupted, run 'lfs migrate -m 3 -c 1 -H 2 bgirod' to finish migration. current getdirstripe info of each component: [root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating mdtidx FID[seq:oid:ver] 3 [0x28003bb05:0x135:0x0] 1 [0x2400576a9:0x1abb:0x0] [root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod/action_recognition lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating mdtidx FID[seq:oid:ver] 3 [0x28003bb05:0x136:0x0] 1 [0x2400576a9:0x1af9:0x0] [root@fir-rbh01 storage]# lfs getdirstripe /fir/groups/bgirod/action_recognition/frames lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating mdtidx FID[seq:oid:ver] 3 [0x28003bb05:0x138:0x0] 1 [0x2400576a9:0x1ce2:0x0]

            Thanks! Attached full debug (+ALL) from the client as client-ALL.log (client NID is 10.0.10.3@o2ib7) while running the following command (same as in the description above):

            lfs migrate -m 1 /fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N
            

            and 2 seconds of debug logs from the two MDS in question:

            In the logs of MDT0001, I can see:

            00000004:00020000:16.0:1588205266.457978:0:22469:0:(mdd_dir.c:4496:mdd_migrate()) fir-MDD0001: '02-N' migration was interrupted, run 'lfs migrate -m 3 -c 1 -H 2 02-N' to finish migration.
            

            so I think I got this part at least.

            Let me know if I should try full debug of the MDS. Perhaps I could increase the debug buffer size.

            sthiell Stephane Thiell added a comment - Thanks! Attached full debug (+ALL) from the client as  client-ALL.log  (client NID is 10.0.10.3@o2ib7) while running the following command (same as in the description above): lfs migrate -m 1 /fir/users/apatel6/data/10-scalingNEB/01-relaxwater/02-N and 2 seconds of debug logs from the two MDS in question: MDT0001 in  fir-md1-s2-MDT0001_dlmtrace+rpctrace.log.gz MDT0003 in  fir-md1-s4-MDT0003_dlmtrace+rpctrace.log.gz In the logs of MDT0001, I can see: 00000004:00020000:16.0:1588205266.457978:0:22469:0:(mdd_dir.c:4496:mdd_migrate()) fir-MDD0001: '02-N' migration was interrupted, run 'lfs migrate -m 3 -c 1 -H 2 02-N' to finish migration. so I think I got this part at least. Let me know if I should try full debug of the MDS. Perhaps I could increase the debug buffer size.

            Stephane, are you able to collect debug logs from the client and MDS during the failed migration? Ideally, full debug in the client and MDS, but if the MDS is busy this would overflow the debug log, so if needed we could start with "debug=+dlmtrace+rpctrace".

            adilger Andreas Dilger added a comment - Stephane, are you able to collect debug logs from the client and MDS during the failed migration? Ideally, full debug in the client and MDS, but if the MDS is busy this would overflow the debug log, so if needed we could start with " debug=+dlmtrace+rpctrace ".
            pjones Peter Jones added a comment -

            Hongchao

            Could you please advise?

            Thanks

            Peter

            pjones Peter Jones added a comment - Hongchao Could you please advise? Thanks Peter

            People

              hongchao.zhang Hongchao Zhang
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: