Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14719

"lfs migrate -m" creates broken agent inodes when target MDT full

Details

    • 3
    • 9223372036854775807

    Description

      mds02 kernel: LustreError: 8471:0:(osd_handler.c:3892:osd_create_local_agent_inode()) lfs1-MDT0001: [0x200026991:0x560d:0x0] add dot dotdot error: rc = -28
      mds02 kernel: LustreError: 8471:0:(out_lib.c:1190:out_tx_index_delete_undo()) lfs1-MDT0001-osd: Oops, can not rollback index_delete yet: rc = -524
      mds02 kernel: LustreError: 8471:0:(out_handler.c:915:out_tx_end()) lfs1-MDT0001-osd: undo for lustre/target/out_handler.c:454: rc = -524
      mds02 kernel: LustreError: 22380:0:(osd_handler.c:3892:osd_create_local_agent_inode()) lfs1-MDT0001: [0x200026991:0x5613:0x0] add dot dotdot error: rc = -28
      mds02 kernel: LustreError: 22380:0:(osd_handler.c:3892:osd_create_local_agent_inode()) Skipped 2 previous similar messages
      mds02 kernel: LustreError: 22380:0:(out_lib.c:1190:out_tx_index_delete_undo()) lfs1-MDT0001-osd: Oops, can not rollback index_delete yet: rc = -524
      mds02 kernel: LustreError: 22380:0:(out_lib.c:1190:out_tx_index_delete_undo()) Skipped 5 previous similar messages
      mds02 kernel: LustreError: 22380:0:(out_handler.c:915:out_tx_end()) lfs1-MDT0001-osd: undo for lustre/target/out_handler.c:454: rc = -524
      

      When e2fsck is run on the filesystem (LU-14710) it reports that the "." and ".." entries are corrupted, and can also report that the HTree index is corrupted:

      Directory entry for '.' in ... (1032783) is big.
      Split? no
      Second entry '3.3.0' (inode=538027 fid=[0x380020941:0x4c38:0x0]) in directory inode 1032783 should be '..'
      Fix? no
      

      This should be handled better by the MDS:

      • check that the target MDT has enough space before starting migration
      • if ENOSPC is returned before directory is migrated then clean up unused agent inodes

      Attachments

        Issue Links

          Activity

            [LU-14719] "lfs migrate -m" creates broken agent inodes when target MDT full
            adilger Andreas Dilger added a comment - - edited

            I think it only affects janitor testing, since the patch https://review.whamcloud.com/49216 passed normal autotest. However, it fails 100% on Janitor testing.

            Note that test_111g already enables full debug logging so you should be able to see it in the latest test results:
            https://testing.whamcloud.com/gerrit-janitor/26658/testresults/replay-single-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

            adilger Andreas Dilger added a comment - - edited I think it only affects janitor testing, since the patch https://review.whamcloud.com/49216 passed normal autotest. However, it fails 100% on Janitor testing. Note that test_111g already enables full debug logging so you should be able to see it in the latest test results: https://testing.whamcloud.com/gerrit-janitor/26658/testresults/replay-single-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/
            laisiyao Lai Siyao added a comment -

            I can't reproduce in local test system, does this fail janitor only? I will enable full debug and check logs.

            laisiyao Lai Siyao added a comment - I can't reproduce in local test system, does this fail janitor only? I will enable full debug and check logs.
            adilger Andreas Dilger added a comment - - edited

            Lai, it looks like patch https://review.whamcloud.com/47867 "LU-14719 lod: distributed transaction check space" broke replay-single.sh test_111g. We kind of knew that based on Oleg's comments on an earlier version of the patch, but the broken version check caused the test to be permanently skipped.

            Could you please look at fixing this. This patch predates the abort_recov_mdt patch, so that can't be the cause of the test_111g failure.

            According to the console logs, the client is dropped from recovery, and the "rm -rf $DIR/$tdir/striped_dir" command is not replayed from the client:

            [  108.039795] LustreError: 13372:0:(ldlm_lockd.c:2569:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1669164818 with bad export cookie 14558261718912942160
            [  108.039842] LustreError: 11-0: lustre-MDT0000-lwp-MDT0001: operation mds_disconnect to node 0@lo failed: rc = -107
            [  134.093090] LustreError: 1851:0:(client.c:3253:ptlrpc_replay_interpret()) @@@ status -107, old was 0  req@ffff88010b421300 x1750246084321408/t4294967302(4294967302) o36->lustre-MDT0001-mdc-ffff8800d749c800@192.168.203.164@tcp:12/10 lens 496/456 e 1 to 0 dl 1669164853 ref 2 fl Interpret:RQU/4/0 rc -107/-107 job:'rm.0'
            [  134.108916] Lustre: lustre-MDT0001-mdc-ffff8800d749c800: Connection restored to  (at 192.168.203.164@tcp)
            

            Full debug logs are available, though this is also a 100% failure and can hopefully also be reproduced locally.
            https://testing.whamcloud.com/gerrit-janitor/26658/testresults/replay-single-special2-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

            adilger Andreas Dilger added a comment - - edited Lai, it looks like patch https://review.whamcloud.com/47867 " LU-14719 lod: distributed transaction check space " broke replay-single.sh test_111g. We kind of knew that based on Oleg's comments on an earlier version of the patch, but the broken version check caused the test to be permanently skipped. Could you please look at fixing this. This patch predates the abort_recov_mdt patch, so that can't be the cause of the test_111g failure. According to the console logs, the client is dropped from recovery, and the " rm -rf $DIR/$tdir/striped_dir " command is not replayed from the client: [ 108.039795] LustreError: 13372:0:(ldlm_lockd.c:2569:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1669164818 with bad export cookie 14558261718912942160 [ 108.039842] LustreError: 11-0: lustre-MDT0000-lwp-MDT0001: operation mds_disconnect to node 0@lo failed: rc = -107 [ 134.093090] LustreError: 1851:0:(client.c:3253:ptlrpc_replay_interpret()) @@@ status -107, old was 0 req@ffff88010b421300 x1750246084321408/t4294967302(4294967302) o36->lustre-MDT0001-mdc-ffff8800d749c800@192.168.203.164@tcp:12/10 lens 496/456 e 1 to 0 dl 1669164853 ref 2 fl Interpret:RQU/4/0 rc -107/-107 job:'rm.0' [ 134.108916] Lustre: lustre-MDT0001-mdc-ffff8800d749c800: Connection restored to (at 192.168.203.164@tcp) Full debug logs are available, though this is also a 100% failure and can hopefully also be reproduced locally. https://testing.whamcloud.com/gerrit-janitor/26658/testresults/replay-single-special2-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/
            gerrit Gerrit Updater added a comment - - edited

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49218
            Subject: LU-14719 tests: find replay-single/111g breakage
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 842c76da8ea4049c3b846ed1dd7a2b40e1bd877b

            gerrit Gerrit Updater added a comment - - edited "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49218 Subject: LU-14719 tests: find replay-single/111g breakage Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 842c76da8ea4049c3b846ed1dd7a2b40e1bd877b
            gerrit Gerrit Updater added a comment - - edited

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49217
            Subject: LU-14719 tests: find replay-single/111g breakage
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2c47c196635aef7e911b1717332d38a321b7c1ac

            gerrit Gerrit Updater added a comment - - edited "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49217 Subject: LU-14719 tests: find replay-single/111g breakage Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2c47c196635aef7e911b1717332d38a321b7c1ac
            gerrit Gerrit Updater added a comment - - edited

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49216
            Subject: LU-14719 tests: fix replay-single/111g version check
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4065ec7afbed4f3aaf94a1c4a0026701267d2963

            gerrit Gerrit Updater added a comment - - edited "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49216 Subject: LU-14719 tests: fix replay-single/111g version check Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4065ec7afbed4f3aaf94a1c4a0026701267d2963

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/47039/
            Subject: LU-14719 lod: distributed transaction check space
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 6aee406c84b6b8fddf08b560acfcdf7c13c97e63

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/47039/ Subject: LU-14719 lod: distributed transaction check space Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6aee406c84b6b8fddf08b560acfcdf7c13c97e63

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47128/
            Subject: LU-14719 osp: add inode watermark
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 336eb696299e1c9731bd1443f05e5d814314ed36

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47128/ Subject: LU-14719 osp: add inode watermark Project: fs/lustre-release Branch: master Current Patch Set: Commit: 336eb696299e1c9731bd1443f05e5d814314ed36

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47040/
            Subject: LU-14719 utils: dir migration stop on error
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 9ca348e8769d2c613082eeaeaf2775e22625e970

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47040/ Subject: LU-14719 utils: dir migration stop on error Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9ca348e8769d2c613082eeaeaf2775e22625e970

            "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47128
            Subject: LU-14719 osp: add inode watermark
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e34b90897c1f47ec692b34c11b4d2482366fa90c

            gerrit Gerrit Updater added a comment - "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47128 Subject: LU-14719 osp: add inode watermark Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e34b90897c1f47ec692b34c11b4d2482366fa90c

            "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47040
            Subject: LU-14719 utils: dir migration stop on error
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5eff26f75e239fb9764cd9947d7f08deab001bbe

            gerrit Gerrit Updater added a comment - "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47040 Subject: LU-14719 utils: dir migration stop on error Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5eff26f75e239fb9764cd9947d7f08deab001bbe

            People

              laisiyao Lai Siyao
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: