Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14719

"lfs migrate -m" creates broken agent inodes when target MDT full

Details

    • 3
    • 9223372036854775807

    Description

      mds02 kernel: LustreError: 8471:0:(osd_handler.c:3892:osd_create_local_agent_inode()) lfs1-MDT0001: [0x200026991:0x560d:0x0] add dot dotdot error: rc = -28
      mds02 kernel: LustreError: 8471:0:(out_lib.c:1190:out_tx_index_delete_undo()) lfs1-MDT0001-osd: Oops, can not rollback index_delete yet: rc = -524
      mds02 kernel: LustreError: 8471:0:(out_handler.c:915:out_tx_end()) lfs1-MDT0001-osd: undo for lustre/target/out_handler.c:454: rc = -524
      mds02 kernel: LustreError: 22380:0:(osd_handler.c:3892:osd_create_local_agent_inode()) lfs1-MDT0001: [0x200026991:0x5613:0x0] add dot dotdot error: rc = -28
      mds02 kernel: LustreError: 22380:0:(osd_handler.c:3892:osd_create_local_agent_inode()) Skipped 2 previous similar messages
      mds02 kernel: LustreError: 22380:0:(out_lib.c:1190:out_tx_index_delete_undo()) lfs1-MDT0001-osd: Oops, can not rollback index_delete yet: rc = -524
      mds02 kernel: LustreError: 22380:0:(out_lib.c:1190:out_tx_index_delete_undo()) Skipped 5 previous similar messages
      mds02 kernel: LustreError: 22380:0:(out_handler.c:915:out_tx_end()) lfs1-MDT0001-osd: undo for lustre/target/out_handler.c:454: rc = -524
      

      When e2fsck is run on the filesystem (LU-14710) it reports that the "." and ".." entries are corrupted, and can also report that the HTree index is corrupted:

      Directory entry for '.' in ... (1032783) is big.
      Split? no
      Second entry '3.3.0' (inode=538027 fid=[0x380020941:0x4c38:0x0]) in directory inode 1032783 should be '..'
      Fix? no
      

      This should be handled better by the MDS:

      • check that the target MDT has enough space before starting migration
      • if ENOSPC is returned before directory is migrated then clean up unused agent inodes

      Attachments

        Issue Links

          Activity

            [LU-14719] "lfs migrate -m" creates broken agent inodes when target MDT full
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49249/
            Subject: LU-14719 lod: ignore space check error in recovery
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e44489f2f29a2e50883f9bbdec491b65ca92a692

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49249/ Subject: LU-14719 lod: ignore space check error in recovery Project: fs/lustre-release Branch: master Current Patch Set: Commit: e44489f2f29a2e50883f9bbdec491b65ca92a692
            laisiyao Lai Siyao added a comment -

            The failure in replay-single 111g is lod_trans_space_check()>dt_statfs()>osp_statfs() returns -ENOTCONN during replay. I'll push a patch to make lod_trans_space_check() fail in low space only, but skip other failures which will be detected in operation processing.

            laisiyao Lai Siyao added a comment - The failure in replay-single 111g is lod_trans_space_check() >dt_statfs() >osp_statfs() returns -ENOTCONN during replay. I'll push a patch to make lod_trans_space_check() fail in low space only, but skip other failures which will be detected in operation processing.
            gerrit Gerrit Updater added a comment - - edited

            "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49249
            Subject: LU-14719 lod: skip space check error
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f2becc7a5b653e8da37a178c1757d5757d5c1e36

            gerrit Gerrit Updater added a comment - - edited "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49249 Subject: LU-14719 lod: skip space check error Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f2becc7a5b653e8da37a178c1757d5757d5c1e36
            adilger Andreas Dilger added a comment - - edited

            I think it only affects janitor testing, since the patch https://review.whamcloud.com/49216 passed normal autotest. However, it fails 100% on Janitor testing.

            Note that test_111g already enables full debug logging so you should be able to see it in the latest test results:
            https://testing.whamcloud.com/gerrit-janitor/26658/testresults/replay-single-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

            adilger Andreas Dilger added a comment - - edited I think it only affects janitor testing, since the patch https://review.whamcloud.com/49216 passed normal autotest. However, it fails 100% on Janitor testing. Note that test_111g already enables full debug logging so you should be able to see it in the latest test results: https://testing.whamcloud.com/gerrit-janitor/26658/testresults/replay-single-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/
            laisiyao Lai Siyao added a comment -

            I can't reproduce in local test system, does this fail janitor only? I will enable full debug and check logs.

            laisiyao Lai Siyao added a comment - I can't reproduce in local test system, does this fail janitor only? I will enable full debug and check logs.
            adilger Andreas Dilger added a comment - - edited

            Lai, it looks like patch https://review.whamcloud.com/47867 "LU-14719 lod: distributed transaction check space" broke replay-single.sh test_111g. We kind of knew that based on Oleg's comments on an earlier version of the patch, but the broken version check caused the test to be permanently skipped.

            Could you please look at fixing this. This patch predates the abort_recov_mdt patch, so that can't be the cause of the test_111g failure.

            According to the console logs, the client is dropped from recovery, and the "rm -rf $DIR/$tdir/striped_dir" command is not replayed from the client:

            [  108.039795] LustreError: 13372:0:(ldlm_lockd.c:2569:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1669164818 with bad export cookie 14558261718912942160
            [  108.039842] LustreError: 11-0: lustre-MDT0000-lwp-MDT0001: operation mds_disconnect to node 0@lo failed: rc = -107
            [  134.093090] LustreError: 1851:0:(client.c:3253:ptlrpc_replay_interpret()) @@@ status -107, old was 0  req@ffff88010b421300 x1750246084321408/t4294967302(4294967302) o36->lustre-MDT0001-mdc-ffff8800d749c800@192.168.203.164@tcp:12/10 lens 496/456 e 1 to 0 dl 1669164853 ref 2 fl Interpret:RQU/4/0 rc -107/-107 job:'rm.0'
            [  134.108916] Lustre: lustre-MDT0001-mdc-ffff8800d749c800: Connection restored to  (at 192.168.203.164@tcp)
            

            Full debug logs are available, though this is also a 100% failure and can hopefully also be reproduced locally.
            https://testing.whamcloud.com/gerrit-janitor/26658/testresults/replay-single-special2-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

            adilger Andreas Dilger added a comment - - edited Lai, it looks like patch https://review.whamcloud.com/47867 " LU-14719 lod: distributed transaction check space " broke replay-single.sh test_111g. We kind of knew that based on Oleg's comments on an earlier version of the patch, but the broken version check caused the test to be permanently skipped. Could you please look at fixing this. This patch predates the abort_recov_mdt patch, so that can't be the cause of the test_111g failure. According to the console logs, the client is dropped from recovery, and the " rm -rf $DIR/$tdir/striped_dir " command is not replayed from the client: [ 108.039795] LustreError: 13372:0:(ldlm_lockd.c:2569:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1669164818 with bad export cookie 14558261718912942160 [ 108.039842] LustreError: 11-0: lustre-MDT0000-lwp-MDT0001: operation mds_disconnect to node 0@lo failed: rc = -107 [ 134.093090] LustreError: 1851:0:(client.c:3253:ptlrpc_replay_interpret()) @@@ status -107, old was 0 req@ffff88010b421300 x1750246084321408/t4294967302(4294967302) o36->lustre-MDT0001-mdc-ffff8800d749c800@192.168.203.164@tcp:12/10 lens 496/456 e 1 to 0 dl 1669164853 ref 2 fl Interpret:RQU/4/0 rc -107/-107 job:'rm.0' [ 134.108916] Lustre: lustre-MDT0001-mdc-ffff8800d749c800: Connection restored to (at 192.168.203.164@tcp) Full debug logs are available, though this is also a 100% failure and can hopefully also be reproduced locally. https://testing.whamcloud.com/gerrit-janitor/26658/testresults/replay-single-special2-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/
            gerrit Gerrit Updater added a comment - - edited

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49218
            Subject: LU-14719 tests: find replay-single/111g breakage
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 842c76da8ea4049c3b846ed1dd7a2b40e1bd877b

            gerrit Gerrit Updater added a comment - - edited "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49218 Subject: LU-14719 tests: find replay-single/111g breakage Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 842c76da8ea4049c3b846ed1dd7a2b40e1bd877b
            gerrit Gerrit Updater added a comment - - edited

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49217
            Subject: LU-14719 tests: find replay-single/111g breakage
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2c47c196635aef7e911b1717332d38a321b7c1ac

            gerrit Gerrit Updater added a comment - - edited "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49217 Subject: LU-14719 tests: find replay-single/111g breakage Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2c47c196635aef7e911b1717332d38a321b7c1ac
            gerrit Gerrit Updater added a comment - - edited

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49216
            Subject: LU-14719 tests: fix replay-single/111g version check
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4065ec7afbed4f3aaf94a1c4a0026701267d2963

            gerrit Gerrit Updater added a comment - - edited "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49216 Subject: LU-14719 tests: fix replay-single/111g version check Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4065ec7afbed4f3aaf94a1c4a0026701267d2963

            People

              laisiyao Lai Siyao
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: