Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17565

migrate racing with another ops can corrupt filesystem

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      mdt_reint_migrate() lookups names, then grabs ldlm locks and doesn't recheck names. another operation like unlink can race and modify directories within this lookup-to-locking window. after that mdt_reint_migrate() generates a distributed transaction which fails on a remote node (due to missing already name), but doesn't rollback properly.

      Attachments

        Issue Links

          Activity

            [LU-17565] migrate racing with another ops can corrupt filesystem
            wamax Max Wang added a comment - - edited

            Ran the test Alex added in a loop.with the fix, and did not see the corruption occur.

             

            One odd thing I noticed was a client crash that would occur an hour or so into running the loop

             

            [38518.943128] LustreError: 4405:0:(mgc_request.c:2107:mgc_process_log()) ASSERTION( !lustre_handle_is_used(&cld->cld_lockh) ) failed:
            [38518.944336] LustreError: 4405:0:(mgc_request.c:2107:mgc_process_log()) LBUG
            [38518.945009] CPU: 0 PID: 4405 Comm: ll_cfg_requeue Kdump: loaded Tainted: G OE 6.1.134-150.224.amzn2023.x86_64 #1
            [38518.946132] Hardware name: Amazon EC2 c5.4xlarge/, BIOS 1.0 10/16/2017
            [38518.946775] Call Trace:
            [38518.947054] <TASK>
            [38518.947300] dump_stack_lvl+0x34/0x48
            [38518.947693] lbug_with_loc.cold+0x5/0x43 [libcfs]
            [38518.948201] mgc_process_log+0x802/0x840 [mgc]
            [38518.948655] ? do_requeue+0x2f0/0x2f0 [mgc]
            [38518.949100] do_requeue+0x81/0x2f0 [mgc]
            [38518.949520] mgc_requeue_thread+0x155/0x450 [mgc]
            [38518.950030] ? membarrier_register_private_expedited+0x90/0x90
            [38518.950623] kthread+0xd9/0x100
            [38518.950952] ? kthread_complete_and_exit+0x20/0x20
            [38518.951446] ret_from_fork+0x22/0x30
            [38518.951827] </TASK>
            [38518.952075] Kernel panic - not syncing: LBUG
            [38518.952510] CPU: 0 PID: 4405 Comm: ll_cfg_requeue Kdump: loaded Tainted: G OE 6.1.134-150.224.amzn2023.x86_64 #1
            [38518.953578] Hardware name: Amazon EC2 c5.4xlarge/, BIOS 1.0 10/16/2017
            [38518.954212] Call Trace:
            [38518.954473] <TASK>
            [38518.954704] dump_stack_lvl+0x34/0x48
            [38518.955074] panic+0x107/0x2a7
            [38518.955415] lbug_with_loc.cold+0x1a/0x43 [libcfs]
            [38518.955915] mgc_process_log+0x802/0x840 [mgc]
            [38518.956401] ? do_requeue+0x2f0/0x2f0 [mgc]
            [38518.956840] do_requeue+0x81/0x2f0 [mgc]
            [38518.957244] mgc_requeue_thread+0x155/0x450 [mgc]
            [38518.957726] ? membarrier_register_private_expedited+0x90/0x90
            [38518.958328] kthread+0xd9/0x100
            [38518.958669] ? kthread_complete_and_exit+0x20/0x20
            [38518.959134] ret_from_fork+0x22/0x30
            [38518.959498] </TASK>

            Client is using Ubuntu 22 on 2.15.6

             

            So far, I can see there is a lock being updated during mgc_requeue_add

            [Mon Jun 16 03:06:54 UTC 2025] Lustre: mgc_requeue_add 1341200906 enter log diyfsx-cliir-ffff961412a4d800 lockh 0x0: requeue (r=3 sp=0 st=3 handle_is_used=0 lostlock=0)
            [Mon Jun 16 03:06:54 UTC 2025] Lustre: mgc_requeue_add 1341200906 after decref log diyfsx-cliir-ffff961412a4d800 lockh 0x0: requeue (r=3 sp=0 st=3 handle_is_used=0 lostlock=0)
            [Mon Jun 16 03:06:54 UTC 2025] Lustre: mgc_requeue_add 1341200906 exit log diyfsx-cliir-ffff961412a4d800 lockh 0xeb85438d30e9f6ab: requeue (r=4 sp=0 st=3 handle_is_used=1 lostlock=1)

             

            https://github.com/lustre/lustre-release/blame/22f2344a13d38c8ceea9fba87bfb125ed19fa35e/lustre/mgc/mgc_request.c#L730-L740

            And after it is requeued, the crash happens.

            [Mon Jun 16 03:06:56 UTC 2025] Lustre: mgc_process_log 2103 log diyfsx-cliir-ffff961412a4d800 lockh 0xeb85438d30e9f6ab, rc 0, lustre_handle_is_used 1
            [Mon Jun 16 03:06:56 UTC 2025] Lustre: mgc_process_log 2110 log diyfsx-cliir-ffff961412a4d800 lockh 0xeb85438d30e9f6ab, rc 0, lustre_handle_is_used 1
            [Mon Jun 16 03:06:56 UTC 2025] LustreError: 2393:0:(mgc_request.c:2126:mgc_process_log()) ASSERTION( !lustre_handle_is_used(&cld->cld_lockh) ) failed: 
            [Mon Jun 16 03:06:56 UTC 2025] LustreError: 2393:0:(mgc_request.c:2126:mgc_process_log()) LBUG

             

            But I haven't yet found what's updating this lock. I'm not sure if this is caused by my change or not. Another data point is I do not see the crash when using a Ubuntu 20 2.12.8 client. Would appreciate feedback on the fix, and tips for debugging the crash further.

            wamax Max Wang added a comment - - edited Ran the test Alex added in a loop.with the fix, and did not see the corruption occur.   One odd thing I noticed was a client crash that would occur an hour or so into running the loop   [38518.943128] LustreError: 4405:0:(mgc_request.c:2107:mgc_process_log()) ASSERTION( !lustre_handle_is_used(&cld->cld_lockh) ) failed: [38518.944336] LustreError: 4405:0:(mgc_request.c:2107:mgc_process_log()) LBUG [38518.945009] CPU: 0 PID: 4405 Comm: ll_cfg_requeue Kdump: loaded Tainted: G OE 6.1.134-150.224.amzn2023.x86_64 #1 [38518.946132] Hardware name: Amazon EC2 c5.4xlarge/, BIOS 1.0 10/16/2017 [38518.946775] Call Trace: [38518.947054] <TASK> [38518.947300] dump_stack_lvl+0x34/0x48 [38518.947693] lbug_with_loc.cold+0x5/0x43 [libcfs] [38518.948201] mgc_process_log+0x802/0x840 [mgc] [38518.948655] ? do_requeue+0x2f0/0x2f0 [mgc] [38518.949100] do_requeue+0x81/0x2f0 [mgc] [38518.949520] mgc_requeue_thread+0x155/0x450 [mgc] [38518.950030] ? membarrier_register_private_expedited+0x90/0x90 [38518.950623] kthread+0xd9/0x100 [38518.950952] ? kthread_complete_and_exit+0x20/0x20 [38518.951446] ret_from_fork+0x22/0x30 [38518.951827] </TASK> [38518.952075] Kernel panic - not syncing: LBUG [38518.952510] CPU: 0 PID: 4405 Comm: ll_cfg_requeue Kdump: loaded Tainted: G OE 6.1.134-150.224.amzn2023.x86_64 #1 [38518.953578] Hardware name: Amazon EC2 c5.4xlarge/, BIOS 1.0 10/16/2017 [38518.954212] Call Trace: [38518.954473] <TASK> [38518.954704] dump_stack_lvl+0x34/0x48 [38518.955074] panic+0x107/0x2a7 [38518.955415] lbug_with_loc.cold+0x1a/0x43 [libcfs] [38518.955915] mgc_process_log+0x802/0x840 [mgc] [38518.956401] ? do_requeue+0x2f0/0x2f0 [mgc] [38518.956840] do_requeue+0x81/0x2f0 [mgc] [38518.957244] mgc_requeue_thread+0x155/0x450 [mgc] [38518.957726] ? membarrier_register_private_expedited+0x90/0x90 [38518.958328] kthread+0xd9/0x100 [38518.958669] ? kthread_complete_and_exit+0x20/0x20 [38518.959134] ret_from_fork+0x22/0x30 [38518.959498] </TASK> Client is using Ubuntu 22 on 2.15.6   So far, I can see there is a lock being updated during mgc_requeue_add [Mon Jun 16 03:06:54 UTC 2025] Lustre: mgc_requeue_add 1341200906 enter log diyfsx-cliir-ffff961412a4d800 lockh 0x0: requeue (r=3 sp=0 st=3 handle_is_used=0 lostlock=0) [Mon Jun 16 03:06:54 UTC 2025] Lustre: mgc_requeue_add 1341200906 after decref log diyfsx-cliir-ffff961412a4d800 lockh 0x0: requeue (r=3 sp=0 st=3 handle_is_used=0 lostlock=0) [Mon Jun 16 03:06:54 UTC 2025] Lustre: mgc_requeue_add 1341200906 exit log diyfsx-cliir-ffff961412a4d800 lockh 0xeb85438d30e9f6ab: requeue (r=4 sp=0 st=3 handle_is_used=1 lostlock=1)   https://github.com/lustre/lustre-release/blame/22f2344a13d38c8ceea9fba87bfb125ed19fa35e/lustre/mgc/mgc_request.c#L730-L740 And after it is requeued, the crash happens. [Mon Jun 16 03:06:56 UTC 2025] Lustre: mgc_process_log 2103 log diyfsx-cliir-ffff961412a4d800 lockh 0xeb85438d30e9f6ab, rc 0, lustre_handle_is_used 1 [Mon Jun 16 03:06:56 UTC 2025] Lustre: mgc_process_log 2110 log diyfsx-cliir-ffff961412a4d800 lockh 0xeb85438d30e9f6ab, rc 0, lustre_handle_is_used 1 [Mon Jun 16 03:06:56 UTC 2025] LustreError: 2393:0:(mgc_request.c:2126:mgc_process_log()) ASSERTION( !lustre_handle_is_used(&cld->cld_lockh) ) failed:  [Mon Jun 16 03:06:56 UTC 2025] LustreError: 2393:0:(mgc_request.c:2126:mgc_process_log()) LBUG   But I haven't yet found what's updating this lock. I'm not sure if this is caused by my change or not. Another data point is I do not see the crash when using a Ubuntu 20 2.12.8 client. Would appreciate feedback on the fix, and tips for debugging the crash further.

            "Max Wang <wamax@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/59840
            Subject: LU-17565 mdt: add 2nd lookup in migrate to avoid unlink race
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 088de2535c3af7de696f5016ca0463e1c103fda2

            gerrit Gerrit Updater added a comment - "Max Wang <wamax@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/59840 Subject: LU-17565 mdt: add 2nd lookup in migrate to avoid unlink race Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 088de2535c3af7de696f5016ca0463e1c103fda2

            I think we should be grabbing LDLM locks first or recheck the data once LDLM locks are granted.
            rollback can be an option but looks more complicated to me.

            bzzz Alex Zhuravlev added a comment - I think we should be grabbing LDLM locks first or recheck the data once LDLM locks are granted. rollback can be an option but looks more complicated to me.

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54104
            Subject: LU-17565 tests: migrate vs unlink race
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4a01bbff1b519dbf5595346d2d0ba8cb40a0259d

            gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54104 Subject: LU-17565 tests: migrate vs unlink race Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4a01bbff1b519dbf5595346d2d0ba8cb40a0259d

            People

              wc-triage WC Triage
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: