Ran the test Alex added in a loop.with the fix, and did not see the corruption occur.
One odd thing I noticed was a client crash that would occur an hour or so into running the loop
[38518.943128] LustreError: 4405:0:(mgc_request.c:2107:mgc_process_log()) ASSERTION( !lustre_handle_is_used(&cld->cld_lockh) ) failed:
[38518.944336] LustreError: 4405:0:(mgc_request.c:2107:mgc_process_log()) LBUG
[38518.945009] CPU: 0 PID: 4405 Comm: ll_cfg_requeue Kdump: loaded Tainted: G OE 6.1.134-150.224.amzn2023.x86_64 #1
[38518.946132] Hardware name: Amazon EC2 c5.4xlarge/, BIOS 1.0 10/16/2017
[38518.946775] Call Trace:
[38518.947054] <TASK>
[38518.947300] dump_stack_lvl+0x34/0x48
[38518.947693] lbug_with_loc.cold+0x5/0x43 [libcfs]
[38518.948201] mgc_process_log+0x802/0x840 [mgc]
[38518.948655] ? do_requeue+0x2f0/0x2f0 [mgc]
[38518.949100] do_requeue+0x81/0x2f0 [mgc]
[38518.949520] mgc_requeue_thread+0x155/0x450 [mgc]
[38518.950030] ? membarrier_register_private_expedited+0x90/0x90
[38518.950623] kthread+0xd9/0x100
[38518.950952] ? kthread_complete_and_exit+0x20/0x20
[38518.951446] ret_from_fork+0x22/0x30
[38518.951827] </TASK>
[38518.952075] Kernel panic - not syncing: LBUG
[38518.952510] CPU: 0 PID: 4405 Comm: ll_cfg_requeue Kdump: loaded Tainted: G OE 6.1.134-150.224.amzn2023.x86_64 #1
[38518.953578] Hardware name: Amazon EC2 c5.4xlarge/, BIOS 1.0 10/16/2017
[38518.954212] Call Trace:
[38518.954473] <TASK>
[38518.954704] dump_stack_lvl+0x34/0x48
[38518.955074] panic+0x107/0x2a7
[38518.955415] lbug_with_loc.cold+0x1a/0x43 [libcfs]
[38518.955915] mgc_process_log+0x802/0x840 [mgc]
[38518.956401] ? do_requeue+0x2f0/0x2f0 [mgc]
[38518.956840] do_requeue+0x81/0x2f0 [mgc]
[38518.957244] mgc_requeue_thread+0x155/0x450 [mgc]
[38518.957726] ? membarrier_register_private_expedited+0x90/0x90
[38518.958328] kthread+0xd9/0x100
[38518.958669] ? kthread_complete_and_exit+0x20/0x20
[38518.959134] ret_from_fork+0x22/0x30
[38518.959498] </TASK>
Client is using Ubuntu 22 on 2.15.6
So far, I can see there is a lock being updated during mgc_requeue_add
[Mon Jun 16 03:06:54 UTC 2025] Lustre: mgc_requeue_add 1341200906 enter log diyfsx-cliir-ffff961412a4d800 lockh 0x0: requeue (r=3 sp=0 st=3 handle_is_used=0 lostlock=0)
[Mon Jun 16 03:06:54 UTC 2025] Lustre: mgc_requeue_add 1341200906 after decref log diyfsx-cliir-ffff961412a4d800 lockh 0x0: requeue (r=3 sp=0 st=3 handle_is_used=0 lostlock=0)
[Mon Jun 16 03:06:54 UTC 2025] Lustre: mgc_requeue_add 1341200906 exit log diyfsx-cliir-ffff961412a4d800 lockh 0xeb85438d30e9f6ab: requeue (r=4 sp=0 st=3 handle_is_used=1 lostlock=1)
https://github.com/lustre/lustre-release/blame/22f2344a13d38c8ceea9fba87bfb125ed19fa35e/lustre/mgc/mgc_request.c#L730-L740
And after it is requeued, the crash happens.
[Mon Jun 16 03:06:56 UTC 2025] Lustre: mgc_process_log 2103 log diyfsx-cliir-ffff961412a4d800 lockh 0xeb85438d30e9f6ab, rc 0, lustre_handle_is_used 1
[Mon Jun 16 03:06:56 UTC 2025] Lustre: mgc_process_log 2110 log diyfsx-cliir-ffff961412a4d800 lockh 0xeb85438d30e9f6ab, rc 0, lustre_handle_is_used 1
[Mon Jun 16 03:06:56 UTC 2025] LustreError: 2393:0:(mgc_request.c:2126:mgc_process_log()) ASSERTION( !lustre_handle_is_used(&cld->cld_lockh) ) failed:
[Mon Jun 16 03:06:56 UTC 2025] LustreError: 2393:0:(mgc_request.c:2126:mgc_process_log()) LBUG
But I haven't yet found what's updating this lock. I'm not sure if this is caused by my change or not. Another data point is I do not see the crash when using a Ubuntu 20 2.12.8 client. Would appreciate feedback on the fix, and tips for debugging the crash further.
Ran the test Alex added in a loop.with the fix, and did not see the corruption occur.
One odd thing I noticed was a client crash that would occur an hour or so into running the loop
[38518.943128] LustreError: 4405:0:(mgc_request.c:2107:mgc_process_log()) ASSERTION( !lustre_handle_is_used(&cld->cld_lockh) ) failed:
[38518.944336] LustreError: 4405:0:(mgc_request.c:2107:mgc_process_log()) LBUG
[38518.945009] CPU: 0 PID: 4405 Comm: ll_cfg_requeue Kdump: loaded Tainted: G OE 6.1.134-150.224.amzn2023.x86_64 #1
[38518.946132] Hardware name: Amazon EC2 c5.4xlarge/, BIOS 1.0 10/16/2017
[38518.946775] Call Trace:
[38518.947054] <TASK>
[38518.947300] dump_stack_lvl+0x34/0x48
[38518.947693] lbug_with_loc.cold+0x5/0x43 [libcfs]
[38518.948201] mgc_process_log+0x802/0x840 [mgc]
[38518.948655] ? do_requeue+0x2f0/0x2f0 [mgc]
[38518.949100] do_requeue+0x81/0x2f0 [mgc]
[38518.949520] mgc_requeue_thread+0x155/0x450 [mgc]
[38518.950030] ? membarrier_register_private_expedited+0x90/0x90
[38518.950623] kthread+0xd9/0x100
[38518.950952] ? kthread_complete_and_exit+0x20/0x20
[38518.951446] ret_from_fork+0x22/0x30
[38518.951827] </TASK>
[38518.952075] Kernel panic - not syncing: LBUG
[38518.952510] CPU: 0 PID: 4405 Comm: ll_cfg_requeue Kdump: loaded Tainted: G OE 6.1.134-150.224.amzn2023.x86_64 #1
[38518.953578] Hardware name: Amazon EC2 c5.4xlarge/, BIOS 1.0 10/16/2017
[38518.954212] Call Trace:
[38518.954473] <TASK>
[38518.954704] dump_stack_lvl+0x34/0x48
[38518.955074] panic+0x107/0x2a7
[38518.955415] lbug_with_loc.cold+0x1a/0x43 [libcfs]
[38518.955915] mgc_process_log+0x802/0x840 [mgc]
[38518.956401] ? do_requeue+0x2f0/0x2f0 [mgc]
[38518.956840] do_requeue+0x81/0x2f0 [mgc]
[38518.957244] mgc_requeue_thread+0x155/0x450 [mgc]
[38518.957726] ? membarrier_register_private_expedited+0x90/0x90
[38518.958328] kthread+0xd9/0x100
[38518.958669] ? kthread_complete_and_exit+0x20/0x20
[38518.959134] ret_from_fork+0x22/0x30
[38518.959498] </TASK>
Client is using Ubuntu 22 on 2.15.6
So far, I can see there is a lock being updated during mgc_requeue_add
[Mon Jun 16 03:06:54 UTC 2025] Lustre: mgc_requeue_add 1341200906 enter log diyfsx-cliir-ffff961412a4d800 lockh 0x0: requeue (r=3 sp=0 st=3 handle_is_used=0 lostlock=0)
[Mon Jun 16 03:06:54 UTC 2025] Lustre: mgc_requeue_add 1341200906 after decref log diyfsx-cliir-ffff961412a4d800 lockh 0x0: requeue (r=3 sp=0 st=3 handle_is_used=0 lostlock=0)
[Mon Jun 16 03:06:54 UTC 2025] Lustre: mgc_requeue_add 1341200906 exit log diyfsx-cliir-ffff961412a4d800 lockh 0xeb85438d30e9f6ab: requeue (r=4 sp=0 st=3 handle_is_used=1 lostlock=1)
https://github.com/lustre/lustre-release/blame/22f2344a13d38c8ceea9fba87bfb125ed19fa35e/lustre/mgc/mgc_request.c#L730-L740
And after it is requeued, the crash happens.
[Mon Jun 16 03:06:56 UTC 2025] Lustre: mgc_process_log 2103 log diyfsx-cliir-ffff961412a4d800 lockh 0xeb85438d30e9f6ab, rc 0, lustre_handle_is_used 1
[Mon Jun 16 03:06:56 UTC 2025] Lustre: mgc_process_log 2110 log diyfsx-cliir-ffff961412a4d800 lockh 0xeb85438d30e9f6ab, rc 0, lustre_handle_is_used 1
[Mon Jun 16 03:06:56 UTC 2025] LustreError: 2393:0:(mgc_request.c:2126:mgc_process_log()) ASSERTION( !lustre_handle_is_used(&cld->cld_lockh) ) failed:
[Mon Jun 16 03:06:56 UTC 2025] LustreError: 2393:0:(mgc_request.c:2126:mgc_process_log()) LBUG
But I haven't yet found what's updating this lock. I'm not sure if this is caused by my change or not. Another data point is I do not see the crash when using a Ubuntu 20 2.12.8 client. Would appreciate feedback on the fix, and tips for debugging the crash further.