[LU-17040] LFSCK hang with resending request on EINPROGRESS Created: 18/Aug/23 Updated: 13/Nov/23 Resolved: 25/Oct/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Upstream |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Boyko | Assignee: | Alexander Boyko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
2MDTs cluster FOFB/IO tests |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
LFSCK hanged, no progress after 10+ hours. There was an attempt to stop it UID PID PPID C STIME TTY TIME CMD root 1241254 1241253 0 00:29 ? 00:00:00 lctl lfsck_stop --device snx11922-MDT0001 --all The stop did not help and node was crashed after stop+10h. Here is ps -m and stacks. crash> ps -m | grep "lfsck\|scrub\|lctl"
[0 00:00:01.186] [RU] PID: 164187 TASK: ffff90055862af80 CPU: 6 COMMAND: "OI_scrub"
[0 10:36:24.817] [ID] PID: 1241254 TASK: ffff9002ab60c740 CPU: 19 COMMAND: "lctl"
[0 23:45:12.588] [ID] PID: 164188 TASK: ffff90055862c740 CPU: 8 COMMAND: "lfsck_layout"
[0 23:45:12.623] [UN] PID: 164186 TASK: ffff90055862df00 CPU: 9 COMMAND: "lfsck"
[0 23:45:16.187] [ID] PID: 164189 TASK: ffff900558628000 CPU: 17 COMMAND: "lfsck_namespace"
PID: 164186 TASK: ffff90055862df00 CPU: 9 COMMAND: "lfsck"
#0 [ffffa414250f7d08] __schedule at ffffffffa374e1d4
#1 [ffffa414250f7d68] schedule at ffffffffa374e648
#2 [ffffa414250f7d78] osd_otable_it_next at ffffffffc1db3983 [osd_ldiskfs]
#3 [ffffa414250f7e08] lfsck_master_oit_engine at ffffffffc187f17e [lfsck]
#4 [ffffa414250f7e78] lfsck_master_engine at ffffffffc188073e [lfsck]
#5 [ffffa414250f7f10] kthread at ffffffffa2f043a6
#6 [ffffa414250f7f50] ret_from_fork at ffffffffa380023f
PID: 164187 TASK: ffff90055862af80 CPU: 6 COMMAND: "OI_scrub"
#0 [fffffe000013be48] crash_nmi_callback at ffffffffa2e54863
#1 [fffffe000013be50] nmi_handle at ffffffffa2e24c83
#2 [fffffe000013bea8] default_do_nmi at ffffffffa3741f89
#3 [fffffe000013bec8] do_nmi at ffffffffa2e2518e
#4 [fffffe000013bef0] end_repeat_nmi at ffffffffa38015c4
[exception RIP: kthread_should_stop+33]
RIP: ffffffffa2f04fc1 RSP: ffffa414250ffce0 RFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff900576b28000 RCX: ffffa414250ffdc8
RDX: ffff900576b2d880 RSI: ffff900576b28000 RDI: ffff900513717000
RBP: ffffa414250ffdc8 R8: 0000000000000001 R9: 0000000000000000
R10: ffffa414250ffc68 R11: 0000000000000000 R12: 0000000000000001
R13: ffff900576b2d880 R14: ffff900513717000 R15: ffff900576b2d820
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#5 [ffffa414250ffce0] kthread_should_stop at ffffffffa2f04fc1
#6 [ffffa414250ffce0] osd_scrub_next at ffffffffc1dadeb4 [osd_ldiskfs]
#7 [ffffa414250ffd78] osd_inode_iteration at ffffffffc1db27fb [osd_ldiskfs]
#8 [ffffa414250ffe40] osd_scrub_main at ffffffffc1db3319 [osd_ldiskfs]
#9 [ffffa414250fff10] kthread at ffffffffa2f043a6
#10 [ffffa414250fff50] ret_from_fork at ffffffffa380023f
PID: 164188 TASK: ffff90055862c740 CPU: 8 COMMAND: "lfsck_layout"
#0 [ffffa41425127da8] __schedule at ffffffffa374e1d4
#1 [ffffa41425127e08] schedule at ffffffffa374e648
#2 [ffffa41425127e18] lfsck_assistant_engine at ffffffffc18816e2 [lfsck]
#3 [ffffa41425127f10] kthread at ffffffffa2f043a6
#4 [ffffa41425127f50] ret_from_fork at ffffffffa380023f
PID: 164189 TASK: ffff900558628000 CPU: 17 COMMAND: "lfsck_namespace"
#0 [ffffa41425137da8] __schedule at ffffffffa374e1d4
#1 [ffffa41425137e08] schedule at ffffffffa374e648
#2 [ffffa41425137e18] lfsck_assistant_engine at ffffffffc18816e2 [lfsck]
#3 [ffffa41425137f10] kthread at ffffffffa2f043a6
#4 [ffffa41425137f50] ret_from_fork at ffffffffa380023f
PID: 1241254 TASK: ffff9002ab60c740 CPU: 19 COMMAND: "lctl"
#0 [ffffa4142502fc28] __schedule at ffffffffa374e1d4
#1 [ffffa4142502fc88] schedule at ffffffffa374e648
#2 [ffffa4142502fc98] lfsck_stop at ffffffffc1876fb5 [lfsck]
#3 [ffffa4142502fd08] mdd_iocontrol at ffffffffc1a2e265 [mdd]
#4 [ffffa4142502fd58] mdt_iocontrol at ffffffffc1bc5c2f [mdt]
#5 [ffffa4142502fdf0] class_handle_ioctl at ffffffffc12a86ff [obdclass]
#6 [ffffa4142502fe68] obd_class_ioctl at ffffffffc12a9047 [obdclass]
#7 [ffffa4142502fe80] do_vfs_ioctl at ffffffffa312e0c4
#8 [ffffa4142502fef8] ksys_ioctl at ffffffffa312e700
#9 [ffffa4142502ff30] __x64_sys_ioctl at ffffffffa312e746
#10 [ffffa4142502ff38] do_syscall_64 at ffffffffa2e0420b
I have found that scrub inconsistent list has entry with zero inode. And looks like scrub with LFSCK looped it. When LFSCK get EINPROGRESS it resend a request. After scrub process oii, it deletes oii from a list, and then LFSCK tries osp_attr_get again and adds it to scrub. So inconsistent list is not empty, scrub cannot finish inode iteration loop. crash> osd_inconsistent_item ffff9001fb355340 -x
struct osd_inconsistent_item {
oii_list = {
next = 0xffff9001fb355900,
prev = 0xffff900576b2d488
},
oii_cache = {
oic_fid = {
f_seq = 0x240019e29,
f_oid = 0x1bac,
f_ver = 0x0
},
oic_lid = {
oii_ino = 0x0,
oii_gen = 0x0
},
oic_dev = 0xffff900576b28000,
oic_remote = 0x0
},
oii_insert = 0x1
}
|
| Comments |
| Comment by Gerrit Updater [ 18/Aug/23 ] |
|
"Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51997 |
| Comment by Gerrit Updater [ 25/Oct/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51997/ |
| Comment by Peter Jones [ 25/Oct/23 ] |
|
Landed for 2.16 |