Details
-
Bug
-
Resolution: Fixed
-
Major
-
Upstream
-
2MDTs cluster FOFB/IO tests
-
3
-
9223372036854775807
Description
LFSCK hanged, no progress after 10+ hours. There was an attempt to stop it
UID PID PPID C STIME TTY TIME CMD root 1241254 1241253 0 00:29 ? 00:00:00 lctl lfsck_stop --device snx11922-MDT0001 --all
The stop did not help and node was crashed after stop+10h. Here is ps -m and stacks.
crash> ps -m | grep "lfsck\|scrub\|lctl" [0 00:00:01.186] [RU] PID: 164187 TASK: ffff90055862af80 CPU: 6 COMMAND: "OI_scrub" [0 10:36:24.817] [ID] PID: 1241254 TASK: ffff9002ab60c740 CPU: 19 COMMAND: "lctl" [0 23:45:12.588] [ID] PID: 164188 TASK: ffff90055862c740 CPU: 8 COMMAND: "lfsck_layout" [0 23:45:12.623] [UN] PID: 164186 TASK: ffff90055862df00 CPU: 9 COMMAND: "lfsck" [0 23:45:16.187] [ID] PID: 164189 TASK: ffff900558628000 CPU: 17 COMMAND: "lfsck_namespace" PID: 164186 TASK: ffff90055862df00 CPU: 9 COMMAND: "lfsck" #0 [ffffa414250f7d08] __schedule at ffffffffa374e1d4 #1 [ffffa414250f7d68] schedule at ffffffffa374e648 #2 [ffffa414250f7d78] osd_otable_it_next at ffffffffc1db3983 [osd_ldiskfs] #3 [ffffa414250f7e08] lfsck_master_oit_engine at ffffffffc187f17e [lfsck] #4 [ffffa414250f7e78] lfsck_master_engine at ffffffffc188073e [lfsck] #5 [ffffa414250f7f10] kthread at ffffffffa2f043a6 #6 [ffffa414250f7f50] ret_from_fork at ffffffffa380023f PID: 164187 TASK: ffff90055862af80 CPU: 6 COMMAND: "OI_scrub" #0 [fffffe000013be48] crash_nmi_callback at ffffffffa2e54863 #1 [fffffe000013be50] nmi_handle at ffffffffa2e24c83 #2 [fffffe000013bea8] default_do_nmi at ffffffffa3741f89 #3 [fffffe000013bec8] do_nmi at ffffffffa2e2518e #4 [fffffe000013bef0] end_repeat_nmi at ffffffffa38015c4 [exception RIP: kthread_should_stop+33] RIP: ffffffffa2f04fc1 RSP: ffffa414250ffce0 RFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff900576b28000 RCX: ffffa414250ffdc8 RDX: ffff900576b2d880 RSI: ffff900576b28000 RDI: ffff900513717000 RBP: ffffa414250ffdc8 R8: 0000000000000001 R9: 0000000000000000 R10: ffffa414250ffc68 R11: 0000000000000000 R12: 0000000000000001 R13: ffff900576b2d880 R14: ffff900513717000 R15: ffff900576b2d820 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #5 [ffffa414250ffce0] kthread_should_stop at ffffffffa2f04fc1 #6 [ffffa414250ffce0] osd_scrub_next at ffffffffc1dadeb4 [osd_ldiskfs] #7 [ffffa414250ffd78] osd_inode_iteration at ffffffffc1db27fb [osd_ldiskfs] #8 [ffffa414250ffe40] osd_scrub_main at ffffffffc1db3319 [osd_ldiskfs] #9 [ffffa414250fff10] kthread at ffffffffa2f043a6 #10 [ffffa414250fff50] ret_from_fork at ffffffffa380023f PID: 164188 TASK: ffff90055862c740 CPU: 8 COMMAND: "lfsck_layout" #0 [ffffa41425127da8] __schedule at ffffffffa374e1d4 #1 [ffffa41425127e08] schedule at ffffffffa374e648 #2 [ffffa41425127e18] lfsck_assistant_engine at ffffffffc18816e2 [lfsck] #3 [ffffa41425127f10] kthread at ffffffffa2f043a6 #4 [ffffa41425127f50] ret_from_fork at ffffffffa380023f PID: 164189 TASK: ffff900558628000 CPU: 17 COMMAND: "lfsck_namespace" #0 [ffffa41425137da8] __schedule at ffffffffa374e1d4 #1 [ffffa41425137e08] schedule at ffffffffa374e648 #2 [ffffa41425137e18] lfsck_assistant_engine at ffffffffc18816e2 [lfsck] #3 [ffffa41425137f10] kthread at ffffffffa2f043a6 #4 [ffffa41425137f50] ret_from_fork at ffffffffa380023f PID: 1241254 TASK: ffff9002ab60c740 CPU: 19 COMMAND: "lctl" #0 [ffffa4142502fc28] __schedule at ffffffffa374e1d4 #1 [ffffa4142502fc88] schedule at ffffffffa374e648 #2 [ffffa4142502fc98] lfsck_stop at ffffffffc1876fb5 [lfsck] #3 [ffffa4142502fd08] mdd_iocontrol at ffffffffc1a2e265 [mdd] #4 [ffffa4142502fd58] mdt_iocontrol at ffffffffc1bc5c2f [mdt] #5 [ffffa4142502fdf0] class_handle_ioctl at ffffffffc12a86ff [obdclass] #6 [ffffa4142502fe68] obd_class_ioctl at ffffffffc12a9047 [obdclass] #7 [ffffa4142502fe80] do_vfs_ioctl at ffffffffa312e0c4 #8 [ffffa4142502fef8] ksys_ioctl at ffffffffa312e700 #9 [ffffa4142502ff30] __x64_sys_ioctl at ffffffffa312e746 #10 [ffffa4142502ff38] do_syscall_64 at ffffffffa2e0420b
I have found that scrub inconsistent list has entry with zero inode. And looks like scrub with LFSCK looped it. When LFSCK get EINPROGRESS it resend a request. After scrub process oii, it deletes oii from a list, and then LFSCK tries osp_attr_get again and adds it to scrub. So inconsistent list is not empty, scrub cannot finish inode iteration loop.
There is also lfsck stop bug, I guess LFSCK stop code doesn't have logic to stop scrub, but it requires additional ticket.
crash> osd_inconsistent_item ffff9001fb355340 -x struct osd_inconsistent_item { oii_list = { next = 0xffff9001fb355900, prev = 0xffff900576b2d488 }, oii_cache = { oic_fid = { f_seq = 0x240019e29, f_oid = 0x1bac, f_ver = 0x0 }, oic_lid = { oii_ino = 0x0, oii_gen = 0x0 }, oic_dev = 0xffff900576b28000, oic_remote = 0x0 }, oii_insert = 0x1 }
Attachments
Issue Links
- is related to
-
LU-17114 sanity test_56xb: migrated hard link has mismatched FID
- Resolved