[LU-10321] MDS - umount hangs during failback Created: 04/Dec/17  Updated: 02/Feb/18  Resolved: 04/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.2
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: soak
Environment:

Soak stress cluster MLNX networking stack.


Attachments: File soak-10.umount.hang.txt.gz     File soak-11.stacks.txt.gz     Text File soak-11.umount.hang.txt     File soak-3.umount.hang.txt.gz     File soak-6.umount.hang.txt.gz     Text File vmcore-dmesg.txt     Text File vmcore-dmesg.txt    
Issue Links:
Related
is related to LU-6684 lctl lfsck_stop hangs Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

MDT 2 (soak-10) fails over to soak-11, with errors

Dec  2 07:27:00 soak-11 kernel: LustreError: 2976:0:(llog_osd.c:960:llog_osd_next_block()) soaked-MDT0003-osp-MDT0002: missed desired record? 2 > 1
Dec  2 07:27:00 soak-11 kernel: LustreError: 2976:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0002 getting update log failed: rc = -2
Dec  2 07:27:00 soak-11 kernel: LustreError: 2976:0:(lod_dev.c:419:lod_sub_recovery_thread()) Skipped 3 previous similar messages
Dec  2 07:27:01 soak-11 kernel: LustreError: 2381:0:(mdt_open.c:1167:mdt_cross_open()) soaked-MDT0002: [0x280002b4c:0xa44:0x0] doesn't exist!: rc = -14
Dec  2 07:27:02 soak-11 kernel: Lustre: 2977:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Dec  2 07:27:02 soak-11 kernel: Lustre: 2977:0:(ldlm_lib.c:2059:target_recovery_overseer()) Skipped 2 previous similar messages
Dec  2 07:27:02 soak-11 kernel: Lustre: soaked-MDT0002: disconnecting 31 stale clients

Soak attempts a umount which hangs:

2017-12-02 07:27:16,430:fsmgmt.fsmgmt:INFO     Unmounting soaked-MDT0002 on soak-11 ...
soak-11
Dec  2 07:30:16 soak-11 kernel: INFO: task umount:3039 blocked for more than 120 seconds.
Dec  2 07:30:16 soak-11 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  2 07:30:16 soak-11 kernel: umount          D ffff8803c81f4008     0  3039   3037 0x00000080
Dec  2 07:30:16 soak-11 kernel: ffff8803ce3afa30 0000000000000086 ffff88081fa50000 ffff8803ce3affd8
Dec  2 07:30:16 soak-11 kernel: ffff8803ce3affd8 ffff8803ce3affd8 ffff88081fa50000 ffff8803c81f4000
Dec  2 07:30:16 soak-11 kernel: ffff8803c81f4004 ffff88081fa50000 00000000ffffffff ffff8803c81f4008
Dec  2 07:30:16 soak-11 kernel: Call Trace:
Dec  2 07:30:16 soak-11 kernel: [<ffffffff816aa489>] schedule_preempt_disabled+0x29/0x70
Dec  2 07:30:16 soak-11 kernel: [<ffffffff816a83b7>] __mutex_lock_slowpath+0xc7/0x1d0
Dec  2 07:30:16 soak-11 kernel: [<ffffffff816a77cf>] mutex_lock+0x1f/0x2f
Dec  2 07:30:16 soak-11 kernel: [<ffffffffc14560c7>] lfsck_stop+0x167/0x4e0 [lfsck]
Dec  2 07:30:16 soak-11 kernel: [<ffffffff810c4832>] ? default_wake_function+0x12/0x20
Dec  2 07:30:16 soak-11 kernel: [<ffffffff811e0593>] ? __kmalloc+0x1e3/0x230
Dec  2 07:30:16 soak-11 kernel: [<ffffffffc1625aa6>] mdd_iocontrol+0x96/0x16a0 [mdd]
Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0ec9619>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
Dec  2 07:30:17 soak-11 kernel: [<ffffffffc1500fc1>] mdt_device_fini+0x71/0x920 [mdt]
Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0ed6911>] class_cleanup+0x971/0xcd0 [obdclass]
Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0ed8cad>] class_process_config+0x19cd/0x23b0 [obdclass]
Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0dc6bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0ed9856>] class_manual_cleanup+0x1c6/0x710 [obdclass]
Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0f07fee>] server_put_super+0x8de/0xcd0 [obdclass]
Dec  2 07:30:17 soak-11 kernel: [<ffffffff81203692>] generic_shutdown_super+0x72/0x100
Dec  2 07:30:17 soak-11 kernel: [<ffffffff81203a62>] kill_anon_super+0x12/0x20
Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0edc152>] lustre_kill_super+0x32/0x50 [obdclass]
Dec  2 07:30:17 soak-11 kernel: [<ffffffff81203e19>] deactivate_locked_super+0x49/0x60
Dec  2 07:30:17 soak-11 kernel: [<ffffffff81204586>] deactivate_super+0x46/0x60
Dec  2 07:30:17 soak-11 kernel: [<ffffffff812217cf>] cleanup_mnt+0x3f/0x80
Dec  2 07:30:18 soak-11 kernel: [<ffffffff81221862>] __cleanup_mnt+0x12/0x20
Dec  2 07:30:18 soak-11 kernel: [<ffffffff810ad275>] task_work_run+0xc5/0xf0
Dec  2 07:30:18 soak-11 kernel: [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0
Dec  2 07:30:18 soak-11 kernel: [<ffffffff816b533d>] int_signal+0x12/0x17
Dec  2 07:30:19 soak-11 kernel: LustreError: 11-0: soaked-OST0016-osc-MDT0002: operation ost_connect to node 192.168.1.106@o2ib failed: rc = -114

This wedges soak, no further faults are attempted, jobs stop scheduling.
This happened over the weekend. Dumped Lustre logs, forced a crash dump.
Logs, crash info attached.
Full crash dump is available on Spirit.



 Comments   
Comment by John Hammond [ 04/Dec/17 ]

Maybe a lfsck start/recovery dealdock.

umount is waiting in lfsck_stop() on li_mutex. tgt_recover_2 is in lfsck_start() holding li_mutex. Possibly waiting on lfsck_layout (pid 3013) which is sending notifications but is stuck waiting for import recovery.

00100000:10000000:17.0:1512199622.506187:0:3013:0:(lfsck_engine.c:1576:lfsck_assistant_engine()) soaked-MDT0002-osd: lfsck_layout LFSCK assistant thread start
00000100:00080000:17.0:1512199622.506589:0:3013:0:(client.c:1563:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != CONNECTING)  req@ffff8803ee66b900 x1585651536798656/t0(0) o1101->soaked-OST0016-osc-MDT0002@192.168.1.106@o2ib:28/4 lens 320/224 e 0 to 0 dl 0 ref 1 fl Rpc:W/0/ffffffff rc 0/-1

Relevant backtraces:

lfsck_namespace S ffff8803c81f4000     0  3012      2 0x00000080^M
 [<---------------->] schedule+0x29/0x70^M
 [<---------------->] lfsck_assistant_engine+0x118d/0x20c0 [lfsck]^M
 [<---------------->] ? dequeue_task_fair+0x41e/0x660^M
 [<---------------->] ? __schedule+0x39d/0x8b0^M
 [<---------------->] ? wake_up_state+0x20/0x20^M
 [<---------------->] ? lfsck_master_engine+0x12f0/0x12f0 [lfsck]^M
 [<---------------->] kthread+0xcf/0xe0^M
 [<---------------->] ? insert_kthread_work+0x40/0x40^M
 [<---------------->] ret_from_fork+0x58/0x90^M
 [<---------------->] ? insert_kthread_work+0x40/0x40^M

lfsck           S ffff8803f6dc9800     0  3010      2 0x00000080^M
 [<---------------->] schedule+0x29/0x70^M
 [<---------------->] lfsck_start_assistant+0x1d5/0x320 [lfsck]^M
 [<---------------->] ? dt_read+0x14/0x50 [obdclass]^M
 [<---------------->] ? lfsck_layout_load_bitmap+0x4bb/0x710 [lfsck]^M
 [<---------------->] ? wake_up_state+0x20/0x20^M
 [<---------------->] lfsck_layout_master_prep+0xfe/0x250 [lfsck]^M
 [<---------------->] lfsck_master_engine+0x176/0x12f0 [lfsck]^M
 [<---------------->] ? __switch_to+0xd7/0x510^M
 [<---------------->] ? __schedule+0x39d/0x8b0^M
 [<---------------->] ? lfsck_master_oit_engine+0x1150/0x1150 [lfsck]^M
 [<---------------->] kthread+0xcf/0xe0^M
 [<---------------->] ? insert_kthread_work+0x40/0x40^M
 [<---------------->] ret_from_fork+0x58/0x90^M
 [<---------------->] ? insert_kthread_work+0x40/0x40^M

lfsck_layout    S 000000000000f908     0  3013      2 0x00000080^M
 [<---------------->] schedule+0x29/0x70^M
 [<---------------->] schedule_timeout+0x174/0x2c0^M
 [<---------------->] ? internal_add_timer+0x70/0x70^M
 [<---------------->] ? ptlrpc_init_rq_pool+0x110/0x110 [ptlrpc]^M
 [<---------------->] ptlrpc_set_wait+0x4c0/0x910 [ptlrpc]^M
 [<---------------->] ? wake_up_state+0x20/0x20^M
 [<---------------->] lfsck_assistant_notify_others+0x110f/0x1390 [lfsck]^M
 [<---------------->] ? libcfs_debug_msg+0x57/0x80 [libcfs]^M
 [<---------------->] ? lfsck_master_engine+0x12f0/0x12f0 [lfsck]^M
 [<---------------->] lfsck_assistant_engine+0x140/0x20c0 [lfsck]^M
 [<---------------->] ? check_preempt_curr+0x78/0xa0^M
 [<---------------->] ? dequeue_task_fair+0x41e/0x660^M
 [<---------------->] ? __switch_to+0xd7/0x510^M
 [<---------------->] ? __schedule+0x39d/0x8b0^M
 [<---------------->] ? lfsck_master_engine+0x12f0/0x12f0 [lfsck]^M
 [<---------------->] kthread+0xcf/0xe0^M
 [<---------------->] ? insert_kthread_work+0x40/0x40^M
 [<---------------->] ret_from_fork+0x58/0x90^M
 [<---------------->] ? insert_kthread_work+0x40/0x40^M

tgt_recover_2   S ffff8808067d9fa0     0  2977      2 0x00000080^M
 [<---------------->] schedule+0x29/0x70^M
 [<---------------->] lfsck_start+0xf25/0x17c0 [lfsck]^M
 [<---------------->] ? llog_key_init+0x20/0xd0 [obdclass]^M
 [<---------------->] ? wake_up_state+0x20/0x20^M
 [<---------------->] mdd_iocontrol+0x3d3/0x16a0 [mdd]^M
 [<---------------->] ? osp_key_init+0x3b/0xd0 [osp]^M
 [<---------------->] mdt_postrecov+0x148/0x1d0 [mdt]^M
 [<---------------->] mdt_obd_postrecov+0x8d/0xb0 [mdt]^M
 [<---------------->] target_recovery_thread+0xaf2/0x1360 [ptlrpc]^M
 [<---------------->] ? __schedule+0x39d/0x8b0^M
 [<---------------->] ? replay_request_or_update.isra.21+0x8c0/0x8c0 [ptlrpc]^M
 [<---------------->] kthread+0xcf/0xe0^M
 [<---------------->] ? insert_kthread_work+0x40/0x40^M
 [<---------------->] ret_from_fork+0x58/0x90^M
 [<---------------->] ? insert_kthread_work+0x40/0x40^M

umount          D ffff8803c81f4008     0  3039   3037 0x00000080^M
 [<---------------->] schedule_preempt_disabled+0x29/0x70^M
 [<---------------->] __mutex_lock_slowpath+0xc7/0x1d0^M
 [<---------------->] mutex_lock+0x1f/0x2f^M
 [<---------------->] lfsck_stop+0x167/0x4e0 [lfsck]^M
 [<---------------->] ? default_wake_function+0x12/0x20^M
 [<---------------->] ? __kmalloc+0x1e3/0x230^M
 [<---------------->] mdd_iocontrol+0x96/0x16a0 [mdd]^M
 [<---------------->] ? lprocfs_counter_add+0xf9/0x160 [obdclass]^M
 [<---------------->] mdt_device_fini+0x71/0x920 [mdt]^M
 [<---------------->] class_cleanup+0x971/0xcd0 [obdclass]^M
 [<---------------->] class_process_config+0x19cd/0x23b0 [obdclass]^M
 [<---------------->] ? libcfs_debug_msg+0x57/0x80 [libcfs]^M
 [<---------------->] class_manual_cleanup+0x1c6/0x710 [obdclass]^M
 [<---------------->] server_put_super+0x8de/0xcd0 [obdclass]^M
 [<---------------->] generic_shutdown_super+0x72/0x100^M
 [<---------------->] kill_anon_super+0x12/0x20^M
 [<---------------->] lustre_kill_super+0x32/0x50 [obdclass]^M
 [<---------------->] deactivate_locked_super+0x49/0x60^M
 [<---------------->] deactivate_super+0x46/0x60^M
 [<---------------->] cleanup_mnt+0x3f/0x80^M
 [<---------------->] __cleanup_mnt+0x12/0x20^M
 [<---------------->] task_work_run+0xc5/0xf0^M
 [<---------------->] do_notify_resume+0x92/0xb0^M
 [<---------------->] int_signal+0x12/0x17^M
Comment by John Hammond [ 04/Dec/17 ]

Fan Yong, does this look familiar?

Comment by nasf (Inactive) [ 05/Dec/17 ]

According to current implementation (master), the LFSCK notify RPC is interruptible, so would you please to try "kill -9 3013" to check whether the layout LFSCK assistant engine can be interrupted? If yes, then the lfsck_start can go ahead (exit with failure), and then the lfsck_stop will not be blocked.

Comment by Cliff White (Inactive) [ 05/Dec/17 ]

Okay, so you want lfsck_layout process killed if this happens again?

Comment by nasf (Inactive) [ 06/Dec/17 ]

Yes, please kill the blocked (on recovery) LFSCK assistant thread, such as lfsck_layout or lfsck_namespace. It is expected to work, otherwise, we need to find out the solution.

Comment by Cliff White (Inactive) [ 06/Dec/17 ]

It does not work.

[root@soak-8 ~]# ps -ef |grep lfsck
root       6695   6693  0 04:35 ?        00:00:00 bash -c  ??lctl lfsck_start -M soaked-MDT0000 -s 1000 -t all -A ??if [[ $? != 0 ]]; then ???lctl lfsck_start -M soaked-MDT0000 -s 1000 -t namespace,layout -A ???if [[ $? != 0 ]]; then ????lctl lfsck_start -M soaked-MDT0000 -s 1000 -t namespace -A ???fi ??fi
root       6713   6695  0 04:35 ?        00:00:00 lctl lfsck_start -M soaked-MDT0000 -s 1000 -t all -A
root       6714      2  0 04:35 ?        00:00:00 [lfsck]
root       6716      2  0 04:35 ?        00:00:00 [lfsck_layout]
root      53709  53676  0 16:09 pts/0    00:00:00 lctl lfsck_stop -M soaked-MDT0000
root      53748  53717  0 16:10 pts/1    00:00:00 grep --color=auto lfsck
[root@soak-8 ~]# kill -9 6716
[root@soak-8 ~]# kill -9 6716
[root@soak-8 ~]# ps -ef |grep lfsck
root       6695   6693  0 04:35 ?        00:00:00 bash -c  ??lctl lfsck_start -M soaked-MDT0000 -s 1000 -t all -A ??if [[ $? != 0 ]]; then ???lctl lfsck_start -M soaked-MDT0000 -s 1000 -t namespace,layout -A ???if [[ $? != 0 ]]; then ????lctl lfsck_start -M soaked-MDT0000 -s 1000 -t namespace -A ???fi ??fi
root       6713   6695  0 04:35 ?        00:00:00 lctl lfsck_start -M soaked-MDT0000 -s 1000 -t all -A
root       6714      2  0 04:35 ?        00:00:00 [lfsck]
root       6716      2  0 04:35 ?        00:00:00 [lfsck_layout]
root      53709  53676  0 16:09 pts/0    00:00:00 lctl lfsck_stop -M soaked-MDT0000
root      53754  53717  0 16:11 pts/1    00:00:00 grep --color=auto lfsck

Process is un-killable, soak-8 is wedged. I can leave it like this if you want to login.

Comment by nasf (Inactive) [ 06/Dec/17 ]

Just login soak-8, but seems reboot.
This situation is more complex than LU-6684. According to current (master) implementation, when lfsck_stop, it will send SIGINT signal to the LFSCK engines, then even if the LFSCK engines are blocked by some RPCs, it still can be waken up and exit. But in this case, the lfsck_stop was blocked on the mutex, then it has no chance to send the SIGINT single.

Comment by Cliff White (Inactive) [ 06/Dec/17 ]

Sorry, I'll let you know when the next hang happens. I should have dumped stacks before the reboot.

Comment by Cliff White (Inactive) [ 06/Dec/17 ]

I think i have re-produced the hang, I'll leave it like this until you have a look.
Sigh, it finally exited. Trying again.

Comment by Gerrit Updater [ 07/Dec/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30420
Subject: LU-10321 lfsck: allow to stop the in-starting lfsck
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7817928a62c8a212f5c2b040587042c193bd5c1a

Comment by Gerrit Updater [ 07/Dec/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30422
Subject: LU-10321 lfsck: allow to stop the in-starting lfsck
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 162e2f5c04b32904053ac31795450eb9da521c00

Comment by Andreas Dilger [ 07/Dec/17 ]

I think there is already an option no scrub to prevent Oi scrub from starting.

However, it would also be useful if it was possible to force LFSCK to abort the running scan at startup, in case there is a problem like this at a customer. Is “no scrub” along with “lctl lfsck_stop —reset” enough to kill the running LFSCK, and erase the saved state?

Comment by nasf (Inactive) [ 08/Dec/17 ]

For this case, it is NOT the LFSCK auto-resumed from the last check-point, instead, it was started manually (or by scripts) after MDT failover. So no related with mount option.

Comment by Cliff White (Inactive) [ 08/Dec/17 ]

Hit the problem again last night. System completed a failback, MDTs are all mounted in the proper place, lfsck is wedged. Will leave the system if you wish to have a look.

Comment by Cliff White (Inactive) [ 08/Dec/17 ]

Dumped lustre logs from all MDS, output in /scratch/results/soak/soak-X-hung.lfsck.Dec08.txt
Dumped stacks on all MDS, output in console logs. Crash-dumped all MDS, results on Spirit.
Restarting.

Comment by Cliff White (Inactive) [ 11/Dec/17 ]

Had a simular hang on soak-3 and soak-6 (OSS) during umount. Dumped lustre-logs, attached. Also dumped stack traces and crash dumped both nodes, output available on spirit.

Comment by nasf (Inactive) [ 12/Dec/17 ]

cliffw, as you said on Skype, there are multiple issues during current Spirit tests. The original hung happened on the MDT because of the blocked uninterrupted LFSCK. The patch 30420 is used for resolving such trouble. But it does not means the Spirit will not hung after applying the patch, because some other issues may also block the system. For the new hung, it happened on the OST side, different from the original LFSCK hung. So would you please to check whether the original LFSCK hung issue resolved or not? Thanks!

Comment by Cliff White (Inactive) [ 12/Dec/17 ]

We have not seen lfsck hangs with MDT failover after the patch. The two hangs occurred with OST failover. After the most recent hang, i was able to reboot/remount the system and then I could start and stop lfsck without the hang.

Comment by nasf (Inactive) [ 13/Dec/17 ]

According to the logs, three were new LFSCK started just after lfsck_stop during the MDT umount. Then nobody will stop the new triggered LFSCK as to the MDT cannot umount. I will make patch to resolve the race condition.

Comment by Gerrit Updater [ 13/Dec/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30513
Subject: LU-10321 lfsck: not start lfsck during umount
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6263064b369dd4fddbb0dfa9ab49013a0d791629

Comment by Gerrit Updater [ 13/Dec/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30514
Subject: LU-10321 lfsck: not start lfsck during umount
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 2022d417ddaf663dc7addb5389acade0390996e5

Comment by Cliff White (Inactive) [ 21/Dec/17 ]

We just hit this again on master, so we need a version of the patch for master.

Comment by Cliff White (Inactive) [ 21/Dec/17 ]

Okay, we have an IB build so we will test this.

Comment by Cliff White (Inactive) [ 26/Dec/17 ]

We are still hitting this issue, with the patch. Soak-10 just hit it, crash dump vmcore-dmesg and lustre-log attached

Comment by Gerrit Updater [ 04/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30420/
Subject: LU-10321 lfsck: allow to stop the in-starting lfsck
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9c9a05fee6c0fce557dfa578ff7116b905d4e00a

Comment by Gerrit Updater [ 04/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30513/
Subject: LU-10321 lfsck: not start lfsck during umount
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0cabe06be1fcc6c3ff690889b1e86bbe4690b854

Comment by nasf (Inactive) [ 04/Jan/18 ]

Patches have been landed to master, other LFSCK start trouble will be handled via LU-10419

Comment by Gerrit Updater [ 02/Feb/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30422/
Subject: LU-10321 lfsck: allow to stop the in-starting lfsck
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 147b2b4177f9f8d3c4407cd74bb1b97a5c5d7f53

Comment by Gerrit Updater [ 02/Feb/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30514/
Subject: LU-10321 lfsck: not start lfsck during umount
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: e82d8ad284fd37bbc4863723aa42c927d456fae6

Generated at Sat Feb 10 02:34:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.