Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.10.2
-
Soak stress cluster MLNX networking stack.
-
3
-
9223372036854775807
Description
MDT 2 (soak-10) fails over to soak-11, with errors
Dec 2 07:27:00 soak-11 kernel: LustreError: 2976:0:(llog_osd.c:960:llog_osd_next_block()) soaked-MDT0003-osp-MDT0002: missed desired record? 2 > 1 Dec 2 07:27:00 soak-11 kernel: LustreError: 2976:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0002 getting update log failed: rc = -2 Dec 2 07:27:00 soak-11 kernel: LustreError: 2976:0:(lod_dev.c:419:lod_sub_recovery_thread()) Skipped 3 previous similar messages Dec 2 07:27:01 soak-11 kernel: LustreError: 2381:0:(mdt_open.c:1167:mdt_cross_open()) soaked-MDT0002: [0x280002b4c:0xa44:0x0] doesn't exist!: rc = -14 Dec 2 07:27:02 soak-11 kernel: Lustre: 2977:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery Dec 2 07:27:02 soak-11 kernel: Lustre: 2977:0:(ldlm_lib.c:2059:target_recovery_overseer()) Skipped 2 previous similar messages Dec 2 07:27:02 soak-11 kernel: Lustre: soaked-MDT0002: disconnecting 31 stale clients
Soak attempts a umount which hangs:
2017-12-02 07:27:16,430:fsmgmt.fsmgmt:INFO Unmounting soaked-MDT0002 on soak-11 ... soak-11 Dec 2 07:30:16 soak-11 kernel: INFO: task umount:3039 blocked for more than 120 seconds. Dec 2 07:30:16 soak-11 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 2 07:30:16 soak-11 kernel: umount D ffff8803c81f4008 0 3039 3037 0x00000080 Dec 2 07:30:16 soak-11 kernel: ffff8803ce3afa30 0000000000000086 ffff88081fa50000 ffff8803ce3affd8 Dec 2 07:30:16 soak-11 kernel: ffff8803ce3affd8 ffff8803ce3affd8 ffff88081fa50000 ffff8803c81f4000 Dec 2 07:30:16 soak-11 kernel: ffff8803c81f4004 ffff88081fa50000 00000000ffffffff ffff8803c81f4008 Dec 2 07:30:16 soak-11 kernel: Call Trace: Dec 2 07:30:16 soak-11 kernel: [<ffffffff816aa489>] schedule_preempt_disabled+0x29/0x70 Dec 2 07:30:16 soak-11 kernel: [<ffffffff816a83b7>] __mutex_lock_slowpath+0xc7/0x1d0 Dec 2 07:30:16 soak-11 kernel: [<ffffffff816a77cf>] mutex_lock+0x1f/0x2f Dec 2 07:30:16 soak-11 kernel: [<ffffffffc14560c7>] lfsck_stop+0x167/0x4e0 [lfsck] Dec 2 07:30:16 soak-11 kernel: [<ffffffff810c4832>] ? default_wake_function+0x12/0x20 Dec 2 07:30:16 soak-11 kernel: [<ffffffff811e0593>] ? __kmalloc+0x1e3/0x230 Dec 2 07:30:16 soak-11 kernel: [<ffffffffc1625aa6>] mdd_iocontrol+0x96/0x16a0 [mdd] Dec 2 07:30:17 soak-11 kernel: [<ffffffffc0ec9619>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] Dec 2 07:30:17 soak-11 kernel: [<ffffffffc1500fc1>] mdt_device_fini+0x71/0x920 [mdt] Dec 2 07:30:17 soak-11 kernel: [<ffffffffc0ed6911>] class_cleanup+0x971/0xcd0 [obdclass] Dec 2 07:30:17 soak-11 kernel: [<ffffffffc0ed8cad>] class_process_config+0x19cd/0x23b0 [obdclass] Dec 2 07:30:17 soak-11 kernel: [<ffffffffc0dc6bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Dec 2 07:30:17 soak-11 kernel: [<ffffffffc0ed9856>] class_manual_cleanup+0x1c6/0x710 [obdclass] Dec 2 07:30:17 soak-11 kernel: [<ffffffffc0f07fee>] server_put_super+0x8de/0xcd0 [obdclass] Dec 2 07:30:17 soak-11 kernel: [<ffffffff81203692>] generic_shutdown_super+0x72/0x100 Dec 2 07:30:17 soak-11 kernel: [<ffffffff81203a62>] kill_anon_super+0x12/0x20 Dec 2 07:30:17 soak-11 kernel: [<ffffffffc0edc152>] lustre_kill_super+0x32/0x50 [obdclass] Dec 2 07:30:17 soak-11 kernel: [<ffffffff81203e19>] deactivate_locked_super+0x49/0x60 Dec 2 07:30:17 soak-11 kernel: [<ffffffff81204586>] deactivate_super+0x46/0x60 Dec 2 07:30:17 soak-11 kernel: [<ffffffff812217cf>] cleanup_mnt+0x3f/0x80 Dec 2 07:30:18 soak-11 kernel: [<ffffffff81221862>] __cleanup_mnt+0x12/0x20 Dec 2 07:30:18 soak-11 kernel: [<ffffffff810ad275>] task_work_run+0xc5/0xf0 Dec 2 07:30:18 soak-11 kernel: [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0 Dec 2 07:30:18 soak-11 kernel: [<ffffffff816b533d>] int_signal+0x12/0x17 Dec 2 07:30:19 soak-11 kernel: LustreError: 11-0: soaked-OST0016-osc-MDT0002: operation ost_connect to node 192.168.1.106@o2ib failed: rc = -114
This wedges soak, no further faults are attempted, jobs stop scheduling.
This happened over the weekend. Dumped Lustre logs, forced a crash dump.
Logs, crash info attached.
Full crash dump is available on Spirit.