Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.11.0
-
Soak stress cluster, lustre-master-next-ib build 1
-
9223372036854775807
Description
Mounts frequently hang on clients.
Mar 9 18:14:41 soak-36 kernel: INFO: task mount.lustre:2807 blocked for more than 120 seconds. Mar 9 18:14:41 soak-36 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 9 18:14:41 soak-36 kernel: mount.lustre D ffff88085b7a0000 0 2807 2806 0x00000080 Mar 9 18:14:41 soak-36 kernel: Call Trace: Mar 9 18:14:41 soak-36 kernel: [<ffffffff816ab8a9>] schedule+0x29/0x70 Mar 9 18:14:41 soak-36 kernel: [<ffffffff816a92b9>] schedule_timeout+0x239/0x2c0 Mar 9 18:14:41 soak-36 kernel: [<ffffffff81050b4c>] ? native_smp_send_reschedule+0x4c/0x70 Mar 9 18:14:41 soak-36 kernel: [<ffffffff810c2358>] ? resched_curr+0xa8/0xc0 Mar 9 18:14:41 soak-36 kernel: [<ffffffff810c30d8>] ? check_preempt_curr+0x78/0xa0 Mar 9 18:14:41 soak-36 kernel: [<ffffffff810c3119>] ? ttwu_do_wakeup+0x19/0xd0 Mar 9 18:14:41 soak-36 kernel: [<ffffffff816abc5d>] wait_for_completion+0xfd/0x140 Mar 9 18:14:42 soak-36 kernel: [<ffffffff810c6620>] ? wake_up_state+0x20/0x20 Mar 9 18:14:42 soak-36 kernel: [<ffffffffc0b28854>] llog_process_or_fork+0x244/0x450 [obdclass] Mar 9 18:14:42 soak-36 kernel: [<ffffffffc0b28a74>] llog_process+0x14/0x20 [obdclass] Mar 9 18:14:42 soak-36 kernel: [<ffffffffc0b5b1c5>] class_config_parse_llog+0x125/0x350 [obdclass] Mar 9 18:14:42 soak-36 kernel: [<ffffffffc06501c8>] mgc_process_cfg_log+0x788/0xc40 [mgc] Mar 9 18:14:42 soak-36 kernel: [<ffffffffc0652243>] mgc_process_log+0x3d3/0x8b0 [mgc] Mar 9 18:14:42 soak-36 kernel: [<ffffffffc0b63240>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Mar 9 18:14:42 soak-36 kernel: [<ffffffffc0652968>] ? do_config_log_add+0x248/0x580 [mgc] Mar 9 18:14:42 soak-36 kernel: [<ffffffffc0653840>] mgc_process_config+0x890/0x13f0 [mgc] Mar 9 18:14:42 soak-36 kernel: [<ffffffffc0b66c85>] lustre_process_log+0x2d5/0xae0 [obdclass] Mar 9 18:14:42 soak-36 kernel: [<ffffffffc0855e27>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Mar 9 18:14:43 soak-36 kernel: [<ffffffffc0f3e3bb>] ll_fill_super+0x45b/0x1100 [lustre] Mar 9 18:14:43 soak-36 kernel: [<ffffffffc0b6caa6>] lustre_fill_super+0x286/0x910 [obdclass] Mar 9 18:14:43 soak-36 kernel: [<ffffffffc0b6c820>] ? lustre_common_put_super+0x270/0x270 [obdclass] Mar 9 18:14:43 soak-36 kernel: [<ffffffff81206abd>] mount_nodev+0x4d/0xb0 Mar 9 18:14:43 soak-36 kernel: [<ffffffffc0b64ab8>] lustre_mount+0x38/0x60 [obdclass] Mar 9 18:14:43 soak-36 kernel: [<ffffffff81207549>] mount_fs+0x39/0x1b0 Mar 9 18:14:43 soak-36 kernel: [<ffffffff81224177>] vfs_kern_mount+0x67/0x110 Mar 9 18:14:43 soak-36 kernel: [<ffffffff81226683>] do_mount+0x233/0xaf0 Mar 9 18:14:43 soak-36 kernel: [<ffffffff811894ee>] ? __get_free_pages+0xe/0x40 Mar 9 18:14:43 soak-36 kernel: [<ffffffff812272c6>] SyS_mount+0x96/0xf0 Mar 9 18:14:43 soak-36 kernel: [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b Mar 9 18:16:43 soak-36 kernel: INFO: task mount.lustre:2807 blocked for more than 120 seconds. Mar 9 18:16:43 soak-36 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
I dumped the lustre log during the hang, attached. I also crash-dumped the client, files available on soak