Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10123

Repeated timeouts during mount()- MDS and client.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0
    • Lustre 2.11.0
    • Soak cluster, lustre-master build 3654 version=2.10.54_13_g84f690e
    • 3
    • 9223372036854775807

    Description

      Attempting to mount filesystem, with DNE
      MGT/MDS node hangs on initial mount.

      Oct 16 16:38:01 soak-8 kernel: LDISKFS-fs warning (device dm-0): ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected, please wait.
      Oct 16 16:38:44 soak-8 kernel: LDISKFS-fs (dm-0): recovery complete
      Oct 16 16:38:44 soak-8 kernel: LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
      Oct 16 16:38:45 soak-8 kernel: Lustre: MGS: Connection restored to e7ad9453-d165-e9e1-75ba-304460ab678f (at 0@lo)
      Oct 16 16:38:50 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1508171925/real 1508171925]  req@ffff88081c348c00 x1581432791105776/t0(0) o8->soaked-OST0001-osc-MDT0000@192.168.1.103@o2ib:28/4 lens 520/544 e 0 to 1 dl 1508171930 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Oct 16 16:39:10 soak-8 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
      Oct 16 16:39:15 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1508171950/real 1508171950]  req@ffff880814820c00 x1581432791105856/t0(0) o8->soaked-OST0000-osc-MDT0000@192.168.1.103@o2ib:28/4 lens 520/544 e 0 to 1 dl 1508171955 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Oct 16 16:39:15 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
      Oct 16 16:39:24 soak-8 sssd_be: GSSAPI client step 1
      Oct 16 16:39:24 soak-8 sssd_be: GSSAPI client step 1
      Oct 16 16:39:24 soak-8 sssd_be: GSSAPI client step 1
      Oct 16 16:39:24 soak-8 sssd_be: GSSAPI client step 2
      Oct 16 16:39:45 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1508171975/real 1508171975]  req@ffff880814821800 x1581432791105936/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib:24/4 lens 520/544 e 0 to 1 dl 1508171985 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Oct 16 16:39:45 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
      Oct 16 16:40:00 soak-8 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
      Oct 16 16:40:01 soak-8 systemd: Started Session 7 of user root.
      Oct 16 16:40:01 soak-8 systemd: Starting Session 7 of user root.
      Oct 16 16:40:01 soak-8 CROND[13027]: (root) CMD (/usr/lib64/sa/sa1 1 1)
      Oct 16 16:40:10 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1508172000/real 1508172000]  req@ffff880814822d00 x1581432791106048/t0(0) o8->soaked-OST0000-osc-MDT0000@192.168.1.103@o2ib:28/4 lens 520/544 e 0 to 1 dl 1508172010 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Oct 16 16:40:10 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
      Oct 16 16:40:40 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1508172025/real 1508172025]  req@ffff880814823c00 x1581432791106128/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib:24/4 lens 520/544 e 0 to 1 dl 1508172040 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Oct 16 16:40:40 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
      Oct 16 16:40:50 soak-8 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
      Oct 16 16:40:53 soak-8 kernel: INFO: task mount.lustre:12918 blocked for more than 120 seconds.
      Oct 16 16:40:54 soak-8 kernel: mount.lustre    D 0000000000000001     0 12918  12917 0x00000082
      Oct 16 16:40:54 soak-8 kernel: ffff8804b9bbb7b0 0000000000000082 ffff88083f9e0000 ffff8804b9bbbfd8
      Oct 16 16:40:54 soak-8 kernel: ffff8804b9bbbfd8 ffff8804b9bbbfd8 ffff88083f9e0000 ffff8808203a8150
      Oct 16 16:40:54 soak-8 kernel: 7fffffffffffffff ffff8808203a8148 ffff88083f9e0000 0000000000000001
      Oct 16 16:40:54 soak-8 kernel: Call Trace:
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff816a94c9>] schedule+0x29/0x70
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff816a6fd9>] schedule_timeout+0x239/0x2c0
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff816a987d>] wait_for_completion+0xfd/0x140
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a2f6b4>] llog_process_or_fork+0x244/0x450 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a2f8d4>] llog_process+0x14/0x20 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a5fe45>] class_config_parse_llog+0x125/0x350 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc10235a8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc1026413>] mgc_process_log+0x3d3/0x890 [mgc]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a68550>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc1026b18>] ? do_config_log_add+0x248/0x580 [mgc]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc10279f0>] mgc_process_config+0x890/0x13f0 [mgc]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a6bfee>] lustre_process_log+0x2de/0xaf0 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc090dba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a57339>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a98c32>] server_start_targets+0x1352/0x2a70 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a6ef5d>] ? lustre_start_mgc+0x20d/0x2800 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a68550>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a9b3dd>] server_fill_super+0x108d/0x184c [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a71ff8>] lustre_fill_super+0x328/0x950 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a71cd0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a69e18>] lustre_mount+0x38/0x60 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff81205599>] mount_fs+0x39/0x1b0
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff81222077>] vfs_kern_mount+0x67/0x110
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff81224583>] do_mount+0x233/0xaf0
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff812251c6>] SyS_mount+0x96/0xf0
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b
      

      Power-cycled soak-8, mount completed after power cycle. Continued bringing up the filesystem
      soak-10 (MDT0003) hung during mount.

      Oct 16 16:57:37 soak-10 kernel: INFO: task mount.lustre:13350 blocked for more than 120 seconds.
      Oct 16 16:57:37 soak-10 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Oct 16 16:57:38 soak-10 kernel: mount.lustre    D 0000000000000001     0 13350  13349 0x00000082
      Oct 16 16:57:38 soak-10 kernel: ffff880817bdf7b0 0000000000000086 ffff880817783f40 ffff880817bdffd8
      Oct 16 16:57:38 soak-10 kernel: ffff880817bdffd8 ffff880817bdffd8 ffff880817783f40 ffff88082ce87c30
      Oct 16 16:57:38 soak-10 kernel: 7fffffffffffffff ffff88082ce87c28 ffff880817783f40 0000000000000001
      Oct 16 16:57:38 soak-10 kernel: Call Trace:
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff816a94c9>] schedule+0x29/0x70
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff816a6fd9>] schedule_timeout+0x239/0x2c0
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff816a987d>] wait_for_completion+0xfd/0x140
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a406b4>] llog_process_or_fork+0x244/0x450 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a408d4>] llog_process+0x14/0x20 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a70e45>] class_config_parse_llog+0x125/0x350 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc14505a8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc1453413>] mgc_process_log+0x3d3/0x890 [mgc]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc1453b77>] ? do_config_log_add+0x2a7/0x580 [mgc]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc14549f0>] mgc_process_config+0x890/0x13f0 [mgc]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a7cfee>] lustre_process_log+0x2de/0xaf0 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc091b4e4>] ? libcfs_log_return+0x24/0x30 [libcfs]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0aa9c32>] server_start_targets+0x1352/0x2a70 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a80259>] ? lustre_start_mgc+0x509/0x2800 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a79550>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0aac3dd>] server_fill_super+0x108d/0x184c [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a82ff8>] lustre_fill_super+0x328/0x950 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a82cd0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a7ae18>] lustre_mount+0x38/0x60 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff81205599>] mount_fs+0x39/0x1b0
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff81222077>] vfs_kern_mount+0x67/0x110
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff81224583>] do_mount+0x233/0xaf0
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff812251c6>] SyS_mount+0x96/0xf0
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b
      

      Power-cycled that node, continued with mount, second mount on soak-10 worked.
      Mounting clients, soak-40 hung.

      Oct 16 17:17:56 soak-40 kernel: INFO: task mount.lustre:2933 blocked for more than 120 seconds.
      Oct 16 17:17:56 soak-40 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Oct 16 17:17:56 soak-40 kernel: mount.lustre    D 0000000000000001     0  2933   2932 0x00000080
      Oct 16 17:17:56 soak-40 kernel: ffff88107f827940 0000000000000086 ffff88085c2e5ee0 ffff88107f827fd8
      Oct 16 17:17:56 soak-40 kernel: ffff88107f827fd8 ffff88107f827fd8 ffff88085c2e5ee0 ffff88085cb5c5d0
      Oct 16 17:17:57 soak-40 kernel: 7fffffffffffffff ffff88085cb5c5c8 ffff88085c2e5ee0 0000000000000001
      Oct 16 17:17:57 soak-40 kernel: Call Trace:
      Oct 16 17:17:57 soak-40 kernel: [<ffffffff816a94e9>] schedule+0x29/0x70
      Oct 16 17:17:57 soak-40 kernel: [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
      Oct 16 17:17:57 soak-40 kernel: [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0
      Oct 16 17:17:57 soak-40 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0
      Oct 16 17:17:57 soak-40 kernel: [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0
      Oct 16 17:17:57 soak-40 kernel: [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
      Oct 16 17:17:57 soak-40 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
      Oct 16 17:17:57 soak-40 kernel: [<ffffffffc0ba66b4>] llog_process_or_fork+0x244/0x450 [obdclass]
      Oct 16 17:17:57 soak-40 kernel: [<ffffffffc0ba68d4>] llog_process+0x14/0x20 [obdclass]
      Oct 16 17:17:57 soak-40 kernel: [<ffffffffc0bd6d65>] class_config_parse_llog+0x125/0x350 [obdclass]
      Oct 16 17:17:57 soak-40 kernel: [<ffffffffc0844368>] mgc_process_cfg_log+0x788/0xc40 [mgc]
      Oct 16 17:17:57 soak-40 kernel: [<ffffffffc08471e3>] mgc_process_log+0x3d3/0x890 [mgc]
      Oct 16 17:17:57 soak-40 kernel: [<ffffffffc0847947>] ? do_config_log_add+0x2a7/0x580 [mgc]
      Oct 16 17:17:57 soak-40 kernel: [<ffffffffc08487c0>] mgc_process_config+0x890/0x13f0 [mgc]
      Oct 16 17:17:58 soak-40 kernel: [<ffffffffc0be2d7e>] lustre_process_log+0x2de/0xaf0 [obdclass]
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff816aba0e>] ? _raw_spin_unlock_bh+0x1e/0x20
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff8132318b>] ? fprop_local_init_percpu+0x1b/0x30
      Oct 16 17:17:58 soak-40 kernel: [<ffffffffc0d7b608>] ll_fill_super+0xaf8/0x1220 [lustre]
      Oct 16 17:17:58 soak-40 kernel: [<ffffffffc0be8ce6>] lustre_fill_super+0x286/0x910 [obdclass]
      Oct 16 17:17:58 soak-40 kernel: [<ffffffffc0be8a60>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0
      Oct 16 17:17:58 soak-40 kernel: [<ffffffffc0be0ba8>] lustre_mount+0x38/0x60 [obdclass]
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff81205599>] mount_fs+0x39/0x1b0
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff81222067>] vfs_kern_mount+0x67/0x110
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff81224573>] do_mount+0x233/0xaf0
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff812251b6>] SyS_mount+0x96/0xf0
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b
      Oct 16 17:19:59 soak-40 kernel: INFO: task mount.lustre:2933 blocked for more than 120 seconds.
      Oct 16 17:19:59 soak-40 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      

      Dumped -1 lustre-log from client, attached.

      Attachments

        Issue Links

          Activity

            People

              bfaccini Bruno Faccini (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: