Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6659

mounting OSTs hangs during upgrade to Lustre 2.7.0

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Blocker
    • None
    • Lustre 2.7.0
    • None
    • RHEL6, ldiskfs
    • 3
    • 9223372036854775807

    Description

      during my ongoing upgrade tests from Lustre 2.5.3 to Lustre 2.7.0 on our test file system, everytime I try to mount the OSTs on the new version some of them just hang (I've waited >30 minutes before). Mounting them on a server using 2.5.3 again works (even then some OSTs take minutes to mount).

      This is in the syslog on the OSS where it hangs.

      May 28 14:10:42 cs04r-sc-oss01-08 kernel: Lustre: 24492:0:(ofd_dev.c:344:ofd_process_config()) For interoperability, skip this ost.quota_type. It is obsolete.
      May 28 14:10:47 cs04r-sc-oss01-08 kernel: Lustre: 24249:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432818642/real 1432818642]  req@ffff88040c137cc0 x1502419185827916/t0(0) o38->play01-MDT0000-lwp-OST0002@172.23.144.5@tcp:12/10 lens 400/544 e 0 to 1 dl 1432818647 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      May 28 14:10:48 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Will be in recovery for at least 5:00, or until 157 clients reconnect
      May 28 14:10:50 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Denying connection for new client play01-MDT0000-mdtlov_UUID (at 172.23.144.6@tcp), waiting for all 157 known clients (11 recovered, 0 in progress, and 0 evicted) to recover in 4:58
      May 28 14:10:58 cs04r-sc-oss01-08 kernel: Lustre: 24553:0:(ofd_dev.c:344:ofd_process_config()) For interoperability, skip this ost.quota_type. It is obsolete.
      May 28 14:11:00 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0005: Will be in recovery for at least 5:00, or until 157 clients reconnect
      May 28 14:11:00 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0005: Denying connection for new client play01-MDT0000-mdtlov_UUID (at 172.23.144.6@tcp), waiting for all 157 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:59
      May 28 14:11:03 cs04r-sc-oss01-08 kernel: Lustre: 24249:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432818658/real 1432818658]  req@ffff88040c137cc0 x1502419185827988/t0(0) o38->play01-MDT0000-lwp-OST0005@172.23.144.5@tcp:12/10 lens 400/544 e 0 to 1 dl 1432818663 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      May 28 14:11:06 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0005: Denying connection for new client play01-MDT0000-mdtlov_UUID (at 172.23.144.6@tcp), waiting for all 157 known clients (66 recovered, 1 in progress, and 0 evicted) to recover in 4:54
      May 28 14:11:07 cs04r-sc-oss01-08 kernel: LustreError: 137-5: play01-OST0004_UUID: not available for connect from 172.23.140.56@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 28 14:11:07 cs04r-sc-oss01-08 kernel: LustreError: Skipped 252 previous similar messages
      May 28 14:11:10 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Client 10603fa0-9f4e-8360-f89f-b42f2cde1a62 (at 172.23.138.67@tcp) reconnecting, waiting for 157 clients in recovery for 4:37
      May 28 14:11:21 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Client 839ba272-6df2-ab36-9f92-fd8cbe53011a (at 172.23.140.14@tcp) reconnecting, waiting for 157 clients in recovery for 4:26
      May 28 14:11:22 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Client f11b080c-8725-8baf-ebf6-052c0150927e (at 172.23.130.3@tcp) reconnecting, waiting for 157 clients in recovery for 4:25
      May 28 14:11:22 cs04r-sc-oss01-08 kernel: Lustre: Skipped 3 previous similar messages
      May 28 14:11:25 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Client f9d06edb-fd05-1d36-5e43-caf7cc1c18c3 (at 172.23.130.17@tcp) reconnecting, waiting for 157 clients in recovery for 4:23
      May 28 14:11:25 cs04r-sc-oss01-08 kernel: Lustre: Skipped 11 previous similar messages
      May 28 14:11:25 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Denying connection for new client play01-MDT0000-mdtlov_UUID (at 172.23.144.6@tcp), waiting for all 157 known clients (140 recovered, 1 in progress, and 0 evicted) to recover in 4:23
      May 28 14:11:28 cs04r-sc-oss01-08 kernel: Lustre: 24626:0:(ofd_dev.c:344:ofd_process_config()) For interoperability, skip this ost.quota_type. It is obsolete.
      May 28 14:11:29 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Client 7e80a2c6-72a7-97f8-f817-5b00e2e09f28 (at 172.23.152.16@tcp) reconnecting, waiting for 157 clients in recovery for 4:19
      May 28 14:11:29 cs04r-sc-oss01-08 kernel: Lustre: Skipped 27 previous similar messages
      May 28 14:11:31 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0003: Will be in recovery for at least 5:00, or until 157 clients reconnect
      May 28 14:11:33 cs04r-sc-oss01-08 kernel: Lustre: 24641:0:(ofd_dev.c:344:ofd_process_config()) For interoperability, skip this ost.quota_type. It is obsolete.
      May 28 14:11:34 cs04r-sc-oss01-08 kernel: Lustre: 24249:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432818688/real 1432818688]  req@ffff88040c1379c0 x1502419185828068/t0(0) o38->play01-MDT0000-lwp-OST0003@172.23.144.5@tcp:12/10 lens 400/544 e 0 to 1 dl 1432818693 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      May 28 14:11:37 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0005: Client 5cf83874-9f99-4ff6-5dd8-1b9995c0c7b7 (at 172.23.152.29@tcp) reconnecting, waiting for 157 clients in recovery for 4:23
      May 28 14:11:37 cs04r-sc-oss01-08 kernel: Lustre: Skipped 39 previous similar messages
      May 28 14:11:38 cs04r-sc-oss01-08 kernel: Lustre: 24249:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432818693/real 1432818693]  req@ffff88020aa093c0 x1502419185828140/t0(0) o38->play01-MDT0000-lwp-OST0004@172.23.144.5@tcp:12/10 lens 400/544 e 0 to 1 dl 1432818698 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      May 28 14:11:50 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0005: Denying connection for new client play01-MDT0000-mdtlov_UUID (at 172.23.144.6@tcp), waiting for all 157 known clients (141 recovered, 2 in progress, and 0 evicted) to recover in 4:09
      May 28 14:11:53 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Client 786c45ae-9a30-e8b3-c300-d2f3149b07d6 (at 172.23.152.40@tcp) reconnecting, waiting for 157 clients in recovery for 3:55
      May 28 14:11:53 cs04r-sc-oss01-08 kernel: Lustre: Skipped 142 previous similar messages
      May 28 14:11:54 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0005: Recovery over after 0:54, of 157 clients 157 recovered and 0 were evicted.
      May 28 14:12:08 cs04r-sc-oss01-08 kernel: Lustre: 24249:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432818718/real 1432818718]  req@ffff8801f36b26c0 x1502419185828164/t0(0) o38->play01-MDT0000-lwp-OST0004@172.23.144.5@tcp:12/10 lens 400/544 e 0 to 1 dl 1432818728 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      May 28 14:12:11 cs04r-sc-oss01-08 kernel: LustreError: 137-5: play01-OST0000_UUID: not available for connect from 172.23.140.12@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 28 14:12:11 cs04r-sc-oss01-08 kernel: LustreError: Skipped 473 previous similar messages
      May 28 14:12:16 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0003: Denying connection for new client play01-MDT0000-mdtlov_UUID (at 172.23.144.6@tcp), waiting for all 157 known clients (137 recovered, 2 in progress, and 0 evicted) to recover in 4:14
      May 28 14:12:16 cs04r-sc-oss01-08 kernel: Lustre: Skipped 3 previous similar messages
      May 28 14:12:16 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0005: deleting orphan objects from 0x0:38179859 to 0x0:38179889
      May 28 14:12:25 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Client e07faa5e-0051-f6ed-49cf-bbf10dc58acd (at 172.23.148.2@tcp) reconnecting, waiting for 157 clients in recovery for 3:23
      May 28 14:12:25 cs04r-sc-oss01-08 kernel: Lustre: Skipped 227 previous similar messages
      May 28 14:12:27 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0003: Recovery over after 0:56, of 157 clients 157 recovered and 0 were evicted.
      May 28 14:12:28 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Recovery over after 1:40, of 157 clients 157 recovered and 0 were evicted.
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: INFO: task mount.lustre:24121 blocked for more than 120 seconds.
      May 28 14:12:37 cs04r-sc-oss01-08 kernel:      Not tainted 2.6.32-504.8.1.el6_lustre.x86_64 #1
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: mount.lustre  D 0000000000000001     0 24121  24117 0x00000080
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: ffff880223131988 0000000000000086 0000000000000000 0000000000000000
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: 0000000000000001 0d657735810a9bea 000003abe95cee6e ffff88040d657735
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: 0000000000000000 000000010039053a ffff88022d0e1af8 ffff880223131fd8
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: Call Trace:
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffff8152bc06>] __mutex_lock_slowpath+0x96/0x210
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffff8152b72b>] mutex_lock+0x2b/0x50
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa2f8e150>] mgc_set_info_async+0x450/0x1940 [mgc]
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa08051c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa16c2395>] server_mgc_set_fs+0x115/0x4e0 [obdclass]
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa16c5d6f>] server_start_targets+0xaf/0x1af0 [obdclass]
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa169bbfb>] ? lustre_start_mgc+0x48b/0x1e00 [obdclass]
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa16cbd35>] server_fill_super+0xbe5/0x1690 [obdclass]
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa169dad0>] lustre_fill_super+0x560/0xa80 [obdclass]
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa169d570>] ? lustre_fill_super+0x0/0xa80 [obdclass]
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffff811917af>] get_sb_nodev+0x5f/0xa0
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa1694b25>] lustre_get_sb+0x25/0x30 [obdclass]
      May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffff81190deb>] vfs_kern_mount+0x7b/0x1b0
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff81190f92>] do_kern_mount+0x52/0x130
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff811b2b9b>] do_mount+0x2fb/0x930
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff811b3260>] sys_mount+0x90/0xe0
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: INFO: task mount.lustre:24125 blocked for more than 120 seconds.
      May 28 14:12:38 cs04r-sc-oss01-08 kernel:      Not tainted 2.6.32-504.8.1.el6_lustre.x86_64 #1
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: mount.lustre  D 0000000000000001     0 24125  24123 0x00000080
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: ffff880439a9b988 0000000000000086 0000000000000000 0000000000000000
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: 0000000000000001 0d657b2d810a9bea 000003abeac37613 ffff88040d657b2d
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: 0000000000000000 0000000100390542 ffff88043b1565f8 ffff880439a9bfd8
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: Call Trace:
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff8152bc06>] __mutex_lock_slowpath+0x96/0x210
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff8152b72b>] mutex_lock+0x2b/0x50
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa2f8e150>] mgc_set_info_async+0x450/0x1940 [mgc]
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa08051c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa16c2395>] server_mgc_set_fs+0x115/0x4e0 [obdclass]
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa16c5d6f>] server_start_targets+0xaf/0x1af0 [obdclass]
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa169bbfb>] ? lustre_start_mgc+0x48b/0x1e00 [obdclass]
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa16cbd35>] server_fill_super+0xbe5/0x1690 [obdclass]
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa169dad0>] lustre_fill_super+0x560/0xa80 [obdclass]
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa169d570>] ? lustre_fill_super+0x0/0xa80 [obdclass]
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff811917af>] get_sb_nodev+0x5f/0xa0
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa1694b25>] lustre_get_sb+0x25/0x30 [obdclass]
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff81190deb>] vfs_kern_mount+0x7b/0x1b0
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff81190f92>] do_kern_mount+0x52/0x130
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff8152d99e>] ? lock_kernel+0x2e/0x50
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: Lustre: 24249:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432818743/real 1432818743]  req@ffff8801f36b2cc0 x1502419185828184/t0(0) o38->play01-MDT0000-lwp-OST0004@172.23.144.5@tcp:12/10 lens 400/544 e 0 to 1 dl 1432818758 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff811b2b9b>] do_mount+0x2fb/0x930
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff811b3260>] sys_mount+0x90/0xe0
      May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      May 28 14:12:41 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: deleting orphan objects from 0x0:29862626 to 0x0:29862801
      May 28 14:12:41 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0003: deleting orphan objects from 0x0:71578434 to 0x0:71578464
      May 28 14:13:08 cs04r-sc-oss01-08 kernel: Lustre: 24249:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432818768/real 1432818768]  req@ffff8801f8154c80 x1502419185828204/t0(0) o38->play01-MDT0000-lwp-OST0004@172.23.144.5@tcp:12/10 lens 400/544 e 0 to 1 dl 1432818788 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      May 28 14:14:08 cs04r-sc-oss01-08 kernel: Lustre: 24249:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432818818/real 1432818818]  req@ffff8801f81540c0 x1502419185828244/t0(0) o38->play01-MDT0000-lwp-OST0004@172.23.144.5@tcp:12/10 lens 400/544 e 0 to 1 dl 1432818848 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      

      3 out of 6 OSTs mounted quickly, the other 3 are hanging trying to mount. I've checked the network connnectivity between MDT and OSS and can't find anything wrong...

      To confirm, this is what I've done on our test file system:

      The file system contains of 2 MDS server (active-passive), 2 OSS (active-active normally), we have upgraded the fail-over MDS and one of the OSSes, mounting the MGS and MDT on the failover MDT works, mounting all OSTs on the upgraded OSS results in some of them hanging, retrying after rebooting the server results in others hanging. Mounting everything on the not yet upgraded servers again just works. This is reproducible.

      I've seen LU-4565 and LU-5797 but both of them are using ZFS and the stack traces look different.

      This is currently stopping us from further testing on lustre 2.7 prior to a planned maintenance early next week.

      Attachments

        Activity

          People

            pjones Peter Jones
            ferner Frederik Ferner (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: