[LU-6659] mounting OSTs hangs during upgrade to Lustre 2.7.0 Created: 28/May/15  Updated: 21/Mar/18  Resolved: 21/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Frederik Ferner (Inactive) Assignee: Peter Jones
Resolution: Won't Fix Votes: 1
Labels: None
Environment:

RHEL6, ldiskfs


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

during my ongoing upgrade tests from Lustre 2.5.3 to Lustre 2.7.0 on our test file system, everytime I try to mount the OSTs on the new version some of them just hang (I've waited >30 minutes before). Mounting them on a server using 2.5.3 again works (even then some OSTs take minutes to mount).

This is in the syslog on the OSS where it hangs.

May 28 14:10:42 cs04r-sc-oss01-08 kernel: Lustre: 24492:0:(ofd_dev.c:344:ofd_process_config()) For interoperability, skip this ost.quota_type. It is obsolete.
May 28 14:10:47 cs04r-sc-oss01-08 kernel: Lustre: 24249:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432818642/real 1432818642]  req@ffff88040c137cc0 x1502419185827916/t0(0) o38->play01-MDT0000-lwp-OST0002@172.23.144.5@tcp:12/10 lens 400/544 e 0 to 1 dl 1432818647 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May 28 14:10:48 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Will be in recovery for at least 5:00, or until 157 clients reconnect
May 28 14:10:50 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Denying connection for new client play01-MDT0000-mdtlov_UUID (at 172.23.144.6@tcp), waiting for all 157 known clients (11 recovered, 0 in progress, and 0 evicted) to recover in 4:58
May 28 14:10:58 cs04r-sc-oss01-08 kernel: Lustre: 24553:0:(ofd_dev.c:344:ofd_process_config()) For interoperability, skip this ost.quota_type. It is obsolete.
May 28 14:11:00 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0005: Will be in recovery for at least 5:00, or until 157 clients reconnect
May 28 14:11:00 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0005: Denying connection for new client play01-MDT0000-mdtlov_UUID (at 172.23.144.6@tcp), waiting for all 157 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:59
May 28 14:11:03 cs04r-sc-oss01-08 kernel: Lustre: 24249:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432818658/real 1432818658]  req@ffff88040c137cc0 x1502419185827988/t0(0) o38->play01-MDT0000-lwp-OST0005@172.23.144.5@tcp:12/10 lens 400/544 e 0 to 1 dl 1432818663 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May 28 14:11:06 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0005: Denying connection for new client play01-MDT0000-mdtlov_UUID (at 172.23.144.6@tcp), waiting for all 157 known clients (66 recovered, 1 in progress, and 0 evicted) to recover in 4:54
May 28 14:11:07 cs04r-sc-oss01-08 kernel: LustreError: 137-5: play01-OST0004_UUID: not available for connect from 172.23.140.56@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
May 28 14:11:07 cs04r-sc-oss01-08 kernel: LustreError: Skipped 252 previous similar messages
May 28 14:11:10 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Client 10603fa0-9f4e-8360-f89f-b42f2cde1a62 (at 172.23.138.67@tcp) reconnecting, waiting for 157 clients in recovery for 4:37
May 28 14:11:21 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Client 839ba272-6df2-ab36-9f92-fd8cbe53011a (at 172.23.140.14@tcp) reconnecting, waiting for 157 clients in recovery for 4:26
May 28 14:11:22 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Client f11b080c-8725-8baf-ebf6-052c0150927e (at 172.23.130.3@tcp) reconnecting, waiting for 157 clients in recovery for 4:25
May 28 14:11:22 cs04r-sc-oss01-08 kernel: Lustre: Skipped 3 previous similar messages
May 28 14:11:25 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Client f9d06edb-fd05-1d36-5e43-caf7cc1c18c3 (at 172.23.130.17@tcp) reconnecting, waiting for 157 clients in recovery for 4:23
May 28 14:11:25 cs04r-sc-oss01-08 kernel: Lustre: Skipped 11 previous similar messages
May 28 14:11:25 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Denying connection for new client play01-MDT0000-mdtlov_UUID (at 172.23.144.6@tcp), waiting for all 157 known clients (140 recovered, 1 in progress, and 0 evicted) to recover in 4:23
May 28 14:11:28 cs04r-sc-oss01-08 kernel: Lustre: 24626:0:(ofd_dev.c:344:ofd_process_config()) For interoperability, skip this ost.quota_type. It is obsolete.
May 28 14:11:29 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Client 7e80a2c6-72a7-97f8-f817-5b00e2e09f28 (at 172.23.152.16@tcp) reconnecting, waiting for 157 clients in recovery for 4:19
May 28 14:11:29 cs04r-sc-oss01-08 kernel: Lustre: Skipped 27 previous similar messages
May 28 14:11:31 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0003: Will be in recovery for at least 5:00, or until 157 clients reconnect
May 28 14:11:33 cs04r-sc-oss01-08 kernel: Lustre: 24641:0:(ofd_dev.c:344:ofd_process_config()) For interoperability, skip this ost.quota_type. It is obsolete.
May 28 14:11:34 cs04r-sc-oss01-08 kernel: Lustre: 24249:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432818688/real 1432818688]  req@ffff88040c1379c0 x1502419185828068/t0(0) o38->play01-MDT0000-lwp-OST0003@172.23.144.5@tcp:12/10 lens 400/544 e 0 to 1 dl 1432818693 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May 28 14:11:37 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0005: Client 5cf83874-9f99-4ff6-5dd8-1b9995c0c7b7 (at 172.23.152.29@tcp) reconnecting, waiting for 157 clients in recovery for 4:23
May 28 14:11:37 cs04r-sc-oss01-08 kernel: Lustre: Skipped 39 previous similar messages
May 28 14:11:38 cs04r-sc-oss01-08 kernel: Lustre: 24249:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432818693/real 1432818693]  req@ffff88020aa093c0 x1502419185828140/t0(0) o38->play01-MDT0000-lwp-OST0004@172.23.144.5@tcp:12/10 lens 400/544 e 0 to 1 dl 1432818698 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May 28 14:11:50 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0005: Denying connection for new client play01-MDT0000-mdtlov_UUID (at 172.23.144.6@tcp), waiting for all 157 known clients (141 recovered, 2 in progress, and 0 evicted) to recover in 4:09
May 28 14:11:53 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Client 786c45ae-9a30-e8b3-c300-d2f3149b07d6 (at 172.23.152.40@tcp) reconnecting, waiting for 157 clients in recovery for 3:55
May 28 14:11:53 cs04r-sc-oss01-08 kernel: Lustre: Skipped 142 previous similar messages
May 28 14:11:54 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0005: Recovery over after 0:54, of 157 clients 157 recovered and 0 were evicted.
May 28 14:12:08 cs04r-sc-oss01-08 kernel: Lustre: 24249:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432818718/real 1432818718]  req@ffff8801f36b26c0 x1502419185828164/t0(0) o38->play01-MDT0000-lwp-OST0004@172.23.144.5@tcp:12/10 lens 400/544 e 0 to 1 dl 1432818728 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May 28 14:12:11 cs04r-sc-oss01-08 kernel: LustreError: 137-5: play01-OST0000_UUID: not available for connect from 172.23.140.12@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
May 28 14:12:11 cs04r-sc-oss01-08 kernel: LustreError: Skipped 473 previous similar messages
May 28 14:12:16 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0003: Denying connection for new client play01-MDT0000-mdtlov_UUID (at 172.23.144.6@tcp), waiting for all 157 known clients (137 recovered, 2 in progress, and 0 evicted) to recover in 4:14
May 28 14:12:16 cs04r-sc-oss01-08 kernel: Lustre: Skipped 3 previous similar messages
May 28 14:12:16 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0005: deleting orphan objects from 0x0:38179859 to 0x0:38179889
May 28 14:12:25 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Client e07faa5e-0051-f6ed-49cf-bbf10dc58acd (at 172.23.148.2@tcp) reconnecting, waiting for 157 clients in recovery for 3:23
May 28 14:12:25 cs04r-sc-oss01-08 kernel: Lustre: Skipped 227 previous similar messages
May 28 14:12:27 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0003: Recovery over after 0:56, of 157 clients 157 recovered and 0 were evicted.
May 28 14:12:28 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: Recovery over after 1:40, of 157 clients 157 recovered and 0 were evicted.
May 28 14:12:37 cs04r-sc-oss01-08 kernel: INFO: task mount.lustre:24121 blocked for more than 120 seconds.
May 28 14:12:37 cs04r-sc-oss01-08 kernel:      Not tainted 2.6.32-504.8.1.el6_lustre.x86_64 #1
May 28 14:12:37 cs04r-sc-oss01-08 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 28 14:12:37 cs04r-sc-oss01-08 kernel: mount.lustre  D 0000000000000001     0 24121  24117 0x00000080
May 28 14:12:37 cs04r-sc-oss01-08 kernel: ffff880223131988 0000000000000086 0000000000000000 0000000000000000
May 28 14:12:37 cs04r-sc-oss01-08 kernel: 0000000000000001 0d657735810a9bea 000003abe95cee6e ffff88040d657735
May 28 14:12:37 cs04r-sc-oss01-08 kernel: 0000000000000000 000000010039053a ffff88022d0e1af8 ffff880223131fd8
May 28 14:12:37 cs04r-sc-oss01-08 kernel: Call Trace:
May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffff8152bc06>] __mutex_lock_slowpath+0x96/0x210
May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffff8152b72b>] mutex_lock+0x2b/0x50
May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa2f8e150>] mgc_set_info_async+0x450/0x1940 [mgc]
May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa08051c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa16c2395>] server_mgc_set_fs+0x115/0x4e0 [obdclass]
May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa16c5d6f>] server_start_targets+0xaf/0x1af0 [obdclass]
May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa169bbfb>] ? lustre_start_mgc+0x48b/0x1e00 [obdclass]
May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa16cbd35>] server_fill_super+0xbe5/0x1690 [obdclass]
May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa169dad0>] lustre_fill_super+0x560/0xa80 [obdclass]
May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa169d570>] ? lustre_fill_super+0x0/0xa80 [obdclass]
May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffff811917af>] get_sb_nodev+0x5f/0xa0
May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffffa1694b25>] lustre_get_sb+0x25/0x30 [obdclass]
May 28 14:12:37 cs04r-sc-oss01-08 kernel: [<ffffffff81190deb>] vfs_kern_mount+0x7b/0x1b0
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff81190f92>] do_kern_mount+0x52/0x130
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff811b2b9b>] do_mount+0x2fb/0x930
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff811b3260>] sys_mount+0x90/0xe0
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
May 28 14:12:38 cs04r-sc-oss01-08 kernel: INFO: task mount.lustre:24125 blocked for more than 120 seconds.
May 28 14:12:38 cs04r-sc-oss01-08 kernel:      Not tainted 2.6.32-504.8.1.el6_lustre.x86_64 #1
May 28 14:12:38 cs04r-sc-oss01-08 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 28 14:12:38 cs04r-sc-oss01-08 kernel: mount.lustre  D 0000000000000001     0 24125  24123 0x00000080
May 28 14:12:38 cs04r-sc-oss01-08 kernel: ffff880439a9b988 0000000000000086 0000000000000000 0000000000000000
May 28 14:12:38 cs04r-sc-oss01-08 kernel: 0000000000000001 0d657b2d810a9bea 000003abeac37613 ffff88040d657b2d
May 28 14:12:38 cs04r-sc-oss01-08 kernel: 0000000000000000 0000000100390542 ffff88043b1565f8 ffff880439a9bfd8
May 28 14:12:38 cs04r-sc-oss01-08 kernel: Call Trace:
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff8152bc06>] __mutex_lock_slowpath+0x96/0x210
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff8152b72b>] mutex_lock+0x2b/0x50
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa2f8e150>] mgc_set_info_async+0x450/0x1940 [mgc]
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa08051c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa16c2395>] server_mgc_set_fs+0x115/0x4e0 [obdclass]
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa16c5d6f>] server_start_targets+0xaf/0x1af0 [obdclass]
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa169bbfb>] ? lustre_start_mgc+0x48b/0x1e00 [obdclass]
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa16cbd35>] server_fill_super+0xbe5/0x1690 [obdclass]
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa169dad0>] lustre_fill_super+0x560/0xa80 [obdclass]
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa169d570>] ? lustre_fill_super+0x0/0xa80 [obdclass]
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff811917af>] get_sb_nodev+0x5f/0xa0
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffffa1694b25>] lustre_get_sb+0x25/0x30 [obdclass]
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff81190deb>] vfs_kern_mount+0x7b/0x1b0
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff81190f92>] do_kern_mount+0x52/0x130
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff8152d99e>] ? lock_kernel+0x2e/0x50
May 28 14:12:38 cs04r-sc-oss01-08 kernel: Lustre: 24249:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432818743/real 1432818743]  req@ffff8801f36b2cc0 x1502419185828184/t0(0) o38->play01-MDT0000-lwp-OST0004@172.23.144.5@tcp:12/10 lens 400/544 e 0 to 1 dl 1432818758 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff811b2b9b>] do_mount+0x2fb/0x930
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff811b3260>] sys_mount+0x90/0xe0
May 28 14:12:38 cs04r-sc-oss01-08 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
May 28 14:12:41 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0002: deleting orphan objects from 0x0:29862626 to 0x0:29862801
May 28 14:12:41 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0003: deleting orphan objects from 0x0:71578434 to 0x0:71578464
May 28 14:13:08 cs04r-sc-oss01-08 kernel: Lustre: 24249:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432818768/real 1432818768]  req@ffff8801f8154c80 x1502419185828204/t0(0) o38->play01-MDT0000-lwp-OST0004@172.23.144.5@tcp:12/10 lens 400/544 e 0 to 1 dl 1432818788 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May 28 14:14:08 cs04r-sc-oss01-08 kernel: Lustre: 24249:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432818818/real 1432818818]  req@ffff8801f81540c0 x1502419185828244/t0(0) o38->play01-MDT0000-lwp-OST0004@172.23.144.5@tcp:12/10 lens 400/544 e 0 to 1 dl 1432818848 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1

3 out of 6 OSTs mounted quickly, the other 3 are hanging trying to mount. I've checked the network connnectivity between MDT and OSS and can't find anything wrong...

To confirm, this is what I've done on our test file system:

The file system contains of 2 MDS server (active-passive), 2 OSS (active-active normally), we have upgraded the fail-over MDS and one of the OSSes, mounting the MGS and MDT on the failover MDT works, mounting all OSTs on the upgraded OSS results in some of them hanging, retrying after rebooting the server results in others hanging. Mounting everything on the not yet upgraded servers again just works. This is reproducible.

I've seen LU-4565 and LU-5797 but both of them are using ZFS and the stack traces look different.

This is currently stopping us from further testing on lustre 2.7 prior to a planned maintenance early next week.



 Comments   
Comment by Oleg Drokin [ 28/May/15 ]

I don't see it mentioned - have you tried upgrading primary MDS and running with that - does that still hang when you mount upgraded OSTs?
And the other way around - if you run with non-upgraded failover MDS, does that still hang mounting upgraded OSTs?

Comment by Frederik Ferner (Inactive) [ 28/May/15 ]

No, I've not tried upgrading the primary MDS yet to keep a simpler fallback path initially. I can try this.

Mounting upgraded OSTs with a non-upgraded MDS, is this supported? My interpretation of the upgrade instructions on the manual are that to go from 2.x to 2.(x+n) all servers need to be upgraded at the same time. If running 2.7 OSS with 2.5 MDT (during upgrade phase) is possible, I can certainly try this as well. FWIW, I know that in the past (with 2.5.X everywhere) we have successfully used each of the MDS servers.

Comment by Oleg Drokin [ 28/May/15 ]

It's ok to upgrade MDTs and OSTs separately, so it's fine to run older MDS with newer OSSes.

I am just trying to see if the problem you are having is due to some 2.7 mgs changes that make new server now work well enough when it's not on primary node.

Comment by Frederik Ferner (Inactive) [ 28/May/15 ]

Thanks for confirming that older MDS with newer OSS should be fine.

As you mentioned "no on primary node", note we have our systems configured without primary node, see below for tunefs.lustre output for MDT and MGS, OSTs are similar.

[bnh65367@cs04r-sc-mds02-04 tmp]$ sudo tunefs.lustre --print /dev/mapper/vg_play01-mdt 
checking for existing Lustre data: found
Reading CONFIGS/mountdata

   Read previous values:
Target:     play01-MDT0000
Index:      0
Lustre FS:  play01
Mount type: ldiskfs
Flags:      0x1001
              (MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro,acl
Parameters: mgsnode=172.23.144.5@tcp mgsnode=172.23.144.6@tcp failover.node=172.23.144.5@tcp failover.node=172.23.144.6@tcp mdt.quota_type=ug mdt.group_upcall=/usr/sbin/l_getgroups


   Permanent disk data:
Target:     play01-MDT0000
Index:      0
Lustre FS:  play01
Mount type: ldiskfs
Flags:      0x1001
              (MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro,acl
Parameters: mgsnode=172.23.144.5@tcp mgsnode=172.23.144.6@tcp failover.node=172.23.144.5@tcp failover.node=172.23.144.6@tcp mdt.quota_type=ug mdt.group_upcall=/usr/sbin/l_getgroups

exiting before disk write.
[bnh65367@cs04r-sc-mds02-04 tmp]$ sudo tunefs.lustre --print /dev/mapper/vg_play01-mgs 
checking for existing Lustre data: found
Reading CONFIGS/mountdata

   Read previous values:
Target:     MGS
Index:      unassigned
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x4
              (MGS )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:


   Permanent disk data:
Target:     MGS
Index:      unassigned
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x4
              (MGS )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:

exiting before disk write.
[bnh65367@cs04r-sc-mds02-04 tmp]$ 

I've upgrade at least lustre on the second MDS and mounted MGS and MDT there. In this combination I can mount all OSTs on the one upgraded OSS (including the ones failing over there), so the test file system is now up and running on lustre 2.7.

I'll leave this running overnight with some tests and will attempt to fail over to the other MDS tomorrow.

Comment by Frederik Ferner (Inactive) [ 29/May/15 ]

In all my tests today, including switching between MDS nodes back and forth with full remounts off everything, I have not been able to reproduce this issue. It seems to have been only an issue when first mounting with the new version...

Comment by Frederik Ferner (Inactive) [ 03/Jun/15 ]

As another update, I've now updated our two production file systems without running into this issue. (On both file systems, I initially mounted on the primary MDS first and while bringing OSTs up for the first time.)

Comment by Peter Jones [ 03/Jun/15 ]

Frederik

That's great! So you are in full production using 2.7 - servers and clients?

Peter

Comment by Eric Kolb [ 14/Aug/15 ]

For what it's worth we recently encountered this same issue during an attempted production upgrade from 2.5.3 to 2.7.0. Counter-intuitively our development upgrade proceeded with a hitch.

Should add this was with ZFS.

Comment by Peter Jones [ 21/Mar/18 ]

I don't think that this is something that affects current versions

Generated at Sat Feb 10 02:02:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.