Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
None
-
Lustre 2.4.1
-
None
-
MDS and OSS on Lustre 2.4.1, clients lustre 1.8.9, all Red Hat Enterprise Linux.
-
3
-
11756
Description
As indicated in LU-4242, I now have a problem on our preproduction file system that stops users from accessing the data, servers from cleanly rebooting etc, stopping any further testing.
After upgrading the servers from 2.3 to 2.4.1 (MDT build #51 of b2_4 from jenkins) our clients can no longer fully access this file system. The clients can mount the file system and can access one OST on each of the two OSSes, but the other OSSes are not accessible and are shown as inactive in lfs df output and /proc/fs/lustre/lov/*/target_obd, but are shown as UP in lctl dl.
[bnh65367@cs04r-sc-serv-07 ~]$ lctl dl |grep play01 91 UP lov play01-clilov-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 4 92 UP mdc play01-MDT0000-mdc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5 93 UP osc play01-OST0000-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5 94 UP osc play01-OST0001-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5 95 UP osc play01-OST0002-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5 96 UP osc play01-OST0003-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5 97 UP osc play01-OST0004-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5 98 UP osc play01-OST0005-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5 [bnh65367@cs04r-sc-serv-07 ~]$ lfs df /mnt/play01 UUID 1K-blocks Used Available Use% Mounted on play01-MDT0000_UUID 78636320 3502948 75133372 4% /mnt/play01[MDT:0] play01-OST0000_UUID 7691221300 4506865920 3184355380 59% /mnt/play01[OST:0] play01-OST0001_UUID 7691221300 3765688064 3925533236 49% /mnt/play01[OST:1] play01-OST0002_UUID : inactive device play01-OST0003_UUID : inactive device play01-OST0004_UUID : inactive device play01-OST0005_UUID : inactive device filesystem summary: 15382442600 8272553984 7109888616 54% /mnt/play01 [bnh65367@cs04r-sc-serv-07 ~]$ cat /proc/fs/lustre/lov/play01-clilov-ffff810076ae2000/target_obd 0: play01-OST0000_UUID ACTIVE 1: play01-OST0001_UUID ACTIVE 2: play01-OST0002_UUID INACTIVE 3: play01-OST0003_UUID INACTIVE 4: play01-OST0004_UUID INACTIVE 5: play01-OST0005_UUID INACTIVE
As expected the fail-over OSS for each OST does see connection attempts and reports (correctly) that that OST is not available on this OSS.
I have confirmed that the OSTs are mounted on the OSSes correctly.
For the other client that I have tried to bring back the situation is similar but the OSTs that are inactive are slightly different:
[bnh65367@cs04r-sc-serv-06 ~]$ lfs df /mnt/play01 UUID 1K-blocks Used Available Use% Mounted on play01-MDT0000_UUID 78636320 3502948 75133372 4% /mnt/play01[MDT:0] play01-OST0000_UUID : inactive device play01-OST0001_UUID 7691221300 3765688064 3925533236 49% /mnt/play01[OST:1] play01-OST0002_UUID 7691221300 1763305508 5927915792 23% /mnt/play01[OST:2] play01-OST0003_UUID : inactive device play01-OST0004_UUID : inactive device play01-OST0005_UUID : inactive device filesystem summary: 15382442600 5528993572 9853449028 36% /mnt/play01 [bnh65367@cs04r-sc-serv-06 ~]$
play01-OST0000, play01-OST0002, play01-OST0004 are on one OSS
play01-OST0001, play01-OST0003, play01-OST0005 are on a different OSS (but all on the same).
I have tested the network, don't see any errors, lnet_selftest between the clients and the OSSes works at line rate at least for the first client (1GigE client...), nothing obvious on the second client either.
For completeness I should probably mention that all the servers (MDS and OSSes) have changed IP addresses at the same time as the upgrade, I have verified the information is correctly changed on the targets, both clients have been rebooted multiple times since the IP address change, without any changes.
Attachments
Issue Links
- is related to
-
LU-4242 mdt_open.c:1685:mdt_reint_open()) LBUG
-
- Resolved
-
Frederick – my error. I see this is already resolved, so no action required. ~ jfc.