Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4282

some OSTs reported as inactive in lfs df, UP with lctl dl, data not accessible

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.4.1
    • None
    • MDS and OSS on Lustre 2.4.1, clients lustre 1.8.9, all Red Hat Enterprise Linux.
    • 3
    • 11756

    Description

      As indicated in LU-4242, I now have a problem on our preproduction file system that stops users from accessing the data, servers from cleanly rebooting etc, stopping any further testing.

      After upgrading the servers from 2.3 to 2.4.1 (MDT build #51 of b2_4 from jenkins) our clients can no longer fully access this file system. The clients can mount the file system and can access one OST on each of the two OSSes, but the other OSSes are not accessible and are shown as inactive in lfs df output and /proc/fs/lustre/lov/*/target_obd, but are shown as UP in lctl dl.

      [bnh65367@cs04r-sc-serv-07 ~]$ lctl dl |grep play01
       91 UP lov play01-clilov-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 4
       92 UP mdc play01-MDT0000-mdc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5
       93 UP osc play01-OST0000-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5
       94 UP osc play01-OST0001-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5
       95 UP osc play01-OST0002-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5
       96 UP osc play01-OST0003-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5
       97 UP osc play01-OST0004-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5
       98 UP osc play01-OST0005-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5
      [bnh65367@cs04r-sc-serv-07 ~]$ lfs df /mnt/play01
      UUID                   1K-blocks        Used   Available Use% Mounted on
      play01-MDT0000_UUID     78636320     3502948    75133372   4% /mnt/play01[MDT:0]
      play01-OST0000_UUID   7691221300  4506865920  3184355380  59% /mnt/play01[OST:0]
      play01-OST0001_UUID   7691221300  3765688064  3925533236  49% /mnt/play01[OST:1]
      play01-OST0002_UUID : inactive device
      play01-OST0003_UUID : inactive device
      play01-OST0004_UUID : inactive device
      play01-OST0005_UUID : inactive device
      
      filesystem summary:  15382442600  8272553984  7109888616  54% /mnt/play01
      
      [bnh65367@cs04r-sc-serv-07 ~]$ cat /proc/fs/lustre/lov/play01-clilov-ffff810076ae2000/target_obd 
      0: play01-OST0000_UUID ACTIVE
      1: play01-OST0001_UUID ACTIVE
      2: play01-OST0002_UUID INACTIVE
      3: play01-OST0003_UUID INACTIVE
      4: play01-OST0004_UUID INACTIVE
      5: play01-OST0005_UUID INACTIVE
      

      As expected the fail-over OSS for each OST does see connection attempts and reports (correctly) that that OST is not available on this OSS.

      I have confirmed that the OSTs are mounted on the OSSes correctly.

      For the other client that I have tried to bring back the situation is similar but the OSTs that are inactive are slightly different:

      [bnh65367@cs04r-sc-serv-06 ~]$ lfs df /mnt/play01
      UUID                   1K-blocks        Used   Available Use% Mounted on
      play01-MDT0000_UUID     78636320     3502948    75133372   4% /mnt/play01[MDT:0]
      play01-OST0000_UUID : inactive device
      play01-OST0001_UUID   7691221300  3765688064  3925533236  49% /mnt/play01[OST:1]
      play01-OST0002_UUID   7691221300  1763305508  5927915792  23% /mnt/play01[OST:2]
      play01-OST0003_UUID : inactive device
      play01-OST0004_UUID : inactive device
      play01-OST0005_UUID : inactive device
      
      filesystem summary:  15382442600  5528993572  9853449028  36% /mnt/play01
      
      [bnh65367@cs04r-sc-serv-06 ~]$ 
      

      play01-OST0000, play01-OST0002, play01-OST0004 are on one OSS
      play01-OST0001, play01-OST0003, play01-OST0005 are on a different OSS (but all on the same).

      I have tested the network, don't see any errors, lnet_selftest between the clients and the OSSes works at line rate at least for the first client (1GigE client...), nothing obvious on the second client either.

      For completeness I should probably mention that all the servers (MDS and OSSes) have changed IP addresses at the same time as the upgrade, I have verified the information is correctly changed on the targets, both clients have been rebooted multiple times since the IP address change, without any changes.

      Attachments

        Issue Links

          Activity

            [LU-4282] some OSTs reported as inactive in lfs df, UP with lctl dl, data not accessible

            Frederick – my error. I see this is already resolved, so no action required. ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - Frederick – my error. I see this is already resolved, so no action required. ~ jfc.

            Frederick – can I check if this is now resolved? If so, I will mark it as such. Thanks ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - Frederick – can I check if this is now resolved? If so, I will mark it as such. Thanks ~ jfc.
            mdiep Minh Diep added a comment -

            dup of LU-4243

            mdiep Minh Diep added a comment - dup of LU-4243

            yes, It should be a duplicate of LU-4243.

            Hi Frederik, could you please try with the patch http://review.whamcloud.com/#/c/8372/? thanks

            hongchao.zhang Hongchao Zhang added a comment - yes, It should be a duplicate of LU-4243 . Hi Frederik, could you please try with the patch http://review.whamcloud.com/#/c/8372/? thanks
            mdiep Minh Diep added a comment -

            I looked at the client log

            Header size : 8192
            Time : Thu Nov 21 20:51:25 2013
            Number of records: 21
            Target uuid : play01-client
            -----------------------
            #01 (224)marker 4 (flags=0x01, v2.4.1.0) play01-clilov 'lov setup' Thu Nov 21 20:51:25 2013-
            #02 (120)attach 0:play01-clilov 1:lov 2:play01-clilov_UUID
            #03 (168)lov_setup 0:play01-clilov 1:(struct lov_desc)
            uuid=play01-clilov_UUID stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1
            #04 (224)marker 4 (flags=0x02, v2.4.1.0) play01-clilov 'lov setup' Thu Nov 21 20:51:25 2013-
            #05 (224)marker 5 (flags=0x01, v2.4.1.0) play01-clilmv 'lmv setup' Thu Nov 21 20:51:25 2013-
            #06 (120)attach 0:play01-clilmv 1:lmv 2:play01-clilmv_UUID
            #07 (168)lov_setup 0:play01-clilmv 1:(struct lov_desc)
            uuid=play01-clilmv_UUID stripe:cnt=0 size=0 offset=0 pattern=0
            #08 (224)marker 5 (flags=0x02, v2.4.1.0) play01-clilmv 'lmv setup' Thu Nov 21 20:51:25 2013-
            #09 (224)marker 6 (flags=0x01, v2.4.1.0) play01-MDT0000 'add mdc' Thu Nov 21 20:51:25 2013-
            #10 (088)add_uuid nid=172.23.144.5@tcp(0x20000ac179005) 0: 1:172.23.144.5@tcp
            #11 (128)attach 0:play01-MDT0000-mdc 1:mdc 2:play01-clilmv_UUID
            #12 (144)setup 0:play01-MDT0000-mdc 1:play01-MDT0000_UUID 2:172.23.144.5@tcp
            #13 (088)add_uuid nid=172.23.144.5@tcp(0x20000ac179005) 0: 1:172.23.144.5@tcp
            #14 (112)add_conn 0:play01-MDT0000-mdc 1:172.23.144.5@tcp
            #15 (088)add_uuid nid=172.23.144.6@tcp(0x20000ac179006) 0: 1:172.23.144.5@tcp
            #16 (112)add_conn 0:play01-MDT0000-mdc 1:172.23.144.5@tcp
            #17 (160)modify_mdc_tgts add 0:play01-clilmv 1:play01-MDT0000_UUID 2:0 3:1 4:play01-MDT0000-mdc_UUID
            #18 (224)marker 6 (flags=0x02, v2.4.1.0) play01-MDT0000 'add mdc' Thu Nov 21 20:51:25 2013-
            #19 (224)marker 7 (flags=0x01, v2.4.1.0) play01-client 'mount opts' Thu Nov 21 20:51:25 2013-
            #20 (120)mount_option 0: 1:play01-client 2:play01-clilov 3:play01-clilmv
            #21 (224)marker 7 (flags=0x02, v2.4.1.0) play01-client 'mount opts' Thu Nov 21 20:51:25 2013-

            line #15 add_uuid should have 172.23.144.6 on the second nid instead of *5.

            This is a dup of LU-4243 IMHO

            mdiep Minh Diep added a comment - I looked at the client log Header size : 8192 Time : Thu Nov 21 20:51:25 2013 Number of records: 21 Target uuid : play01-client ----------------------- #01 (224)marker 4 (flags=0x01, v2.4.1.0) play01-clilov 'lov setup' Thu Nov 21 20:51:25 2013- #02 (120)attach 0:play01-clilov 1:lov 2:play01-clilov_UUID #03 (168)lov_setup 0:play01-clilov 1:(struct lov_desc) uuid=play01-clilov_UUID stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1 #04 (224)marker 4 (flags=0x02, v2.4.1.0) play01-clilov 'lov setup' Thu Nov 21 20:51:25 2013- #05 (224)marker 5 (flags=0x01, v2.4.1.0) play01-clilmv 'lmv setup' Thu Nov 21 20:51:25 2013- #06 (120)attach 0:play01-clilmv 1:lmv 2:play01-clilmv_UUID #07 (168)lov_setup 0:play01-clilmv 1:(struct lov_desc) uuid=play01-clilmv_UUID stripe:cnt=0 size=0 offset=0 pattern=0 #08 (224)marker 5 (flags=0x02, v2.4.1.0) play01-clilmv 'lmv setup' Thu Nov 21 20:51:25 2013- #09 (224)marker 6 (flags=0x01, v2.4.1.0) play01-MDT0000 'add mdc' Thu Nov 21 20:51:25 2013- #10 (088)add_uuid nid=172.23.144.5@tcp(0x20000ac179005) 0: 1:172.23.144.5@tcp #11 (128)attach 0:play01-MDT0000-mdc 1:mdc 2:play01-clilmv_UUID #12 (144)setup 0:play01-MDT0000-mdc 1:play01-MDT0000_UUID 2:172.23.144.5@tcp #13 (088)add_uuid nid=172.23.144.5@tcp(0x20000ac179005) 0: 1:172.23.144.5@tcp #14 (112)add_conn 0:play01-MDT0000-mdc 1:172.23.144.5@tcp #15 (088)add_uuid nid=172.23.144.6@tcp(0x20000ac179006) 0: 1:172.23.144.5@tcp #16 (112)add_conn 0:play01-MDT0000-mdc 1:172.23.144.5@tcp #17 (160)modify_mdc_tgts add 0:play01-clilmv 1:play01-MDT0000_UUID 2:0 3:1 4:play01-MDT0000-mdc_UUID #18 (224)marker 6 (flags=0x02, v2.4.1.0) play01-MDT0000 'add mdc' Thu Nov 21 20:51:25 2013- #19 (224)marker 7 (flags=0x01, v2.4.1.0) play01-client 'mount opts' Thu Nov 21 20:51:25 2013- #20 (120)mount_option 0: 1:play01-client 2:play01-clilov 3:play01-clilmv #21 (224)marker 7 (flags=0x02, v2.4.1.0) play01-client 'mount opts' Thu Nov 21 20:51:25 2013- line #15 add_uuid should have 172.23.144.6 on the second nid instead of *5. This is a dup of LU-4243 IMHO
            mdiep Minh Diep added a comment -

            Hongchao,

            Could you check if this is a dump of LU-4243? Thanks

            mdiep Minh Diep added a comment - Hongchao, Could you check if this is a dump of LU-4243 ? Thanks

            Minh,

            I wasn't quite sure which logs you want, so I remounted both mdt and mgs with ldiskfs, copied all files in the CONFIGS directories to a different location and ran llog_reader over them. The result is in the attached file.

            Thanks,
            Frederik

            ferner Frederik Ferner (Inactive) added a comment - Minh, I wasn't quite sure which logs you want, so I remounted both mdt and mgs with ldiskfs, copied all files in the CONFIGS directories to a different location and ran llog_reader over them. The result is in the attached file. Thanks, Frederik

            CONFIGS directories for MDT and MGS, including llog_reader output

            ferner Frederik Ferner (Inactive) added a comment - CONFIGS directories for MDT and MGS, including llog_reader output
            mdiep Minh Diep added a comment -

            This seems to relate to LU-4243. could you remount the mdt with ldiskfs and dump the config log?

            mdiep Minh Diep added a comment - This seems to relate to LU-4243 . could you remount the mdt with ldiskfs and dump the config log?

            Minh,

            it is sort of working. I have one configuration/setup where all OSTs can be accessed by all clients I've tried to bring up. However if I try to bring any of the OSTs up on a different OSS then they are on now, none of my clients even tries to contact this OSS. Recovery doesn't even start...

            So I'd not say everything is working, but the urgency is lower as we have a work around. (which is valid until one of the servers fails...)

            I would appreciate help in fully resolving this. Let me know if there are any diagnostics that I should provide...

            Kind regards,
            Frederik

            ferner Frederik Ferner (Inactive) added a comment - Minh, it is sort of working. I have one configuration/setup where all OSTs can be accessed by all clients I've tried to bring up. However if I try to bring any of the OSTs up on a different OSS then they are on now, none of my clients even tries to contact this OSS. Recovery doesn't even start... So I'd not say everything is working, but the urgency is lower as we have a work around. (which is valid until one of the servers fails...) I would appreciate help in fully resolving this. Let me know if there are any diagnostics that I should provide... Kind regards, Frederik

            People

              mdiep Minh Diep
              ferner Frederik Ferner (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: