Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14695

New OST not visible by MDTs. MGS problem or corrupt catalog llog?

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.6
    • None
    • CentOS 7.9
    • 3
    • 9223372036854775807

    Description

      On our Oak filesystem, running 2.12.6, we have a problem with either the MGS or a corrupt catalog somewhere.

      Active OSTs on this filesystem are from OST000c (12) to OST0137 (311). Today, we tried to add OST index 312 oak-OST0138. The new OST is visible from client, but not from MDTs: we have 6 MDTs (oak-MDT0000 to oak-MDT0005).

      Full disclosure... older OSTs 0-11 were previously removed with the experimental command lctl del_ost from LU-7668.

      The server logs when I started the new OST are available in servers-logs.txt

      What is weird is the following:

      May 20 14:06:05 oak-md1-s2 kernel: Lustre: 108193:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
      

      and that it complains about other OSTs (not OST0138):

      May 20 14:06:05 oak-md1-s2 kernel: LustreError: 108193:0:(genops.c:556:class_register_device()) oak-OST0134-osc-MDT0003: already exists, won't add
      May 20 14:06:05 oak-md1-s2 kernel: LustreError: 108193:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
      May 20 14:06:05 oak-md1-s2 kernel: Lustre:    cmd=cf001 0:oak-OST0134-osc-MDT0003  1:osp  2:oak-MDT0003-mdtlov_UUID  
      May 20 14:06:05 oak-md1-s2 kernel: LustreError: 4061:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
      May 20 14:06:05 oak-md1-s2 kernel: Lustre:    cmd=cf001 0:oak-OST0134-osc-MDT0000  1:osp  2:oak-MDT0000-mdtlov_UUID  
      May 20 14:06:07 oak-md2-s2 kernel: Lustre: 14846:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
      May 20 14:06:07 oak-md2-s2 kernel: LustreError: 14846:0:(genops.c:556:class_register_device()) oak-OST0136-osc-MDT0005: already exists, won't add
      May 20 14:06:07 oak-md2-s2 kernel: LustreError: 14846:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
      May 20 14:06:07 oak-md2-s2 kernel: Lustre:    cmd=cf001 0:oak-OST0136-osc-MDT0005  1:osp  2:oak-MDT0005-mdtlov_UUID  
      May 20 14:06:07 oak-md2-s2 kernel: LustreError: 4291:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
      

      If I check the llog catalogs on the MGS, the new OST oak-OST0138 seems to have been added though:

      Client catalog on MGS:

      [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-client | grep OST0138
      - { index: 2716, event: attach, device: oak-OST0138-osc, type: osc, UUID: oak-clilov_UUID }
      - { index: 2717, event: setup, device: oak-OST0138-osc, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 }
      - { index: 2719, event: add_conn, device: oak-OST0138-osc, node: 10.0.2.104@o2ib5 }
      - { index: 2720, event: add_osc, device: oak-clilov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
      

      MDS catalogs on MGS:

      [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0000 | grep OST0138
      - { index: 2785, event: attach, device: oak-OST0138-osc-MDT0000, type: osc, UUID: oak-MDT0000-mdtlov_UUID }
      - { index: 2786, event: setup, device: oak-OST0138-osc-MDT0000, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 }
      - { index: 2788, event: add_conn, device: oak-OST0138-osc-MDT0000, node: 10.0.2.104@o2ib5 }
      - { index: 2789, event: add_osc, device: oak-MDT0000-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
      
      [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0001 | grep OST0138
      - { index: 2930, event: attach, device: oak-OST0138-osc-MDT0001, type: osc, UUID: oak-MDT0001-mdtlov_UUID }
      - { index: 2931, event: setup, device: oak-OST0138-osc-MDT0001, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 }
      - { index: 2933, event: add_conn, device: oak-OST0138-osc-MDT0001, node: 10.0.2.104@o2ib5 }
      - { index: 2934, event: add_osc, device: oak-MDT0001-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
      
      [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0002 | grep OST0138
      - { index: 3063, event: attach, device: oak-OST0138-osc-MDT0002, type: osc, UUID: oak-MDT0002-mdtlov_UUID }
      - { index: 3064, event: setup, device: oak-OST0138-osc-MDT0002, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 }
      - { index: 3066, event: add_conn, device: oak-OST0138-osc-MDT0002, node: 10.0.2.104@o2ib5 }
      - { index: 3067, event: add_osc, device: oak-MDT0002-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
      
      [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0003 | grep OST0138
      - { index: 3079, event: attach, device: oak-OST0138-osc-MDT0003, type: osc, UUID: oak-MDT0003-mdtlov_UUID }
      - { index: 3080, event: setup, device: oak-OST0138-osc-MDT0003, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 }
      - { index: 3082, event: add_conn, device: oak-OST0138-osc-MDT0003, node: 10.0.2.104@o2ib5 }
      - { index: 3083, event: add_osc, device: oak-MDT0003-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
      
      [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0004 | grep OST0138
      - { index: 3255, event: attach, device: oak-OST0138-osc-MDT0004, type: osc, UUID: oak-MDT0004-mdtlov_UUID }
      - { index: 3256, event: setup, device: oak-OST0138-osc-MDT0004, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 }
      - { index: 3258, event: add_conn, device: oak-OST0138-osc-MDT0004, node: 10.0.2.104@o2ib5 }
      - { index: 3259, event: add_osc, device: oak-MDT0004-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
      
      [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0005 | grep OST0138
      - { index: 3255, event: attach, device: oak-OST0138-osc-MDT0005, type: osc, UUID: oak-MDT0005-mdtlov_UUID }
      - { index: 3256, event: setup, device: oak-OST0138-osc-MDT0005, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 }
      - { index: 3258, event: add_conn, device: oak-OST0138-osc-MDT0005, node: 10.0.2.104@o2ib5 }
      - { index: 3259, event: add_osc, device: oak-MDT0005-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
      

      However, this new OST is NOT visible from the MDTs:

      [root@oak-md1-s2 CONFIGS]# llog_reader /mnt/ldiskfs/mdt/0/CONFIGS/oak-MDT0000 | grep 0138
      [root@oak-md1-s2 CONFIGS]# 
      
      [root@oak-md1-s2 ~]# lctl dl | grep OST0138
      [root@oak-md1-s2 ~]# 
      

       

      From a client, we can see the new OST but it's not filling up, which makes sense if the MDTs are not aware of it:

      oak-OST0133_UUID     108461852548 37418203104 69949699416  35% /oak[OST:307]
      oak-OST0134_UUID     108461852548 38597230784 68770659804  36% /oak[OST:308]
      oak-OST0135_UUID     108461852548 38483562644 68884328272  36% /oak[OST:309]
      oak-OST0136_UUID     108461852548 41312045604 66055819468  39% /oak[OST:310]
      oak-OST0137_UUID     108461852548 43196874132 64170973596  41% /oak[OST:311]
      oak-OST0138_UUID     108461852548        1828 107368054308   1% /oak[OST:312]
      
      

      Right now, we're up and running in that weird situation... not ideal.

      I'm attaching the catalogs found on the 6 MDTs as oak-MDT-CONFIGS-llog.tar and a tarball of the CONFIGS directory on the MGS as oak-MGS-CONFIGS.tar.gz

      Any idea of what is wrong or corrupt? We would really appreciate any help to avoid doing a full writeconf.

      Attachments

        1. servers-logs.txt
          6 kB
        2. oak-MDT-CONFIGS-llog.tar
          2.71 MB
        3. oak-MGS-CONFIGS.tar.gz
          465 kB
        4. oak-md1-s2_dk_config+info-1.log
          10.35 MB
        5. oak-md1-s2_dk_config+info-2.log
          1.64 MB

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: