Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14695

New OST not visible by MDTs. MGS problem or corrupt catalog llog?

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.6
    • None
    • CentOS 7.9
    • 3
    • 9223372036854775807

    Description

      On our Oak filesystem, running 2.12.6, we have a problem with either the MGS or a corrupt catalog somewhere.

      Active OSTs on this filesystem are from OST000c (12) to OST0137 (311). Today, we tried to add OST index 312 oak-OST0138. The new OST is visible from client, but not from MDTs: we have 6 MDTs (oak-MDT0000 to oak-MDT0005).

      Full disclosure... older OSTs 0-11 were previously removed with the experimental command lctl del_ost from LU-7668.

      The server logs when I started the new OST are available in servers-logs.txt

      What is weird is the following:

      May 20 14:06:05 oak-md1-s2 kernel: Lustre: 108193:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
      

      and that it complains about other OSTs (not OST0138):

      May 20 14:06:05 oak-md1-s2 kernel: LustreError: 108193:0:(genops.c:556:class_register_device()) oak-OST0134-osc-MDT0003: already exists, won't add
      May 20 14:06:05 oak-md1-s2 kernel: LustreError: 108193:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
      May 20 14:06:05 oak-md1-s2 kernel: Lustre:    cmd=cf001 0:oak-OST0134-osc-MDT0003  1:osp  2:oak-MDT0003-mdtlov_UUID  
      May 20 14:06:05 oak-md1-s2 kernel: LustreError: 4061:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
      May 20 14:06:05 oak-md1-s2 kernel: Lustre:    cmd=cf001 0:oak-OST0134-osc-MDT0000  1:osp  2:oak-MDT0000-mdtlov_UUID  
      May 20 14:06:07 oak-md2-s2 kernel: Lustre: 14846:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
      May 20 14:06:07 oak-md2-s2 kernel: LustreError: 14846:0:(genops.c:556:class_register_device()) oak-OST0136-osc-MDT0005: already exists, won't add
      May 20 14:06:07 oak-md2-s2 kernel: LustreError: 14846:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
      May 20 14:06:07 oak-md2-s2 kernel: Lustre:    cmd=cf001 0:oak-OST0136-osc-MDT0005  1:osp  2:oak-MDT0005-mdtlov_UUID  
      May 20 14:06:07 oak-md2-s2 kernel: LustreError: 4291:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
      

      If I check the llog catalogs on the MGS, the new OST oak-OST0138 seems to have been added though:

      Client catalog on MGS:

      [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-client | grep OST0138
      - { index: 2716, event: attach, device: oak-OST0138-osc, type: osc, UUID: oak-clilov_UUID }
      - { index: 2717, event: setup, device: oak-OST0138-osc, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 }
      - { index: 2719, event: add_conn, device: oak-OST0138-osc, node: 10.0.2.104@o2ib5 }
      - { index: 2720, event: add_osc, device: oak-clilov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
      

      MDS catalogs on MGS:

      [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0000 | grep OST0138
      - { index: 2785, event: attach, device: oak-OST0138-osc-MDT0000, type: osc, UUID: oak-MDT0000-mdtlov_UUID }
      - { index: 2786, event: setup, device: oak-OST0138-osc-MDT0000, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 }
      - { index: 2788, event: add_conn, device: oak-OST0138-osc-MDT0000, node: 10.0.2.104@o2ib5 }
      - { index: 2789, event: add_osc, device: oak-MDT0000-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
      
      [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0001 | grep OST0138
      - { index: 2930, event: attach, device: oak-OST0138-osc-MDT0001, type: osc, UUID: oak-MDT0001-mdtlov_UUID }
      - { index: 2931, event: setup, device: oak-OST0138-osc-MDT0001, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 }
      - { index: 2933, event: add_conn, device: oak-OST0138-osc-MDT0001, node: 10.0.2.104@o2ib5 }
      - { index: 2934, event: add_osc, device: oak-MDT0001-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
      
      [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0002 | grep OST0138
      - { index: 3063, event: attach, device: oak-OST0138-osc-MDT0002, type: osc, UUID: oak-MDT0002-mdtlov_UUID }
      - { index: 3064, event: setup, device: oak-OST0138-osc-MDT0002, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 }
      - { index: 3066, event: add_conn, device: oak-OST0138-osc-MDT0002, node: 10.0.2.104@o2ib5 }
      - { index: 3067, event: add_osc, device: oak-MDT0002-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
      
      [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0003 | grep OST0138
      - { index: 3079, event: attach, device: oak-OST0138-osc-MDT0003, type: osc, UUID: oak-MDT0003-mdtlov_UUID }
      - { index: 3080, event: setup, device: oak-OST0138-osc-MDT0003, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 }
      - { index: 3082, event: add_conn, device: oak-OST0138-osc-MDT0003, node: 10.0.2.104@o2ib5 }
      - { index: 3083, event: add_osc, device: oak-MDT0003-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
      
      [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0004 | grep OST0138
      - { index: 3255, event: attach, device: oak-OST0138-osc-MDT0004, type: osc, UUID: oak-MDT0004-mdtlov_UUID }
      - { index: 3256, event: setup, device: oak-OST0138-osc-MDT0004, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 }
      - { index: 3258, event: add_conn, device: oak-OST0138-osc-MDT0004, node: 10.0.2.104@o2ib5 }
      - { index: 3259, event: add_osc, device: oak-MDT0004-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
      
      [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0005 | grep OST0138
      - { index: 3255, event: attach, device: oak-OST0138-osc-MDT0005, type: osc, UUID: oak-MDT0005-mdtlov_UUID }
      - { index: 3256, event: setup, device: oak-OST0138-osc-MDT0005, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 }
      - { index: 3258, event: add_conn, device: oak-OST0138-osc-MDT0005, node: 10.0.2.104@o2ib5 }
      - { index: 3259, event: add_osc, device: oak-MDT0005-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
      

      However, this new OST is NOT visible from the MDTs:

      [root@oak-md1-s2 CONFIGS]# llog_reader /mnt/ldiskfs/mdt/0/CONFIGS/oak-MDT0000 | grep 0138
      [root@oak-md1-s2 CONFIGS]# 
      
      [root@oak-md1-s2 ~]# lctl dl | grep OST0138
      [root@oak-md1-s2 ~]# 
      

       

      From a client, we can see the new OST but it's not filling up, which makes sense if the MDTs are not aware of it:

      oak-OST0133_UUID     108461852548 37418203104 69949699416  35% /oak[OST:307]
      oak-OST0134_UUID     108461852548 38597230784 68770659804  36% /oak[OST:308]
      oak-OST0135_UUID     108461852548 38483562644 68884328272  36% /oak[OST:309]
      oak-OST0136_UUID     108461852548 41312045604 66055819468  39% /oak[OST:310]
      oak-OST0137_UUID     108461852548 43196874132 64170973596  41% /oak[OST:311]
      oak-OST0138_UUID     108461852548        1828 107368054308   1% /oak[OST:312]
      
      

      Right now, we're up and running in that weird situation... not ideal.

      I'm attaching the catalogs found on the 6 MDTs as oak-MDT-CONFIGS-llog.tar and a tarball of the CONFIGS directory on the MGS as oak-MGS-CONFIGS.tar.gz

      Any idea of what is wrong or corrupt? We would really appreciate any help to avoid doing a full writeconf.

      Attachments

        1. oak-md1-s2_dk_config+info-1.log
          10.35 MB
        2. oak-md1-s2_dk_config+info-2.log
          1.64 MB
        3. oak-MDT-CONFIGS-llog.tar
          2.71 MB
        4. oak-MGS-CONFIGS.tar.gz
          465 kB
        5. servers-logs.txt
          6 kB

        Issue Links

          Activity

            [LU-14695] New OST not visible by MDTs. MGS problem or corrupt catalog llog?

            We had an opportunity to reboot the MDS in question, so both MDT0000 and MDT0003 restarted, which is a bit confusing in the log. I renamed the config for MDT0003 prior to mounting, but unfortunately I was only able to capture the config for MDT0000, I think, with an error on duplicate OST, this time OST0135 (super weird...). Anyway, we can see the part when the config is loaded (see oak-md1-s2_dk_config+info-1.log for full logs), this is just for OST0135:

            00000020:00000080:0.0:1623351596.339966:0:57227:0:(obd_config.c:1128:class_process_config()) processing cmd: cf010
            00000020:00000080:0.0:1623351596.339966:0:57227:0:(obd_config.c:1198:class_process_config()) marker 4694 (0x1) oak-OST0135 add osc
            00000020:00000080:0.0:1623351596.339967:0:57227:0:(obd_config.c:1128:class_process_config()) processing cmd: cf005
            00000020:00000080:0.0:1623351596.339968:0:57227:0:(obd_config.c:1139:class_process_config()) adding mapping from uuid 10.0.2.104@o2ib5 to nid 0x500050a000268 (10.0.2.104@o2ib5)
            00000100:00000040:0.0:1623351596.339969:0:57227:0:(lustre_peer.c:122:class_add_uuid()) found uuid 10.0.2.104@o2ib5 10.0.2.104@o2ib5 cnt=1
            00000020:01000000:0.0:1623351596.339969:0:57227:0:(obd_config.c:1695:class_config_llog_handler()) For 2.x interoperability, rename obd type from osc to osp (oak-MDT0000)
            00000020:00000080:0.0:1623351596.339970:0:57227:0:(obd_config.c:1128:class_process_config()) processing cmd: cf001
            00000020:00000080:0.0:1623351596.339972:0:57227:0:(genops.c:451:class_newdev()) Allocate new device oak-OST0135-osc-MDT0000 (ffffa0ba056920f0)
            00000020:00000040:0.0:1623351596.339972:0:57227:0:(lustre_handles.c:99:class_handle_hash()) added object ffffa0ab7c744c00 with handle 0x60ebddc04fb89991 to hash
            00000020:00000040:0.0:1623351596.339973:0:57227:0:(genops.c:1018:class_export_put()) PUTting export ffffa0ab7c744c00 : new refcount 1
            00000100:00000040:3.0:1623351596.339975:0:57124:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=0  req@ffffa0ba05689b00 x1702207520042944/t0(0) o8->oak-OST0134-osc-MDT0000@10.0.2.103@o2ib5:28/4 lens 520/544 e 0 to 0 dl 1623351601 ref 2 fl Rpc:N/0/ffffffff rc 0/-1
            00000100:00000040:3.0:1623351596.339978:0:57124:0:(niobuf.c:57:ptl_send_buf()) peer_id 12345-10.0.2.103@o2ib5
            00000020:00000080:0.0:1623351596.339993:0:57227:0:(obd_config.c:431:class_attach()) OBD: dev 307 attached type osp with refcount 1
            

            and later when it tries to register duplicate OST (this time, it was OST0135), see oak-md1-s2_dk_config+info-2.log :

            00000020:00000400:14.0:1623352197.724522:0:58995:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
            00000020:00000400:14.0:1623352197.739473:0:58995:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x4)
            00000020:00000400:14.0:1623352197.739474:0:58995:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x4)
            00000020:00020000:14.0:1623352197.739539:0:58995:0:(genops.c:556:class_register_device()) oak-OST0135-osc-MDT0000: already exists, won't add
            00000020:00020000:14.0:1623352197.751872:0:58995:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
            00000020:02000400:14.0:1623352197.764886:0:58995:0:(obd_config.c:2068:class_config_dump_handler())    cmd=cf001 0:oak-OST0135-osc-MDT0000  1:osp  2:oak-MDT0000-mdtlov_UUID  
            
            10000000:00020000:23.0:1623352197.776197:0:57194:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
            
            sthiell Stephane Thiell added a comment - We had an opportunity to reboot the MDS in question, so both MDT0000 and MDT0003 restarted, which is a bit confusing in the log. I renamed the config for MDT0003 prior to mounting, but unfortunately I was only able to capture the config for MDT0000, I think, with an error on duplicate OST, this time OST0135 (super weird...). Anyway, we can see the part when the config is loaded (see oak-md1-s2_dk_config+info-1.log for full logs), this is just for OST0135: 00000020:00000080:0.0:1623351596.339966:0:57227:0:(obd_config.c:1128:class_process_config()) processing cmd: cf010 00000020:00000080:0.0:1623351596.339966:0:57227:0:(obd_config.c:1198:class_process_config()) marker 4694 (0x1) oak-OST0135 add osc 00000020:00000080:0.0:1623351596.339967:0:57227:0:(obd_config.c:1128:class_process_config()) processing cmd: cf005 00000020:00000080:0.0:1623351596.339968:0:57227:0:(obd_config.c:1139:class_process_config()) adding mapping from uuid 10.0.2.104@o2ib5 to nid 0x500050a000268 (10.0.2.104@o2ib5) 00000100:00000040:0.0:1623351596.339969:0:57227:0:(lustre_peer.c:122:class_add_uuid()) found uuid 10.0.2.104@o2ib5 10.0.2.104@o2ib5 cnt=1 00000020:01000000:0.0:1623351596.339969:0:57227:0:(obd_config.c:1695:class_config_llog_handler()) For 2.x interoperability, rename obd type from osc to osp (oak-MDT0000) 00000020:00000080:0.0:1623351596.339970:0:57227:0:(obd_config.c:1128:class_process_config()) processing cmd: cf001 00000020:00000080:0.0:1623351596.339972:0:57227:0:(genops.c:451:class_newdev()) Allocate new device oak-OST0135-osc-MDT0000 (ffffa0ba056920f0) 00000020:00000040:0.0:1623351596.339972:0:57227:0:(lustre_handles.c:99:class_handle_hash()) added object ffffa0ab7c744c00 with handle 0x60ebddc04fb89991 to hash 00000020:00000040:0.0:1623351596.339973:0:57227:0:(genops.c:1018:class_export_put()) PUTting export ffffa0ab7c744c00 : new refcount 1 00000100:00000040:3.0:1623351596.339975:0:57124:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=0 req@ffffa0ba05689b00 x1702207520042944/t0(0) o8->oak-OST0134-osc-MDT0000@10.0.2.103@o2ib5:28/4 lens 520/544 e 0 to 0 dl 1623351601 ref 2 fl Rpc:N/0/ffffffff rc 0/-1 00000100:00000040:3.0:1623351596.339978:0:57124:0:(niobuf.c:57:ptl_send_buf()) peer_id 12345-10.0.2.103@o2ib5 00000020:00000080:0.0:1623351596.339993:0:57227:0:(obd_config.c:431:class_attach()) OBD: dev 307 attached type osp with refcount 1 and later when it tries to register duplicate OST (this time, it was OST0135), see oak-md1-s2_dk_config+info-2.log : 00000020:00000400:14.0:1623352197.724522:0:58995:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0) 00000020:00000400:14.0:1623352197.739473:0:58995:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x4) 00000020:00000400:14.0:1623352197.739474:0:58995:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x4) 00000020:00020000:14.0:1623352197.739539:0:58995:0:(genops.c:556:class_register_device()) oak-OST0135-osc-MDT0000: already exists, won't add 00000020:00020000:14.0:1623352197.751872:0:58995:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17 00000020:02000400:14.0:1623352197.764886:0:58995:0:(obd_config.c:2068:class_config_dump_handler()) cmd=cf001 0:oak-OST0135-osc-MDT0000 1:osp 2:oak-MDT0000-mdtlov_UUID 10000000:00020000:23.0:1623352197.776197:0:57194:0:(mgc_request.c:599:do_requeue()) failed processing log: -17

            Hi Mike,

            Thanks! I will try to gather config llog processing in more details after disabling local MDT config log, at a next scheduled maintenance so I can restart MGS and MDTs. (I think the MGS might have a problem somehow, so better to restart it too). It might not be before 2 weeks though.

             
            As for OST 0136, I think it was added with Lustre 2.12.6.

            I can see the version of Lustre in the MGS's oak-MDT0000 config llog, for example:

            #2735 (224)marker 4700 (flags=0x01, v2.12.6.0) oak-OST0136     'add osc' Thu Feb 18 11:22:03 2021-
            

             
            And you're right that we added 2 new MDTs recently, MDT0004 and MDT0005. Perhaps this is the source of the issue here.

            sthiell Stephane Thiell added a comment - Hi Mike, Thanks! I will try to gather config llog processing in more details after disabling local MDT config log, at a next scheduled maintenance so I can restart MGS and MDTs. (I think the MGS might have a problem somehow, so better to restart it too). It might not be before 2 weeks though.   As for OST 0136, I think it was added with Lustre 2.12.6. I can see the version of Lustre in the MGS's oak-MDT0000 config llog, for example: #2735 (224)marker 4700 (flags=0x01, v2.12.6.0) oak-OST0136     'add osc' Thu Feb 18 11:22:03 2021-   And you're right that we added 2 new MDTs recently, MDT0004 and MDT0005. Perhaps this is the source of the issue here.

            Stephane,  as I see from config logs local copies on MDTs were not updated from main config on MGS, I am not sure why, so it would still be valuable to get server log during mount, it can be related somehow to the servers order in log - there are MDT0004 and MDT0005 were added after last OST0137, so probably that is log processing/copying bug, I am checking that

            As for solution, you could just try to remove (better move to other location just in case) local MDT log of one MDT, say 0003 and remount it. The config log should be copied from MGS and MDT0003 might see OST0138. I worry about that -17 error during log processing, maybe it will interfere, but config log on MGS looks OK and has OST0138

            tappro Mikhail Pershin added a comment - Stephane,  as I see from config logs local copies on MDTs were not updated from main config on MGS, I am not sure why, so it would still be valuable to get server log during mount, it can be related somehow to the servers order in log - there are MDT0004 and MDT0005 were added after last OST0137, so probably that is log processing/copying bug, I am checking that As for solution, you could just try to remove (better move to other location just in case) local MDT log of one MDT, say 0003 and remount it. The config log should be copied from MGS and MDT0003 might see OST0138. I worry about that -17 error during log processing, maybe it will interfere, but config log on MGS looks OK and has OST0138

            Stephane, just couple questions, you've mentioned before that you have added other OSTs previously, e.g. 0136, and these additions went well, right? Any chance to know what Lustre version was used at that time? My proposal right now is to collect MDT Lustre log on server start to see config llog processing in more details, is that possible? Please add 'config' and 'info' levels to debug.

            tappro Mikhail Pershin added a comment - Stephane, just couple questions, you've mentioned before that you have added other OSTs previously, e.g. 0136, and these additions went well, right? Any chance to know what Lustre version was used at that time? My proposal right now is to collect MDT Lustre log on server start to see config llog processing in more details, is that possible? Please add 'config' and 'info' levels to debug.
            sthiell Stephane Thiell added a comment - - edited

            Hi Mike! Yes, Lustre 2.12.6 is used here on Oak, on all servers, including newly added OSTs. But yes, older OSTs were added using previous versions of Lustre. We started this filesystem with 2.9 in early 2017, then 2.10 for several years and upgraded to 2.12.x in October 2020. Then we have been upgrading Oak to the latest 2.12.x.

            I've started to see random weird behaviors when adding the previous OSTs (like oak-OST0136). One other thing, we have seen a crash similar to LU-9699 "ASSERTION( osp->opd_connects == 1 ) failed" once or twice. I guess some llog corruption and/or bad llog buffer handling could be the cause but I can't find what. I wonder if there is a way to simulate the llog config processing.

            Otherwise, a drastic solution would be to do a full writeconf and remount all targets to regenerate a clean config, but I guess we would also need to stop all clients, which means a long down time as Oak is mounted on several clusters.

            sthiell Stephane Thiell added a comment - - edited Hi Mike! Yes, Lustre 2.12.6 is used here on Oak, on all servers, including newly added OSTs. But yes, older OSTs were added using previous versions of Lustre. We started this filesystem with 2.9 in early 2017, then 2.10 for several years and upgraded to 2.12.x in October 2020. Then we have been upgrading Oak to the latest 2.12.x. I've started to see random weird behaviors when adding the previous OSTs (like oak-OST0136). One other thing, we have seen a crash similar to  LU-9699 "ASSERTION( osp->opd_connects == 1 ) failed" once or twice. I guess some llog corruption and/or bad llog buffer handling could be the cause but I can't find what. I wonder if there is a way to simulate the llog config processing. Otherwise, a drastic solution would be to do a full writeconf and remount all targets to regenerate a clean config, but I guess we would also need to stop all clients, which means a long down time as Oak is mounted on several clusters.

            Stephane, this looks like bug for me too, though can't say for sure is that corrupted llog or something else. I am still checking logs and existent tickets for something similar.  You've said that Lustre 2.12.6 is used, is that so also for newly added OST? Also I suppose that older servers were updated to 2.12.6 from older versions, am I right?

            tappro Mikhail Pershin added a comment - Stephane, this looks like bug for me too, though can't say for sure is that corrupted llog or something else. I am still checking logs and existent tickets for something similar.  You've said that Lustre 2.12.6 is used, is that so also for newly added OST? Also I suppose that older servers were updated to 2.12.6 from older versions, am I right?

            Also... lctl dk from a MDS (oak-md2-s1 serving oak-MDT0004):

            00000100:02000000:4.0:1621544759.322774:0:5321:0:(import.c:1597:ptlrpc_import_recovery_state_machine()) oak-MDT0004: Connection restored to oak-MDT0004-lwp-OST0136_UUID (at 10.0.2.103@o2ib5)
            00000020:00000400:51.0:1621544769.080003:0:37195:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
            00000020:00000400:51.0:1621544769.093563:0:37195:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x4)
            00000020:00020000:51.0:1621544769.093605:0:37195:0:(genops.c:556:class_register_device()) oak-OST0136-osc-MDT0004: already exists, won't add
            00000020:00020000:51.0:1621544769.104827:0:37195:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
            00000020:02000400:51.0:1621544769.116657:0:37195:0:(obd_config.c:2068:class_config_dump_handler())    cmd=cf001 0:oak-OST0136-osc-MDT0004  1:osp  2:oak-MDT0004-mdtlov_UUID
            
            10000000:00020000:19.0:1621544769.127093:0:4304:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
            00010000:02000400:22.0:1621544973.442080:0:4310:0:(ldlm_lib.c:816:target_handle_reconnect()) oak-MDT0004: Client 9458049c-ca8d-335b-3531-2606964e11c0 (at 10.51.2.31@o2ib3) reconnecting
            

            What generates this error is:

                            if (!(cfg->cfg_flags & CFG_F_MARKER) &&
                                (lcfg->lcfg_command != LCFG_MARKER)) {
                                    CWARN("Skip config outside markers, (inst: %016lx, uuid: %s, flags: %#x)\n",
                                            cfg->cfg_instance,
                                            cfg->cfg_uuid.uuid, cfg->cfg_flags);
                                    cfg->cfg_flags |= CFG_F_SKIP;
            

            but cfg->cfg_instance is NULL and cfg->cfg_uuid.uuid empty. Bug?

            sthiell Stephane Thiell added a comment - Also... lctl dk from a MDS (oak-md2-s1 serving oak-MDT0004): 00000100:02000000:4.0:1621544759.322774:0:5321:0:(import.c:1597:ptlrpc_import_recovery_state_machine()) oak-MDT0004: Connection restored to oak-MDT0004-lwp-OST0136_UUID (at 10.0.2.103@o2ib5) 00000020:00000400:51.0:1621544769.080003:0:37195:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0) 00000020:00000400:51.0:1621544769.093563:0:37195:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x4) 00000020:00020000:51.0:1621544769.093605:0:37195:0:(genops.c:556:class_register_device()) oak-OST0136-osc-MDT0004: already exists, won't add 00000020:00020000:51.0:1621544769.104827:0:37195:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17 00000020:02000400:51.0:1621544769.116657:0:37195:0:(obd_config.c:2068:class_config_dump_handler()) cmd=cf001 0:oak-OST0136-osc-MDT0004 1:osp 2:oak-MDT0004-mdtlov_UUID 10000000:00020000:19.0:1621544769.127093:0:4304:0:(mgc_request.c:599:do_requeue()) failed processing log: -17 00010000:02000400:22.0:1621544973.442080:0:4310:0:(ldlm_lib.c:816:target_handle_reconnect()) oak-MDT0004: Client 9458049c-ca8d-335b-3531-2606964e11c0 (at 10.51.2.31@o2ib3) reconnecting What generates this error is: if (!(cfg->cfg_flags & CFG_F_MARKER) && (lcfg->lcfg_command != LCFG_MARKER)) { CWARN("Skip config outside markers, (inst: %016lx, uuid: %s, flags: %#x)\n", cfg->cfg_instance, cfg->cfg_uuid.uuid, cfg->cfg_flags); cfg->cfg_flags |= CFG_F_SKIP; but cfg->cfg_instance is NULL and cfg->cfg_uuid.uuid empty. Bug?
            pjones Peter Jones added a comment -

            Mike

            Could you please advise?

            Thanks

            Peter

            pjones Peter Jones added a comment - Mike Could you please advise? Thanks Peter
            pjones Peter Jones added a comment -

            Sorry- wrong ticket

            pjones Peter Jones added a comment - Sorry- wrong ticket
            pjones Peter Jones added a comment -

            Serguei

            Can you please advise?

            Thanks

            Peter

            pjones Peter Jones added a comment - Serguei Can you please advise? Thanks Peter

            People

              tappro Mikhail Pershin
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: