Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.12.6
-
None
-
CentOS 7.9
-
3
-
9223372036854775807
Description
On our Oak filesystem, running 2.12.6, we have a problem with either the MGS or a corrupt catalog somewhere.
Active OSTs on this filesystem are from OST000c (12) to OST0137 (311). Today, we tried to add OST index 312 oak-OST0138. The new OST is visible from client, but not from MDTs: we have 6 MDTs (oak-MDT0000 to oak-MDT0005).
Full disclosure... older OSTs 0-11 were previously removed with the experimental command lctl del_ost from LU-7668.
The server logs when I started the new OST are available in servers-logs.txt
What is weird is the following:
May 20 14:06:05 oak-md1-s2 kernel: Lustre: 108193:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
and that it complains about other OSTs (not OST0138):
May 20 14:06:05 oak-md1-s2 kernel: LustreError: 108193:0:(genops.c:556:class_register_device()) oak-OST0134-osc-MDT0003: already exists, won't add May 20 14:06:05 oak-md1-s2 kernel: LustreError: 108193:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17 May 20 14:06:05 oak-md1-s2 kernel: Lustre: cmd=cf001 0:oak-OST0134-osc-MDT0003 1:osp 2:oak-MDT0003-mdtlov_UUID May 20 14:06:05 oak-md1-s2 kernel: LustreError: 4061:0:(mgc_request.c:599:do_requeue()) failed processing log: -17 May 20 14:06:05 oak-md1-s2 kernel: Lustre: cmd=cf001 0:oak-OST0134-osc-MDT0000 1:osp 2:oak-MDT0000-mdtlov_UUID May 20 14:06:07 oak-md2-s2 kernel: Lustre: 14846:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0) May 20 14:06:07 oak-md2-s2 kernel: LustreError: 14846:0:(genops.c:556:class_register_device()) oak-OST0136-osc-MDT0005: already exists, won't add May 20 14:06:07 oak-md2-s2 kernel: LustreError: 14846:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17 May 20 14:06:07 oak-md2-s2 kernel: Lustre: cmd=cf001 0:oak-OST0136-osc-MDT0005 1:osp 2:oak-MDT0005-mdtlov_UUID May 20 14:06:07 oak-md2-s2 kernel: LustreError: 4291:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
If I check the llog catalogs on the MGS, the new OST oak-OST0138 seems to have been added though:
Client catalog on MGS:
[root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-client | grep OST0138 - { index: 2716, event: attach, device: oak-OST0138-osc, type: osc, UUID: oak-clilov_UUID } - { index: 2717, event: setup, device: oak-OST0138-osc, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 } - { index: 2719, event: add_conn, device: oak-OST0138-osc, node: 10.0.2.104@o2ib5 } - { index: 2720, event: add_osc, device: oak-clilov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
MDS catalogs on MGS:
[root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0000 | grep OST0138 - { index: 2785, event: attach, device: oak-OST0138-osc-MDT0000, type: osc, UUID: oak-MDT0000-mdtlov_UUID } - { index: 2786, event: setup, device: oak-OST0138-osc-MDT0000, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 } - { index: 2788, event: add_conn, device: oak-OST0138-osc-MDT0000, node: 10.0.2.104@o2ib5 } - { index: 2789, event: add_osc, device: oak-MDT0000-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 } [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0001 | grep OST0138 - { index: 2930, event: attach, device: oak-OST0138-osc-MDT0001, type: osc, UUID: oak-MDT0001-mdtlov_UUID } - { index: 2931, event: setup, device: oak-OST0138-osc-MDT0001, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 } - { index: 2933, event: add_conn, device: oak-OST0138-osc-MDT0001, node: 10.0.2.104@o2ib5 } - { index: 2934, event: add_osc, device: oak-MDT0001-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 } [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0002 | grep OST0138 - { index: 3063, event: attach, device: oak-OST0138-osc-MDT0002, type: osc, UUID: oak-MDT0002-mdtlov_UUID } - { index: 3064, event: setup, device: oak-OST0138-osc-MDT0002, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 } - { index: 3066, event: add_conn, device: oak-OST0138-osc-MDT0002, node: 10.0.2.104@o2ib5 } - { index: 3067, event: add_osc, device: oak-MDT0002-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 } [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0003 | grep OST0138 - { index: 3079, event: attach, device: oak-OST0138-osc-MDT0003, type: osc, UUID: oak-MDT0003-mdtlov_UUID } - { index: 3080, event: setup, device: oak-OST0138-osc-MDT0003, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 } - { index: 3082, event: add_conn, device: oak-OST0138-osc-MDT0003, node: 10.0.2.104@o2ib5 } - { index: 3083, event: add_osc, device: oak-MDT0003-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 } [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0004 | grep OST0138 - { index: 3255, event: attach, device: oak-OST0138-osc-MDT0004, type: osc, UUID: oak-MDT0004-mdtlov_UUID } - { index: 3256, event: setup, device: oak-OST0138-osc-MDT0004, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 } - { index: 3258, event: add_conn, device: oak-OST0138-osc-MDT0004, node: 10.0.2.104@o2ib5 } - { index: 3259, event: add_osc, device: oak-MDT0004-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 } [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0005 | grep OST0138 - { index: 3255, event: attach, device: oak-OST0138-osc-MDT0005, type: osc, UUID: oak-MDT0005-mdtlov_UUID } - { index: 3256, event: setup, device: oak-OST0138-osc-MDT0005, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 } - { index: 3258, event: add_conn, device: oak-OST0138-osc-MDT0005, node: 10.0.2.104@o2ib5 } - { index: 3259, event: add_osc, device: oak-MDT0005-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
However, this new OST is NOT visible from the MDTs:
[root@oak-md1-s2 CONFIGS]# llog_reader /mnt/ldiskfs/mdt/0/CONFIGS/oak-MDT0000 | grep 0138 [root@oak-md1-s2 CONFIGS]# [root@oak-md1-s2 ~]# lctl dl | grep OST0138 [root@oak-md1-s2 ~]#
From a client, we can see the new OST but it's not filling up, which makes sense if the MDTs are not aware of it:
oak-OST0133_UUID 108461852548 37418203104 69949699416 35% /oak[OST:307] oak-OST0134_UUID 108461852548 38597230784 68770659804 36% /oak[OST:308] oak-OST0135_UUID 108461852548 38483562644 68884328272 36% /oak[OST:309] oak-OST0136_UUID 108461852548 41312045604 66055819468 39% /oak[OST:310] oak-OST0137_UUID 108461852548 43196874132 64170973596 41% /oak[OST:311] oak-OST0138_UUID 108461852548 1828 107368054308 1% /oak[OST:312]
Right now, we're up and running in that weird situation... not ideal.
I'm attaching the catalogs found on the 6 MDTs as oak-MDT-CONFIGS-llog.tar and a tarball of the CONFIGS directory on the MGS as oak-MGS-CONFIGS.tar.gz
Any idea of what is wrong or corrupt? We would really appreciate any help to avoid doing a full writeconf.
Attachments
Issue Links
- is related to
-
LU-15000 MDS crashes with (osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed
-
- Resolved
-
Stephane, as I see from config logs local copies on MDTs were not updated from main config on MGS, I am not sure why, so it would still be valuable to get server log during mount, it can be related somehow to the servers order in log - there are MDT0004 and MDT0005 were added after last OST0137, so probably that is log processing/copying bug, I am checking that
As for solution, you could just try to remove (better move to other location just in case) local MDT log of one MDT, say 0003 and remount it. The config log should be copied from MGS and MDT0003 might see OST0138. I worry about that -17 error during log processing, maybe it will interfere, but config log on MGS looks OK and has OST0138