Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.12.6
-
None
-
CentOS 7.9
-
3
-
9223372036854775807
Description
On our Oak filesystem, running 2.12.6, we have a problem with either the MGS or a corrupt catalog somewhere.
Active OSTs on this filesystem are from OST000c (12) to OST0137 (311). Today, we tried to add OST index 312 oak-OST0138. The new OST is visible from client, but not from MDTs: we have 6 MDTs (oak-MDT0000 to oak-MDT0005).
Full disclosure... older OSTs 0-11 were previously removed with the experimental command lctl del_ost from LU-7668.
The server logs when I started the new OST are available in servers-logs.txt
What is weird is the following:
May 20 14:06:05 oak-md1-s2 kernel: Lustre: 108193:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
and that it complains about other OSTs (not OST0138):
May 20 14:06:05 oak-md1-s2 kernel: LustreError: 108193:0:(genops.c:556:class_register_device()) oak-OST0134-osc-MDT0003: already exists, won't add May 20 14:06:05 oak-md1-s2 kernel: LustreError: 108193:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17 May 20 14:06:05 oak-md1-s2 kernel: Lustre: cmd=cf001 0:oak-OST0134-osc-MDT0003 1:osp 2:oak-MDT0003-mdtlov_UUID May 20 14:06:05 oak-md1-s2 kernel: LustreError: 4061:0:(mgc_request.c:599:do_requeue()) failed processing log: -17 May 20 14:06:05 oak-md1-s2 kernel: Lustre: cmd=cf001 0:oak-OST0134-osc-MDT0000 1:osp 2:oak-MDT0000-mdtlov_UUID May 20 14:06:07 oak-md2-s2 kernel: Lustre: 14846:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0) May 20 14:06:07 oak-md2-s2 kernel: LustreError: 14846:0:(genops.c:556:class_register_device()) oak-OST0136-osc-MDT0005: already exists, won't add May 20 14:06:07 oak-md2-s2 kernel: LustreError: 14846:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17 May 20 14:06:07 oak-md2-s2 kernel: Lustre: cmd=cf001 0:oak-OST0136-osc-MDT0005 1:osp 2:oak-MDT0005-mdtlov_UUID May 20 14:06:07 oak-md2-s2 kernel: LustreError: 4291:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
If I check the llog catalogs on the MGS, the new OST oak-OST0138 seems to have been added though:
Client catalog on MGS:
[root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-client | grep OST0138 - { index: 2716, event: attach, device: oak-OST0138-osc, type: osc, UUID: oak-clilov_UUID } - { index: 2717, event: setup, device: oak-OST0138-osc, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 } - { index: 2719, event: add_conn, device: oak-OST0138-osc, node: 10.0.2.104@o2ib5 } - { index: 2720, event: add_osc, device: oak-clilov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
MDS catalogs on MGS:
[root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0000 | grep OST0138 - { index: 2785, event: attach, device: oak-OST0138-osc-MDT0000, type: osc, UUID: oak-MDT0000-mdtlov_UUID } - { index: 2786, event: setup, device: oak-OST0138-osc-MDT0000, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 } - { index: 2788, event: add_conn, device: oak-OST0138-osc-MDT0000, node: 10.0.2.104@o2ib5 } - { index: 2789, event: add_osc, device: oak-MDT0000-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 } [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0001 | grep OST0138 - { index: 2930, event: attach, device: oak-OST0138-osc-MDT0001, type: osc, UUID: oak-MDT0001-mdtlov_UUID } - { index: 2931, event: setup, device: oak-OST0138-osc-MDT0001, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 } - { index: 2933, event: add_conn, device: oak-OST0138-osc-MDT0001, node: 10.0.2.104@o2ib5 } - { index: 2934, event: add_osc, device: oak-MDT0001-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 } [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0002 | grep OST0138 - { index: 3063, event: attach, device: oak-OST0138-osc-MDT0002, type: osc, UUID: oak-MDT0002-mdtlov_UUID } - { index: 3064, event: setup, device: oak-OST0138-osc-MDT0002, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 } - { index: 3066, event: add_conn, device: oak-OST0138-osc-MDT0002, node: 10.0.2.104@o2ib5 } - { index: 3067, event: add_osc, device: oak-MDT0002-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 } [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0003 | grep OST0138 - { index: 3079, event: attach, device: oak-OST0138-osc-MDT0003, type: osc, UUID: oak-MDT0003-mdtlov_UUID } - { index: 3080, event: setup, device: oak-OST0138-osc-MDT0003, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 } - { index: 3082, event: add_conn, device: oak-OST0138-osc-MDT0003, node: 10.0.2.104@o2ib5 } - { index: 3083, event: add_osc, device: oak-MDT0003-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 } [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0004 | grep OST0138 - { index: 3255, event: attach, device: oak-OST0138-osc-MDT0004, type: osc, UUID: oak-MDT0004-mdtlov_UUID } - { index: 3256, event: setup, device: oak-OST0138-osc-MDT0004, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 } - { index: 3258, event: add_conn, device: oak-OST0138-osc-MDT0004, node: 10.0.2.104@o2ib5 } - { index: 3259, event: add_osc, device: oak-MDT0004-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 } [root@oak-md1-s1 ~]# lctl --device MGS llog_print oak-MDT0005 | grep OST0138 - { index: 3255, event: attach, device: oak-OST0138-osc-MDT0005, type: osc, UUID: oak-MDT0005-mdtlov_UUID } - { index: 3256, event: setup, device: oak-OST0138-osc-MDT0005, UUID: oak-OST0138_UUID, node: 10.0.2.103@o2ib5 } - { index: 3258, event: add_conn, device: oak-OST0138-osc-MDT0005, node: 10.0.2.104@o2ib5 } - { index: 3259, event: add_osc, device: oak-MDT0005-mdtlov, ost: oak-OST0138_UUID, index: 312, gen: 1 }
However, this new OST is NOT visible from the MDTs:
[root@oak-md1-s2 CONFIGS]# llog_reader /mnt/ldiskfs/mdt/0/CONFIGS/oak-MDT0000 | grep 0138 [root@oak-md1-s2 CONFIGS]# [root@oak-md1-s2 ~]# lctl dl | grep OST0138 [root@oak-md1-s2 ~]#
From a client, we can see the new OST but it's not filling up, which makes sense if the MDTs are not aware of it:
oak-OST0133_UUID 108461852548 37418203104 69949699416 35% /oak[OST:307] oak-OST0134_UUID 108461852548 38597230784 68770659804 36% /oak[OST:308] oak-OST0135_UUID 108461852548 38483562644 68884328272 36% /oak[OST:309] oak-OST0136_UUID 108461852548 41312045604 66055819468 39% /oak[OST:310] oak-OST0137_UUID 108461852548 43196874132 64170973596 41% /oak[OST:311] oak-OST0138_UUID 108461852548 1828 107368054308 1% /oak[OST:312]
Right now, we're up and running in that weird situation... not ideal.
I'm attaching the catalogs found on the 6 MDTs as oak-MDT-CONFIGS-llog.tar and a tarball of the CONFIGS directory on the MGS as oak-MGS-CONFIGS.tar.gz
Any idea of what is wrong or corrupt? We would really appreciate any help to avoid doing a full writeconf.
Attachments
Issue Links
- is related to
-
LU-15000 MDS crashes with (osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed
- Resolved