[LU-15000] MDS crashes with (osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed Created: 10/Sep/21  Updated: 01/May/23  Resolved: 30/May/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.7
Fix Version/s: Lustre 2.16.0, Lustre 2.15.3

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 7.9


Attachments: HTML File llogdiff     File oak-MDT0000.after-ost331     File oak-MDT0000.backup-20210923     Text File oak-MDT0000.llog.txt     Text File oak-MDT0001.llog.txt     Text File oak-MDT0002.llog.txt     Text File oak-MDT0003.llog.txt     Text File oak-MDT0004.llog.txt     Text File oak-MDT0005.llog.txt     Text File oak-client.llog.txt    
Issue Links:
Blocker
is blocking LU-7668 permanently remove deactivated OSTs f... Resolved
Related
is related to LU-14695 New OST not visible by MDTs. MGS prob... Open
is related to LU-9699 osp_obd_connect()) ASSERTION( osp->op... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I previously reported some issues when adding OSTs in LU-14695, and I also guess we got hit by LU-13356 and the MGS was not unable to process new OSTs. So we recently upgraded all Oak servers to 2.12.7 + single patch from LU-13356. I'm opening this ticket in this new context where the MGS seems to work better so far. But now the issue is more on the MDS side apparently.

We have 300+ OSTs. After adding new OSTs, the first ones succeed, but after 3 or 4 new OSTs, all MDTs crash with the following backtrace (you can see the new OST being initialized on an OSS and then a MDS crash:

Sep 09 07:59:33 oak-io6-s2 kernel: md/raid:md45: raid level 6 active with 10 out of 10 devices, algorithm 2
Sep 09 07:59:33 oak-io6-s2 kernel: md45: detected capacity change from 0 to 112003075014656
Sep 09 07:59:33 oak-io6-s2 kernel: LDISKFS-fs (md45): file extents enabled, maximum tree depth=5
Sep 09 07:59:34 oak-io6-s2 kernel: LDISKFS-fs (md45): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Sep 09 07:59:34 oak-io6-s2 kernel: LDISKFS-fs (md45): file extents enabled, maximum tree depth=5
Sep 09 07:59:34 oak-io6-s2 kernel: LDISKFS-fs (md45): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Sep 09 07:59:38 oak-io6-s2 kernel: Lustre: oak-OST013d: new disk, initializing
Sep 09 07:59:38 oak-io6-s2 kernel: Lustre: srv-oak-OST013d: No data found on store. Initialize space
Sep 09 07:59:38 oak-io6-s2 kernel: Lustre: oak-OST013d: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
Sep 09 07:59:43 oak-md1-s1 kernel: LustreError: 24251:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
Sep 09 07:59:43 oak-md1-s1 kernel: LustreError: 24251:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
Sep 09 07:59:43 oak-md1-s1 kernel: Pid: 24251, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
Sep 09 07:59:43 oak-md1-s1 kernel: Call Trace:
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0cac7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0cac87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc1a2beb6>] osp_obd_connect+0x3c6/0x400 [osp]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc194b04e>] lod_add_device+0xa8e/0x19a0 [lod]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc1946895>] lod_process_config+0x13b5/0x1510 [lod]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0ef18b2>] class_process_config+0x2142/0x2830 [obdclass]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0ef3b79>] class_config_llog_handler+0x819/0x1520 [obdclass]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0eb65af>] llog_process_thread+0x85f/0x1a10 [obdclass]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0eb8174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffa62c5c21>] kthread+0xd1/0xe0
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffa6994ddd>] ret_from_fork_nospec_begin+0x7/0x21
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff

This is another example when we can see llog errors, for OSTs that actually are added (they are properly registered on the MGS), before the crash of the MDS:

Sep  9 08:21:17 oak-md1-s1 kernel: Lustre: oak-MDT0001: Client 59a1cc5a-bdc6-102a-1ccb-b551d993bacc (at 10.51.3.32@o2ib3) reconnecting
Sep  9 08:21:18 oak-md1-s1 kernel: Lustre: oak-MDT0001: Client ea26edda-b4c7-1625-cd1a-c65d4487d437 (at 10.51.3.68@o2ib3) reconnecting
Sep  9 08:21:18 oak-md1-s1 kernel: Lustre: Skipped 5 previous similar messages
Sep  9 08:21:20 oak-md1-s1 kernel: Lustre: oak-MDT0001: Client 38d27508-3ea0-383f-8501-005bb0842ebd (at 10.51.4.12@o2ib3) reconnecting
Sep  9 08:21:20 oak-md1-s1 kernel: Lustre: Skipped 2 previous similar messages
Sep  9 08:21:35 oak-md1-s1 kernel: Lustre: 7555:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
Sep  9 08:21:35 oak-md1-s1 kernel: LustreError: 7555:0:(genops.c:556:class_register_device()) oak-OST0144-osc-MDT0001: already exists, won't add
Sep  9 08:21:35 oak-md1-s1 kernel: LustreError: 7555:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
Sep  9 08:21:35 oak-md1-s1 kernel: Lustre:    cmd=cf001 0:oak-OST0144-osc-MDT0001  1:osp  2:oak-MDT0001-mdtlov_UUID  
Sep  9 08:21:35 oak-md1-s1 kernel: LustreError: 3775:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
Sep  9 08:21:35 oak-md1-s1 kernel: Lustre:    cmd=cf001 0:oak-OST0144-osc-MDT0002  1:osp  2:oak-MDT0002-mdtlov_UUID  
Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 7569:0:(obd_config.c:461:class_setup()) Device 635 already setup (type osp)
Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 7569:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 7569:0:(obd_config.c:1835:class_config_llog_handler()) Skipped 1 previous similar message
Sep  9 08:21:44 oak-md1-s1 kernel: Lustre:    cmd=cf003 0:oak-OST0144-osc-MDT0001  1:oak-OST0144_UUID  2:10.0.2.103@o2ib5  
Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 3775:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 3775:0:(mgc_request.c:599:do_requeue()) Skipped 1 previous similar message
Sep  9 08:21:48 oak-md1-s1 kernel: LustreError: 7571:0:(obd_config.c:461:class_setup()) Device 316 already setup (type osp)
Sep  9 08:21:48 oak-md1-s1 kernel: LustreError: 7571:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
Sep  9 08:21:48 oak-md1-s1 kernel: Lustre:    cmd=cf003 0:oak-OST0144-osc-MDT0002  1:oak-OST0144_UUID  2:10.0.2.103@o2ib5  
Sep  9 08:21:48 oak-md1-s1 kernel: LustreError: 3775:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
Sep  9 08:21:55 oak-md1-s1 kernel: LustreError: 7574:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
Sep  9 08:21:55 oak-md1-s1 kernel: LustreError: 7574:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
Sep  9 08:21:55 oak-md1-s1 kernel: Pid: 7574, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
Sep  9 08:21:56 oak-md1-s1 kernel: Call Trace:
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0b057cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0b0587c>] lbug_with_loc+0x4c/0xa0 [libcfs]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc18baeb6>] osp_obd_connect+0x3c6/0x400 [osp]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc17da04e>] lod_add_device+0xa8e/0x19a0 [lod]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc17d5895>] lod_process_config+0x13b5/0x1510 [lod]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0d4f8b2>] class_process_config+0x2142/0x2830 [obdclass]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0d51b79>] class_config_llog_handler+0x819/0x1520 [obdclass]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0d145af>] llog_process_thread+0x85f/0x1a10 [obdclass]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0d16174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffff9dcc5c21>] kthread+0xd1/0xe0
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffff9e394ddd>] ret_from_fork_nospec_begin+0x7/0x21
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff

In the trace below, we can see that the MDS crashes are happening at the same time:

oak-io6-s2: Sep 09 08:37:59 oak-io6-s2 kernel: md/raid:md53: device dm-494 operational as raid disk 1
oak-io6-s2: Sep 09 08:37:59 oak-io6-s2 kernel: md/raid:md53: raid level 6 active with 10 out of 10 devices, algorithm 2
oak-io6-s2: Sep 09 08:37:59 oak-io6-s2 kernel: md53: detected capacity change from 0 to 112003075014656
oak-io6-s2: Sep 09 08:38:00 oak-io6-s2 kernel: LDISKFS-fs (md53): file extents enabled, maximum tree depth=5
oak-io6-s2: Sep 09 08:38:00 oak-io6-s2 kernel: LDISKFS-fs (md53): mounted filesystem with ordered data mode. Opts: errors=remount-ro
oak-io6-s2: Sep 09 08:38:01 oak-io6-s2 kernel: LDISKFS-fs (md53): file extents enabled, maximum tree depth=5
oak-io6-s2: Sep 09 08:38:01 oak-io6-s2 kernel: LDISKFS-fs (md53): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
oak-io6-s2: Sep 09 08:38:06 oak-io6-s2 kernel: Lustre: oak-OST0145: new disk, initializing
oak-io6-s2: Sep 09 08:38:06 oak-io6-s2 kernel: Lustre: srv-oak-OST0145: No data found on store. Initialize space
oak-io6-s2: Sep 09 08:38:06 oak-io6-s2 kernel: Lustre: oak-OST0145: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel: LustreError: 7744:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel: LustreError: 7744:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel: Pid: 7744, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel: LustreError: 5232:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel: LustreError: 5232:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel: Pid: 5232, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel: Call Trace:
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc097b7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel: Call Trace:
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc097b87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc06667cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc066687c>] lbug_with_loc+0x4c/0xa0 [libcfs]
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc172deb6>] osp_obd_connect+0x3c6/0x400 [osp]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc1311eb6>] osp_obd_connect+0x3c6/0x400 [osp]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc1bd404e>] lod_add_device+0xa8e/0x19a0 [lod]
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc164d04e>] lod_add_device+0xa8e/0x19a0 [lod]
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc1648895>] lod_process_config+0x13b5/0x1510 [lod]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc1bcf895>] lod_process_config+0x13b5/0x1510 [lod]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc13df8b2>] class_process_config+0x2142/0x2830 [obdclass]
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc0c3a8b2>] class_process_config+0x2142/0x2830 [obdclass]
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc0c3cb79>] class_config_llog_handler+0x819/0x1520 [obdclass]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc13e1b79>] class_config_llog_handler+0x819/0x1520 [obdclass]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc13a45af>] llog_process_thread+0x85f/0x1a10 [obdclass]
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc0bff5af>] llog_process_thread+0x85f/0x1a10 [obdclass]
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc0c01174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc13a6174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffff9bec5c21>] kthread+0xd1/0xe0
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffbd8c5c21>] kthread+0xd1/0xe0
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffbdf94ddd>] ret_from_fork_nospec_begin+0x7/0x21
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffff9c594ddd>] ret_from_fork_nospec_begin+0x7/0x21
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel: LustreError: 5224:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel: LustreError: 5224:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel: Pid: 5224, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel: Call Trace:
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc09f07cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc09f087c>] lbug_with_loc+0x4c/0xa0 [libcfs]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc1b38eb6>] osp_obd_connect+0x3c6/0x400 [osp]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc1a5804e>] lod_add_device+0xa8e/0x19a0 [lod]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc1a53895>] lod_process_config+0x13b5/0x1510 [lod]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc13228b2>] class_process_config+0x2142/0x2830 [obdclass]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc1324b79>] class_config_llog_handler+0x819/0x1520 [obdclass]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc12e75af>] llog_process_thread+0x85f/0x1a10 [obdclass]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc12e9174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffae4c5c21>] kthread+0xd1/0xe0
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffaeb94ddd>] ret_from_fork_nospec_begin+0x7/0x21
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff

After rebooting the crashed MDS, the last added OSTs are working fine (after MDS recovery, they allocate their "super-sequence" and are then operational. But every time I add more OSTs to the filesystem, all MDS crash with this backtrace. This looks similar to LU-9699.

I'm attaching the output of

lctl --device MGS llog_print <FILE>

as <FILE>.llog.txt for FILE = oak-client and oak-MDT0000 to oak-MDT0005
in case anyone can see if there is a problem somewhere.



 Comments   
Comment by Peter Jones [ 10/Sep/21 ]

Mike

It looks like some work was started but not completed relating to this area under LU-9699

Peter

Comment by Mikhail Pershin [ 14/Sep/21 ]

I've just updated patch under LU-9699, preparing its port to 2.12 

Comment by Stephane Thiell [ 14/Sep/21 ]

Thank you!

Comment by Stephane Thiell [ 23/Sep/21 ]

Hi Mike... so I have some update!
We have applied the backported patch on MGS and all MDS, on top of 2.12.7.
Then this morning we tried to add an OST. The MGS propagated the config to the clients, but not to the MDTs. The local config llog for oak-MDT* on the MGS seem all corrupt now (but I do have backups and they are not corrupt on MDT). This is new.

This is the log from the MGS when we started new OST index 331 (oak-OST014b):

00000020:00000400:7.0:1632410256.389569:0:127696:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x4)
00000020:00020000:19.0:1632410256.389727:0:127696:0:(genops.c:556:class_register_device()) oak-OST0141-osc-MDT0001: already exists, won't add
00000020:00020000:19.0:1632410256.402255:0:127696:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
00000020:02000400:19.0:1632410256.415459:0:127696:0:(obd_config.c:2068:class_config_dump_handler())    cmd=cf001 0:oak-OST0141-osc-MDT0001  1:osp  2:oak-MDT0001-mdtlov_UUID

10000000:00020000:9.0:1632410256.426947:0:3915:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
00000020:00000400:14.0F:1632410256.437628:0:127698:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
00000020:00000400:14.0:1632410256.437631:0:127698:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x4)
00000020:00020000:14.0:1632410256.437734:0:127698:0:(genops.c:556:class_register_device()) oak-OST0141-osc-MDT0002: already exists, won't add
00000020:00020000:14.0:1632410256.437736:0:127698:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
00000020:02000400:14.0:1632410256.437739:0:127698:0:(obd_config.c:2068:class_config_dump_handler())    cmd=cf001 0:oak-OST0141-osc-MDT0002  1:osp  2:oak-MDT0002-mdtlov_UUID

10000000:00020000:21.0:1632410256.449208:0:3915:0:(mgc_request.c:599:do_requeue()) failed processing log: -17

New OST is not visible from MDTs (no logs) and not shown by lctl dl:

# clush -w@mds 'lctl dl | grep oak-OST014b'
clush: oak-md2-s2: exited with exit code 1
clush: oak-md1-s1: exited with exit code 1
clush: oak-md2-s1: exited with exit code 1
clush: oak-md1-s2: exited with exit code 1

New OST is visible from clients but not filling up (which makes sense):

[root@oak-h01v19 ~]# lfs df /oak | grep OST:331
oak-OST014b_UUID     108461852548        1868 107368054268   1% /oak[OST:331]

System is still running fine so there is that, no crash. But now the problem is that the config llogs seem corrupt so I'm worried we will need to restore them from backups:

llog_reader oak-MDT0000
rec #1 type=10620000 len=224 offset 8192
rec #2 type=10620000 len=128 offset 8416
rec #3 type=10620000 len=176 offset 8544
...<skip>...
rec #2943 type=10620000 len=112 offset 435072
rec #2944 type=10620000 len=136 offset 435184
rec #2945 type=10620000 len=224 offset 435320
off 435544 skip 6824 to next chunk.
Previous index is 2945, current 0, offset 435544
The log is corrupt (too big at 2826)
llog_reader: Could not pack buffer.: Invalid argument (22)

same for CONFIGS/oak-MDT0001 to 5 on the MGS.

If I run llog_print on MGS, we can see the new OST (which explains why clients can see it):

[root@oak-md1-s1 CONFIGS]# lctl --device MGS llog_print oak-client | grep oak-OST014b
- { index: 2879, event: attach, device: oak-OST014b-osc, type: osc, UUID: oak-clilov_UUID }
- { index: 2880, event: setup, device: oak-OST014b-osc, UUID: oak-OST014b_UUID, node: 10.0.2.104@o2ib5 }
- { index: 2882, event: add_conn, device: oak-OST014b-osc, node: 10.0.2.103@o2ib5 }
- { index: 2883, event: add_osc, device: oak-clilov, ost: oak-OST014b_UUID, index: 331, gen: 1 }

But it seems to be in memory only as it is not to be found in the MGS oak-client file:

[root@oak-md1-s1 CONFIGS]# llog_reader oak-client | grep 014b
[root@oak-md1-s1 CONFIGS]# 

I'm attaching oak-MDT0000 before (ok but without OST 331) as oak-MDT0000.backup-20210923 and after (corrupted) as oak-MDT0000.after-ost331. Do you see what could have caused this (and how to fix it)? Thanks!

Comment by Stephane Thiell [ 23/Sep/21 ]

Hi Mike,
My bad. I should have used debugfs to check those config files, as properly documented in man llog_reader! They are not corrupted when I extract them from the MGS via debugfs:

[root@oak-md1-s1 ~]# debugfs -c -R 'dump CONFIGS/oak-MDT0000 /tmp/oak-MDT0000' /dev/mapper/md1-rbod1-mgt
debugfs 1.45.6.wc5 (09-Feb-2021)
/dev/mapper/md1-rbod1-mgt: catastrophic mode - not reading inode or group bitmaps
[root@oak-md1-s1 ~]# llog_reader /tmp/oak-MDT0000  | grep 014b
#2946 (224)marker 6183 (flags=0x01, v2.12.7.0) oak-OST014b     'add osc' Thu Sep 23 08:15:05 2021-
#2948 (128)attach    0:oak-OST014b-osc-MDT0000  1:osc  2:oak-MDT0000-mdtlov_UUID  
#2949 (144)setup     0:oak-OST014b-osc-MDT0000  1:oak-OST014b_UUID  2:10.0.2.104@o2ib5  
#2951 (112)add_conn  0:oak-OST014b-osc-MDT0000  1:10.0.2.103@o2ib5  
#2952 (136)lov_modify_tgts add 0:oak-MDT0000-mdtlov  1:oak-OST014b_UUID  2:331  3:1  
#2953 (224)END   marker 6183 (flags=0x02, v2.12.7.0) oak-OST014b     'add osc' Thu Sep 23 08:15:05 2021-

And the OST has been properly added to it, this is a diff of before/after for oak-MDT0000:

--- /tmp/before 2021-09-23 12:59:43.058307733 -0700
+++ /tmp/after  2021-09-23 13:00:15.115469245 -0700
@@ -2825,9 +2825,17 @@
 rec #2943 type=10620000 len=112 offset 435072
 rec #2944 type=10620000 len=136 offset 435184
 rec #2945 type=10620000 len=224 offset 435320
+rec #2946 type=10620000 len=224 offset 435544
+rec #2947 type=10620000 len=88 offset 435768
+rec #2948 type=10620000 len=128 offset 435856
+rec #2949 type=10620000 len=144 offset 435984
+rec #2950 type=10620000 len=88 offset 436128
+rec #2951 type=10620000 len=112 offset 436216
+rec #2952 type=10620000 len=136 offset 436328
+rec #2953 type=10620000 len=224 offset 436464
 Header size : 8192  llh_size : 80
 Time : Mon Feb 13 12:37:27 2017
-Number of records: 2827    cat_idx: 0  last_idx: 2945
+Number of records: 2835    cat_idx: 0  last_idx: 2953
 Target uuid : config_uuid
 -----------------------
 #01 (224)marker   2 (flags=0x01, v2.9.0.0) oak-MDT0000-mdtlov 'lov setup' Mon Feb 13 12:37:27 2017-
@@ -5658,3 +5666,11 @@
 #2943 (112)add_conn  0:oak-OST014a-osc-MDT0000  1:10.0.2.104@o2ib5
 #2944 (136)lov_modify_tgts add 0:oak-MDT0000-mdtlov  1:oak-OST014a_UUID  2:330  3:1
 #2945 (224)END   marker 6175 (flags=0x02, v2.12.7.0) oak-OST014a     'add osc' Thu Sep  9 08:54:11 2021-
+#2946 (224)marker 6183 (flags=0x01, v2.12.7.0) oak-OST014b     'add osc' Thu Sep 23 08:15:05 2021-
+#2947 (088)add_uuid  nid=10.0.2.104@o2ib5(0x500050a000268)  0:  1:10.0.2.104@o2ib5  
+#2948 (128)attach    0:oak-OST014b-osc-MDT0000  1:osc  2:oak-MDT0000-mdtlov_UUID  
+#2949 (144)setup     0:oak-OST014b-osc-MDT0000  1:oak-OST014b_UUID  2:10.0.2.104@o2ib5  
+#2950 (088)add_uuid  nid=10.0.2.103@o2ib5(0x500050a000267)  0:  1:10.0.2.103@o2ib5  
+#2951 (112)add_conn  0:oak-OST014b-osc-MDT0000  1:10.0.2.103@o2ib5  
+#2952 (136)lov_modify_tgts add 0:oak-MDT0000-mdtlov  1:oak-OST014b_UUID  2:331  3:1  
+#2953 (224)END   marker 6183 (flags=0x02, v2.12.7.0) oak-OST014b     'add osc' Thu Sep 23 08:15:05 2021-

So the main question remains, why are MDTs not aware of this new OST? How can I troubleshoot that? Thanks and sorry for the confusion.

Comment by Mikhail Pershin [ 27/Sep/21 ]

Stephane, such problem can be result of LU-13356 issue, could you check if related patch is in your tree?

Comment by Etienne Aujames [ 27/Sep/21 ]

Hello,
This ticket could be related to LU-14802, this is also a configuration issue with the LU-13356 on the background.

Comment by Stephane Thiell [ 27/Sep/21 ]

Thank you both! We do have the patch from LU-13356 on the Lustre servers but not on all clients yet. It is being deployed.

Comment by Stephane Thiell [ 11/Oct/21 ]

Hi Mike and Etienne,

We still think that in our case, we have another problem when adding new OSTs. With the patch from LU-9699, MDS don't crash anymore on this assert, but we still can't add OSTs without restarting all MDS. I believe that the following messages might give us a hint:

---------------
oak-md1-s1
---------------
Oct 04 00:53:44 oak-md1-s1 kernel: Lustre: 6206:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
Oct 04 00:53:44 oak-md1-s1 kernel: LustreError: 6206:0:(genops.c:556:class_register_device()) oak-OST0142-osc-MDT0001: already exists, won't add
---------------
oak-md1-s2
---------------
Oct 04 00:53:42 oak-md1-s2 kernel: Lustre: 6253:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
Oct 04 00:53:42 oak-md1-s2 kernel: LustreError: 6253:0:(genops.c:556:class_register_device()) oak-OST0142-osc-MDT0003: already exists, won't add
---------------
oak-md2-s1
---------------
Oct 04 00:53:41 oak-md2-s1 kernel: Lustre: 5413:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
Oct 04 00:53:41 oak-md2-s1 kernel: LustreError: 5413:0:(genops.c:556:class_register_device()) oak-OST0145-osc-MDT0004: already exists, won't add
---------------
oak-md2-s2
---------------
Oct 04 00:33:18 oak-md2-s2 kernel: Lustre: 48951:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
Oct 04 00:33:18 oak-md2-s2 kernel: LustreError: 48951:0:(genops.c:556:class_register_device()) oak-OST0145-osc-MDT0005: already exists, won't add

Reminder: these OSTs are not the ones being added.

When a new OST is added, the MGS is updated (and that is OK) revokes the config lock which triggers an update on clients (this is OK too), and also on MDTs, and class_config_llog_handler() where this message is produced is called with cfg->cfg_instance == NULL. In my understanding, cfg_instance should be defined so that new llog can be appended and so the MDTs can then add the new OST at the end without re-processing the full config. My current guess is that this works for a certain length of config llog (until OST0142 / OST0145 depending on the MDT, this is reproducible - do we reach a buffer limit here?), and when the second part of the config buffer is processed, we are losing the instance (and thus also cfg_last_idx?) so logs are processed at the beginning of the buffer which here happens to be OST0142 / OST0145 depending on the MDT as their config llog is not exactly the same. I've spent some time trying to look for a possible defect but I haven't found anything yet.

Another thing I am suspecting is that it could have started when we removed a few OSTs: we previously disabled 12 OSTs on this system, we used the experimental del_ost which behaves like llog_cancel on specific indexes, and I noticed they are marked as EXCLUDE and not SKIP. So even a lctl clear_conf wouldn't help. This is an issue too if we want to definitely get rid of them.

#28 (224)marker 323 (flags=0x01, v2.10.4.0) oak-MDT0001     'add osc(copied)' Thu Oct 18 11:45:30 2018-
#29 (224)EXCLUDE START marker 324 (flags=0x11, v2.10.4.0) oak-OST0001     'add osc' Thu Oct 18 11:45:30 2018-Fri Mar 19 15:29:22 2021
#30 (088)add_uuid  nid=10.0.2.102@o2ib5(0x500050a000266)  0:  1:10.0.2.102@o2ib5
#33 (088)add_uuid  nid=10.0.2.101@o2ib5(0x500050a000265)  0:  1:10.0.2.101@o2ib5
#36 (224)END   EXCLUDE END   marker 324 (flags=0x12, v2.10.4.0) oak-OST0001     'add osc' Thu Oct 18 11:45:30 2018-Fri Mar 19 15:29:22 2021

To come back to our 'Skip config outside markers' issue,  perhaps there is a llog buffer issue when EXCLUDE markers are previously used. I will try to reproduce to create a large config like this on a test system, but I wanted to share that with you in case you have some insights. Thanks!

Comment by Dominique Martinet [ 14/Feb/22 ]

Hello,

it's been a while but it's still a problem when adding new OSTs, I've taken a fresh look and after some efforts I was able to reproduce using the catalogs from the MGS on a dummy test filesystem, this should make it easy to get more infos out of it.

 

So first for the symptoms we were chasing, Stéphane was correct: the cld_cfg.cfg_last_idx for the oak-MDTxxxx catalogs are wrong on MDS, so old events are being processed when they already had been applied.

My reproducing occurence had the following dmesg:

[ 3486.702628] Lustre: 4657:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
[ 3486.710037] LustreError: 4657:0:(genops.c:556:class_register_device()) oak-OST0147-osc-MDT0002: already exists, won't add
[ 3486.714672] LustreError: 4657:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.42.17.51@tcp: cfg command failed: rc = -17
[ 3486.719065] Lustre:    cmd=cf001 0:oak-OST0147-osc-MDT0002  1:osp  2:oak-MDT0002-mdtlov_UUID  
[ 3486.722883] LustreError: 2907:0:(mgc_request.c:599:do_requeue()) failed processing log: -17

And with loglevels enabled we can see this in dk

mgc_request.c:2058:mgc_process_log()) Process log oak-MDT0002-0000000000000000 from 3203

With the catalog as follow on mgs:

... last_idx: 3259
...
#3203 (144)setup     0:oak-OST0147-osc-MDT0002  1:oak-OST0147_UUID
...
#3259 (last line)

So: the last_idx here was 3202 when it should have been 3259 to start processing from 3260 and the discreptancy made the MDS reprocess oak-OST0147-osc-MDT0002's setup, leading to the error message, but the real problem is that last_idx doesn't match.

This can also be confirmed with crash for manual inspection (I assume it got +1'd parsing that erroneous line):

crash> mod -s mgc
crash> list -H config_llog_list -s config_llog_data.cld_cfg,cld_logname -l config_llog_data3.cld_list_chain
...
ffff8bdcb6d3fa70
  cld_cfg = {
    cfg_instance = 0, 
    cfg_sb = 0xffff8bdc48767000, 
    cfg_uuid = {
      uuid = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
    }, 
    cfg_callback = 0xffffffffc0949980 <class_config_llog_handler>, 
    cfg_last_idx = 3203, 
    cfg_flags = 2, 
    cfg_lwp_idx = 327, 
    cfg_sub_clds = 31
  }
  cld_logname = 0xffff8bdcb6d3fae5 "oak-MDT0002"
...

The previous fix was also targetting side effects, but what really needs fixing is the index so we don't reprocess events.

Now, why is that index off?
This is just a supposition for now but upon a closer look, it turns out the CONFIGS/oak-MDT0002 catalog offsets don't keep up between the original on MGT and the copy on the MDT.

Here's what it looks like on MGT:

...
rec #3258 type=10620000 len=88 offset 508600
rec #3259 type=10620000 len=224 offset 508688
Header size : 8192       llh_size : 80
Time : Thu Sep 26 05:36:36 2019
Number of records: 3132 cat_idx: 0      last_idx: 3259
Target uuid : config_uuid
-----------------------
#01 (224)marker 2015 (flags=0x01, v2.10.8.0) oak-MDT0002-mdtlov 'lov setup' Thu Sep 26 05:36:36 2019-
#02 (128)attach    0:oak-MDT0002-mdtlov  1:lov  2:oak-MDT0002-mdtlov_UUID  
#03 (176)lov_setup 0:oak-MDT0002-mdtlov  1:(struct lov_desc)
                uuid=oak-MDT0002-mdtlov_UUID  stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1
...
#18 (224)marker 2018 (flags=0x01, v2.10.8.0) oak-MDT0002     'add osc(copied)' Thu Sep 26 05:36:36 2019-
#19 (224)EXCLUDE START marker 2019 (flags=0x11, v2.10.8.0) oak-OST0003     'add osc' Thu Sep 26 05:36:36 2019-Fri Mar 19 22:29:33 2021
#20 (088)add_uuid  nid=10.0.2.102@o2ib5(0x500050a000266)  0:  1:10.42.17.114@tcp  
#23 (088)add_uuid  nid=10.0.2.101@o2ib5(0x500050a000265)  0:  1:10.42.17.114@tcp  
#26 (224)END   EXCLUDE END   marker 2019 (flags=0x12, v2.10.8.0) oak-OST0003     'add osc' Thu Sep 26 05:36:36 2019-Fri Mar 19 22:29:33 2021
#27 (224)END   marker 2019 (flags=0x02, v2.10.8.0) oak-MDT0002     'add osc(copied)' Thu Sep 26 05:36:36 2019-
#28 (224)marker 2020 (flags=0x01, v2.10.8.0) oak-MDT0002     'add osc(copied)' Thu Sep 26 05:36:36 2019-
#29 (224)EXCLUDE START marker 2021 (flags=0x11, v2.10.8.0) oak-OST0001     'add osc' Thu Sep 26 05:36:36 2019-Fri Mar 19 22:29:22 2021
...
#3172 (112)add_conn  0:oak-OST0142-osc-MDT0002  1:10.42.17.114@tcp  
#3173 (136)lov_modify_tgts add 0:oak-MDT0002-mdtlov  1:oak-OST0142_UUID  2:322  3:1  
#3174 (224)END   marker 6129 (flags=0x02, v2.12.7.0) oak-OST0142     'add osc' Thu Sep  9 15:37:12 2021-
#3175 (224)marker 6137 (flags=0x01, v2.12.7.0) oak-OST0143     'add osc' Thu Sep  9 15:37:38 2021-
#3176 (088)add_uuid  nid=10.0.2.104@o2ib5(0x500050a000268)  0:  1:10.42.17.114@tcp  
#3177 (128)attach    0:oak-OST0143-osc-MDT0002  1:osc  2:oak-MDT0002-mdtlov_UUID  
#3178 (144)setup     0:oak-OST0143-osc-MDT0002  1:oak-OST0143_UUID  2:10.42.17.114@tcp  
#3179 (088)add_uuid  nid=10.0.2.103@o2ib5(0x500050a000267)  0:  1:10.42.17.114@tcp  
#3180 (112)add_conn  0:oak-OST0143-osc-MDT0002  1:10.42.17.114@tcp  
#3181 (136)lov_modify_tgts add 0:oak-MDT0002-mdtlov  1:oak-OST0143_UUID  2:323  3:1  
#3182 (224)END   marker 6137 (flags=0x02, v2.12.7.0) oak-OST0143     'add osc' Thu Sep  9 15:37:38 2021-
#3183 (224)marker 6145 (flags=0x01, v2.12.7.0) oak-OST0145     'add osc' Thu Sep  9 15:38:03 2021-
#3184 (088)add_uuid  nid=10.0.2.104@o2ib5(0x500050a000268)  0:  1:10.42.17.114@tcp  
#3185 (128)attach    0:oak-OST0145-osc-MDT0002  1:osc  2:oak-MDT0002-mdtlov_UUID  
#3186 (144)setup     0:oak-OST0145-osc-MDT0002  1:oak-OST0145_UUID  2:10.42.17.114@tcp  
#3187 (088)add_uuid  nid=10.0.2.103@o2ib5(0x500050a000267)  0:  1:10.42.17.114@tcp  
#3188 (112)add_conn  0:oak-OST0145-osc-MDT0002  1:10.42.17.114@tcp  
#3189 (136)lov_modify_tgts add 0:oak-MDT0002-mdtlov  1:oak-OST0145_UUID  2:325  3:1  
#3190 (224)END   marker 6145 (flags=0x02, v2.12.7.0) oak-OST0145     'add osc' Thu Sep  9 15:38:03 2021-
#3191 (224)marker 6153 (flags=0x01, v2.12.7.0) oak-OST0146     'add osc' Thu Sep  9 15:53:39 2021-
...
#3240 (224)marker 6201 (flags=0x01, v2.12.7.0) oak-OST014d     'add osc' Mon Oct  4 08:10:30 2021-
#3241 (088)add_uuid  nid=10.0.2.104@o2ib5(0x500050a000268)  0:  1:10.42.17.114@tcp  
#3242 (128)attach    0:oak-OST014d-osc-MDT0002  1:osc  2:oak-MDT0002-mdtlov_UUID  
#3243 (144)setup     0:oak-OST014d-osc-MDT0002  1:oak-OST014d_UUID  2:10.42.17.114@tcp  
#3244 (088)add_uuid  nid=10.0.2.103@o2ib5(0x500050a000267)  0:  1:10.42.17.114@tcp  
#3245 (112)add_conn  0:oak-OST014d-osc-MDT0002  1:10.42.17.114@tcp  
#3246 (136)lov_modify_tgts add 0:oak-MDT0002-mdtlov  1:oak-OST014d_UUID  2:333  3:1  
#3247 (224)END   marker 6201 (flags=0x02, v2.12.7.0) oak-OST014d     'add osc' Mon Oct  4 08:10:30 2021-
#3248 (224)marker 6209 (flags=0x01, v2.12.7.0) oak-OST014e     'add osc' Mon Oct  4 08:13:45 2021-
#3249 (088)add_uuid  nid=10.0.2.103@o2ib5(0x500050a000267)  0:  1:10.42.17.114@tcp  
#3250 (128)attach    0:oak-OST014e-osc-MDT0002  1:osc  2:oak-MDT0002-mdtlov_UUID  
#3251 (144)setup     0:oak-OST014e-osc-MDT0002  1:oak-OST014e_UUID  2:10.42.17.114@tcp  
#3252 (088)add_uuid  nid=10.0.2.104@o2ib5(0x500050a000268)  0:  1:10.42.17.114@tcp  
#3254 (112)add_conn  0:oak-OST014e-osc-MDT0002  1:10.42.17.114@tcp  
#3255 (136)lov_modify_tgts add 0:oak-MDT0002-mdtlov  1:oak-OST014e_UUID  2:334  3:1  
#3256 (224)END   marker 6209 (flags=0x02, v2.12.7.0) oak-OST014e     'add osc' Mon Oct  4 08:13:45 2021-
#3257 (224)marker 6362 (flags=0x01, v2.12.7.0) oak             'quota.mdt' Thu Jan 20 21:23:29 2022-
#3258 (088)param 0:oak  1:quota.mdt=gp  
#3259 (224)END   marker 6362 (flags=0x02, v2.12.7.0) oak             'quota.mdt' Thu Jan 20 21:23:29 2022-

And here's the MDT copy:

rec #3188 type=10620000 len=224 offset 500632
rec #3189 type=10620000 len=224 offset 500856
rec #3190 type=10620000 len=88 offset 501080
rec #3191 type=10620000 len=224 offset 501168
Header size : 8192       llh_size : 80
Time : Fri Jan 21 18:36:56 2022
Number of records: 3132 cat_idx: 0      last_idx: 3191
Target uuid : 
-----------------------
#01 (224)marker 2015 (flags=0x01, v2.10.8.0) oak-MDT0002-mdtlov 'lov setup' Thu Sep 26 05:36:36 2019-
#02 (128)attach    0:oak-MDT0002-mdtlov  1:lov  2:oak-MDT0002-mdtlov_UUID  
#03 (176)lov_setup 0:oak-MDT0002-mdtlov  1:(struct lov_desc)
                uuid=oak-MDT0002-mdtlov_UUID  stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1
...
#18 (224)marker 2018 (flags=0x01, v2.10.8.0) oak-MDT0002     'add osc(copied)' Thu Sep 26 05:36:36 2019-
#19 (224)EXCLUDE START marker 2019 (flags=0x11, v2.10.8.0) oak-OST0003     'add osc' Thu Sep 26 05:36:36 2019-Fri Mar 19 22:29:33 2021
#20 (088)add_uuid  nid=10.0.2.102@o2ib5(0x500050a000266)  0:  1:10.42.17.114@tcp  
#21 (088)add_uuid  nid=10.0.2.101@o2ib5(0x500050a000265)  0:  1:10.42.17.114@tcp  
#22 (224)END   EXCLUDE END   marker 2019 (flags=0x12, v2.10.8.0) oak-OST0003     'add osc' Thu Sep 26 05:36:36 2019-Fri Mar 19 22:29:33 2021
#23 (224)END   marker 2019 (flags=0x02, v2.10.8.0) oak-MDT0002     'add osc(copied)' Thu Sep 26 05:36:36 2019-
#24 (224)marker 2020 (flags=0x01, v2.10.8.0) oak-MDT0002     'add osc(copied)' Thu Sep 26 05:36:36 2019-
#25 (224)EXCLUDE START marker 2021 (flags=0x11, v2.10.8.0) oak-OST0001     'add osc' Thu Sep 26 05:36:36 2019-Fri Mar 19 22:29:22 2021
..
#3172 (224)marker 6201 (flags=0x01, v2.12.7.0) oak-OST014d     'add osc' Mon Oct  4 08:10:30 2021-
#3173 (088)add_uuid  nid=10.0.2.104@o2ib5(0x500050a000268)  0:  1:10.42.17.114@tcp  
#3174 (128)attach    0:oak-OST014d-osc-MDT0002  1:osc  2:oak-MDT0002-mdtlov_UUID  
#3175 (144)setup     0:oak-OST014d-osc-MDT0002  1:oak-OST014d_UUID  2:10.42.17.114@tcp  
#3176 (088)add_uuid  nid=10.0.2.103@o2ib5(0x500050a000267)  0:  1:10.42.17.114@tcp  
#3177 (112)add_conn  0:oak-OST014d-osc-MDT0002  1:10.42.17.114@tcp  
#3178 (136)lov_modify_tgts add 0:oak-MDT0002-mdtlov  1:oak-OST014d_UUID  2:333  3:1  
#3179 (224)END   marker 6201 (flags=0x02, v2.12.7.0) oak-OST014d     'add osc' Mon Oct  4 08:10:30 2021-
#3181 (224)marker 6209 (flags=0x01, v2.12.7.0) oak-OST014e     'add osc' Mon Oct  4 08:13:45 2021-
#3182 (088)add_uuid  nid=10.0.2.103@o2ib5(0x500050a000267)  0:  1:10.42.17.114@tcp  
#3183 (128)attach    0:oak-OST014e-osc-MDT0002  1:osc  2:oak-MDT0002-mdtlov_UUID  
#3184 (144)setup     0:oak-OST014e-osc-MDT0002  1:oak-OST014e_UUID  2:10.42.17.114@tcp  
#3185 (088)add_uuid  nid=10.0.2.104@o2ib5(0x500050a000268)  0:  1:10.42.17.114@tcp  
#3186 (112)add_conn  0:oak-OST014e-osc-MDT0002  1:10.42.17.114@tcp  
#3187 (136)lov_modify_tgts add 0:oak-MDT0002-mdtlov  1:oak-OST014e_UUID  2:334  3:1  
#3188 (224)END   marker 6209 (flags=0x02, v2.12.7.0) oak-OST014e     'add osc' Mon Oct  4 08:13:45 2021-
#3189 (224)marker 6362 (flags=0x01, v2.12.7.0) oak             'quota.mdt' Thu Jan 20 21:23:29 2022-
#3190 (088)param 0:oak  1:quota.mdt=gp  
#3191 (224)END   marker 6362 (flags=0x02, v2.12.7.0) oak             'quota.mdt' Thu Jan 20 21:23:29 2022-

In particular:

  • the last_idx doesn't match
  • from 20 onwards, the copy llog content is the same but the index drifts off: it goes up for EXCLUDE blocks then oscillate between 48/49 (some holes on mdt that aren't on mgs), then climbs again when there are records pertaining OST[0000-000b] which had been excluded previously showing up again later. I've attached a post-processed catalog with diff/index on mgt/index on mdt: llogdiff

From my understanding, the catalog should be identical on both nodes as the local copy will be used during mds start, so differences there will likely cause problems. It's also safe to say the difference is probably due to the EXCLUDE blocks as suspected from the start...

So we now have two ways forward:

  • try to fix the copy so excluded indexes don't affect overall index (e.g. copy holes too)
  • remove EXCLUDED rows on catalogs and pretend this all never happened.

since exclude has never been merged I'd be tempted to put a rug on this and try to fix catalogs, but unfortunately I can't think of any way of fixing them nicely without having to shut down the whole filesystem (e.g. nuke the MGS and have targets reregister is most certainly not safe with thousands of clients mounted)
Do you have any suggestion?

Thanks!

Comment by Etienne Aujames [ 16/Feb/22 ]

umounting/mounting the MDT target should sync the local configuration with the MGS if the MDT successfully get a lock on llog config ressource.
The job is done by mgc_llog_local_copy().

If the MDT fail to sync the configuration the following message should appear (with debug=+mgc):

Failed to get MGS log %s, using local copy for now, will try to update later.

You can get more information from : https://review.whamcloud.com/40448/ ("LU-14090 mgs: no local logs flag")

If the MDT local configuration is different from the MGS configuration after the copy, it can mean that there is a bug in llog_backup().

Comment by Dominique Martinet [ 16/Feb/22 ]

Hi Etienne! Thanks for the reply.

Yes, in my case the copy was successful, so there definitely is a bug in the copy as I pointed out.
I'm not sure if it's worth fixing this bug for a feature that never was released (there probably isn't anyone else with EXCLUDE catalog entries...) but it might actually be easier to fix the copy than to try to fix the catalogs – I'll admit I haven't looked yet and it's worth a shot.

Comment by Dominique Martinet [ 17/Feb/22 ]

Well, it looks like we can have EXCLUDE entries, so it's probably worth a check anyway...

  • first, a confirmation that this is a problem:
    we use the same cld->cld_cfg (which stores the last_idx processed) whether we look at the remote catalog (in mgc_process_cfg_log, rctxt) or local one (ctxt) , so if the indexes don't match (at least the last index) we'll get bad behaviour as observed: we need copy to preserve the indices.
  • looking at llog_backup:
    probing on llog_osd_write_rec I can see that the rec->lrh_index is correct (mgs index#), but the writing idx argument is -1 so we're just appending, basically incrementing the record# by 1 everytime and not skipping things like the original record does. So as observed it basically copies the right content but doesn't care about the index.

Unfortunately, the code doesn't let us modify an unset index, so we'd need to keep track of the last written index, compare with the one we want to write and create fake records and cancel these to have the index artificially match?

FWIW I've kludged llog_reader to take a peek at what the "holes" are about and it's cancelled llogs from the exluded block, as well as padding log entries (which are also present on the copy at different offset when they are needed) – unfortunately llog_osd_pad is not exposed so we can't abuse it but we could just write entries with LLOG_PAD_MAGIC type empty or minimal size and immediately cancel it to unset the header bit...? Looking back I don't see any clean way of clearing the bit actually, it might not be so easy :|

Anyway, ETIMEDOUT for today – will keep digging. Please speak up if you have better ideas than what I'm rambling about

Comment by Etienne Aujames [ 17/Feb/22 ]

Hi Dominique,

I think issue is that the mgc_llog_local_copy/llog_backup do not reproduce gap from the original config llog. The copy is done via llog_process that ignore cancel record (from bitmap llog header).

So your original:

#23 (088)add_uuid  nid=10.0.2.101@o2ib5(0x500050a000265)  0:  1:10.42.17.114@tcp  
#26 (224)END   EXCLUDE END   marker 2019 (flags=0x12, v2.10.8.0) oak-OST0003     'add osc' Thu Sep 26 05:36:36 2019-Fri Mar 19 22:29:33 2021

is remplaced in local copy by:

#21 (088)add_uuid  nid=10.0.2.101@o2ib5(0x500050a000265)  0:  1:10.42.17.114@tcp  
#22 (224)END   EXCLUDE END   marker 2019 (flags=0x12, v2.10.8.0) oak-OST0003     'add osc' Thu Sep 26 05:36:36 2019-Fri Mar 19 22:29:33 2021

The copy don't keep the "holes" in the indexes created by "del_ost", that why you do not have the same cfg_last_idx.
So it mess up the config update mechanism because the MDT rely on the local last index to get new record from the MGS config.

So if you try to remove the EXCLUDE record it will increase the index difference between local and original. This will not help.

I am working on a patch to reproduce index gap on the local copy.

Comment by Gerrit Updater [ 17/Feb/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46545
Subject: LU-15000 llog: check for index gaps in llog_backup()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fa9c281196d7683e76096aa20b3f08b7ccaca0be

Comment by Etienne Aujames [ 17/Feb/22 ]

The patch above implement what you have described (without LLOG_PAD_MAGIC). But like you said this is not totaly accurate because of the chunk padding.
Maybe the easiest way is to take the local backup contents (without "holes") and correct the MGS original with it.

Comment by Stephane Thiell [ 17/Feb/22 ]

Hi Etienne,
Thank you!
Re: your last comment, we were thinking of doing that (do a copy of each MDT's local backup to the MGS), which could fix the immediate issue on the MDTs when adding new OSTs, but wouldn't that mess up mounted clients in that case, especially after we add more OSTs? Are clients' config based on indexes? Our main goal is to avoid re-mounting all clients but still be able to add new OSTs, which is tricky due to this index problem.

Comment by Dominique Martinet [ 17/Feb/22 ]

Hi Etienne, thanks for the patch!

That's pretty much what I had in mind, yes. This came back haunting me a bit while I sleep and yes it's not perfect because of padding: we can't control when each side will add their padding (because length differences), so the last index written might be 1 too big and we might have the opposite problem now (last index bigger means we'll skip one record we should have processed)

So I've been thinking it's safer to keep the holes, and just append dummy small logs at the end until we get the right last index to fake it up. It's safer because the dummy logs being small would reduce the need to add padding, but unfortunately it's still not perfect because if it does require padding at the last dummy log appended we'll overrun the last index again

I can't think of any safe copy mechanism that preserves indices right now as we don't know the side of skipped records northe offset within the changelog of the entry immediately after a hole, and unless we can reproduce that we won't be able to match the automatic padding. I plan to look further this way this weekend.

 

Stéphane, this is not a problem: clients don't have a copy of the MDT changelogs, so we can just restart the MDT and they'll get a new copy which will fix all our problems. If we want to play safe we can also backport the tunefs --nolocallogs patch, but even without it the servers will try to get a new copy and throw away the old one if copy worked, so it should work as far as I understand. That won't impact clients more than a normal failover, so if we can get this right this is a good solution.

Comment by Dominique Martinet [ 17/Feb/22 ]

Hi again, sorry for the spam Talking with Stéphane gave me another idea: if we can't copy the offsets reliably, we can just get rid of them.

In mgc_llog_local_copy the MDS will make a local copy first with llog_backup(env, obd, lctxt, lctxt, logname, temp_log); (note they use lctxt twice which is LLOG_CONFIG_ORIG_CTXT).

This process is identical and will get rid of any cancelled changelog, so if we just do the same on the MGS for any log that has holes when it starts we won't need to worry about these anymore.

It's not perfect either though: we still have the problem while the MGS is running immediately after cancel events, so I guess we need a better time to do that.. asynchronously a while after any cancel event? hmm... Well, just a wild idea if we can't get copy right. I still intend on looking further this weekend.

Comment by Etienne Aujames [ 18/Feb/22 ]

@Stephane Thiell the issue with correcting the MGS configs is that the target copies have last indexes lower that the original ones. Clients/targets are notified only for "new" records in mounted state, so they will ignore indexes between original and the copy.

One solution could be to artificially generate rec after correcting MGS (with "lctl config_param <param>" and "lctl config_param -d <param>") until the correct configs index reach the old configs "last indexes". Then clients/targets should take into account new records (e.g adding new OSTs).

Comment by Gerrit Updater [ 18/Feb/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46552
Subject: LU-15000 llog: read canceled records in llog_backup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0c5471748b18d40893309aad3943b8aed5e91f1a

Comment by Stephane Thiell [ 22/Feb/22 ]

Hi Etienne,
Today, we applied your patch on MGS/MDS (on top of 2.12.7) and we were then able to add 16 OSTs without issue. So your patch worked for us. This will also allow us to consider using llog_cancel/del_ost in the future again.
Thank you!

Comment by Etienne Aujames [ 28/Feb/22 ]

Hi Stephane,
Great for you.
I will add some reviewers to land the patch.

Comment by Gerrit Updater [ 30/May/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46552/
Subject: LU-15000 llog: read canceled records in llog_backup
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d8e2723b4e9409954846939026c599b0b1170e6e

Comment by Peter Jones [ 30/May/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 02/Jun/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47515
Subject: LU-15000 llog: read canceled records in llog_backup
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 57b085cb1068c1a7a7a35376ccf1db10f489c6a9

Comment by Stephane Thiell [ 24/Jun/22 ]

@Peter Jones could you please consider this patch for inclusion in LTS releases (at least for 2.15)?

Comment by Gerrit Updater [ 17/Oct/22 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48898
Subject: LU-15000 llog: read canceled records in llog_backup
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 51bf31c545e2bc67b214de7ae8fd415c1776c3bf

Comment by Gerrit Updater [ 19/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48898/
Subject: LU-15000 llog: read canceled records in llog_backup
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 7ba5dc8e895c693f68d49e7ffc46483710d67beb

Generated at Sat Feb 10 03:14:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.