[LU-15000] MDS crashes with (osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.16.0, Lustre 2.15.3
Affects Version/s: Lustre 2.12.7
Labels:
- llog
- recovery
Environment:
CentOS 7.9

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

I previously reported some issues when adding OSTs in LU-14695, and I also guess we got hit by ~~LU-13356~~ and the MGS was not unable to process new OSTs. So we recently upgraded all Oak servers to 2.12.7 + single patch from ~~LU-13356~~. I'm opening this ticket in this new context where the MGS seems to work better so far. But now the issue is more on the MDS side apparently.

We have 300+ OSTs. After adding new OSTs, the first ones succeed, but after 3 or 4 new OSTs, all MDTs crash with the following backtrace (you can see the new OST being initialized on an OSS and then a MDS crash:

Sep 09 07:59:33 oak-io6-s2 kernel: md/raid:md45: raid level 6 active with 10 out of 10 devices, algorithm 2
Sep 09 07:59:33 oak-io6-s2 kernel: md45: detected capacity change from 0 to 112003075014656
Sep 09 07:59:33 oak-io6-s2 kernel: LDISKFS-fs (md45): file extents enabled, maximum tree depth=5
Sep 09 07:59:34 oak-io6-s2 kernel: LDISKFS-fs (md45): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Sep 09 07:59:34 oak-io6-s2 kernel: LDISKFS-fs (md45): file extents enabled, maximum tree depth=5
Sep 09 07:59:34 oak-io6-s2 kernel: LDISKFS-fs (md45): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Sep 09 07:59:38 oak-io6-s2 kernel: Lustre: oak-OST013d: new disk, initializing
Sep 09 07:59:38 oak-io6-s2 kernel: Lustre: srv-oak-OST013d: No data found on store. Initialize space
Sep 09 07:59:38 oak-io6-s2 kernel: Lustre: oak-OST013d: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
Sep 09 07:59:43 oak-md1-s1 kernel: LustreError: 24251:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
Sep 09 07:59:43 oak-md1-s1 kernel: LustreError: 24251:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
Sep 09 07:59:43 oak-md1-s1 kernel: Pid: 24251, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
Sep 09 07:59:43 oak-md1-s1 kernel: Call Trace:
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0cac7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0cac87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc1a2beb6>] osp_obd_connect+0x3c6/0x400 [osp]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc194b04e>] lod_add_device+0xa8e/0x19a0 [lod]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc1946895>] lod_process_config+0x13b5/0x1510 [lod]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0ef18b2>] class_process_config+0x2142/0x2830 [obdclass]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0ef3b79>] class_config_llog_handler+0x819/0x1520 [obdclass]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0eb65af>] llog_process_thread+0x85f/0x1a10 [obdclass]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0eb8174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffa62c5c21>] kthread+0xd1/0xe0
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffa6994ddd>] ret_from_fork_nospec_begin+0x7/0x21
Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff

This is another example when we can see llog errors, for OSTs that actually are added (they are properly registered on the MGS), before the crash of the MDS:

Sep  9 08:21:17 oak-md1-s1 kernel: Lustre: oak-MDT0001: Client 59a1cc5a-bdc6-102a-1ccb-b551d993bacc (at 10.51.3.32@o2ib3) reconnecting
Sep  9 08:21:18 oak-md1-s1 kernel: Lustre: oak-MDT0001: Client ea26edda-b4c7-1625-cd1a-c65d4487d437 (at 10.51.3.68@o2ib3) reconnecting
Sep  9 08:21:18 oak-md1-s1 kernel: Lustre: Skipped 5 previous similar messages
Sep  9 08:21:20 oak-md1-s1 kernel: Lustre: oak-MDT0001: Client 38d27508-3ea0-383f-8501-005bb0842ebd (at 10.51.4.12@o2ib3) reconnecting
Sep  9 08:21:20 oak-md1-s1 kernel: Lustre: Skipped 2 previous similar messages
Sep  9 08:21:35 oak-md1-s1 kernel: Lustre: 7555:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
Sep  9 08:21:35 oak-md1-s1 kernel: LustreError: 7555:0:(genops.c:556:class_register_device()) oak-OST0144-osc-MDT0001: already exists, won't add
Sep  9 08:21:35 oak-md1-s1 kernel: LustreError: 7555:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
Sep  9 08:21:35 oak-md1-s1 kernel: Lustre:    cmd=cf001 0:oak-OST0144-osc-MDT0001  1:osp  2:oak-MDT0001-mdtlov_UUID  
Sep  9 08:21:35 oak-md1-s1 kernel: LustreError: 3775:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
Sep  9 08:21:35 oak-md1-s1 kernel: Lustre:    cmd=cf001 0:oak-OST0144-osc-MDT0002  1:osp  2:oak-MDT0002-mdtlov_UUID  
Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 7569:0:(obd_config.c:461:class_setup()) Device 635 already setup (type osp)
Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 7569:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 7569:0:(obd_config.c:1835:class_config_llog_handler()) Skipped 1 previous similar message
Sep  9 08:21:44 oak-md1-s1 kernel: Lustre:    cmd=cf003 0:oak-OST0144-osc-MDT0001  1:oak-OST0144_UUID  2:10.0.2.103@o2ib5  
Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 3775:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 3775:0:(mgc_request.c:599:do_requeue()) Skipped 1 previous similar message
Sep  9 08:21:48 oak-md1-s1 kernel: LustreError: 7571:0:(obd_config.c:461:class_setup()) Device 316 already setup (type osp)
Sep  9 08:21:48 oak-md1-s1 kernel: LustreError: 7571:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
Sep  9 08:21:48 oak-md1-s1 kernel: Lustre:    cmd=cf003 0:oak-OST0144-osc-MDT0002  1:oak-OST0144_UUID  2:10.0.2.103@o2ib5  
Sep  9 08:21:48 oak-md1-s1 kernel: LustreError: 3775:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
Sep  9 08:21:55 oak-md1-s1 kernel: LustreError: 7574:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
Sep  9 08:21:55 oak-md1-s1 kernel: LustreError: 7574:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
Sep  9 08:21:55 oak-md1-s1 kernel: Pid: 7574, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
Sep  9 08:21:56 oak-md1-s1 kernel: Call Trace:
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0b057cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0b0587c>] lbug_with_loc+0x4c/0xa0 [libcfs]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc18baeb6>] osp_obd_connect+0x3c6/0x400 [osp]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc17da04e>] lod_add_device+0xa8e/0x19a0 [lod]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc17d5895>] lod_process_config+0x13b5/0x1510 [lod]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0d4f8b2>] class_process_config+0x2142/0x2830 [obdclass]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0d51b79>] class_config_llog_handler+0x819/0x1520 [obdclass]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0d145af>] llog_process_thread+0x85f/0x1a10 [obdclass]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0d16174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffff9dcc5c21>] kthread+0xd1/0xe0
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffff9e394ddd>] ret_from_fork_nospec_begin+0x7/0x21
Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff

In the trace below, we can see that the MDS crashes are happening at the same time:

oak-io6-s2: Sep 09 08:37:59 oak-io6-s2 kernel: md/raid:md53: device dm-494 operational as raid disk 1
oak-io6-s2: Sep 09 08:37:59 oak-io6-s2 kernel: md/raid:md53: raid level 6 active with 10 out of 10 devices, algorithm 2
oak-io6-s2: Sep 09 08:37:59 oak-io6-s2 kernel: md53: detected capacity change from 0 to 112003075014656
oak-io6-s2: Sep 09 08:38:00 oak-io6-s2 kernel: LDISKFS-fs (md53): file extents enabled, maximum tree depth=5
oak-io6-s2: Sep 09 08:38:00 oak-io6-s2 kernel: LDISKFS-fs (md53): mounted filesystem with ordered data mode. Opts: errors=remount-ro
oak-io6-s2: Sep 09 08:38:01 oak-io6-s2 kernel: LDISKFS-fs (md53): file extents enabled, maximum tree depth=5
oak-io6-s2: Sep 09 08:38:01 oak-io6-s2 kernel: LDISKFS-fs (md53): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
oak-io6-s2: Sep 09 08:38:06 oak-io6-s2 kernel: Lustre: oak-OST0145: new disk, initializing
oak-io6-s2: Sep 09 08:38:06 oak-io6-s2 kernel: Lustre: srv-oak-OST0145: No data found on store. Initialize space
oak-io6-s2: Sep 09 08:38:06 oak-io6-s2 kernel: Lustre: oak-OST0145: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel: LustreError: 7744:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel: LustreError: 7744:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel: Pid: 7744, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel: LustreError: 5232:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel: LustreError: 5232:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel: Pid: 5232, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel: Call Trace:
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc097b7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel: Call Trace:
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc097b87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc06667cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc066687c>] lbug_with_loc+0x4c/0xa0 [libcfs]
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc172deb6>] osp_obd_connect+0x3c6/0x400 [osp]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc1311eb6>] osp_obd_connect+0x3c6/0x400 [osp]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc1bd404e>] lod_add_device+0xa8e/0x19a0 [lod]
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc164d04e>] lod_add_device+0xa8e/0x19a0 [lod]
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc1648895>] lod_process_config+0x13b5/0x1510 [lod]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc1bcf895>] lod_process_config+0x13b5/0x1510 [lod]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc13df8b2>] class_process_config+0x2142/0x2830 [obdclass]
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc0c3a8b2>] class_process_config+0x2142/0x2830 [obdclass]
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc0c3cb79>] class_config_llog_handler+0x819/0x1520 [obdclass]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc13e1b79>] class_config_llog_handler+0x819/0x1520 [obdclass]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc13a45af>] llog_process_thread+0x85f/0x1a10 [obdclass]
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc0bff5af>] llog_process_thread+0x85f/0x1a10 [obdclass]
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc0c01174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc13a6174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffff9bec5c21>] kthread+0xd1/0xe0
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffbd8c5c21>] kthread+0xd1/0xe0
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffbdf94ddd>] ret_from_fork_nospec_begin+0x7/0x21
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffff9c594ddd>] ret_from_fork_nospec_begin+0x7/0x21
oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel: LustreError: 5224:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel: LustreError: 5224:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel: Pid: 5224, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel: Call Trace:
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc09f07cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc09f087c>] lbug_with_loc+0x4c/0xa0 [libcfs]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc1b38eb6>] osp_obd_connect+0x3c6/0x400 [osp]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc1a5804e>] lod_add_device+0xa8e/0x19a0 [lod]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc1a53895>] lod_process_config+0x13b5/0x1510 [lod]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc13228b2>] class_process_config+0x2142/0x2830 [obdclass]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc1324b79>] class_config_llog_handler+0x819/0x1520 [obdclass]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc12e75af>] llog_process_thread+0x85f/0x1a10 [obdclass]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc12e9174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffae4c5c21>] kthread+0xd1/0xe0
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffaeb94ddd>] ret_from_fork_nospec_begin+0x7/0x21
oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff

After rebooting the crashed MDS, the last added OSTs are working fine (after MDS recovery, they allocate their "super-sequence" and are then operational. But every time I add more OSTs to the filesystem, all MDS crash with this backtrace. This looks similar to ~~LU-9699~~.

I'm attaching the output of

lctl --device MGS llog_print <FILE>

as <FILE>.llog.txt for FILE = oak-client and oak-MDT0000 to oak-MDT0005
in case anyone can see if there is a problem somewhere.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

llogdiff
297 kB
14/Feb/22 10:22 PM
oak-client.llog.txt
187 kB
10/Sep/21 5:15 AM
oak-MDT0000.after-ost331
426 kB
23/Sep/21 5:10 PM
oak-MDT0000.backup-20210923
425 kB
23/Sep/21 5:10 PM
oak-MDT0000.llog.txt
201 kB
10/Sep/21 5:15 AM
oak-MDT0001.llog.txt
199 kB
10/Sep/21 5:15 AM
oak-MDT0002.llog.txt
199 kB
10/Sep/21 5:15 AM
oak-MDT0003.llog.txt
198 kB
10/Sep/21 5:15 AM
oak-MDT0004.llog.txt
198 kB
10/Sep/21 5:15 AM
oak-MDT0005.llog.txt
198 kB
10/Sep/21 5:15 AM

Issue Links

is blocking

LU-7668 permanently remove deactivated OSTs from configuration log

Resolved

is related to

LU-9699 osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed

Resolved

is related to

LU-14695 New OST not visible by MDTs. MGS problem or corrupt catalog llog?

Open

MDS crashes with (osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates