Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15000

MDS crashes with (osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0, Lustre 2.15.3
    • Lustre 2.12.7
    • None
    • CentOS 7.9
    • 3
    • 9223372036854775807

    Description

      I previously reported some issues when adding OSTs in LU-14695, and I also guess we got hit by LU-13356 and the MGS was not unable to process new OSTs. So we recently upgraded all Oak servers to 2.12.7 + single patch from LU-13356. I'm opening this ticket in this new context where the MGS seems to work better so far. But now the issue is more on the MDS side apparently.

      We have 300+ OSTs. After adding new OSTs, the first ones succeed, but after 3 or 4 new OSTs, all MDTs crash with the following backtrace (you can see the new OST being initialized on an OSS and then a MDS crash:

      Sep 09 07:59:33 oak-io6-s2 kernel: md/raid:md45: raid level 6 active with 10 out of 10 devices, algorithm 2
      Sep 09 07:59:33 oak-io6-s2 kernel: md45: detected capacity change from 0 to 112003075014656
      Sep 09 07:59:33 oak-io6-s2 kernel: LDISKFS-fs (md45): file extents enabled, maximum tree depth=5
      Sep 09 07:59:34 oak-io6-s2 kernel: LDISKFS-fs (md45): mounted filesystem with ordered data mode. Opts: errors=remount-ro
      Sep 09 07:59:34 oak-io6-s2 kernel: LDISKFS-fs (md45): file extents enabled, maximum tree depth=5
      Sep 09 07:59:34 oak-io6-s2 kernel: LDISKFS-fs (md45): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
      Sep 09 07:59:38 oak-io6-s2 kernel: Lustre: oak-OST013d: new disk, initializing
      Sep 09 07:59:38 oak-io6-s2 kernel: Lustre: srv-oak-OST013d: No data found on store. Initialize space
      Sep 09 07:59:38 oak-io6-s2 kernel: Lustre: oak-OST013d: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      Sep 09 07:59:43 oak-md1-s1 kernel: LustreError: 24251:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
      Sep 09 07:59:43 oak-md1-s1 kernel: LustreError: 24251:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
      Sep 09 07:59:43 oak-md1-s1 kernel: Pid: 24251, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
      Sep 09 07:59:43 oak-md1-s1 kernel: Call Trace:
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0cac7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0cac87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc1a2beb6>] osp_obd_connect+0x3c6/0x400 [osp]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc194b04e>] lod_add_device+0xa8e/0x19a0 [lod]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc1946895>] lod_process_config+0x13b5/0x1510 [lod]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0ef18b2>] class_process_config+0x2142/0x2830 [obdclass]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0ef3b79>] class_config_llog_handler+0x819/0x1520 [obdclass]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0eb65af>] llog_process_thread+0x85f/0x1a10 [obdclass]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0eb8174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffa62c5c21>] kthread+0xd1/0xe0
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffa6994ddd>] ret_from_fork_nospec_begin+0x7/0x21
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      

      This is another example when we can see llog errors, for OSTs that actually are added (they are properly registered on the MGS), before the crash of the MDS:

      Sep  9 08:21:17 oak-md1-s1 kernel: Lustre: oak-MDT0001: Client 59a1cc5a-bdc6-102a-1ccb-b551d993bacc (at 10.51.3.32@o2ib3) reconnecting
      Sep  9 08:21:18 oak-md1-s1 kernel: Lustre: oak-MDT0001: Client ea26edda-b4c7-1625-cd1a-c65d4487d437 (at 10.51.3.68@o2ib3) reconnecting
      Sep  9 08:21:18 oak-md1-s1 kernel: Lustre: Skipped 5 previous similar messages
      Sep  9 08:21:20 oak-md1-s1 kernel: Lustre: oak-MDT0001: Client 38d27508-3ea0-383f-8501-005bb0842ebd (at 10.51.4.12@o2ib3) reconnecting
      Sep  9 08:21:20 oak-md1-s1 kernel: Lustre: Skipped 2 previous similar messages
      Sep  9 08:21:35 oak-md1-s1 kernel: Lustre: 7555:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
      Sep  9 08:21:35 oak-md1-s1 kernel: LustreError: 7555:0:(genops.c:556:class_register_device()) oak-OST0144-osc-MDT0001: already exists, won't add
      Sep  9 08:21:35 oak-md1-s1 kernel: LustreError: 7555:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
      Sep  9 08:21:35 oak-md1-s1 kernel: Lustre:    cmd=cf001 0:oak-OST0144-osc-MDT0001  1:osp  2:oak-MDT0001-mdtlov_UUID  
      Sep  9 08:21:35 oak-md1-s1 kernel: LustreError: 3775:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
      Sep  9 08:21:35 oak-md1-s1 kernel: Lustre:    cmd=cf001 0:oak-OST0144-osc-MDT0002  1:osp  2:oak-MDT0002-mdtlov_UUID  
      Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 7569:0:(obd_config.c:461:class_setup()) Device 635 already setup (type osp)
      Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 7569:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
      Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 7569:0:(obd_config.c:1835:class_config_llog_handler()) Skipped 1 previous similar message
      Sep  9 08:21:44 oak-md1-s1 kernel: Lustre:    cmd=cf003 0:oak-OST0144-osc-MDT0001  1:oak-OST0144_UUID  2:10.0.2.103@o2ib5  
      Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 3775:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
      Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 3775:0:(mgc_request.c:599:do_requeue()) Skipped 1 previous similar message
      Sep  9 08:21:48 oak-md1-s1 kernel: LustreError: 7571:0:(obd_config.c:461:class_setup()) Device 316 already setup (type osp)
      Sep  9 08:21:48 oak-md1-s1 kernel: LustreError: 7571:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
      Sep  9 08:21:48 oak-md1-s1 kernel: Lustre:    cmd=cf003 0:oak-OST0144-osc-MDT0002  1:oak-OST0144_UUID  2:10.0.2.103@o2ib5  
      Sep  9 08:21:48 oak-md1-s1 kernel: LustreError: 3775:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
      Sep  9 08:21:55 oak-md1-s1 kernel: LustreError: 7574:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
      Sep  9 08:21:55 oak-md1-s1 kernel: LustreError: 7574:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
      Sep  9 08:21:55 oak-md1-s1 kernel: Pid: 7574, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
      Sep  9 08:21:56 oak-md1-s1 kernel: Call Trace:
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0b057cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0b0587c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc18baeb6>] osp_obd_connect+0x3c6/0x400 [osp]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc17da04e>] lod_add_device+0xa8e/0x19a0 [lod]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc17d5895>] lod_process_config+0x13b5/0x1510 [lod]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0d4f8b2>] class_process_config+0x2142/0x2830 [obdclass]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0d51b79>] class_config_llog_handler+0x819/0x1520 [obdclass]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0d145af>] llog_process_thread+0x85f/0x1a10 [obdclass]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0d16174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffff9dcc5c21>] kthread+0xd1/0xe0
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffff9e394ddd>] ret_from_fork_nospec_begin+0x7/0x21
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
      

      In the trace below, we can see that the MDS crashes are happening at the same time:

      oak-io6-s2: Sep 09 08:37:59 oak-io6-s2 kernel: md/raid:md53: device dm-494 operational as raid disk 1
      oak-io6-s2: Sep 09 08:37:59 oak-io6-s2 kernel: md/raid:md53: raid level 6 active with 10 out of 10 devices, algorithm 2
      oak-io6-s2: Sep 09 08:37:59 oak-io6-s2 kernel: md53: detected capacity change from 0 to 112003075014656
      oak-io6-s2: Sep 09 08:38:00 oak-io6-s2 kernel: LDISKFS-fs (md53): file extents enabled, maximum tree depth=5
      oak-io6-s2: Sep 09 08:38:00 oak-io6-s2 kernel: LDISKFS-fs (md53): mounted filesystem with ordered data mode. Opts: errors=remount-ro
      oak-io6-s2: Sep 09 08:38:01 oak-io6-s2 kernel: LDISKFS-fs (md53): file extents enabled, maximum tree depth=5
      oak-io6-s2: Sep 09 08:38:01 oak-io6-s2 kernel: LDISKFS-fs (md53): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
      oak-io6-s2: Sep 09 08:38:06 oak-io6-s2 kernel: Lustre: oak-OST0145: new disk, initializing
      oak-io6-s2: Sep 09 08:38:06 oak-io6-s2 kernel: Lustre: srv-oak-OST0145: No data found on store. Initialize space
      oak-io6-s2: Sep 09 08:38:06 oak-io6-s2 kernel: Lustre: oak-OST0145: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel: LustreError: 7744:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel: LustreError: 7744:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel: Pid: 7744, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel: LustreError: 5232:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel: LustreError: 5232:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel: Pid: 5232, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel: Call Trace:
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc097b7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel: Call Trace:
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc097b87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc06667cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc066687c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc172deb6>] osp_obd_connect+0x3c6/0x400 [osp]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc1311eb6>] osp_obd_connect+0x3c6/0x400 [osp]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc1bd404e>] lod_add_device+0xa8e/0x19a0 [lod]
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc164d04e>] lod_add_device+0xa8e/0x19a0 [lod]
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc1648895>] lod_process_config+0x13b5/0x1510 [lod]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc1bcf895>] lod_process_config+0x13b5/0x1510 [lod]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc13df8b2>] class_process_config+0x2142/0x2830 [obdclass]
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc0c3a8b2>] class_process_config+0x2142/0x2830 [obdclass]
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc0c3cb79>] class_config_llog_handler+0x819/0x1520 [obdclass]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc13e1b79>] class_config_llog_handler+0x819/0x1520 [obdclass]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc13a45af>] llog_process_thread+0x85f/0x1a10 [obdclass]
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc0bff5af>] llog_process_thread+0x85f/0x1a10 [obdclass]
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc0c01174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc13a6174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffff9bec5c21>] kthread+0xd1/0xe0
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffbd8c5c21>] kthread+0xd1/0xe0
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffbdf94ddd>] ret_from_fork_nospec_begin+0x7/0x21
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffff9c594ddd>] ret_from_fork_nospec_begin+0x7/0x21
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel: LustreError: 5224:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel: LustreError: 5224:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel: Pid: 5224, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel: Call Trace:
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc09f07cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc09f087c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc1b38eb6>] osp_obd_connect+0x3c6/0x400 [osp]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc1a5804e>] lod_add_device+0xa8e/0x19a0 [lod]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc1a53895>] lod_process_config+0x13b5/0x1510 [lod]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc13228b2>] class_process_config+0x2142/0x2830 [obdclass]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc1324b79>] class_config_llog_handler+0x819/0x1520 [obdclass]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc12e75af>] llog_process_thread+0x85f/0x1a10 [obdclass]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc12e9174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffae4c5c21>] kthread+0xd1/0xe0
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffaeb94ddd>] ret_from_fork_nospec_begin+0x7/0x21
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      

      After rebooting the crashed MDS, the last added OSTs are working fine (after MDS recovery, they allocate their "super-sequence" and are then operational. But every time I add more OSTs to the filesystem, all MDS crash with this backtrace. This looks similar to LU-9699.

      I'm attaching the output of

      lctl --device MGS llog_print <FILE>
      

      as <FILE>.llog.txt for FILE = oak-client and oak-MDT0000 to oak-MDT0005
      in case anyone can see if there is a problem somewhere.

      Attachments

        1. llogdiff
          297 kB
          Dominique Martinet
        2. oak-client.llog.txt
          187 kB
          Stephane Thiell
        3. oak-MDT0000.after-ost331
          426 kB
          Stephane Thiell
        4. oak-MDT0000.backup-20210923
          425 kB
          Stephane Thiell
        5. oak-MDT0000.llog.txt
          201 kB
          Stephane Thiell
        6. oak-MDT0001.llog.txt
          199 kB
          Stephane Thiell
        7. oak-MDT0002.llog.txt
          199 kB
          Stephane Thiell
        8. oak-MDT0003.llog.txt
          198 kB
          Stephane Thiell
        9. oak-MDT0004.llog.txt
          198 kB
          Stephane Thiell
        10. oak-MDT0005.llog.txt
          198 kB
          Stephane Thiell

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: