Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15000

MDS crashes with (osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed

Details

    • 3
    • 9223372036854775807

    Description

      I previously reported some issues when adding OSTs in LU-14695, and I also guess we got hit by LU-13356 and the MGS was not unable to process new OSTs. So we recently upgraded all Oak servers to 2.12.7 + single patch from LU-13356. I'm opening this ticket in this new context where the MGS seems to work better so far. But now the issue is more on the MDS side apparently.

      We have 300+ OSTs. After adding new OSTs, the first ones succeed, but after 3 or 4 new OSTs, all MDTs crash with the following backtrace (you can see the new OST being initialized on an OSS and then a MDS crash:

      Sep 09 07:59:33 oak-io6-s2 kernel: md/raid:md45: raid level 6 active with 10 out of 10 devices, algorithm 2
      Sep 09 07:59:33 oak-io6-s2 kernel: md45: detected capacity change from 0 to 112003075014656
      Sep 09 07:59:33 oak-io6-s2 kernel: LDISKFS-fs (md45): file extents enabled, maximum tree depth=5
      Sep 09 07:59:34 oak-io6-s2 kernel: LDISKFS-fs (md45): mounted filesystem with ordered data mode. Opts: errors=remount-ro
      Sep 09 07:59:34 oak-io6-s2 kernel: LDISKFS-fs (md45): file extents enabled, maximum tree depth=5
      Sep 09 07:59:34 oak-io6-s2 kernel: LDISKFS-fs (md45): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
      Sep 09 07:59:38 oak-io6-s2 kernel: Lustre: oak-OST013d: new disk, initializing
      Sep 09 07:59:38 oak-io6-s2 kernel: Lustre: srv-oak-OST013d: No data found on store. Initialize space
      Sep 09 07:59:38 oak-io6-s2 kernel: Lustre: oak-OST013d: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      Sep 09 07:59:43 oak-md1-s1 kernel: LustreError: 24251:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
      Sep 09 07:59:43 oak-md1-s1 kernel: LustreError: 24251:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
      Sep 09 07:59:43 oak-md1-s1 kernel: Pid: 24251, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
      Sep 09 07:59:43 oak-md1-s1 kernel: Call Trace:
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0cac7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0cac87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc1a2beb6>] osp_obd_connect+0x3c6/0x400 [osp]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc194b04e>] lod_add_device+0xa8e/0x19a0 [lod]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc1946895>] lod_process_config+0x13b5/0x1510 [lod]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0ef18b2>] class_process_config+0x2142/0x2830 [obdclass]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0ef3b79>] class_config_llog_handler+0x819/0x1520 [obdclass]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0eb65af>] llog_process_thread+0x85f/0x1a10 [obdclass]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffc0eb8174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffa62c5c21>] kthread+0xd1/0xe0
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffa6994ddd>] ret_from_fork_nospec_begin+0x7/0x21
      Sep 09 07:59:43 oak-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      

      This is another example when we can see llog errors, for OSTs that actually are added (they are properly registered on the MGS), before the crash of the MDS:

      Sep  9 08:21:17 oak-md1-s1 kernel: Lustre: oak-MDT0001: Client 59a1cc5a-bdc6-102a-1ccb-b551d993bacc (at 10.51.3.32@o2ib3) reconnecting
      Sep  9 08:21:18 oak-md1-s1 kernel: Lustre: oak-MDT0001: Client ea26edda-b4c7-1625-cd1a-c65d4487d437 (at 10.51.3.68@o2ib3) reconnecting
      Sep  9 08:21:18 oak-md1-s1 kernel: Lustre: Skipped 5 previous similar messages
      Sep  9 08:21:20 oak-md1-s1 kernel: Lustre: oak-MDT0001: Client 38d27508-3ea0-383f-8501-005bb0842ebd (at 10.51.4.12@o2ib3) reconnecting
      Sep  9 08:21:20 oak-md1-s1 kernel: Lustre: Skipped 2 previous similar messages
      Sep  9 08:21:35 oak-md1-s1 kernel: Lustre: 7555:0:(obd_config.c:1641:class_config_llog_handler()) Skip config outside markers, (inst: 0000000000000000, uuid: , flags: 0x0)
      Sep  9 08:21:35 oak-md1-s1 kernel: LustreError: 7555:0:(genops.c:556:class_register_device()) oak-OST0144-osc-MDT0001: already exists, won't add
      Sep  9 08:21:35 oak-md1-s1 kernel: LustreError: 7555:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
      Sep  9 08:21:35 oak-md1-s1 kernel: Lustre:    cmd=cf001 0:oak-OST0144-osc-MDT0001  1:osp  2:oak-MDT0001-mdtlov_UUID  
      Sep  9 08:21:35 oak-md1-s1 kernel: LustreError: 3775:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
      Sep  9 08:21:35 oak-md1-s1 kernel: Lustre:    cmd=cf001 0:oak-OST0144-osc-MDT0002  1:osp  2:oak-MDT0002-mdtlov_UUID  
      Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 7569:0:(obd_config.c:461:class_setup()) Device 635 already setup (type osp)
      Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 7569:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
      Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 7569:0:(obd_config.c:1835:class_config_llog_handler()) Skipped 1 previous similar message
      Sep  9 08:21:44 oak-md1-s1 kernel: Lustre:    cmd=cf003 0:oak-OST0144-osc-MDT0001  1:oak-OST0144_UUID  2:10.0.2.103@o2ib5  
      Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 3775:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
      Sep  9 08:21:44 oak-md1-s1 kernel: LustreError: 3775:0:(mgc_request.c:599:do_requeue()) Skipped 1 previous similar message
      Sep  9 08:21:48 oak-md1-s1 kernel: LustreError: 7571:0:(obd_config.c:461:class_setup()) Device 316 already setup (type osp)
      Sep  9 08:21:48 oak-md1-s1 kernel: LustreError: 7571:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.0.2.51@o2ib5: cfg command failed: rc = -17
      Sep  9 08:21:48 oak-md1-s1 kernel: Lustre:    cmd=cf003 0:oak-OST0144-osc-MDT0002  1:oak-OST0144_UUID  2:10.0.2.103@o2ib5  
      Sep  9 08:21:48 oak-md1-s1 kernel: LustreError: 3775:0:(mgc_request.c:599:do_requeue()) failed processing log: -17
      Sep  9 08:21:55 oak-md1-s1 kernel: LustreError: 7574:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
      Sep  9 08:21:55 oak-md1-s1 kernel: LustreError: 7574:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
      Sep  9 08:21:55 oak-md1-s1 kernel: Pid: 7574, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
      Sep  9 08:21:56 oak-md1-s1 kernel: Call Trace:
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0b057cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0b0587c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc18baeb6>] osp_obd_connect+0x3c6/0x400 [osp]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc17da04e>] lod_add_device+0xa8e/0x19a0 [lod]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc17d5895>] lod_process_config+0x13b5/0x1510 [lod]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0d4f8b2>] class_process_config+0x2142/0x2830 [obdclass]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0d51b79>] class_config_llog_handler+0x819/0x1520 [obdclass]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0d145af>] llog_process_thread+0x85f/0x1a10 [obdclass]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffc0d16174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffff9dcc5c21>] kthread+0xd1/0xe0
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffff9e394ddd>] ret_from_fork_nospec_begin+0x7/0x21
      Sep  9 08:21:56 oak-md1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
      

      In the trace below, we can see that the MDS crashes are happening at the same time:

      oak-io6-s2: Sep 09 08:37:59 oak-io6-s2 kernel: md/raid:md53: device dm-494 operational as raid disk 1
      oak-io6-s2: Sep 09 08:37:59 oak-io6-s2 kernel: md/raid:md53: raid level 6 active with 10 out of 10 devices, algorithm 2
      oak-io6-s2: Sep 09 08:37:59 oak-io6-s2 kernel: md53: detected capacity change from 0 to 112003075014656
      oak-io6-s2: Sep 09 08:38:00 oak-io6-s2 kernel: LDISKFS-fs (md53): file extents enabled, maximum tree depth=5
      oak-io6-s2: Sep 09 08:38:00 oak-io6-s2 kernel: LDISKFS-fs (md53): mounted filesystem with ordered data mode. Opts: errors=remount-ro
      oak-io6-s2: Sep 09 08:38:01 oak-io6-s2 kernel: LDISKFS-fs (md53): file extents enabled, maximum tree depth=5
      oak-io6-s2: Sep 09 08:38:01 oak-io6-s2 kernel: LDISKFS-fs (md53): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
      oak-io6-s2: Sep 09 08:38:06 oak-io6-s2 kernel: Lustre: oak-OST0145: new disk, initializing
      oak-io6-s2: Sep 09 08:38:06 oak-io6-s2 kernel: Lustre: srv-oak-OST0145: No data found on store. Initialize space
      oak-io6-s2: Sep 09 08:38:06 oak-io6-s2 kernel: Lustre: oak-OST0145: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel: LustreError: 7744:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel: LustreError: 7744:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel: Pid: 7744, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel: LustreError: 5232:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel: LustreError: 5232:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel: Pid: 5232, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel: Call Trace:
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc097b7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel: Call Trace:
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc097b87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc06667cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc066687c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc172deb6>] osp_obd_connect+0x3c6/0x400 [osp]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc1311eb6>] osp_obd_connect+0x3c6/0x400 [osp]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc1bd404e>] lod_add_device+0xa8e/0x19a0 [lod]
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc164d04e>] lod_add_device+0xa8e/0x19a0 [lod]
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc1648895>] lod_process_config+0x13b5/0x1510 [lod]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc1bcf895>] lod_process_config+0x13b5/0x1510 [lod]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc13df8b2>] class_process_config+0x2142/0x2830 [obdclass]
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc0c3a8b2>] class_process_config+0x2142/0x2830 [obdclass]
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc0c3cb79>] class_config_llog_handler+0x819/0x1520 [obdclass]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc13e1b79>] class_config_llog_handler+0x819/0x1520 [obdclass]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc13a45af>] llog_process_thread+0x85f/0x1a10 [obdclass]
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc0bff5af>] llog_process_thread+0x85f/0x1a10 [obdclass]
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffc0c01174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffc13a6174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffff9bec5c21>] kthread+0xd1/0xe0
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffbd8c5c21>] kthread+0xd1/0xe0
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffbdf94ddd>] ret_from_fork_nospec_begin+0x7/0x21
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffff9c594ddd>] ret_from_fork_nospec_begin+0x7/0x21
      oak-md2-s2: Sep 09 08:38:11 oak-md2-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      oak-md1-s2: Sep 09 08:38:11 oak-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel: LustreError: 5224:0:(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed: 
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel: LustreError: 5224:0:(osp_dev.c:1404:osp_obd_connect()) LBUG
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel: Pid: 5224, comm: llog_process_th 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel: Call Trace:
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc09f07cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc09f087c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc1b38eb6>] osp_obd_connect+0x3c6/0x400 [osp]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc1a5804e>] lod_add_device+0xa8e/0x19a0 [lod]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc1a53895>] lod_process_config+0x13b5/0x1510 [lod]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc13228b2>] class_process_config+0x2142/0x2830 [obdclass]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc1324b79>] class_config_llog_handler+0x819/0x1520 [obdclass]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc12e75af>] llog_process_thread+0x85f/0x1a10 [obdclass]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffc12e9174>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffae4c5c21>] kthread+0xd1/0xe0
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffaeb94ddd>] ret_from_fork_nospec_begin+0x7/0x21
      oak-md2-s1: Sep 09 08:38:12 oak-md2-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      

      After rebooting the crashed MDS, the last added OSTs are working fine (after MDS recovery, they allocate their "super-sequence" and are then operational. But every time I add more OSTs to the filesystem, all MDS crash with this backtrace. This looks similar to LU-9699.

      I'm attaching the output of

      lctl --device MGS llog_print <FILE>
      

      as <FILE>.llog.txt for FILE = oak-client and oak-MDT0000 to oak-MDT0005
      in case anyone can see if there is a problem somewhere.

      Attachments

        1. llogdiff
          297 kB
        2. oak-client.llog.txt
          187 kB
        3. oak-MDT0000.after-ost331
          426 kB
        4. oak-MDT0000.backup-20210923
          425 kB
        5. oak-MDT0000.llog.txt
          201 kB
        6. oak-MDT0001.llog.txt
          199 kB
        7. oak-MDT0002.llog.txt
          199 kB
        8. oak-MDT0003.llog.txt
          198 kB
        9. oak-MDT0004.llog.txt
          198 kB
        10. oak-MDT0005.llog.txt
          198 kB

        Issue Links

          Activity

            [LU-15000] MDS crashes with (osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48898/
            Subject: LU-15000 llog: read canceled records in llog_backup
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 7ba5dc8e895c693f68d49e7ffc46483710d67beb

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48898/ Subject: LU-15000 llog: read canceled records in llog_backup Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 7ba5dc8e895c693f68d49e7ffc46483710d67beb

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48898
            Subject: LU-15000 llog: read canceled records in llog_backup
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 51bf31c545e2bc67b214de7ae8fd415c1776c3bf

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48898 Subject: LU-15000 llog: read canceled records in llog_backup Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 51bf31c545e2bc67b214de7ae8fd415c1776c3bf

            @Peter Jones could you please consider this patch for inclusion in LTS releases (at least for 2.15)?

            sthiell Stephane Thiell added a comment - @Peter Jones could you please consider this patch for inclusion in LTS releases (at least for 2.15)?

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47515
            Subject: LU-15000 llog: read canceled records in llog_backup
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 57b085cb1068c1a7a7a35376ccf1db10f489c6a9

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47515 Subject: LU-15000 llog: read canceled records in llog_backup Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 57b085cb1068c1a7a7a35376ccf1db10f489c6a9
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46552/
            Subject: LU-15000 llog: read canceled records in llog_backup
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d8e2723b4e9409954846939026c599b0b1170e6e

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46552/ Subject: LU-15000 llog: read canceled records in llog_backup Project: fs/lustre-release Branch: master Current Patch Set: Commit: d8e2723b4e9409954846939026c599b0b1170e6e
            eaujames Etienne Aujames added a comment - - edited

            Hi Stephane,
            Great for you.
            I will add some reviewers to land the patch.

            eaujames Etienne Aujames added a comment - - edited Hi Stephane, Great for you. I will add some reviewers to land the patch.

            Hi Etienne,
            Today, we applied your patch on MGS/MDS (on top of 2.12.7) and we were then able to add 16 OSTs without issue. So your patch worked for us. This will also allow us to consider using llog_cancel/del_ost in the future again.
            Thank you!

            sthiell Stephane Thiell added a comment - Hi Etienne, Today, we applied your patch on MGS/MDS (on top of 2.12.7) and we were then able to add 16 OSTs without issue. So your patch worked for us. This will also allow us to consider using llog_cancel/del_ost in the future again. Thank you!

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46552
            Subject: LU-15000 llog: read canceled records in llog_backup
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0c5471748b18d40893309aad3943b8aed5e91f1a

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46552 Subject: LU-15000 llog: read canceled records in llog_backup Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0c5471748b18d40893309aad3943b8aed5e91f1a

            @Stephane Thiell the issue with correcting the MGS configs is that the target copies have last indexes lower that the original ones. Clients/targets are notified only for "new" records in mounted state, so they will ignore indexes between original and the copy.

            One solution could be to artificially generate rec after correcting MGS (with "lctl config_param <param>" and "lctl config_param -d <param>") until the correct configs index reach the old configs "last indexes". Then clients/targets should take into account new records (e.g adding new OSTs).

            eaujames Etienne Aujames added a comment - @Stephane Thiell the issue with correcting the MGS configs is that the target copies have last indexes lower that the original ones. Clients/targets are notified only for "new" records in mounted state, so they will ignore indexes between original and the copy. One solution could be to artificially generate rec after correcting MGS (with "lctl config_param <param>" and "lctl config_param -d <param>") until the correct configs index reach the old configs "last indexes". Then clients/targets should take into account new records (e.g adding new OSTs).

            People

              tappro Mikhail Pershin
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: