Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17060

conf-sanity test_21c: Unable to start OST2 - MGS refuses to allow this server to start

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/57f0b610-268d-4d1e-be2b-d1898e8431b1

      test_21c failed with the following error in the test logs:

      trevis-98vm3 mkdir -p /mnt/lustre-ost2; mount -t lustre -o localrecov  /dev/mapper/ost2_flakey /mnt/lustre-ost2
      trevis-98vm3: mount.lustre: mount /dev/mapper/ost2_flakey at /mnt/lustre-ost2 failed: Invalid argument
      trevis-98vm3: This may have multiple causes.
      trevis-98vm3: Are the mount options correct?
      trevis-98vm3: Check the syslog for more info.
      

      The OST console log shows:

      Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost2; mount -t lustre -o localrecov  /dev/mapper/ost2_flakey /mnt/lustre-ost2
      LDISKFS-fs (dm-11): file extents enabled, maximum tree depth=5
      LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro
      LDISKFS-fs (dm-11): file extents enabled, maximum tree depth=5
      LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
      LustreError: 160-7: lustre-OST0001: the MGS refuses to allow this server to start: rc = -22. Please see messages on the MGS.
      LustreError: 30487:0:(tgt_mount.c:2216:server_fill_super()) Unable to start targets: -22
      LustreError: 30487:0:(tgt_mount.c:1752:server_put_super()) no obd lustre-OST0001
      LustreError: 30487:0:(tgt_mount.c:132:server_deregister_mount()) lustre-OST0001 not registered
      Lustre: server umount lustre-OST0001 complete
      LustreError: 30487:0:(super25.c:188:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -22
      

      The MGS console log shows:

      LustreError: 14228:0:(mgs_llog.c:4541:mgs_write_log_target()) Can't write logs for lustre-OST0001 (-22)
      LustreError: 14228:0:(mgs_handler.c:518:mgs_target_reg()) Failed to write lustre-OST0001 log (-22)
      Lustre: DEBUG MARKER: /usr/sbin/lctl mark  conf-sanity test_21c: @@@@@@ FAIL: Unable to start OST2 
      Lustre: DEBUG MARKER: conf-sanity test_21c: @@@@@@ FAIL: Unable to start OST2
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master/4455 - 4.18.0-477.15.1.el8_8.x86_64
      servers: https://build.whamcloud.com/job/lustre-master/4455 - 4.18.0-477.15.1.el8_lustre.x86_64

      It appears that the "Unable to start OST2" problem started on 2023-08-19. There were previous intermittent failures with "Unable to start OST1" but these were all fallout from earlier failures.

      https://testing.whamcloud.com/search?status%5B%5D=FAIL&test_set_script_id=7f66aa20-3db2-11e0-80c0-52540025f9af&sub_test_script_id=8104001c-3db2-11e0-80c0-52540025f9af&start_date=2023-05-22&end_date=2023-08-29&source=sub_tests#redirect

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      conf-sanity test_21c - Unable to start OST2

      Attachments

        Issue Links

          Activity

            [LU-17060] conf-sanity test_21c: Unable to start OST2 - MGS refuses to allow this server to start
            simmonsja James A Simmons added a comment - - edited

            Still failing. I looked at the logs and I think its some LNet issue on the MGS. I see this in the logs on the MGS:

            00000400:00000200:0.0:1693491996.281509:0:9414:0:(lib-move.c:4757:lnet_parse()) TRACE: 10.240.42.123@tcp(10.240.42.123@tcp) <- 10.240.42.122@tcp : PUT - for me
            00000400:00000200:0.0:1693491996.281517:0:9414:0:(lib-ptl.c:573:lnet_ptl_match_md()) Request from 12345-10.240.42.122@tcp of length 520 into portal 26 MB=0x64f0a19a0d780 
            00000400:00000100:0.0:1693491996.281532:0:9414:0:(lib-move.c:4455:lnet_parse_put()) Dropping PUT from 12345-10.240.42.122@tcp portal 26 match 1775754658502528 offset 0 length 520: 4 
            00000400:00000200:0.0:1693491996.281534:0:9414:0:(lib-msg.c:1041:lnet_is_health_check()) Msg 00000000eb75fdc0 is in inconsistent state, don't perform health checking (-2, 0) 
            00000400:00000200:0.0:1693491996.281536:0:9414:0:(lib-msg.c:1046:lnet_is_health_check()) health check = 0, status = -2, hstatus = 0
            

            this leads to a timeout (-110) for the OST registration with the MGS. On the OST its a ptlrpc timeout issue.

            simmonsja James A Simmons added a comment - - edited Still failing. I looked at the logs and I think its some LNet issue on the MGS. I see this in the logs on the MGS: 00000400:00000200:0.0:1693491996.281509:0:9414:0:(lib-move.c:4757:lnet_parse()) TRACE: 10.240.42.123@tcp(10.240.42.123@tcp) <- 10.240.42.122@tcp : PUT - for me 00000400:00000200:0.0:1693491996.281517:0:9414:0:(lib-ptl.c:573:lnet_ptl_match_md()) Request from 12345-10.240.42.122@tcp of length 520 into portal 26 MB=0x64f0a19a0d780 00000400:00000100:0.0:1693491996.281532:0:9414:0:(lib-move.c:4455:lnet_parse_put()) Dropping PUT from 12345-10.240.42.122@tcp portal 26 match 1775754658502528 offset 0 length 520: 4 00000400:00000200:0.0:1693491996.281534:0:9414:0:(lib-msg.c:1041:lnet_is_health_check()) Msg 00000000eb75fdc0 is in inconsistent state, don't perform health checking (-2, 0) 00000400:00000200:0.0:1693491996.281536:0:9414:0:(lib-msg.c:1046:lnet_is_health_check()) health check = 0, status = -2, hstatus = 0 this leads to a timeout (-110) for the OST registration with the MGS. On the OST its a ptlrpc timeout issue.

            "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52199
            Subject: LU-17060 tests: verify conf-sanity 21c passes
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 99f82aa8c22870cda50da1b5a19fe55f41666dda

            gerrit Gerrit Updater added a comment - "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52199 Subject: LU-17060 tests: verify conf-sanity 21c passes Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 99f82aa8c22870cda50da1b5a19fe55f41666dda

            I agree, patch 52053 should fix these issues.

            simmonsja James A Simmons added a comment - I agree, patch 52053 should fix these issues.

            It looks like this is also causing conf-sanity test_21c failures in non-interop testing. I'm not sure if it is random, but it looks like it is mainly failing for full test runs and also fairly regularly for e2fsprogs. I think the LU-13306 patch should fix it.

            adilger Andreas Dilger added a comment - It looks like this is also causing conf-sanity test_21c failures in non-interop testing. I'm not sure if it is random, but it looks like it is mainly failing for full test runs and also fairly regularly for e2fsprogs. I think the LU-13306 patch should fix it.
            adilger Andreas Dilger added a comment - - edited

            It looks like this may already be fixed by the following patch:

            "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/52053
            Subject: LU-13306 mgs: support large NID for mgs_write_log_osc_to_lov
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e28695e90051d33bc5129d3512cdd9e01297f4d8

            adilger Andreas Dilger added a comment - - edited It looks like this may already be fixed by the following patch: "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/52053 Subject: LU-13306 mgs: support large NID for mgs_write_log_osc_to_lov Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e28695e90051d33bc5129d3512cdd9e01297f4d8

            It looks like the test is failing every few days in batches, likely when full testing is run after patch landings to master. However, it is also failing regularly (though not every run) for e2fsprogs patches since they run conf-sanity

            Patches landed before the first reported failure:

            $ git log --oneline --after 2023-08-16 --before 2023-08-20
            673ff86a84 LU-16393 o2iblnd: add IBLND_REJECT_EARLY reject reason
            9183fa655a LU-17021 socklnd: fix late ksnr_max_conns set
            441902fa3d LU-17010 lfsck: don't create trans in dryrun mode
            a28ae45d77 LU-17000 lnet: fix use-after-free in lnet_startup_lndnet
            be131d125a LU-16974 utils: lfs mirror resync to show progress
            d9cd9992b9 LU-8191 lnet: convert functions in utils to static
            0163acf4f3 LU-16883 ldiskfs: update for ext4-delayed-iput for SUSE 15
            64535355de LU-16862 rpm: set kmod-lustre-tests requires kmod-lustre explicitly
            6481328c1c LU-9859 libcfs: remove cfs_size_round()
            c08795f952 LU-9859 lnet: simplify cfs_parse_nidlist()
            c0e881a8e1 LU-9859 llite: simplify pcc_fname_list_parse()
            a1dc44aef0 LU-9859 ptlrpc: simplify nrs_tbf_opcode_list_parse()
            b7b55534b9 LU-16673 tests: add checking for user USER0
            ac522557b1 LU-16374 enc: rename O_FILE_ENC to O_CIPHERTEXT
            d285b41b86 LU-16851 llite: use right predicate in ll_atomic_open
            1e9d48625b LU-16962 build: parallel configure cleanup
            42bf5f78ba LU-8130 nrs: convert NRS ORR/TRR to rhashtable
            778791dd7d LU-8130 libcfs: don't use radix tree for xarray
            2feb4a7bb0 LU-16973 ptlrpc: flush delayed file desc if idle
            d0be255ae3 LU-16847 ldiskfs: refactor bio submit.
            c0cb747ebe LU-13306 mgs: use large NIDS in the nid table on the MGS
            7576d29458 LU-15978 osp: fix striped directory deletion fails for 64K PAGE_SIZE
            48ccf2db84 LU-15722 osd-ldiskfs: fix write stuck for 64K PAGE_SIZE
            7bb1685048 LU-16872 tests: exercise sanity test_27M more fully
            766b35a970 LU-16872 lod: reset llc_ostlist when using O_APPEND stripes
            

            I would suspect that LU-13306 is the source of the problem, since it landed on 2023-08-18 and is one of the few patches affecting the MGS/llog code.

            The debug log (edited for brevity) for the OST registration looks like:
            https://testing.whamcloud.com/test_logs/174f3129-f466-47a5-93ba-e43f0cd32a5f/show_text

            0000000:00000001:0.0:1692961293.191655:0:14228:0:(mgs_llog.c:3078:mgs_write_log_osc_to_lov()) Process entered
            20000000:00000040:0.0:1692961293.191655:0:14228:0:(mgs_llog.c:3080:mgs_write_log_osc_to_lov()) adding osc for lustre-OST0001 to log lustre-MDT0000
            00000040:00000001:0.0:1692961293.191657:0:14228:0:(llog.c:1381:llog_open()) Process entered
            00000040:00000010:0.0:1692961293.191657:0:14228:0:(llog.c:62:llog_alloc_handle()) kmalloced '(loghandle)': 280 at ffff8fd080e03400.
            00000040:00000001:0.0:1692961293.191658:0:14228:0:(llog_osd.c:1280:llog_osd_open()) Process entered
            00000020:00000001:0.0:1692961293.191658:0:14228:0:(local_storage.c:150:ls_device_find_or_init()) Process entered
            00000020:00000001:0.0:1692961293.191659:0:14228:0:(local_storage.c:163:ls_device_find_or_init()) Process leaving via out_ls (rc=18446620724537471488 : -123349172080128 : 0xffff8fd0886a3200)
            00080000:00000001:0.0:1692961293.191661:0:14228:0:(osd_handler.c:7959:osd_index_ea_lookup()) Process entered
            00080000:00000002:0.0:1692961293.191671:0:14228:0:(osd_handler.c:6211:osd_add_oi_cache()) add [0xa:0x4:0x0] 87:831759535 to info ffff8fd08d818000
            00080000:00000002:0.0:1692961293.191672:0:14228:0:(osd_handler.c:6428:osd_ea_lookup_rec()) [0x200000001:0x1017:0x0]/lustre-MDT0000 => [0xa:0x4:0x0]
            00080000:00000001:0.0:1692961293.191676:0:14228:0:(osd_handler.c:7986:osd_index_ea_lookup()) Process leaving (rc=1 : 1 : 1)
            00000040:00000010:0.0:1692961293.191677:0:14228:0:(llog_osd.c:1342:llog_osd_open()) kmalloced '(handle->lgh_name)': 15 at ffff8fd078c33aa0.
            00000040:00000001:0.0:1692961293.191679:0:14228:0:(llog_osd.c:1392:llog_osd_open()) Process leaving (rc=0 : 0 : 0)
            00000040:00000001:0.0:1692961293.191680:0:14228:0:(llog.c:1404:llog_open()) Process leaving (rc=0 : 0 : 0)
            00000040:00000001:0.0:1692961293.191681:0:14228:0:(llog.c:401:llog_init_handle()) Process entered
            00000040:00000001:0.0:1692961293.191687:0:14228:0:(llog.c:464:llog_init_handle()) Process leaving (rc=0 : 0 : 0)
            00000040:00000001:0.0:1692961293.191688:0:14228:0:(llog_osd.c:1688:llog_osd_close()) Process entered
            00000020:00000002:0.0:1692961293.191688:0:14228:0:(lu_object.c:224:lu_object_put()) Add ffff8fd06ae40000/ffff8fd08bbd0c00 to site lru. bkt: ffff8fd0b9968d20
            00000040:00000010:0.0:1692961293.191689:0:14228:0:(llog_osd.c:1706:llog_osd_close()) kfreed 'handle->lgh_name': 15 at ffff8fd078c33aa0.
            00000040:00000001:0.0:1692961293.191690:0:14228:0:(llog_osd.c:1708:llog_osd_close()) Process leaving (rc=0 : 0 : 0)
            20000000:00000040:0.0:1692961293.191692:0:14228:0:(lustre_log.h:410:llog_ctxt_put()) PUTting ctxt ffff8fd0bad1f480 : new refcount 1
            20000000:00000010:0.0:1692961293.191694:0:14228:0:(mgs_llog.c:156:name_create()) kmalloced '(*newname)': 1 at ffff8fd066ff6f38.
            20000000:00000010:0.0:1692961293.191695:0:14228:0:(mgs_llog.c:156:name_create()) kmalloced '(*newname)': 19 at ffff8fd081693bc0.
            20000000:00000010:0.0:1692961293.191714:0:14228:0:(mgs_llog.c:156:name_create()) kmalloced '(*newname)': 27 at ffff8fd081693820.
            20000000:00000010:0.0:1692961293.191715:0:14228:0:(mgs_llog.c:156:name_create()) kmalloced '(*newname)': 32 at ffff8fd081693100.
            20000000:00000010:0.0:1692961293.191716:0:14228:0:(mgs_llog.c:156:name_create()) kmalloced '(*newname)': 27 at ffff8fd081693600.
            20000000:00000001:0.0:1692961293.191717:0:14228:0:(mgs_llog.c:1953:record_start_log()) Process entered
            20000000:00000040:0.0:1692961293.191717:0:14228:0:(lustre_log.h:395:llog_ctxt_get()) GETting ctxt ffff8fd0bad1f480 : new refcount 2
            00000040:00000001:0.0:1692961293.191718:0:14228:0:(llog.c:1242:llog_open_create()) Process entered
            00000040:00000001:0.0:1692961293.191728:0:14228:0:(llog.c:1070:llog_exist()) Process entered
            00000040:00000001:0.0:1692961293.191728:0:14228:0:(llog.c:1079:llog_exist()) Process leaving (rc=1 : 1 : 1)
            00000040:00000001:0.0:1692961293.191729:0:14228:0:(llog.c:1249:llog_open_create()) Process leaving (rc=0 : 0 : 0)
            00000040:00000001:0.0:1692961293.191729:0:14228:0:(llog.c:401:llog_init_handle()) Process entered
            00000040:00000010:0.0:1692961293.191730:0:14228:0:(llog.c:406:llog_init_handle()) kmalloced '(llh)': 8192 at ffff8fd07e134000.
            00000040:00000001:0.0:1692961293.191734:0:14228:0:(llog.c:464:llog_init_handle()) Process leaving (rc=0 : 0 : 0)
            20000000:00000040:0.0:1692961293.191735:0:14228:0:(lustre_log.h:410:llog_ctxt_put()) PUTting ctxt ffff8fd0bad1f480 : new refcount 1
            20000000:00000001:0.0:1692961293.191735:0:14228:0:(mgs_llog.c:1977:record_start_log()) Process leaving (rc=0 : 0 : 0)
            00000020:00000001:0.0:1692961293.191736:0:14228:0:(obd_config.c:1742:lustre_cfg_rec_new()) Process entered
            00000020:00000010:0.0:1692961293.191737:0:14228:0:(obd_config.c:1748:lustre_cfg_rec_new()) kmalloced '(lcr)': 224 at ffff8fd0bad1b000.
            00000020:00000001:0.0:1692961293.191737:0:14228:0:(obd_config.c:1757:lustre_cfg_rec_new()) Process leaving (rc=18446620725383114752 : -123348326436864 : ffff8fd0bad1b000)
            00000040:00000001:0.0:1692961293.191738:0:14228:0:(llog.c:1328:llog_write()) Process entered
            00000040:00000001:0.0:1692961293.191748:0:14228:0:(llog.c:1157:llog_write_rec()) Process entered
            00000040:00000001:0.0:1692961293.191748:0:14228:0:(llog_osd.c:397:llog_osd_write_rec()) Process entered
            00000040:00001000:0.0:1692961293.191748:0:14228:0:(llog_osd.c:404:llog_osd_write_rec()) new record 10620000 to [0xa:0x4:0x0]
            00000040:00080000:0.0:1692961293.191755:0:14228:0:(llog_osd.c:735:llog_osd_write_rec()) added record [0xa:0x4:0x0].34, 224 off14112
            00000040:00000001:0.0:1692961293.191757:0:14228:0:(llog_osd.c:748:llog_osd_write_rec()) Process leaving (rc=0 : 0 : 0)
            00000040:00000001:0.0:1692961293.191757:0:14228:0:(llog.c:1190:llog_write_rec()) Process leaving (rc=0 : 0 : 0)
            00000040:00000001:0.0:1692961293.191761:0:14228:0:(llog.c:1370:llog_write()) Process leaving (rc=0 : 0 : 0)
            00000020:00000001:0.0:1692961293.191762:0:14228:0:(obd_config.c:1763:lustre_cfg_rec_free()) Process entered
            00000020:00000010:0.0:1692961293.191762:0:14228:0:(obd_config.c:1764:lustre_cfg_rec_free()) kfreed 'lcr': 224 at ffff8fd0bad1b000.
            00000020:00000001:0.0:1692961293.191763:0:14228:0:(obd_config.c:1765:lustre_cfg_rec_free()) Process leaving
            20000000:00000001:0.0:1692961293.191764:0:14228:0:(mgs_llog.c:3149:mgs_write_log_osc_to_lov()) Process leaving via out_end (rc=18446744073709551594 : -22 : 0xffffffffffffffea)
            00000040:00000001:0.0:1692961293.191765:0:14228:0:(llog_osd.c:1688:llog_osd_close()) Process entered
            00000020:00000002:0.0:1692961293.191765:0:14228:0:(lu_object.c:224:lu_object_put()) Add ffff8fd06ae40000/ffff8fd08bbd0c00 to site lru. bkt: ffff8fd0b9968d20
            00000040:00000010:0.0:1692961293.191766:0:14228:0:(llog_osd.c:1706:llog_osd_close()) kfreed 'handle->lgh_name': 15 at ffff8fd078c33aa0.
            00000040:00000001:0.0:1692961293.191767:0:14228:0:(llog_osd.c:1708:llog_osd_close()) Process leaving (rc=0 : 0 : 0)
            00000040:00000010:0.0:1692961293.191767:0:14228:0:(llog.c:90:llog_free_handle()) kfreed 'loghandle->lgh_hdr': 8192 at ffff8fd07e134000.
            00000040:00000010:0.0:1692961293.191768:0:14228:0:(llog.c:92:llog_free_handle()) kfreed 'loghandle': 280 at ffff8fd080e03400.
            20000000:00000010:0.0:1692961293.191769:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 27 at ffff8fd081693600.
            20000000:00000010:0.0:1692961293.191770:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 32 at ffff8fd081693100.
            20000000:00000010:0.0:1692961293.191770:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 27 at ffff8fd081693820.
            20000000:00000010:0.0:1692961293.191771:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 19 at ffff8fd081693bc0.
            20000000:00000010:0.0:1692961293.191772:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 1 at ffff8fd066ff6f38.
            20000000:00000001:0.0:1692961293.191773:0:14228:0:(mgs_llog.c:3185:mgs_write_log_osc_to_lov()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
            20000000:00000010:0.0:1692961293.191774:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 15 at ffff8fd078c33e10.
            20000000:00000010:0.0:1692961293.191774:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 22 at ffff8fd081693ee0.
            20000000:00000001:0.0:1692961293.191775:0:14228:0:(mgs_llog.c:3283:mgs_write_log_ost()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
            20000000:00020000:0.0:1692961293.191776:0:14228:0:(mgs_llog.c:4541:mgs_write_log_target()) Can't write logs for lustre-OST0001 (-22)
            20000000:00000001:0.0:1692961293.193427:0:14228:0:(mgs_llog.c:4542:mgs_write_log_target()) Process leaving via out_up (rc=18446744073709551594 : -22 : 0xffffffffffffffea)
            20000000:00000001:0.0:1692961293.193429:0:14228:0:(mgs_llog.c:4578:mgs_write_log_target()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
            20000000:00020000:0.0:1692961293.193430:0:14228:0:(mgs_handler.c:518:mgs_target_reg()) Failed to write lustre-OST0001 log (-22)
            20000000:00000001:0.0:1692961293.194962:0:14228:0:(mgs_handler.c:519:mgs_target_reg()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)
            
            adilger Andreas Dilger added a comment - It looks like the test is failing every few days in batches, likely when full testing is run after patch landings to master. However, it is also failing regularly (though not every run) for e2fsprogs patches since they run conf-sanity Patches landed before the first reported failure: $ git log --oneline --after 2023-08-16 --before 2023-08-20 673ff86a84 LU-16393 o2iblnd: add IBLND_REJECT_EARLY reject reason 9183fa655a LU-17021 socklnd: fix late ksnr_max_conns set 441902fa3d LU-17010 lfsck: don't create trans in dryrun mode a28ae45d77 LU-17000 lnet: fix use-after-free in lnet_startup_lndnet be131d125a LU-16974 utils: lfs mirror resync to show progress d9cd9992b9 LU-8191 lnet: convert functions in utils to static 0163acf4f3 LU-16883 ldiskfs: update for ext4-delayed-iput for SUSE 15 64535355de LU-16862 rpm: set kmod-lustre-tests requires kmod-lustre explicitly 6481328c1c LU-9859 libcfs: remove cfs_size_round() c08795f952 LU-9859 lnet: simplify cfs_parse_nidlist() c0e881a8e1 LU-9859 llite: simplify pcc_fname_list_parse() a1dc44aef0 LU-9859 ptlrpc: simplify nrs_tbf_opcode_list_parse() b7b55534b9 LU-16673 tests: add checking for user USER0 ac522557b1 LU-16374 enc: rename O_FILE_ENC to O_CIPHERTEXT d285b41b86 LU-16851 llite: use right predicate in ll_atomic_open 1e9d48625b LU-16962 build: parallel configure cleanup 42bf5f78ba LU-8130 nrs: convert NRS ORR/TRR to rhashtable 778791dd7d LU-8130 libcfs: don't use radix tree for xarray 2feb4a7bb0 LU-16973 ptlrpc: flush delayed file desc if idle d0be255ae3 LU-16847 ldiskfs: refactor bio submit. c0cb747ebe LU-13306 mgs: use large NIDS in the nid table on the MGS 7576d29458 LU-15978 osp: fix striped directory deletion fails for 64K PAGE_SIZE 48ccf2db84 LU-15722 osd-ldiskfs: fix write stuck for 64K PAGE_SIZE 7bb1685048 LU-16872 tests: exercise sanity test_27M more fully 766b35a970 LU-16872 lod: reset llc_ostlist when using O_APPEND stripes I would suspect that LU-13306 is the source of the problem, since it landed on 2023-08-18 and is one of the few patches affecting the MGS/llog code. The debug log (edited for brevity) for the OST registration looks like: https://testing.whamcloud.com/test_logs/174f3129-f466-47a5-93ba-e43f0cd32a5f/show_text 0000000:00000001:0.0:1692961293.191655:0:14228:0:(mgs_llog.c:3078:mgs_write_log_osc_to_lov()) Process entered 20000000:00000040:0.0:1692961293.191655:0:14228:0:(mgs_llog.c:3080:mgs_write_log_osc_to_lov()) adding osc for lustre-OST0001 to log lustre-MDT0000 00000040:00000001:0.0:1692961293.191657:0:14228:0:(llog.c:1381:llog_open()) Process entered 00000040:00000010:0.0:1692961293.191657:0:14228:0:(llog.c:62:llog_alloc_handle()) kmalloced '(loghandle)': 280 at ffff8fd080e03400. 00000040:00000001:0.0:1692961293.191658:0:14228:0:(llog_osd.c:1280:llog_osd_open()) Process entered 00000020:00000001:0.0:1692961293.191658:0:14228:0:(local_storage.c:150:ls_device_find_or_init()) Process entered 00000020:00000001:0.0:1692961293.191659:0:14228:0:(local_storage.c:163:ls_device_find_or_init()) Process leaving via out_ls (rc=18446620724537471488 : -123349172080128 : 0xffff8fd0886a3200) 00080000:00000001:0.0:1692961293.191661:0:14228:0:(osd_handler.c:7959:osd_index_ea_lookup()) Process entered 00080000:00000002:0.0:1692961293.191671:0:14228:0:(osd_handler.c:6211:osd_add_oi_cache()) add [0xa:0x4:0x0] 87:831759535 to info ffff8fd08d818000 00080000:00000002:0.0:1692961293.191672:0:14228:0:(osd_handler.c:6428:osd_ea_lookup_rec()) [0x200000001:0x1017:0x0]/lustre-MDT0000 => [0xa:0x4:0x0] 00080000:00000001:0.0:1692961293.191676:0:14228:0:(osd_handler.c:7986:osd_index_ea_lookup()) Process leaving (rc=1 : 1 : 1) 00000040:00000010:0.0:1692961293.191677:0:14228:0:(llog_osd.c:1342:llog_osd_open()) kmalloced '(handle->lgh_name)': 15 at ffff8fd078c33aa0. 00000040:00000001:0.0:1692961293.191679:0:14228:0:(llog_osd.c:1392:llog_osd_open()) Process leaving (rc=0 : 0 : 0) 00000040:00000001:0.0:1692961293.191680:0:14228:0:(llog.c:1404:llog_open()) Process leaving (rc=0 : 0 : 0) 00000040:00000001:0.0:1692961293.191681:0:14228:0:(llog.c:401:llog_init_handle()) Process entered 00000040:00000001:0.0:1692961293.191687:0:14228:0:(llog.c:464:llog_init_handle()) Process leaving (rc=0 : 0 : 0) 00000040:00000001:0.0:1692961293.191688:0:14228:0:(llog_osd.c:1688:llog_osd_close()) Process entered 00000020:00000002:0.0:1692961293.191688:0:14228:0:(lu_object.c:224:lu_object_put()) Add ffff8fd06ae40000/ffff8fd08bbd0c00 to site lru. bkt: ffff8fd0b9968d20 00000040:00000010:0.0:1692961293.191689:0:14228:0:(llog_osd.c:1706:llog_osd_close()) kfreed 'handle->lgh_name': 15 at ffff8fd078c33aa0. 00000040:00000001:0.0:1692961293.191690:0:14228:0:(llog_osd.c:1708:llog_osd_close()) Process leaving (rc=0 : 0 : 0) 20000000:00000040:0.0:1692961293.191692:0:14228:0:(lustre_log.h:410:llog_ctxt_put()) PUTting ctxt ffff8fd0bad1f480 : new refcount 1 20000000:00000010:0.0:1692961293.191694:0:14228:0:(mgs_llog.c:156:name_create()) kmalloced '(*newname)': 1 at ffff8fd066ff6f38. 20000000:00000010:0.0:1692961293.191695:0:14228:0:(mgs_llog.c:156:name_create()) kmalloced '(*newname)': 19 at ffff8fd081693bc0. 20000000:00000010:0.0:1692961293.191714:0:14228:0:(mgs_llog.c:156:name_create()) kmalloced '(*newname)': 27 at ffff8fd081693820. 20000000:00000010:0.0:1692961293.191715:0:14228:0:(mgs_llog.c:156:name_create()) kmalloced '(*newname)': 32 at ffff8fd081693100. 20000000:00000010:0.0:1692961293.191716:0:14228:0:(mgs_llog.c:156:name_create()) kmalloced '(*newname)': 27 at ffff8fd081693600. 20000000:00000001:0.0:1692961293.191717:0:14228:0:(mgs_llog.c:1953:record_start_log()) Process entered 20000000:00000040:0.0:1692961293.191717:0:14228:0:(lustre_log.h:395:llog_ctxt_get()) GETting ctxt ffff8fd0bad1f480 : new refcount 2 00000040:00000001:0.0:1692961293.191718:0:14228:0:(llog.c:1242:llog_open_create()) Process entered 00000040:00000001:0.0:1692961293.191728:0:14228:0:(llog.c:1070:llog_exist()) Process entered 00000040:00000001:0.0:1692961293.191728:0:14228:0:(llog.c:1079:llog_exist()) Process leaving (rc=1 : 1 : 1) 00000040:00000001:0.0:1692961293.191729:0:14228:0:(llog.c:1249:llog_open_create()) Process leaving (rc=0 : 0 : 0) 00000040:00000001:0.0:1692961293.191729:0:14228:0:(llog.c:401:llog_init_handle()) Process entered 00000040:00000010:0.0:1692961293.191730:0:14228:0:(llog.c:406:llog_init_handle()) kmalloced '(llh)': 8192 at ffff8fd07e134000. 00000040:00000001:0.0:1692961293.191734:0:14228:0:(llog.c:464:llog_init_handle()) Process leaving (rc=0 : 0 : 0) 20000000:00000040:0.0:1692961293.191735:0:14228:0:(lustre_log.h:410:llog_ctxt_put()) PUTting ctxt ffff8fd0bad1f480 : new refcount 1 20000000:00000001:0.0:1692961293.191735:0:14228:0:(mgs_llog.c:1977:record_start_log()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1692961293.191736:0:14228:0:(obd_config.c:1742:lustre_cfg_rec_new()) Process entered 00000020:00000010:0.0:1692961293.191737:0:14228:0:(obd_config.c:1748:lustre_cfg_rec_new()) kmalloced '(lcr)': 224 at ffff8fd0bad1b000. 00000020:00000001:0.0:1692961293.191737:0:14228:0:(obd_config.c:1757:lustre_cfg_rec_new()) Process leaving (rc=18446620725383114752 : -123348326436864 : ffff8fd0bad1b000) 00000040:00000001:0.0:1692961293.191738:0:14228:0:(llog.c:1328:llog_write()) Process entered 00000040:00000001:0.0:1692961293.191748:0:14228:0:(llog.c:1157:llog_write_rec()) Process entered 00000040:00000001:0.0:1692961293.191748:0:14228:0:(llog_osd.c:397:llog_osd_write_rec()) Process entered 00000040:00001000:0.0:1692961293.191748:0:14228:0:(llog_osd.c:404:llog_osd_write_rec()) new record 10620000 to [0xa:0x4:0x0] 00000040:00080000:0.0:1692961293.191755:0:14228:0:(llog_osd.c:735:llog_osd_write_rec()) added record [0xa:0x4:0x0].34, 224 off14112 00000040:00000001:0.0:1692961293.191757:0:14228:0:(llog_osd.c:748:llog_osd_write_rec()) Process leaving (rc=0 : 0 : 0) 00000040:00000001:0.0:1692961293.191757:0:14228:0:(llog.c:1190:llog_write_rec()) Process leaving (rc=0 : 0 : 0) 00000040:00000001:0.0:1692961293.191761:0:14228:0:(llog.c:1370:llog_write()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1692961293.191762:0:14228:0:(obd_config.c:1763:lustre_cfg_rec_free()) Process entered 00000020:00000010:0.0:1692961293.191762:0:14228:0:(obd_config.c:1764:lustre_cfg_rec_free()) kfreed 'lcr': 224 at ffff8fd0bad1b000. 00000020:00000001:0.0:1692961293.191763:0:14228:0:(obd_config.c:1765:lustre_cfg_rec_free()) Process leaving 20000000:00000001:0.0:1692961293.191764:0:14228:0:(mgs_llog.c:3149:mgs_write_log_osc_to_lov()) Process leaving via out_end (rc=18446744073709551594 : -22 : 0xffffffffffffffea) 00000040:00000001:0.0:1692961293.191765:0:14228:0:(llog_osd.c:1688:llog_osd_close()) Process entered 00000020:00000002:0.0:1692961293.191765:0:14228:0:(lu_object.c:224:lu_object_put()) Add ffff8fd06ae40000/ffff8fd08bbd0c00 to site lru. bkt: ffff8fd0b9968d20 00000040:00000010:0.0:1692961293.191766:0:14228:0:(llog_osd.c:1706:llog_osd_close()) kfreed 'handle->lgh_name': 15 at ffff8fd078c33aa0. 00000040:00000001:0.0:1692961293.191767:0:14228:0:(llog_osd.c:1708:llog_osd_close()) Process leaving (rc=0 : 0 : 0) 00000040:00000010:0.0:1692961293.191767:0:14228:0:(llog.c:90:llog_free_handle()) kfreed 'loghandle->lgh_hdr': 8192 at ffff8fd07e134000. 00000040:00000010:0.0:1692961293.191768:0:14228:0:(llog.c:92:llog_free_handle()) kfreed 'loghandle': 280 at ffff8fd080e03400. 20000000:00000010:0.0:1692961293.191769:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 27 at ffff8fd081693600. 20000000:00000010:0.0:1692961293.191770:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 32 at ffff8fd081693100. 20000000:00000010:0.0:1692961293.191770:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 27 at ffff8fd081693820. 20000000:00000010:0.0:1692961293.191771:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 19 at ffff8fd081693bc0. 20000000:00000010:0.0:1692961293.191772:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 1 at ffff8fd066ff6f38. 20000000:00000001:0.0:1692961293.191773:0:14228:0:(mgs_llog.c:3185:mgs_write_log_osc_to_lov()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea) 20000000:00000010:0.0:1692961293.191774:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 15 at ffff8fd078c33e10. 20000000:00000010:0.0:1692961293.191774:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 22 at ffff8fd081693ee0. 20000000:00000001:0.0:1692961293.191775:0:14228:0:(mgs_llog.c:3283:mgs_write_log_ost()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea) 20000000:00020000:0.0:1692961293.191776:0:14228:0:(mgs_llog.c:4541:mgs_write_log_target()) Can't write logs for lustre-OST0001 (-22) 20000000:00000001:0.0:1692961293.193427:0:14228:0:(mgs_llog.c:4542:mgs_write_log_target()) Process leaving via out_up (rc=18446744073709551594 : -22 : 0xffffffffffffffea) 20000000:00000001:0.0:1692961293.193429:0:14228:0:(mgs_llog.c:4578:mgs_write_log_target()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea) 20000000:00020000:0.0:1692961293.193430:0:14228:0:(mgs_handler.c:518:mgs_target_reg()) Failed to write lustre-OST0001 log (-22) 20000000:00000001:0.0:1692961293.194962:0:14228:0:(mgs_handler.c:519:mgs_target_reg()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)

            People

              simmonsja James A Simmons
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: