[LU-17060] conf-sanity test_21c: Unable to start OST2 - MGS refuses to allow this server to start Created: 29/Aug/23 Updated: 02/Sep/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | James A Simmons |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/57f0b610-268d-4d1e-be2b-d1898e8431b1 test_21c failed with the following error in the test logs: trevis-98vm3 mkdir -p /mnt/lustre-ost2; mount -t lustre -o localrecov /dev/mapper/ost2_flakey /mnt/lustre-ost2 trevis-98vm3: mount.lustre: mount /dev/mapper/ost2_flakey at /mnt/lustre-ost2 failed: Invalid argument trevis-98vm3: This may have multiple causes. trevis-98vm3: Are the mount options correct? trevis-98vm3: Check the syslog for more info. The OST console log shows: Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost2; mount -t lustre -o localrecov /dev/mapper/ost2_flakey /mnt/lustre-ost2 LDISKFS-fs (dm-11): file extents enabled, maximum tree depth=5 LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro LDISKFS-fs (dm-11): file extents enabled, maximum tree depth=5 LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc LustreError: 160-7: lustre-OST0001: the MGS refuses to allow this server to start: rc = -22. Please see messages on the MGS. LustreError: 30487:0:(tgt_mount.c:2216:server_fill_super()) Unable to start targets: -22 LustreError: 30487:0:(tgt_mount.c:1752:server_put_super()) no obd lustre-OST0001 LustreError: 30487:0:(tgt_mount.c:132:server_deregister_mount()) lustre-OST0001 not registered Lustre: server umount lustre-OST0001 complete LustreError: 30487:0:(super25.c:188:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -22 The MGS console log shows: LustreError: 14228:0:(mgs_llog.c:4541:mgs_write_log_target()) Can't write logs for lustre-OST0001 (-22) LustreError: 14228:0:(mgs_handler.c:518:mgs_target_reg()) Failed to write lustre-OST0001 log (-22) Lustre: DEBUG MARKER: /usr/sbin/lctl mark conf-sanity test_21c: @@@@@@ FAIL: Unable to start OST2 Lustre: DEBUG MARKER: conf-sanity test_21c: @@@@@@ FAIL: Unable to start OST2 Test session details: It appears that the "Unable to start OST2" problem started on 2023-08-19. There were previous intermittent failures with "Unable to start OST1" but these were all fallout from earlier failures. VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 29/Aug/23 ] |
|
It looks like the test is failing every few days in batches, likely when full testing is run after patch landings to master. However, it is also failing regularly (though not every run) for e2fsprogs patches since they run conf-sanity Patches landed before the first reported failure: $ git log --oneline --after 2023-08-16 --before 2023-08-20 673ff86a84 LU-16393 o2iblnd: add IBLND_REJECT_EARLY reject reason 9183fa655a LU-17021 socklnd: fix late ksnr_max_conns set 441902fa3d LU-17010 lfsck: don't create trans in dryrun mode a28ae45d77 LU-17000 lnet: fix use-after-free in lnet_startup_lndnet be131d125a LU-16974 utils: lfs mirror resync to show progress d9cd9992b9 LU-8191 lnet: convert functions in utils to static 0163acf4f3 LU-16883 ldiskfs: update for ext4-delayed-iput for SUSE 15 64535355de LU-16862 rpm: set kmod-lustre-tests requires kmod-lustre explicitly 6481328c1c LU-9859 libcfs: remove cfs_size_round() c08795f952 LU-9859 lnet: simplify cfs_parse_nidlist() c0e881a8e1 LU-9859 llite: simplify pcc_fname_list_parse() a1dc44aef0 LU-9859 ptlrpc: simplify nrs_tbf_opcode_list_parse() b7b55534b9 LU-16673 tests: add checking for user USER0 ac522557b1 LU-16374 enc: rename O_FILE_ENC to O_CIPHERTEXT d285b41b86 LU-16851 llite: use right predicate in ll_atomic_open 1e9d48625b LU-16962 build: parallel configure cleanup 42bf5f78ba LU-8130 nrs: convert NRS ORR/TRR to rhashtable 778791dd7d LU-8130 libcfs: don't use radix tree for xarray 2feb4a7bb0 LU-16973 ptlrpc: flush delayed file desc if idle d0be255ae3 LU-16847 ldiskfs: refactor bio submit. c0cb747ebe LU-13306 mgs: use large NIDS in the nid table on the MGS 7576d29458 LU-15978 osp: fix striped directory deletion fails for 64K PAGE_SIZE 48ccf2db84 LU-15722 osd-ldiskfs: fix write stuck for 64K PAGE_SIZE 7bb1685048 LU-16872 tests: exercise sanity test_27M more fully 766b35a970 LU-16872 lod: reset llc_ostlist when using O_APPEND stripes I would suspect that The debug log (edited for brevity) for the OST registration looks like: 0000000:00000001:0.0:1692961293.191655:0:14228:0:(mgs_llog.c:3078:mgs_write_log_osc_to_lov()) Process entered 20000000:00000040:0.0:1692961293.191655:0:14228:0:(mgs_llog.c:3080:mgs_write_log_osc_to_lov()) adding osc for lustre-OST0001 to log lustre-MDT0000 00000040:00000001:0.0:1692961293.191657:0:14228:0:(llog.c:1381:llog_open()) Process entered 00000040:00000010:0.0:1692961293.191657:0:14228:0:(llog.c:62:llog_alloc_handle()) kmalloced '(loghandle)': 280 at ffff8fd080e03400. 00000040:00000001:0.0:1692961293.191658:0:14228:0:(llog_osd.c:1280:llog_osd_open()) Process entered 00000020:00000001:0.0:1692961293.191658:0:14228:0:(local_storage.c:150:ls_device_find_or_init()) Process entered 00000020:00000001:0.0:1692961293.191659:0:14228:0:(local_storage.c:163:ls_device_find_or_init()) Process leaving via out_ls (rc=18446620724537471488 : -123349172080128 : 0xffff8fd0886a3200) 00080000:00000001:0.0:1692961293.191661:0:14228:0:(osd_handler.c:7959:osd_index_ea_lookup()) Process entered 00080000:00000002:0.0:1692961293.191671:0:14228:0:(osd_handler.c:6211:osd_add_oi_cache()) add [0xa:0x4:0x0] 87:831759535 to info ffff8fd08d818000 00080000:00000002:0.0:1692961293.191672:0:14228:0:(osd_handler.c:6428:osd_ea_lookup_rec()) [0x200000001:0x1017:0x0]/lustre-MDT0000 => [0xa:0x4:0x0] 00080000:00000001:0.0:1692961293.191676:0:14228:0:(osd_handler.c:7986:osd_index_ea_lookup()) Process leaving (rc=1 : 1 : 1) 00000040:00000010:0.0:1692961293.191677:0:14228:0:(llog_osd.c:1342:llog_osd_open()) kmalloced '(handle->lgh_name)': 15 at ffff8fd078c33aa0. 00000040:00000001:0.0:1692961293.191679:0:14228:0:(llog_osd.c:1392:llog_osd_open()) Process leaving (rc=0 : 0 : 0) 00000040:00000001:0.0:1692961293.191680:0:14228:0:(llog.c:1404:llog_open()) Process leaving (rc=0 : 0 : 0) 00000040:00000001:0.0:1692961293.191681:0:14228:0:(llog.c:401:llog_init_handle()) Process entered 00000040:00000001:0.0:1692961293.191687:0:14228:0:(llog.c:464:llog_init_handle()) Process leaving (rc=0 : 0 : 0) 00000040:00000001:0.0:1692961293.191688:0:14228:0:(llog_osd.c:1688:llog_osd_close()) Process entered 00000020:00000002:0.0:1692961293.191688:0:14228:0:(lu_object.c:224:lu_object_put()) Add ffff8fd06ae40000/ffff8fd08bbd0c00 to site lru. bkt: ffff8fd0b9968d20 00000040:00000010:0.0:1692961293.191689:0:14228:0:(llog_osd.c:1706:llog_osd_close()) kfreed 'handle->lgh_name': 15 at ffff8fd078c33aa0. 00000040:00000001:0.0:1692961293.191690:0:14228:0:(llog_osd.c:1708:llog_osd_close()) Process leaving (rc=0 : 0 : 0) 20000000:00000040:0.0:1692961293.191692:0:14228:0:(lustre_log.h:410:llog_ctxt_put()) PUTting ctxt ffff8fd0bad1f480 : new refcount 1 20000000:00000010:0.0:1692961293.191694:0:14228:0:(mgs_llog.c:156:name_create()) kmalloced '(*newname)': 1 at ffff8fd066ff6f38. 20000000:00000010:0.0:1692961293.191695:0:14228:0:(mgs_llog.c:156:name_create()) kmalloced '(*newname)': 19 at ffff8fd081693bc0. 20000000:00000010:0.0:1692961293.191714:0:14228:0:(mgs_llog.c:156:name_create()) kmalloced '(*newname)': 27 at ffff8fd081693820. 20000000:00000010:0.0:1692961293.191715:0:14228:0:(mgs_llog.c:156:name_create()) kmalloced '(*newname)': 32 at ffff8fd081693100. 20000000:00000010:0.0:1692961293.191716:0:14228:0:(mgs_llog.c:156:name_create()) kmalloced '(*newname)': 27 at ffff8fd081693600. 20000000:00000001:0.0:1692961293.191717:0:14228:0:(mgs_llog.c:1953:record_start_log()) Process entered 20000000:00000040:0.0:1692961293.191717:0:14228:0:(lustre_log.h:395:llog_ctxt_get()) GETting ctxt ffff8fd0bad1f480 : new refcount 2 00000040:00000001:0.0:1692961293.191718:0:14228:0:(llog.c:1242:llog_open_create()) Process entered 00000040:00000001:0.0:1692961293.191728:0:14228:0:(llog.c:1070:llog_exist()) Process entered 00000040:00000001:0.0:1692961293.191728:0:14228:0:(llog.c:1079:llog_exist()) Process leaving (rc=1 : 1 : 1) 00000040:00000001:0.0:1692961293.191729:0:14228:0:(llog.c:1249:llog_open_create()) Process leaving (rc=0 : 0 : 0) 00000040:00000001:0.0:1692961293.191729:0:14228:0:(llog.c:401:llog_init_handle()) Process entered 00000040:00000010:0.0:1692961293.191730:0:14228:0:(llog.c:406:llog_init_handle()) kmalloced '(llh)': 8192 at ffff8fd07e134000. 00000040:00000001:0.0:1692961293.191734:0:14228:0:(llog.c:464:llog_init_handle()) Process leaving (rc=0 : 0 : 0) 20000000:00000040:0.0:1692961293.191735:0:14228:0:(lustre_log.h:410:llog_ctxt_put()) PUTting ctxt ffff8fd0bad1f480 : new refcount 1 20000000:00000001:0.0:1692961293.191735:0:14228:0:(mgs_llog.c:1977:record_start_log()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1692961293.191736:0:14228:0:(obd_config.c:1742:lustre_cfg_rec_new()) Process entered 00000020:00000010:0.0:1692961293.191737:0:14228:0:(obd_config.c:1748:lustre_cfg_rec_new()) kmalloced '(lcr)': 224 at ffff8fd0bad1b000. 00000020:00000001:0.0:1692961293.191737:0:14228:0:(obd_config.c:1757:lustre_cfg_rec_new()) Process leaving (rc=18446620725383114752 : -123348326436864 : ffff8fd0bad1b000) 00000040:00000001:0.0:1692961293.191738:0:14228:0:(llog.c:1328:llog_write()) Process entered 00000040:00000001:0.0:1692961293.191748:0:14228:0:(llog.c:1157:llog_write_rec()) Process entered 00000040:00000001:0.0:1692961293.191748:0:14228:0:(llog_osd.c:397:llog_osd_write_rec()) Process entered 00000040:00001000:0.0:1692961293.191748:0:14228:0:(llog_osd.c:404:llog_osd_write_rec()) new record 10620000 to [0xa:0x4:0x0] 00000040:00080000:0.0:1692961293.191755:0:14228:0:(llog_osd.c:735:llog_osd_write_rec()) added record [0xa:0x4:0x0].34, 224 off14112 00000040:00000001:0.0:1692961293.191757:0:14228:0:(llog_osd.c:748:llog_osd_write_rec()) Process leaving (rc=0 : 0 : 0) 00000040:00000001:0.0:1692961293.191757:0:14228:0:(llog.c:1190:llog_write_rec()) Process leaving (rc=0 : 0 : 0) 00000040:00000001:0.0:1692961293.191761:0:14228:0:(llog.c:1370:llog_write()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1692961293.191762:0:14228:0:(obd_config.c:1763:lustre_cfg_rec_free()) Process entered 00000020:00000010:0.0:1692961293.191762:0:14228:0:(obd_config.c:1764:lustre_cfg_rec_free()) kfreed 'lcr': 224 at ffff8fd0bad1b000. 00000020:00000001:0.0:1692961293.191763:0:14228:0:(obd_config.c:1765:lustre_cfg_rec_free()) Process leaving 20000000:00000001:0.0:1692961293.191764:0:14228:0:(mgs_llog.c:3149:mgs_write_log_osc_to_lov()) Process leaving via out_end (rc=18446744073709551594 : -22 : 0xffffffffffffffea) 00000040:00000001:0.0:1692961293.191765:0:14228:0:(llog_osd.c:1688:llog_osd_close()) Process entered 00000020:00000002:0.0:1692961293.191765:0:14228:0:(lu_object.c:224:lu_object_put()) Add ffff8fd06ae40000/ffff8fd08bbd0c00 to site lru. bkt: ffff8fd0b9968d20 00000040:00000010:0.0:1692961293.191766:0:14228:0:(llog_osd.c:1706:llog_osd_close()) kfreed 'handle->lgh_name': 15 at ffff8fd078c33aa0. 00000040:00000001:0.0:1692961293.191767:0:14228:0:(llog_osd.c:1708:llog_osd_close()) Process leaving (rc=0 : 0 : 0) 00000040:00000010:0.0:1692961293.191767:0:14228:0:(llog.c:90:llog_free_handle()) kfreed 'loghandle->lgh_hdr': 8192 at ffff8fd07e134000. 00000040:00000010:0.0:1692961293.191768:0:14228:0:(llog.c:92:llog_free_handle()) kfreed 'loghandle': 280 at ffff8fd080e03400. 20000000:00000010:0.0:1692961293.191769:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 27 at ffff8fd081693600. 20000000:00000010:0.0:1692961293.191770:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 32 at ffff8fd081693100. 20000000:00000010:0.0:1692961293.191770:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 27 at ffff8fd081693820. 20000000:00000010:0.0:1692961293.191771:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 19 at ffff8fd081693bc0. 20000000:00000010:0.0:1692961293.191772:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 1 at ffff8fd066ff6f38. 20000000:00000001:0.0:1692961293.191773:0:14228:0:(mgs_llog.c:3185:mgs_write_log_osc_to_lov()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea) 20000000:00000010:0.0:1692961293.191774:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 15 at ffff8fd078c33e10. 20000000:00000010:0.0:1692961293.191774:0:14228:0:(mgs_llog.c:166:name_destroy()) kfreed '*name': 22 at ffff8fd081693ee0. 20000000:00000001:0.0:1692961293.191775:0:14228:0:(mgs_llog.c:3283:mgs_write_log_ost()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea) 20000000:00020000:0.0:1692961293.191776:0:14228:0:(mgs_llog.c:4541:mgs_write_log_target()) Can't write logs for lustre-OST0001 (-22) 20000000:00000001:0.0:1692961293.193427:0:14228:0:(mgs_llog.c:4542:mgs_write_log_target()) Process leaving via out_up (rc=18446744073709551594 : -22 : 0xffffffffffffffea) 20000000:00000001:0.0:1692961293.193429:0:14228:0:(mgs_llog.c:4578:mgs_write_log_target()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea) 20000000:00020000:0.0:1692961293.193430:0:14228:0:(mgs_handler.c:518:mgs_target_reg()) Failed to write lustre-OST0001 log (-22) 20000000:00000001:0.0:1692961293.194962:0:14228:0:(mgs_handler.c:519:mgs_target_reg()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea) |
| Comment by Andreas Dilger [ 29/Aug/23 ] |
|
It looks like this may already be fixed by the following patch: "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/52053 |
| Comment by Andreas Dilger [ 29/Aug/23 ] |
|
It looks like this is also causing conf-sanity test_21c failures in non-interop testing. I'm not sure if it is random, but it looks like it is mainly failing for full test runs and also fairly regularly for e2fsprogs. I think the |
| Comment by James A Simmons [ 30/Aug/23 ] |
|
I agree, patch 52053 should fix these issues. |
| Comment by Gerrit Updater [ 31/Aug/23 ] |
|
"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52199 |
| Comment by James A Simmons [ 01/Sep/23 ] |
|
Still failing. I looked at the logs and I think its some LNet issue on the MGS. I see this in the logs on the MGS: 00000400:00000200:0.0:1693491996.281509:0:9414:0:(lib-move.c:4757:lnet_parse()) TRACE: 10.240.42.123@tcp(10.240.42.123@tcp) <- 10.240.42.122@tcp : PUT - for me 00000400:00000200:0.0:1693491996.281517:0:9414:0:(lib-ptl.c:573:lnet_ptl_match_md()) Request from 12345-10.240.42.122@tcp of length 520 into portal 26 MB=0x64f0a19a0d780 00000400:00000100:0.0:1693491996.281532:0:9414:0:(lib-move.c:4455:lnet_parse_put()) Dropping PUT from 12345-10.240.42.122@tcp portal 26 match 1775754658502528 offset 0 length 520: 4 00000400:00000200:0.0:1693491996.281534:0:9414:0:(lib-msg.c:1041:lnet_is_health_check()) Msg 00000000eb75fdc0 is in inconsistent state, don't perform health checking (-2, 0) 00000400:00000200:0.0:1693491996.281536:0:9414:0:(lib-msg.c:1046:lnet_is_health_check()) health check = 0, status = -2, hstatus = 0 this leads to a timeout (-110) for the OST registration with the MGS. On the OST its a ptlrpc timeout issue. |