Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.15.5
-
None
-
el8_10
-
3
-
9223372036854775807
Description
Hi Team,
Recently we met some issues when adding OSTs to existing file system. We have 800+ OSTs in the file system, when adding new OSTs, the first few hundreds succeed, but at a point to adding 32 more OSTs, while these OSTs were mounted to the OSS hosts, the MDT crash with the following backtrace.
[Tue Oct 1 16:45:36 2024] Lustre: 2124796:0:(osd_io.c:2104:osd_ldiskfs_write_record()) fslustre-MDT0000: adding bh without locking off 197728 (block 48, size 32, offs 197728) [Tue Oct 1 16:45:36 2024] WARNING: CPU: 130 PID: 2124796 at fs/jbd2/transaction.c:1526 jbd2_journal_dirty_metadata+0x247/0x260 [jbd2] [Tue Oct 1 16:45:36 2024] Modules linked in: mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) mbcache jbd2 lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) binfmt_misc 8021q garp mrp stp llc macvlan dm_queue_length cuse fuse vfat fat intel_rapl_msr intel_rapl_common amd64_edac_mod edac_mce_amd kvm_amd ast i2c_algo_bit kvm mlx5_ib drm_shmem_helper drm_kms_helper irqbypass ib_uverbs syscopyarea sysfillrect sysimgblt rapl sp5100_tco drm ib_core pcspkr acpi_cpufreq k10temp i2c_piix4 xfs libcrc32c sd_mod sg nvme_tcp(X) nvme_fabrics nvme nvme_core t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel mlx5_core ghash_clmulni_intel ahci libahci libata mlxfw psample ccp pci_hyperv_intf dm_multipath sunrpc dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls libcxgbi libcxgb qla4xxx [Tue Oct 1 16:45:36 2024] iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi [last unloaded: libcfs] [Tue Oct 1 16:45:37 2024] CPU: 130 PID: 2124796 Comm: llog_process_th Kdump: loaded Tainted: G OE X -------- - - 4.18.0-553.5.1.el8_10_lustre_20240808.x86_64 #1 [Tue Oct 1 16:45:37 2024] Hardware name: SERVER E5-2c/Asm,MB+Tray,E5-2c, BIOS 83070100 06/10/2024 [Tue Oct 1 16:45:37 2024] RIP: 0010:jbd2_journal_dirty_metadata+0x247/0x260 [jbd2] [Tue Oct 1 16:45:37 2024] Code: 80 00 75 f4 e9 26 ff ff ff 41 bd 8b ff ff ff e9 32 fe ff ff 4c 8b 4e 70 4c 8d 73 02 4d 39 cc 0f 84 e1 fe ff ff e9 42 9c 00 00 <0f> 0b 41 bd e4 ff ff ff 4c 8d 73 02 e9 cb fe ff ff 0f 0b 66 0f 1f [Tue Oct 1 16:45:37 2024] RSP: 0018:ff681bd7a0297850 EFLAGS: 00010246 [Tue Oct 1 16:45:37 2024] RAX: 0000000000000001 RBX: ff4e3735d2360208 RCX: 0000000000000000 [Tue Oct 1 16:45:37 2024] RDX: 0000000000000007 RSI: ff4e372c9dace000 RDI: ff4e3606eac291f8 [Tue Oct 1 16:45:37 2024] RBP: ff4e373889478ca8 R08: 0000000000000000 R09: ff4e3728c4792000 [Tue Oct 1 16:45:37 2024] R10: 0000000000000000 R11: 0000000000000000 R12: ff4e360608420400 [Tue Oct 1 16:45:37 2024] R13: 0000000000000000 R14: ffffffffc1acfc50 R15: 00000000000003f6 [Tue Oct 1 16:45:37 2024] FS: 0000000000000000(0000) GS:ff4e3843ba480000(0000) knlGS:0000000000000000 [Tue Oct 1 16:45:37 2024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Tue Oct 1 16:45:37 2024] CR2: 00007fd8dd8c2484 CR3: 000001f9f0e10002 CR4: 0000000000771ee0 [Tue Oct 1 16:45:37 2024] PKRU: 55555554 [Tue Oct 1 16:45:37 2024] Call Trace: [Tue Oct 1 16:45:37 2024] ? __warn+0x94/0xe0 [Tue Oct 1 16:45:37 2024] ? jbd2_journal_dirty_metadata+0x247/0x260 [jbd2] [Tue Oct 1 16:45:37 2024] ? jbd2_journal_dirty_metadata+0x247/0x260 [jbd2] [Tue Oct 1 16:45:37 2024] ? report_bug+0xb1/0xe0 [Tue Oct 1 16:45:37 2024] ? srso_alias_return_thunk+0x5/0xfcdfd [Tue Oct 1 16:45:37 2024] ? do_error_trap+0x9e/0xd0 [Tue Oct 1 16:45:37 2024] ? do_invalid_op+0x36/0x40 [Tue Oct 1 16:45:37 2024] ? jbd2_journal_dirty_metadata+0x247/0x260 [jbd2] [Tue Oct 1 16:45:37 2024] ? invalid_op+0x14/0x20 [Tue Oct 1 16:45:37 2024] ? jbd2_journal_dirty_metadata+0x247/0x260 [jbd2] [Tue Oct 1 16:45:37 2024] __ldiskfs_handle_dirty_metadata+0x4f/0x190 [ldiskfs] [Tue Oct 1 16:45:37 2024] ldiskfs_getblk+0x112/0x190 [ldiskfs] [Tue Oct 1 16:45:37 2024] ldiskfs_bread+0x1f/0xc0 [ldiskfs] [Tue Oct 1 16:45:37 2024] ? osd_ldiskfs_write_record+0x3e0/0x6c0 [osd_ldiskfs] [Tue Oct 1 16:45:37 2024] osd_ldiskfs_write_record+0x515/0x6c0 [osd_ldiskfs] [Tue Oct 1 16:45:37 2024] ? __irqentry_text_end+0x101463/0x101467 [Tue Oct 1 16:45:37 2024] osd_write+0x12e/0x670 [osd_ldiskfs] [Tue Oct 1 16:45:37 2024] dt_record_write+0x32/0x110 [obdclass] [Tue Oct 1 16:45:37 2024] llog_osd_put_cat_list+0x79d/0x930 [obdclass] [Tue Oct 1 16:45:37 2024] osp_sync_llog_init+0x66f/0xb20 [osp] [Tue Oct 1 16:45:37 2024] ? osp_sync_init+0x262/0x770 [osp] [Tue Oct 1 16:45:37 2024] ? srso_alias_return_thunk+0x5/0xfcdfd [Tue Oct 1 16:45:37 2024] osp_sync_init+0x262/0x770 [osp] [Tue Oct 1 16:45:37 2024] ? osp_init_precreate+0x35/0x2b0 [osp] [Tue Oct 1 16:45:37 2024] ? srso_alias_return_thunk+0x5/0xfcdfd [Tue Oct 1 16:45:37 2024] osp_init0.isra.19+0x16ad/0x19f0 [osp] [Tue Oct 1 16:45:37 2024] osp_device_alloc+0xcb/0x180 [osp] [Tue Oct 1 16:45:37 2024] obd_setup+0x119/0x2e0 [obdclass] [Tue Oct 1 16:45:37 2024] class_setup+0x587/0x790 [obdclass] [Tue Oct 1 16:45:37 2024] class_process_config+0xfc8/0x2080 [obdclass] [Tue Oct 1 16:45:37 2024] ? class_config_llog_handler+0x6b1/0x1250 [obdclass] [Tue Oct 1 16:45:37 2024] ? srso_alias_return_thunk+0x5/0xfcdfd [Tue Oct 1 16:45:37 2024] ? __kmalloc+0x15f/0x2d0 [Tue Oct 1 16:45:37 2024] ? srso_alias_return_thunk+0x5/0xfcdfd [Tue Oct 1 16:45:37 2024] class_config_llog_handler+0x846/0x1250 [obdclass] [Tue Oct 1 16:45:37 2024] llog_process_thread+0xf99/0x1a30 [obdclass] [Tue Oct 1 16:45:37 2024] ? srso_alias_return_thunk+0x5/0xfcdfd [Tue Oct 1 16:45:37 2024] ? lu_context_init+0xa5/0x1b0 [obdclass] [Tue Oct 1 16:45:37 2024] ? llog_backup+0x540/0x540 [obdclass] [Tue Oct 1 16:45:37 2024] llog_process_thread_daemonize+0x9b/0xe0 [obdclass] [Tue Oct 1 16:45:37 2024] kthread+0x134/0x150 [Tue Oct 1 16:45:37 2024] ? set_kthread_struct+0x50/0x50 [Tue Oct 1 16:45:37 2024] ret_from_fork+0x1f/0x40 [Tue Oct 1 16:45:37 2024] ---[ end trace 712fcac813961656 ]--- [Tue Oct 1 16:45:37 2024] WARNING: CPU: 130 PID: 2124796 at /tmp/rpmbuild-lustre-root-jrXeJ2yL/BUILD/lustre-2.15.5_0/ldiskfs/ext4_jbd2.c:288 __ldiskfs_handle_dirty_metadata+0x106/0x190 [ldiskfs] [Tue Oct 1 16:45:37 2024] Modules linked in: mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) mbcache jbd2 lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) binfmt_misc 8021q garp mrp stp llc macvlan dm_queue_length cuse fuse vfat fat intel_rapl_msr intel_rapl_common amd64_edac_mod edac_mce_amd kvm_amd ast i2c_algo_bit kvm mlx5_ib drm_shmem_helper drm_kms_helper irqbypass ib_uverbs syscopyarea sysfillrect sysimgblt rapl sp5100_tco drm ib_core pcspkr acpi_cpufreq k10temp i2c_piix4 xfs libcrc32c sd_mod sg nvme_tcp(X) nvme_fabrics nvme nvme_core t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel mlx5_core ghash_clmulni_intel ahci libahci libata mlxfw psample ccp pci_hyperv_intf dm_multipath sunrpc dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls libcxgbi libcxgb qla4xxx [Tue Oct 1 16:45:37 2024] iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi [last unloaded: libcfs] [Tue Oct 1 16:45:37 2024] CPU: 130 PID: 2124796 Comm: llog_process_th Kdump: loaded Tainted: G W OE X -------- - - 4.18.0-553.5.1.el8_10_lustre_20240808.x86_64 #1 [Tue Oct 1 16:45:37 2024] Hardware name: SERVER E5-2c/Asm,MB+Tray,E5-2c, BIOS 83070100 06/10/2024 [Tue Oct 1 16:45:37 2024] RIP: 0010:__ldiskfs_handle_dirty_metadata+0x106/0x190 [ldiskfs] [Tue Oct 1 16:45:37 2024] Code: 80 81 ff ff eb 93 f0 80 4b 01 80 e9 4f ff ff ff f0 80 4b 01 40 e9 39 ff ff ff 48 89 df 45 31 e4 e8 ff b4 0f e8 e9 6f ff ff ff <0f> 0b 48 c7 c2 80 1c ad c1 45 89 e0 48 89 e9 44 89 fe 4c 89 f7 e8 [Tue Oct 1 16:45:37 2024] RSP: 0018:ff681bd7a0297880 EFLAGS: 00010286 [Tue Oct 1 16:45:37 2024] RAX: ff4e372c9dace000 RBX: ff4e3735d2360208 RCX: 0000000000000000 [Tue Oct 1 16:45:37 2024] RDX: 0000000000000007 RSI: ff4e372c9dace000 RDI: ff4e3606eac291f8 [Tue Oct 1 16:45:37 2024] RBP: ff4e3606eac291f8 R08: 0000000000000000 R09: ff4e3728c4792000 [Tue Oct 1 16:45:37 2024] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4 [Tue Oct 1 16:45:37 2024] R13: ff4e372df68dab78 R14: ffffffffc1acfc50 R15: 00000000000003f6 [Tue Oct 1 16:45:37 2024] FS: 0000000000000000(0000) GS:ff4e3843ba480000(0000) knlGS:0000000000000000 [Tue Oct 1 16:45:37 2024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Tue Oct 1 16:45:37 2024] CR2: 00007fd8dd8c2484 CR3: 000001f9f0e10002 CR4: 0000000000771ee0 [Tue Oct 1 16:45:37 2024] PKRU: 55555554 [Tue Oct 1 16:45:37 2024] Call Trace: [Tue Oct 1 16:45:37 2024] ? __warn+0x94/0xe0 [Tue Oct 1 16:45:37 2024] ? __ldiskfs_handle_dirty_metadata+0x106/0x190 [ldiskfs] [Tue Oct 1 16:45:37 2024] ? __ldiskfs_handle_dirty_metadata+0x106/0x190 [ldiskfs] [Tue Oct 1 16:45:37 2024] ? report_bug+0xb1/0xe0 [Tue Oct 1 16:45:37 2024] ? do_error_trap+0x9e/0xd0 [Tue Oct 1 16:45:37 2024] ? do_invalid_op+0x36/0x40 [Tue Oct 1 16:45:37 2024] ? __ldiskfs_handle_dirty_metadata+0x106/0x190 [ldiskfs] [Tue Oct 1 16:45:37 2024] ? invalid_op+0x14/0x20 [Tue Oct 1 16:45:37 2024] ? __ldiskfs_handle_dirty_metadata+0x106/0x190 [ldiskfs] [Tue Oct 1 16:45:37 2024] ldiskfs_getblk+0x112/0x190 [ldiskfs] [Tue Oct 1 16:45:37 2024] ldiskfs_bread+0x1f/0xc0 [ldiskfs] [Tue Oct 1 16:45:37 2024] ? osd_ldiskfs_write_record+0x3e0/0x6c0 [osd_ldiskfs] [Tue Oct 1 16:45:37 2024] osd_ldiskfs_write_record+0x515/0x6c0 [osd_ldiskfs] [Tue Oct 1 16:45:37 2024] ? __irqentry_text_end+0x101463/0x101467 [Tue Oct 1 16:45:37 2024] osd_write+0x12e/0x670 [osd_ldiskfs] [Tue Oct 1 16:45:37 2024] dt_record_write+0x32/0x110 [obdclass] [Tue Oct 1 16:45:37 2024] llog_osd_put_cat_list+0x79d/0x930 [obdclass] [Tue Oct 1 16:45:37 2024] osp_sync_llog_init+0x66f/0xb20 [osp] [Tue Oct 1 16:45:37 2024] ? osp_sync_init+0x262/0x770 [osp] [Tue Oct 1 16:45:37 2024] ? srso_alias_return_thunk+0x5/0xfcdfd [Tue Oct 1 16:45:37 2024] osp_sync_init+0x262/0x770 [osp] [Tue Oct 1 16:45:37 2024] ? osp_init_precreate+0x35/0x2b0 [osp] [Tue Oct 1 16:45:37 2024] ? srso_alias_return_thunk+0x5/0xfcdfd [Tue Oct 1 16:45:37 2024] osp_init0.isra.19+0x16ad/0x19f0 [osp] [Tue Oct 1 16:45:37 2024] osp_device_alloc+0xcb/0x180 [osp] [Tue Oct 1 16:45:37 2024] obd_setup+0x119/0x2e0 [obdclass] [Tue Oct 1 16:45:37 2024] class_setup+0x587/0x790 [obdclass] [Tue Oct 1 16:45:37 2024] class_process_config+0xfc8/0x2080 [obdclass] [Tue Oct 1 16:45:37 2024] ? class_config_llog_handler+0x6b1/0x1250 [obdclass] [Tue Oct 1 16:45:37 2024] ? srso_alias_return_thunk+0x5/0xfcdfd [Tue Oct 1 16:45:37 2024] ? __kmalloc+0x15f/0x2d0 [Tue Oct 1 16:45:37 2024] ? srso_alias_return_thunk+0x5/0xfcdfd [Tue Oct 1 16:45:37 2024] class_config_llog_handler+0x846/0x1250 [obdclass] [Tue Oct 1 16:45:37 2024] llog_process_thread+0xf99/0x1a30 [obdclass] [Tue Oct 1 16:45:37 2024] ? srso_alias_return_thunk+0x5/0xfcdfd [Tue Oct 1 16:45:37 2024] ? lu_context_init+0xa5/0x1b0 [obdclass] [Tue Oct 1 16:45:37 2024] ? llog_backup+0x540/0x540 [obdclass] [Tue Oct 1 16:45:37 2024] llog_process_thread_daemonize+0x9b/0xe0 [obdclass] [Tue Oct 1 16:45:37 2024] kthread+0x134/0x150 [Tue Oct 1 16:45:37 2024] ? set_kthread_struct+0x50/0x50 [Tue Oct 1 16:45:37 2024] ret_from_fork+0x1f/0x40 [Tue Oct 1 16:45:37 2024] ---[ end trace 712fcac813961657 ]--- [Tue Oct 1 16:45:37 2024] LDISKFS-fs: ldiskfs_getblk:1014: aborting transaction: error 28 in __ldiskfs_handle_dirty_metadata [Tue Oct 1 16:45:37 2024] LDISKFS-fs error (device dm-2): ldiskfs_getblk:1014: inode #95: block 5895972: comm llog_process_th: journal_dirty_metadata failed: handle type 0 started at line 1990, credits 5/0, errcode -28 [Tue Oct 1 16:45:37 2024] Aborting journal on device dm-2-8. [Tue Oct 1 16:45:38 2024] LDISKFS-fs (dm-2): Remounting filesystem read-only [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LDISKFS-fs (dm-2): Remounting filesystem read-only [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:61: Detected aborted journal [Tue Oct 1 16:45:38 2024] LustreError: 2124796:0:(osd_io.c:2138:osd_ldiskfs_write_record()) fslustre-MDT0000: error reading offset 197728 (block 48, size 32, offs 197728), credits 5/1: rc = -28
After that, the file system failover failed too, and the MDT could not be mounted to MDS server.
[Tue Oct 1 16:50:10 2024] LDISKFS-fs warning (device dm-2): ldiskfs_clear_journal_err:5253: Filesystem error recorded from previous mount: IO failure
[Tue Oct 1 16:50:10 2024] LDISKFS-fs warning (device dm-2): ldiskfs_clear_journal_err:5254: Marking fs in need of filesystem check.
[Tue Oct 1 16:50:10 2024] LDISKFS-fs (dm-2): warning: mounting fs with errors, running e2fsck is recommended
[Tue Oct 1 16:50:11 2024] LDISKFS-fs (dm-2): recovery complete
[Tue Oct 1 16:50:11 2024] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[Tue Oct 1 16:50:14 2024] LustreError: 2126348:0:(genops.c:522:class_register_device()) fslustre-OST1823-osc-MDT0000: already exists, won't add
[Tue Oct 1 16:50:14 2024] LustreError: 2126348:0:(obd_config.c:1999:class_config_llog_handler()) MGC10.10.208.6@tcp1: cfg command failed: rc = -17
[Tue Oct 1 16:50:14 2024] Lustre: cmd=cf001 0:fslustre-OST1823-osc-MDT0000 1:osp 2:fslustre-MDT0000-mdtlov_UUID[Tue Oct 1 16:50:14 2024] LustreError: 15c-8: MGC10.10.208.6@tcp1: Confguration from log fslustre-MDT0000 failed from MGS -17. Communication error between node & MGS, a bad configuration, or other errors. See syslog for more info
[Tue Oct 1 16:50:14 2024] LustreError: 2126089:0:(obd_mount_server.c:1425:server_start_targets()) failed to start server fslustre-MDT0000: -17
[Tue Oct 1 16:50:15 2024] LustreError: 2126089:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start targets: -17
the fsck checking not showing block corruption
e2fsck 1.47.1-wc1 (28-May-2024) MMP interval is 5 seconds and total wait time is 22 seconds. Please wait... fslustre-MDT0000 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information 151231300 inodes used (8.80%, out of 1717993472) 7 non-contiguous files (0.0%) 44972 non-contiguous directories (0.0%) # of inodes with ind/dind/tind blocks: 17456/419/0 460752157 blocks used (42.91%, out of 1073741824) 0 bad blocks 2 large files 125051258 regular files 26156689 directories 0 character device files 0 block device files 0 fifos 0 links 23343 symbolic links (3500 fast symbolic links) 0 sockets ------------ 151231290 files
1, The volume on which MDT resides is healthy. It had been verified by storage provider. No IO errors or other errors prior to or after the problem.
2, To mitigate the issue, we had to run e2fsck in order to clear the filesystem check flag, and then regenerating the config logs per the following guide: https://doc.lustre.org/lustre_manual.xhtml#lustremaint.regenerateConfigLogs.
3, We attempted to reproduce the issue by adding upto 3000 OSTs in a test lab running same version of Lustre, however were unable to reproduce the issue.
The MDT config llog dump attached below:
Thanks for your time!
Attachments
Issue Links
- is related to
-
LU-18763 LBUG on multiple MDS while adding OSTs - journal_dirty_metadata failed: handle type 0 started at line 1994, credits 5/0, errcode -28
-
- Resolved
-