Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.3.0, Lustre 2.4.0
-
None
-
SWL - Hyperion/LLNL
-
3
-
4217
Description
Running SWL test, after ~6 hours seeing ldiskfs failure and EROFS on the MDT.
Sep 27 20:13:20 hyperion-rst6 kernel: Lustre: 4918:0:(mdt_recovery.c:622:mdt_steal_ack_locks()) Stealing 1 locks from rs ffff880108e55000 x1414295525519816.t4537566185 o0 NID 192.168.116.112@o2ib1 Sep 27 20:13:20 hyperion-rst6 kernel: Lustre: 4233:0:(service.c:2105:ptlrpc_handle_rs()) All locks stolen from rs ffff880108e55000 x1414295525519816.t4537566185 o0 NID 192.168.116.112@o2ib1 Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #127928380: rec_len is smaller than minimal - block=64030811offset=536(536), inode=127946027, rec_len=0, name_len=4 Sep 27 20:16:39 hyperion-rst6 kernel: Aborting journal on device md1-8. Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30 Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs (md1): bad entry in directory #127928380: rec_len is smaller than minimal - block=64030811offset=536(536), inode=127946027, rec_len=0, name_len=4 Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30 Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30 Sep 27 20:16:39 hyperion-rst6 kernel: Remounting filesystem read-only Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted
Ran fsck on device, recovered
LDISKFS-fs (md1): mounted filesystem with ordered data mode. quota=off. Opts: LDISKFS-fs (md1): mounted filesystem with ordered data mode. quota=off. Opts: Lustre: MGC192.168.127.6@o2ib: Reactivating import Lustre: lustre-MDT0000: used disk, loading Lustre: 5311:0:(ldlm_lib.c:2139:target_recovery_init()) RECOVERY: service lustre-MDT0000, 91 recoverable clients, last_transno 4541171590 Lustre: lustre-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 Lustre: 4326:0:(mgc_request.c:1534:mgc_process_recover_log()) Process recover log lustre-mdtir error -22 Lustre: 4326:0:(mgc_request.c:1534:mgc_process_recover_log()) Skipped 2 previous similar messages Lustre: lustre-MDT0000: Will be in recovery for at least 2:30, or until 91 clients reconnect LustreError: 5315:0:(ldlm_lib.c:1717:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4541171591, ql: 62, comp: 29, conn: 91, next: 4541177085, last_committed: 4541171590) LustreError: 5315:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids LustreError: 5315:0:(ldlm_lib.c:1717:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4541177086, ql: 62, comp: 29, conn: 91, next: 4541181324, last_committed: 4541171590) Lustre: lustre-MDT0000: disconnecting 1 stale clients Lustre: lustre-MDT0000: Recovery over after 1:24, of 91 clients 90 recovered and 1 was evicted. Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0028_UUID now active, resetting orphans
Will continue to run, see if we get repeat.
Attachments
Issue Links
- is duplicated by
-
LU-2080 SWL -ldiskfs_add_entry: bad entry in directory #: rec_len is smaller than minimal
-
- Resolved
-
Crashed.
Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs Dumping htree_lock ffff88011e8db200, mode 3, head ffff880139a39180 Oct 5 18:27:58 hyperion-rst6 kernel: granted mode: 3:1 Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs error (device md1): add_dirent(de1): dir(134759485, dx: 1, nblks: 33) de(mdtest.16.41 : 12), off 2376 rlen 40 nlen 40 Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_dx_find_entry: bad entry in directory #134759485: rec_len is smaller than minimal - block=67436580offset=2416(59760), inode=134767633, rec_len=0, name_len=5 Oct 5 18:27:58 hyperion-rst6 kernel: Aborting journal on device md1-8. Oct 5 18:27:58 hyperion-rst6 kernel: LustreError: 4331:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 Oct 5 18:27:58 hyperion-rst6 kernel: LustreError: 4331:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30 Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in osd_trans_stop: Journal has aborted Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only Oct 5 18:27:58 hyperion-rst6 kernel: LustreError: 4331:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30 Oct 5 18:27:58 hyperion-rst6 kernel: Oct 5 18:27:58 hyperion-rst6 kernel: ------------[ cut here ]------------ Oct 5 18:27:58 hyperion-rst6 kernel: kernel BUG at /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/namei.c:1757! Oct 5 18:27:58 hyperion-rst6 kernel: invalid opcode: 0000 [#1] SMP Oct 5 18:27:58 hyperion-rst6 kernel: last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map Oct 5 18:27:58 hyperion-rst6 kernel: CPU 5 Oct 5 18:27:59 hyperion-rst6 kernel: Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ldiskfs(U) mbcache jbd2 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan] Oct 5 18:27:59 hyperion-rst6 kernel: Oct 5 18:27:59 hyperion-rst6 kernel: Pid: 6331, comm: mdt03_016 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.gb16fe80.x86_64 #1 Dell Inc. PowerEdge R610/0K399H Oct 5 18:27:59 hyperion-rst6 kernel: RIP: 0010:[<ffffffffa0726502>] [<ffffffffa0726502>] add_dirent_to_buf+0x552/0x6a0 [ldiskfs] Oct 5 18:27:59 hyperion-rst6 kernel: RSP: 0018:ffff880159a17480 EFLAGS: 00010246 Oct 5 18:27:59 hyperion-rst6 kernel: RAX: ffff8801eea83800 RBX: ffff880313470948 RCX: ffff880313470970 Oct 5 18:27:59 hyperion-rst6 kernel: RDX: ffff88031833f400 RSI: 0000000000000046 RDI: ffff8801eea83c00 Oct 5 18:27:59 hyperion-rst6 kernel: RBP: ffff880159a17550 R08: 0000000000000000 R09: 0000000000000000 Oct 5 18:27:59 hyperion-rst6 kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff880146a44250 Oct 5 18:27:59 hyperion-rst6 kernel: R13: 0000000000000948 R14: 000000000000000c R15: 0000000000000028 Oct 5 18:27:59 hyperion-rst6 kernel: FS: 00002aaaab47e700(0000) GS:ffff880028240000(0000) knlGS:0000000000000000 Oct 5 18:27:59 hyperion-rst6 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 5 18:27:59 hyperion-rst6 kernel: CR2: 00002aaaab34230c CR3: 0000000001a85000 CR4: 00000000000006e0 Oct 5 18:27:59 hyperion-rst6 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Oct 5 18:27:59 hyperion-rst6 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Oct 5 18:27:59 hyperion-rst6 kernel: Stack: Oct 5 18:27:59 hyperion-rst6 kernel: ffffc9002ae7e1f0 000000000000000c ffff880100000948 0000000000000028 Oct 5 18:27:59 hyperion-rst6 kernel: <d> 0000000000000028 00000000ecc23e5e ffff8801abec6080 ffffffff080863e4 Oct 5 18:27:59 hyperion-rst6 kernel: <d> 0000000000021000 ffffffffa074720c ffff88011e8db200 ffff880113f7cc18 Oct 5 18:27:59 hyperion-rst6 kernel: Call Trace: Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0728999>] ldiskfs_dx_add_entry+0x169/0x7e0 [ldiskfs] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa07433f1>] ? ldiskfs_xattr_set_handle+0x151/0x560 [ldiskfs] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa072929d>] ldiskfs_add_entry+0x28d/0x500 [ldiskfs] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0f13689>] __osd_ea_add_rec+0xb9/0x190 [osd_ldiskfs] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0f1fc1b>] osd_index_ea_insert+0x21b/0x5e0 [osd_ldiskfs] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e21d2d>] ? __mdd_xattr_set+0x25d/0x380 [mdd] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e3a767>] __mdd_index_insert_only+0x147/0x150 [mdd] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e3b901>] __mdd_index_insert+0x51/0x1f0 [mdd] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e3bdf9>] ? mdd_object_capa+0x29/0x1b0 [mdd] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e416eb>] mdd_create+0x1a0b/0x2180 [mdd] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0f19f9f>] ? osd_xattr_get+0x9f/0x360 [osd_ldiskfs] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0f6d637>] cml_create+0x97/0x250 [cmm] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0eaaddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0ebeb9f>] mdt_reint_open+0x108f/0x18a0 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e4728e>] ? md_ucred+0x1e/0x60 [mdd] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e8c235>] ? mdt_ucred+0x15/0x20 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0ea8151>] mdt_reint_rec+0x41/0xe0 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0ea19aa>] mdt_reint_internal+0x50a/0x810 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0ea1f7d>] mdt_intent_reint+0x1ed/0x500 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e9e191>] mdt_intent_policy+0x371/0x6a0 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa090d881>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa09359bf>] ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e9e506>] mdt_enqueue+0x46/0x130 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e95802>] mdt_handle_common+0x922/0x1740 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e966f5>] mdt_regular_handle+0x15/0x20 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0965b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa03a514f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa095cf37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70 Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0967111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0966520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0966520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0966520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Oct 5 18:27:59 hyperion-rst6 kernel: Code: 54 24 18 48 8b 55 b8 48 89 c1 31 c0 48 89 14 24 48 c7 c2 38 c3 74 a0 e8 0d 55 01 00 48 8b 4d a8 66 83 79 04 00 0f 85 13 fd ff ff <0f> 0b eb fe 48 8b 7d 80 e8 d1 e7 01 00 49 8b bc 24 08 01 00 00 Oct 5 18:27:59 hyperion-rst6 kernel: RIP [<ffffffffa0726502>] add_dirent_to_buf+0x552/0x6a0 [ldiskfs] Oct 5 18:27:59 hyperion-rst6 kernel: RSP <ffff880159a17480>
I don't think we dropped a vmcore on this one, still looking