Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2041

SWL ldiskfs_add_entry: bad entry in directory #127928380: rec_len is smaller than minimal

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.3.0, Lustre 2.4.0
    • Lustre 2.3.0, Lustre 2.4.0
    • None
    • SWL - Hyperion/LLNL
    • 3
    • 4217

    Description

      Running SWL test, after ~6 hours seeing ldiskfs failure and EROFS on the MDT.

       
      Sep 27 20:13:20 hyperion-rst6 kernel: Lustre: 4918:0:(mdt_recovery.c:622:mdt_steal_ack_locks()) Stealing 1 locks from rs ffff880108e55000 x1414295525519816.t4537566185 o0 NID 192.168.116.112@o2ib1
      Sep 27 20:13:20 hyperion-rst6 kernel: Lustre: 4233:0:(service.c:2105:ptlrpc_handle_rs()) All locks stolen from rs ffff880108e55000 x1414295525519816.t4537566185 o0 NID 192.168.116.112@o2ib1
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry:
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #127928380: rec_len is smaller than minimal - block=64030811offset=536(536), inode=127946027, rec_len=0, name_len=4
      Sep 27 20:16:39 hyperion-rst6 kernel: Aborting journal on device md1-8.
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs (md1): bad entry in directory #127928380: rec_len is smaller than minimal - block=64030811offset=536(536), inode=127946027, rec_len=0, name_len=4
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: Remounting filesystem read-only
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted
      

      Ran fsck on device, recovered

      LDISKFS-fs (md1): mounted filesystem with ordered data mode. quota=off. Opts:
      LDISKFS-fs (md1): mounted filesystem with ordered data mode. quota=off. Opts:
      Lustre: MGC192.168.127.6@o2ib: Reactivating import
      Lustre: lustre-MDT0000: used disk, loading
      Lustre: 5311:0:(ldlm_lib.c:2139:target_recovery_init()) RECOVERY: service lustre-MDT0000, 91 recoverable clients, last_transno 4541171590
      Lustre: lustre-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
      Lustre: 4326:0:(mgc_request.c:1534:mgc_process_recover_log()) Process recover log lustre-mdtir error -22
      Lustre: 4326:0:(mgc_request.c:1534:mgc_process_recover_log()) Skipped 2 previous similar messages
      Lustre: lustre-MDT0000: Will be in recovery for at least 2:30, or until 91 clients reconnect
      LustreError: 5315:0:(ldlm_lib.c:1717:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4541171591, ql: 62, comp: 29, conn: 91, next: 4541177085, last_committed: 4541171590)
      LustreError: 5315:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids
      LustreError: 5315:0:(ldlm_lib.c:1717:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4541177086, ql: 62, comp: 29, conn: 91, next: 4541181324, last_committed: 4541171590)
      Lustre: lustre-MDT0000: disconnecting 1 stale clients
      Lustre: lustre-MDT0000: Recovery over after 1:24, of 91 clients 90 recovered and 1 was evicted.
      Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0028_UUID now active, resetting orphans
      
      

      Will continue to run, see if we get repeat.

      Attachments

        Issue Links

          Activity

            [LU-2041] SWL ldiskfs_add_entry: bad entry in directory #127928380: rec_len is smaller than minimal

            Please let me know if this needs to be reopened for more work.

            jlevi Jodi Levi (Inactive) added a comment - Please let me know if this needs to be reopened for more work.
            jlevi Jodi Levi (Inactive) added a comment - Patch for master here: http://review.whamcloud.com/#change,4208
            pjones Peter Jones added a comment -

            Dropping priority because landed for 2.3

            pjones Peter Jones added a comment - Dropping priority because landed for 2.3

            The latest build is here: http://build.whamcloud.com/job/lustre-reviews/9758

            I merged fanyong's fix into this build.

            liang Liang Zhen (Inactive) added a comment - The latest build is here: http://build.whamcloud.com/job/lustre-reviews/9758 I merged fanyong's fix into this build.

            This patch fixes some potential OI mapping cache inconsistency issues:

            http://review.whamcloud.com/#change,4207

            yong.fan nasf (Inactive) added a comment - This patch fixes some potential OI mapping cache inconsistency issues: http://review.whamcloud.com/#change,4207

            latest build is here: http://build.whamcloud.com/job/lustre-reviews/9744/
            major change in this build is disabled OI cache

            liang Liang Zhen (Inactive) added a comment - latest build is here: http://build.whamcloud.com/job/lustre-reviews/9744/ major change in this build is disabled OI cache
            liang Liang Zhen (Inactive) added a comment - - edited

            this time we crashed in mdtest, and the directory is a directory with indices, so it's more like a PDO issue, however:

            • we can see in the log (granted mode: 3:1), which means the directory only granted one thread to enter the directory, but there are two threads hit different errors in different places, under the same directory (with same ino: 134759485)
            • there is a entry has name_len=5, but mdtest will never create file with name_len=5
              Oct  5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_dx_find_entry: bad entry in directory #134759485: rec_len is smaller than minimal - block=67436580offset=2416(59760), inode=134767633, rec_len=0, name_len=5
              

            Based on previous two things, it could be either:
            a) PDO lock totally screwed up locking;

            or

            b) two OSD objects with different FIDs point to a same inode (directory) for some unknown reason.

            I think the only choice now is disable several things one by one:

            • OI cache
            • OI shrinker
            • PDO

            So I have to create different builds to try...

            liang Liang Zhen (Inactive) added a comment - - edited this time we crashed in mdtest, and the directory is a directory with indices, so it's more like a PDO issue, however: we can see in the log (granted mode: 3:1), which means the directory only granted one thread to enter the directory, but there are two threads hit different errors in different places, under the same directory (with same ino: 134759485) there is a entry has name_len=5, but mdtest will never create file with name_len=5 Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_dx_find_entry: bad entry in directory #134759485: rec_len is smaller than minimal - block=67436580offset=2416(59760), inode=134767633, rec_len=0, name_len=5 Based on previous two things, it could be either: a) PDO lock totally screwed up locking; or b) two OSD objects with different FIDs point to a same inode (directory) for some unknown reason. I think the only choice now is disable several things one by one: OI cache OI shrinker PDO So I have to create different builds to try...

            Crashed.

            Oct  5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs Dumping htree_lock ffff88011e8db200, mode 3, head ffff880139a39180
            Oct  5 18:27:58 hyperion-rst6 kernel:   granted mode: 3:1
            Oct  5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs error (device md1): add_dirent(de1): dir(134759485, dx: 1, nblks: 33) de(mdtest.16.41 : 12), off 2376 rlen 40 nlen 40
            Oct  5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_dx_find_entry: bad entry in directory #134759485: rec_len is smaller than minimal - block=67436580offset=2416(59760), inode=134767633, rec_len=0, name_len=5
            Oct  5 18:27:58 hyperion-rst6 kernel: Aborting journal on device md1-8.
            Oct  5 18:27:58 hyperion-rst6 kernel: LustreError: 4331:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
            Oct  5 18:27:58 hyperion-rst6 kernel: LustreError: 4331:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30
            Oct  5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in osd_trans_stop: Journal has aborted
            Oct  5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
            Oct  5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
            Oct  5 18:27:58 hyperion-rst6 kernel: LustreError: 4331:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30
            Oct  5 18:27:58 hyperion-rst6 kernel:
            Oct  5 18:27:58 hyperion-rst6 kernel: ------------[ cut here ]------------
            Oct  5 18:27:58 hyperion-rst6 kernel: kernel BUG at /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/namei.c:1757!
            Oct  5 18:27:58 hyperion-rst6 kernel: invalid opcode: 0000 [#1] SMP
            Oct  5 18:27:58 hyperion-rst6 kernel: last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
            Oct  5 18:27:58 hyperion-rst6 kernel: CPU 5  
            Oct  5 18:27:59 hyperion-rst6 kernel: Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ldiskfs(U) mbcache jbd2 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan]
            Oct  5 18:27:59 hyperion-rst6 kernel:
            Oct  5 18:27:59 hyperion-rst6 kernel: Pid: 6331, comm: mdt03_016 Tainted: P           ---------------    2.6.32-279.5.1.el6_lustre.gb16fe80.x86_64 #1 Dell Inc. PowerEdge R610/0K399H
            Oct  5 18:27:59 hyperion-rst6 kernel: RIP: 0010:[<ffffffffa0726502>]  [<ffffffffa0726502>] add_dirent_to_buf+0x552/0x6a0 [ldiskfs]
            Oct  5 18:27:59 hyperion-rst6 kernel: RSP: 0018:ffff880159a17480  EFLAGS: 00010246
            Oct  5 18:27:59 hyperion-rst6 kernel: RAX: ffff8801eea83800 RBX: ffff880313470948 RCX: ffff880313470970
            Oct  5 18:27:59 hyperion-rst6 kernel: RDX: ffff88031833f400 RSI: 0000000000000046 RDI: ffff8801eea83c00
            Oct  5 18:27:59 hyperion-rst6 kernel: RBP: ffff880159a17550 R08: 0000000000000000 R09: 0000000000000000
            Oct  5 18:27:59 hyperion-rst6 kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff880146a44250
            Oct  5 18:27:59 hyperion-rst6 kernel: R13: 0000000000000948 R14: 000000000000000c R15: 0000000000000028
            Oct  5 18:27:59 hyperion-rst6 kernel: FS:  00002aaaab47e700(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
            Oct  5 18:27:59 hyperion-rst6 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
            Oct  5 18:27:59 hyperion-rst6 kernel: CR2: 00002aaaab34230c CR3: 0000000001a85000 CR4: 00000000000006e0
            Oct  5 18:27:59 hyperion-rst6 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            Oct  5 18:27:59 hyperion-rst6 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            
            Oct  5 18:27:59 hyperion-rst6 kernel: Stack:
            Oct  5 18:27:59 hyperion-rst6 kernel: ffffc9002ae7e1f0 000000000000000c ffff880100000948 0000000000000028
            Oct  5 18:27:59 hyperion-rst6 kernel: <d> 0000000000000028 00000000ecc23e5e ffff8801abec6080 ffffffff080863e4
            Oct  5 18:27:59 hyperion-rst6 kernel: <d> 0000000000021000 ffffffffa074720c ffff88011e8db200 ffff880113f7cc18
            Oct  5 18:27:59 hyperion-rst6 kernel: Call Trace:
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0728999>] ldiskfs_dx_add_entry+0x169/0x7e0 [ldiskfs]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa07433f1>] ? ldiskfs_xattr_set_handle+0x151/0x560 [ldiskfs]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa072929d>] ldiskfs_add_entry+0x28d/0x500 [ldiskfs]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0f13689>] __osd_ea_add_rec+0xb9/0x190 [osd_ldiskfs]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0f1fc1b>] osd_index_ea_insert+0x21b/0x5e0 [osd_ldiskfs]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e21d2d>] ? __mdd_xattr_set+0x25d/0x380 [mdd]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e3a767>] __mdd_index_insert_only+0x147/0x150 [mdd]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e3b901>] __mdd_index_insert+0x51/0x1f0 [mdd]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e3bdf9>] ? mdd_object_capa+0x29/0x1b0 [mdd]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e416eb>] mdd_create+0x1a0b/0x2180 [mdd]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0f19f9f>] ? osd_xattr_get+0x9f/0x360 [osd_ldiskfs]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0f6d637>] cml_create+0x97/0x250 [cmm]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0eaaddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0ebeb9f>] mdt_reint_open+0x108f/0x18a0 [mdt]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e4728e>] ? md_ucred+0x1e/0x60 [mdd]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e8c235>] ? mdt_ucred+0x15/0x20 [mdt]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0ea8151>] mdt_reint_rec+0x41/0xe0 [mdt]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0ea19aa>] mdt_reint_internal+0x50a/0x810 [mdt]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0ea1f7d>] mdt_intent_reint+0x1ed/0x500 [mdt]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e9e191>] mdt_intent_policy+0x371/0x6a0 [mdt]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa090d881>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa09359bf>] ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e9e506>] mdt_enqueue+0x46/0x130 [mdt]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e95802>] mdt_handle_common+0x922/0x1740 [mdt]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e966f5>] mdt_regular_handle+0x15/0x20 [mdt]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0965b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa03a514f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa095cf37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0967111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0966520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0966520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0966520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
            Oct  5 18:27:59 hyperion-rst6 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
            Oct  5 18:27:59 hyperion-rst6 kernel: Code: 54 24 18 48 8b 55 b8 48 89 c1 31 c0 48 89 14 24 48 c7 c2 38 c3 74 a0 e8 0d 55 01 00 48 8b 4d a8 66 83 79 04 00 0f 85 13 fd ff ff <0f> 0b eb fe 48 8b 7d 80 e8 d1 e7 01 00 49 8b bc 24 08 01 00 00
            Oct  5 18:27:59 hyperion-rst6 kernel: RIP  [<ffffffffa0726502>] add_dirent_to_buf+0x552/0x6a0 [ldiskfs]
            Oct  5 18:27:59 hyperion-rst6 kernel: RSP <ffff880159a17480>
            

            I don't think we dropped a vmcore on this one, still looking

            cliffw Cliff White (Inactive) added a comment - Crashed. Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs Dumping htree_lock ffff88011e8db200, mode 3, head ffff880139a39180 Oct 5 18:27:58 hyperion-rst6 kernel: granted mode: 3:1 Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs error (device md1): add_dirent(de1): dir(134759485, dx: 1, nblks: 33) de(mdtest.16.41 : 12), off 2376 rlen 40 nlen 40 Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_dx_find_entry: bad entry in directory #134759485: rec_len is smaller than minimal - block=67436580offset=2416(59760), inode=134767633, rec_len=0, name_len=5 Oct 5 18:27:58 hyperion-rst6 kernel: Aborting journal on device md1-8. Oct 5 18:27:58 hyperion-rst6 kernel: LustreError: 4331:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 Oct 5 18:27:58 hyperion-rst6 kernel: LustreError: 4331:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30 Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in osd_trans_stop: Journal has aborted Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only Oct 5 18:27:58 hyperion-rst6 kernel: LustreError: 4331:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30 Oct 5 18:27:58 hyperion-rst6 kernel: Oct 5 18:27:58 hyperion-rst6 kernel: ------------[ cut here ]------------ Oct 5 18:27:58 hyperion-rst6 kernel: kernel BUG at / var /lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/namei.c:1757! Oct 5 18:27:58 hyperion-rst6 kernel: invalid opcode: 0000 [#1] SMP Oct 5 18:27:58 hyperion-rst6 kernel: last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map Oct 5 18:27:58 hyperion-rst6 kernel: CPU 5 Oct 5 18:27:59 hyperion-rst6 kernel: Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ldiskfs(U) mbcache jbd2 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan] Oct 5 18:27:59 hyperion-rst6 kernel: Oct 5 18:27:59 hyperion-rst6 kernel: Pid: 6331, comm: mdt03_016 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.gb16fe80.x86_64 #1 Dell Inc. PowerEdge R610/0K399H Oct 5 18:27:59 hyperion-rst6 kernel: RIP: 0010:[<ffffffffa0726502>] [<ffffffffa0726502>] add_dirent_to_buf+0x552/0x6a0 [ldiskfs] Oct 5 18:27:59 hyperion-rst6 kernel: RSP: 0018:ffff880159a17480 EFLAGS: 00010246 Oct 5 18:27:59 hyperion-rst6 kernel: RAX: ffff8801eea83800 RBX: ffff880313470948 RCX: ffff880313470970 Oct 5 18:27:59 hyperion-rst6 kernel: RDX: ffff88031833f400 RSI: 0000000000000046 RDI: ffff8801eea83c00 Oct 5 18:27:59 hyperion-rst6 kernel: RBP: ffff880159a17550 R08: 0000000000000000 R09: 0000000000000000 Oct 5 18:27:59 hyperion-rst6 kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff880146a44250 Oct 5 18:27:59 hyperion-rst6 kernel: R13: 0000000000000948 R14: 000000000000000c R15: 0000000000000028 Oct 5 18:27:59 hyperion-rst6 kernel: FS: 00002aaaab47e700(0000) GS:ffff880028240000(0000) knlGS:0000000000000000 Oct 5 18:27:59 hyperion-rst6 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 5 18:27:59 hyperion-rst6 kernel: CR2: 00002aaaab34230c CR3: 0000000001a85000 CR4: 00000000000006e0 Oct 5 18:27:59 hyperion-rst6 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Oct 5 18:27:59 hyperion-rst6 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Oct 5 18:27:59 hyperion-rst6 kernel: Stack: Oct 5 18:27:59 hyperion-rst6 kernel: ffffc9002ae7e1f0 000000000000000c ffff880100000948 0000000000000028 Oct 5 18:27:59 hyperion-rst6 kernel: <d> 0000000000000028 00000000ecc23e5e ffff8801abec6080 ffffffff080863e4 Oct 5 18:27:59 hyperion-rst6 kernel: <d> 0000000000021000 ffffffffa074720c ffff88011e8db200 ffff880113f7cc18 Oct 5 18:27:59 hyperion-rst6 kernel: Call Trace: Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0728999>] ldiskfs_dx_add_entry+0x169/0x7e0 [ldiskfs] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa07433f1>] ? ldiskfs_xattr_set_handle+0x151/0x560 [ldiskfs] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa072929d>] ldiskfs_add_entry+0x28d/0x500 [ldiskfs] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0f13689>] __osd_ea_add_rec+0xb9/0x190 [osd_ldiskfs] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0f1fc1b>] osd_index_ea_insert+0x21b/0x5e0 [osd_ldiskfs] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e21d2d>] ? __mdd_xattr_set+0x25d/0x380 [mdd] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e3a767>] __mdd_index_insert_only+0x147/0x150 [mdd] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e3b901>] __mdd_index_insert+0x51/0x1f0 [mdd] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e3bdf9>] ? mdd_object_capa+0x29/0x1b0 [mdd] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e416eb>] mdd_create+0x1a0b/0x2180 [mdd] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0f19f9f>] ? osd_xattr_get+0x9f/0x360 [osd_ldiskfs] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0f6d637>] cml_create+0x97/0x250 [cmm] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0eaaddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0ebeb9f>] mdt_reint_open+0x108f/0x18a0 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e4728e>] ? md_ucred+0x1e/0x60 [mdd] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e8c235>] ? mdt_ucred+0x15/0x20 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0ea8151>] mdt_reint_rec+0x41/0xe0 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0ea19aa>] mdt_reint_internal+0x50a/0x810 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0ea1f7d>] mdt_intent_reint+0x1ed/0x500 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e9e191>] mdt_intent_policy+0x371/0x6a0 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa090d881>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa09359bf>] ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e9e506>] mdt_enqueue+0x46/0x130 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e95802>] mdt_handle_common+0x922/0x1740 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e966f5>] mdt_regular_handle+0x15/0x20 [mdt] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0965b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa03a514f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa095cf37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70 Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0967111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0966520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0966520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0966520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Oct 5 18:27:59 hyperion-rst6 kernel: Code: 54 24 18 48 8b 55 b8 48 89 c1 31 c0 48 89 14 24 48 c7 c2 38 c3 74 a0 e8 0d 55 01 00 48 8b 4d a8 66 83 79 04 00 0f 85 13 fd ff ff <0f> 0b eb fe 48 8b 7d 80 e8 d1 e7 01 00 49 8b bc 24 08 01 00 00 Oct 5 18:27:59 hyperion-rst6 kernel: RIP [<ffffffffa0726502>] add_dirent_to_buf+0x552/0x6a0 [ldiskfs] Oct 5 18:27:59 hyperion-rst6 kernel: RSP <ffff880159a17480> I don't think we dropped a vmcore on this one, still looking

            just posted a patch : http://review.whamcloud.com/#change,4186
            It fixed a corner case in osd_it_ea_rec() which might cause inconsistent OI cache, I'm not 100% sure it's fix of this bug, but at least last time SWL ran much longer than before w/o seeing this bug after applying this patch.

            liang Liang Zhen (Inactive) added a comment - just posted a patch : http://review.whamcloud.com/#change,4186 It fixed a corner case in osd_it_ea_rec() which might cause inconsistent OI cache, I'm not 100% sure it's fix of this bug, but at least last time SWL ran much longer than before w/o seeing this bug after applying this patch.

            also add another fix from wangdi to the debug patch, the latest build is: http://build.whamcloud.com/job/lustre-reviews/9641/

            liang Liang Zhen (Inactive) added a comment - also add another fix from wangdi to the debug patch, the latest build is: http://build.whamcloud.com/job/lustre-reviews/9641/

            People

              yong.fan nasf (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: