Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.3.0
    • None
    • SWL - Hyperion/LLNL RHEL6 servers and clients
    • 3
    • 6323

    Description

      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry:
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #88606751: rec_len is smaller than minimal - block=44369457offset=536(536), inode=88627711, rec_len=0, name_len=4
      Sep 16 01:28:12 hyperion-rst6 kernel: Aborting journal on device md1-8.
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_new_inode: Journal has aborted
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_delete_inode: Journal has aborted
      Sep 16 01:28:12 hyperion-rst6 kernel: LustreError: 4489:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
      Sep 16 01:28:12 hyperion-rst6 kernel: LustreError: 4186:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30

      Disk appeared to be quite messed up with fsck -fy. Ran the data capture script from lu-1015, results attached.

      Attachments

        Activity

          [LU-1948] ldiskfs - MDS goes read-only (SWL)

          According to test result the corrupt directory comes from fdtree test. " 78644697: name L0F7, lck 0000000000000004, name_len=568931904". Hmm during fdtree test, each thread will only create files in its own directory, and only 10 files for each dir.

           
           \\srun -N $NODES -n $PROCS --wait=7200 -o fdtree.%t.out \$SWL/IO/fdtree.bash -l 3 -d 10 >& tmp_file
          

          So this problem should be unrelated with ldiskfs pdir patch. Now I suspect it might be related with mballoc, because we saw this during the test

           
          "2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: pa free mismatch: [pa ffff88031438e748] [phy 190752] [logic 608] [len 16] [free 13] [error 0] [inode 7340126] [freed 16]
          2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: free 16, pa_free 13 "
          

          though it happened after the fs turned to read-only.

          di.wang Di Wang (Inactive) added a comment - According to test result the corrupt directory comes from fdtree test. " 78644697: name L0F7, lck 0000000000000004, name_len=568931904". Hmm during fdtree test, each thread will only create files in its own directory, and only 10 files for each dir. \\srun -N $NODES -n $PROCS --wait=7200 -o fdtree.%t.out \$SWL/IO/fdtree.bash -l 3 -d 10 >& tmp_file So this problem should be unrelated with ldiskfs pdir patch. Now I suspect it might be related with mballoc, because we saw this during the test "2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: pa free mismatch: [pa ffff88031438e748] [phy 190752] [logic 608] [len 16] [free 13] [error 0] [inode 7340126] [freed 16] 2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: free 16, pa_free 13 " though it happened after the fs turned to read-only.

          Okay, the MDS just 'sploded.

          2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent:
          2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #78644697: rec_len is smaller than minimal - block=39336951offset=152(152), inode=78655820, rec_len=0, name_len=4
          2012-09-18 14:13:37 Aborting journal on device md1-8.
          2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal
          2012-09-18 14:13:37 LDISKFS-fs (md1): Remounting filesystem read-only
          2012-09-18 14:13:37 LDISKFS-fs (md1): Remounting filesystem read-only
          2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent: 78644697: lck ffff8801304ad000 off=152(78655820),inode=32, rec_len=4, name_len=0
          2012-09-18 14:13:37
          2012-09-18 14:13:37 LustreError: 8052:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
          2012-09-18 14:13:37 LustreError: 8052:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30
          2012-09-18 14:13:37 LustreError: 8052:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30
          2012-09-18 14:13:37 78644697: name L0F7, lck 0000000000000004, name_len=568931904
          2012-09-18 14:13:37
          2012-09-18 14:13:37 ------------[ cut here ]------------
          2012-09-18 14:13:37 kernel BUG at /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/namei.c:1757!
          2012-09-18 14:13:37 invalid opcode: 0000 [#1] SMP
          2012-09-18 14:13:37 last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/0000:05:00.0/infiniband_mad/umad0/port
          2012-09-18 14:13:37 CPU 7
          2012-09-18 14:13:37 Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) ldiskfs(U) mbcache jbd2 lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan]
          2012-09-18 14:13:37
          2012-09-18 14:13:37 Pid: 4898, comm: mdt03_008 Tainted: P           ---------------    2.6.32-279.5.1.el6_lustre.gb4cc145.x86_64 #1 Dell Inc. PowerEdge R610/0K399H
          
          2012-09-18 14:13:37 RIP: 0010:[<ffffffffa0db21d0>]  [<ffffffffa0db21d0>] add_dirent_to_buf+0x4c0/0x530 [ldiskfs]
          2012-09-18 14:13:37 RSP: 0018:ffff880121e935f0  EFLAGS: 00010246
          2012-09-18 14:13:37 RAX: ffff880181248000 RBX: ffff880268654078 RCX: 00000000000014b5
          2012-09-18 14:13:37 RDX: ffff880268654098 RSI: 0000000000000046 RDI: ffff88016ff93c00
          2012-09-18 14:13:37 RBP: ffff880121e936b0 R08: 0000000000000000 R09: 0000000000000000
          2012-09-18 14:13:37 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880130e16ad0
          2012-09-18 14:13:37 R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000020
          2012-09-18 14:13:37 FS:  00002aaaab47e700(0000) GS:ffff880028260000(0000) knlGS:0000000000000000
          2012-09-18 14:13:37 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          2012-09-18 14:13:37 CR2: 00000000006d3a80 CR3: 0000000001a85000 CR4: 00000000000006e0
          2012-09-18 14:13:37 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          2012-09-18 14:13:37 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          2012-09-18 14:13:37 Process mdt03_008 (pid: 4898, threadinfo ffff880121e92000, task ffff88012df62080)
          2012-09-18 14:13:37 LustreError: 4099:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30
          2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30
          2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30
          2012-09-18 14:13:37 Stack:
          2012-09-18 14:13:37  ffff880121e93640 0000000002583bf7 0000000000001000 0000000000000000
          2012-09-18 14:13:37 <d> ffff8801abdad5c0 ffff880200000004 ffff880104b03141 0000000000000004
          2012-09-18 14:13:37 <d> 00000020e2f30800 ffff8801258d9e00 ffff8801064c3288 ffff880268654fe0
          2012-09-18 14:13:37 Call Trace:
          2012-09-18 14:13:37  [<ffffffffa0db5c5d>] ldiskfs_add_entry+0xcd/0x500 [ldiskfs]
          2012-09-18 14:13:37  [<ffffffffa1001689>] __osd_ea_add_rec+0xb9/0x190 [osd_ldiskfs]
          2012-09-18 14:13:37  [<ffffffffa100dbeb>] osd_index_ea_insert+0x21b/0x5e0 [osd_ldiskfs]
          2012-09-18 14:13:37  [<ffffffffa0eee977>] __mdd_index_insert_only+0x147/0x150 [mdd]
          2012-09-18 14:13:37  [<ffffffffa0eef9b1>] __mdd_index_insert+0x51/0x1f0 [mdd]
          2012-09-18 14:13:37  [<ffffffffa0ef55e3>] mdd_create+0x19a3/0x20c0 [mdd]
          2012-09-18 14:13:37  [<ffffffffa03aa5b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
          2012-09-18 14:13:37  [<ffffffffa06a4637>] cml_create+0x97/0x250 [cmm]
          2012-09-18 14:13:37  [<ffffffffa0f8bb9f>] mdt_reint_open+0x108f/0x18a0 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f75151>] mdt_reint_rec+0x41/0xe0 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f6e9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f6ef7d>] mdt_intent_reint+0x1ed/0x500 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f6b191>] mdt_intent_policy+0x371/0x6a0 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0859881>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa08819bf>] ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa0f6b506>] mdt_enqueue+0x46/0x130 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f62802>] mdt_handle_common+0x922/0x1740 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f636f5>] mdt_regular_handle+0x15/0x20 [mdt]
          2012-09-18 14:13:37  [<ffffffffa08b199d>] ptlrpc_server_handle_request+0x40d/0xea0 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa08a8f37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa03aa5b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
          2012-09-18 14:13:37  [<ffffffff810533f3>] ? __wake_up+0x53/0x70
          2012-09-18 14:13:37  [<ffffffffa08b2f89>] ptlrpc_main+0xb59/0x1860 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffff8100c14a>] child_rip+0xa/0x20
          2012-09-18 14:13:37  [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffff8100c140>] ? child_rip+0x0/0x20
          2012-09-18 14:13:37 Code: 00 00 48 c7 c2 20 81 dd a0 4c 8b 45 a8 45 89 f1 48 c7 c6 e5 d2 dd a0 e8 9f 56 01 00 48 8b 55 c8 66 83 7a 04 00 0f 85 a9 fd ff ff <0f> 0b eb fe 49 8b
          2012-09-18 14:13:37 LustreError: 4099:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30
          2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30
          2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30
          2012-09-18 14:13:37 4c 24 40 49 8b bc 24 08 01 00 00 31 c0 4c 8b
          2012-09-18 14:13:37 RIP  [<ffffffffa0db21d0>] add_dirent_to_buf+0x4c0/0x530 [ldiskfs]
          2012-09-18 14:13:37  RSP <ffff880121e935f0>
          2012-09-18 14:13:38 Initializing cgroup subsys cpuset
          2012-09-18 14:13:38 Initializing cgroup subsys cpu
          2012-09-18 14:13:38 Linux version 2.6.32-279.5.1.el6_lustre.gb4cc145.x86
          

          vmcore has been captured, is on brent.whamcloud.com ~/cliffw/lu1948/vmcore-rst6-c2.gz

          cliffw Cliff White (Inactive) added a comment - Okay, the MDS just 'sploded. 2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent: 2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #78644697: rec_len is smaller than minimal - block=39336951offset=152(152), inode=78655820, rec_len=0, name_len=4 2012-09-18 14:13:37 Aborting journal on device md1-8. 2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal 2012-09-18 14:13:37 LDISKFS-fs (md1): Remounting filesystem read-only 2012-09-18 14:13:37 LDISKFS-fs (md1): Remounting filesystem read-only 2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent: 78644697: lck ffff8801304ad000 off=152(78655820),inode=32, rec_len=4, name_len=0 2012-09-18 14:13:37 2012-09-18 14:13:37 LustreError: 8052:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 2012-09-18 14:13:37 LustreError: 8052:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30 2012-09-18 14:13:37 LustreError: 8052:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30 2012-09-18 14:13:37 78644697: name L0F7, lck 0000000000000004, name_len=568931904 2012-09-18 14:13:37 2012-09-18 14:13:37 ------------[ cut here ]------------ 2012-09-18 14:13:37 kernel BUG at / var /lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/namei.c:1757! 2012-09-18 14:13:37 invalid opcode: 0000 [#1] SMP 2012-09-18 14:13:37 last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/0000:05:00.0/infiniband_mad/umad0/port 2012-09-18 14:13:37 CPU 7 2012-09-18 14:13:37 Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) ldiskfs(U) mbcache jbd2 lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan] 2012-09-18 14:13:37 2012-09-18 14:13:37 Pid: 4898, comm: mdt03_008 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.gb4cc145.x86_64 #1 Dell Inc. PowerEdge R610/0K399H 2012-09-18 14:13:37 RIP: 0010:[<ffffffffa0db21d0>] [<ffffffffa0db21d0>] add_dirent_to_buf+0x4c0/0x530 [ldiskfs] 2012-09-18 14:13:37 RSP: 0018:ffff880121e935f0 EFLAGS: 00010246 2012-09-18 14:13:37 RAX: ffff880181248000 RBX: ffff880268654078 RCX: 00000000000014b5 2012-09-18 14:13:37 RDX: ffff880268654098 RSI: 0000000000000046 RDI: ffff88016ff93c00 2012-09-18 14:13:37 RBP: ffff880121e936b0 R08: 0000000000000000 R09: 0000000000000000 2012-09-18 14:13:37 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880130e16ad0 2012-09-18 14:13:37 R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000020 2012-09-18 14:13:37 FS: 00002aaaab47e700(0000) GS:ffff880028260000(0000) knlGS:0000000000000000 2012-09-18 14:13:37 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2012-09-18 14:13:37 CR2: 00000000006d3a80 CR3: 0000000001a85000 CR4: 00000000000006e0 2012-09-18 14:13:37 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2012-09-18 14:13:37 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2012-09-18 14:13:37 Process mdt03_008 (pid: 4898, threadinfo ffff880121e92000, task ffff88012df62080) 2012-09-18 14:13:37 LustreError: 4099:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30 2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30 2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30 2012-09-18 14:13:37 Stack: 2012-09-18 14:13:37 ffff880121e93640 0000000002583bf7 0000000000001000 0000000000000000 2012-09-18 14:13:37 <d> ffff8801abdad5c0 ffff880200000004 ffff880104b03141 0000000000000004 2012-09-18 14:13:37 <d> 00000020e2f30800 ffff8801258d9e00 ffff8801064c3288 ffff880268654fe0 2012-09-18 14:13:37 Call Trace: 2012-09-18 14:13:37 [<ffffffffa0db5c5d>] ldiskfs_add_entry+0xcd/0x500 [ldiskfs] 2012-09-18 14:13:37 [<ffffffffa1001689>] __osd_ea_add_rec+0xb9/0x190 [osd_ldiskfs] 2012-09-18 14:13:37 [<ffffffffa100dbeb>] osd_index_ea_insert+0x21b/0x5e0 [osd_ldiskfs] 2012-09-18 14:13:37 [<ffffffffa0eee977>] __mdd_index_insert_only+0x147/0x150 [mdd] 2012-09-18 14:13:37 [<ffffffffa0eef9b1>] __mdd_index_insert+0x51/0x1f0 [mdd] 2012-09-18 14:13:37 [<ffffffffa0ef55e3>] mdd_create+0x19a3/0x20c0 [mdd] 2012-09-18 14:13:37 [<ffffffffa03aa5b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2012-09-18 14:13:37 [<ffffffffa06a4637>] cml_create+0x97/0x250 [cmm] 2012-09-18 14:13:37 [<ffffffffa0f8bb9f>] mdt_reint_open+0x108f/0x18a0 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f75151>] mdt_reint_rec+0x41/0xe0 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f6e9aa>] mdt_reint_internal+0x50a/0x810 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f6ef7d>] mdt_intent_reint+0x1ed/0x500 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f6b191>] mdt_intent_policy+0x371/0x6a0 [mdt] 2012-09-18 14:13:37 [<ffffffffa0859881>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa08819bf>] ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa0f6b506>] mdt_enqueue+0x46/0x130 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f62802>] mdt_handle_common+0x922/0x1740 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f636f5>] mdt_regular_handle+0x15/0x20 [mdt] 2012-09-18 14:13:37 [<ffffffffa08b199d>] ptlrpc_server_handle_request+0x40d/0xea0 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa08a8f37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa03aa5b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2012-09-18 14:13:37 [<ffffffff810533f3>] ? __wake_up+0x53/0x70 2012-09-18 14:13:37 [<ffffffffa08b2f89>] ptlrpc_main+0xb59/0x1860 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc] 2012-09-18 14:13:37 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-09-18 14:13:37 [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc] 2012-09-18 14:13:37 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-09-18 14:13:37 Code: 00 00 48 c7 c2 20 81 dd a0 4c 8b 45 a8 45 89 f1 48 c7 c6 e5 d2 dd a0 e8 9f 56 01 00 48 8b 55 c8 66 83 7a 04 00 0f 85 a9 fd ff ff <0f> 0b eb fe 49 8b 2012-09-18 14:13:37 LustreError: 4099:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30 2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30 2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30 2012-09-18 14:13:37 4c 24 40 49 8b bc 24 08 01 00 00 31 c0 4c 8b 2012-09-18 14:13:37 RIP [<ffffffffa0db21d0>] add_dirent_to_buf+0x4c0/0x530 [ldiskfs] 2012-09-18 14:13:37 RSP <ffff880121e935f0> 2012-09-18 14:13:38 Initializing cgroup subsys cpuset 2012-09-18 14:13:38 Initializing cgroup subsys cpu 2012-09-18 14:13:38 Linux version 2.6.32-279.5.1.el6_lustre.gb4cc145.x86 vmcore has been captured, is on brent.whamcloud.com ~/cliffw/lu1948/vmcore-rst6-c2.gz

          I add some debug patch based on b2_3 (http://review.whamcloud.com/#change,4020 only patch on el6 ldiskfs series). build http://build.whamcloud.com/job/lustre-reviews/9250/ Cliff, could you please test with this rpm. And I suspect this problem might be related with ldiskfs htree, so Liang is looking at this problem as well.

          di.wang Di Wang (Inactive) added a comment - I add some debug patch based on b2_3 ( http://review.whamcloud.com/#change,4020 only patch on el6 ldiskfs series). build http://build.whamcloud.com/job/lustre-reviews/9250/ Cliff, could you please test with this rpm. And I suspect this problem might be related with ldiskfs htree, so Liang is looking at this problem as well.

          MDS just went read-only, then crashed when I attempted a umount -f

          2012-09-17 17:44:00 Lustre: 4130:0:(service.c:2101:ptlrpc_handle_rs()) All locks stolen from rs ffff880132a5d000 x1413385599017471.t4342523245 o0 NID 192.168.119.37@o2ib1
          2012-09-17 17:44:00 Lustre: 4130:0:(service.c:2101:ptlrpc_handle_rs()) Skipped 402 previous similar messages
          2012-09-17 17:44:00 LustreError: 5729:0:(mdt_recovery.c:611:mdt_steal_ack_locks()) Skipped 426 previous similar messages
          2012-09-17 17:47:49 LDISKFS-fs error (device md1): ldiskfs_add_entry:
          2012-09-17 17:47:49 LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #116917774: rec_len is smaller than minimal - block=58525219offset=312(312), inode=117441145, rec_len=0, name_len=4
          2012-09-17 17:47:49 Aborting journal on device md1-8.  
          2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted
          2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted
          2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted
          2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted
          2012-09-17 17:47:49 LDISKFS-fs (md1):
          2012-09-17 17:47:49 LustreError: 5731:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
          2012-09-17 17:47:49 LDISKFS-fs (md1):
          2012-09-17 17:47:49 LDISKFS-fs (md1):
          2012-09-17 17:47:49 LDISKFS-fs (md1): Remounting filesystem read-onlyRemounting filesystem read-onlyRemounting filesystem read-only
          

          Kernel BUG

          2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: pa free mismatch: [pa ffff88031438e748] [phy 190752] [logic 608] [len 16] [free 13] [error 0] [inode 7340126] [freed 16]
          2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: free 16, pa_free 13
          2012-09-17 17:49:57 ------------[ cut here ]------------
          2012-09-17 17:49:57 kernel BUG at /var/lib/jenkins/workspace/lustre-b2_3/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/mballoc.c:3784!
          2012-09-17 17:49:57 invalid opcode: 0000 [#1] SMP
          2012-09-17 17:49:57 last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
          2012-09-17 17:49:57 CPU 14
          2012-09-17 17:49:57 Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ldiskfs(U) mbcache jbd2 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan]
          2012-09-17 17:49:57
          2012-09-17 17:49:57 Pid: 10207, comm: umount Tainted: P           ---------------    2.6.32-279.5.1.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R610/0K399H
          2012-09-17 17:49:57 RIP: 0010:[<ffffffffa07214b6>]  [<ffffffffa07214b6>] ldiskfs_mb_release_inode_pa+0x346/0x360 [ldiskfs]
          2012-09-17 17:49:57 RSP: 0018:ffff88026e7152b8  EFLAGS: 00010216
          2012-09-17 17:49:57 RAX: 000000000000000d RBX: 0000000000000010 RCX: ffff88032a805800
          2012-09-17 17:49:57 RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff880316f3a2c0
          2012-09-17 17:49:57 RBP: ffff88026e715368 R08: 0000000000000000 R09: 0000000000000000
          2012-09-17 17:49:57 R10: 00000000001 R11: 0000000000000000 R12: ffff88025db65898
          2012-09-17 17:49:57 R13: ffff88023f45d3a0 R14: 0000000000002931 R15: ffff88031438e748
          2012-09-17 17:49:57 FS:  00002aaaab690740(0000) GS:ffff8801b58e0000(0000) knlGS:0000000000000000
          2012-09-17 17:49:57 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          2012-09-17 17:49:57 CR2: 00002aaaaace3387 CR3: 00000002b6ee5000 CR4: 00000000000006e0
          2012-09-17 17:49:57 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          2012-09-17 17:49:57 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          2012-09-17 17:49:57 Process umount (pid: 10207, threadinfo ffff88026e714000, task ffff88007fa10080)
          2012-09-17 17:49:57 Stack:
          2012-09-17 17:49:57  ffff880200000010 000000000000000d ffff880200000000 000000000070005e
          2012-09-17 17:49:57 <d> 0000000000000010 000000000000000b ffff88026e7152f8 ffffffff811ae146
          2012-09-17 17:49:57 <d> ffff88032a805800 ffff88032a3b8c00 ffff88031438e748 000000000002c000
          2012-09-17 17:49:57 Call Trace:
          
          2012-09-17 17:49:57  [<ffffffff811ae146>] ? __wait_on_buffer+0x26/0x30
          2012-09-17 17:49:57  [<ffffffffa072701e>] ldiskfs_discard_preallocations+0x1fe/0x490 [ldiskfs]
          2012-09-17 17:49:57  [<ffffffffa0711077>] ldiskfs_release_file+0xb7/0xd0 [ldiskfs]
          2012-09-17 17:49:57  [<ffffffff8117ca65>] __fput+0xf5/0x210
          2012-09-17 17:49:57  [<ffffffff8117cba5>] fput+0x25/0x30
          2012-09-17 17:49:57  [<ffffffff811785cd>] filp_close+0x5d/0x90
          2012-09-17 17:49:57  [<ffffffffa07a1dbe>] llog_lvfs_close+0x2e/0x130 [obdclass]
          2012-09-17 17:49:57  [<ffffffffa079afff>] llog_close+0x5f/0x1b0 [obdclass]
          2012-09-17 17:49:57  [<ffffffffa079e92d>] llog_cat_put+0x5d/0x160 [obdclass]
          2012-09-17 17:49:57  [<ffffffffa07a76bc>] llog_obd_origin_cleanup+0x1bc/0x460 [obdclass]
          2012-09-17 17:49:57  [<ffffffffa07a7366>] __llog_ctxt_put+0xc6/0x260 [obdclass]
          2012-09-17 17:49:57  [<ffffffffa07a7a0c>] llog_cleanup+0xac/0x490 [obdclass]
          2012-09-17 17:49:57  [<ffffffffa0b7b6ef>] osc_llog_finish+0x7f/0x250 [osc]
          2012-09-17 17:49:57  [<ffffffffa07a6438>] obd_llog_finish+0x88/0x1a0 [obdclass]
          2012-09-17 17:49:57  [<ffffffffa0b76dca>] osc_precleanup+0x2ea/0x3b0 [osc]
          2012-09-17 17:49:57  [<ffffffffa07d00c7>] class_cleanup+0x1f7/0xdc0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa07b12b6>] ? class_name2dev+0x56/0xe0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa07d243b>] class_process_config+0x102b/0x1c30 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa0398be0>] ? cfs_alloc+0x30/0x60 [libcfs]
          2012-09-17 17:49:58  [<ffffffffa07cbe43>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa07d31b9>] class_manual_cleanup+0x179/0x6f0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa0c09743>] lov_putref+0x393/0xb30 [lov]
          2012-09-17 17:49:58  [<ffffffff814ff18e>] ? mutex_lock+0x1e/0x50
          2012-09-17 17:49:58  [<ffffffffa0c1335f>] lov_disconnect+0x19f/0x4c0 [lov]
          2012-09-17 17:49:58  [<ffffffffa0eb1c72>] mds_lov_clean+0x262/0x6f0 [mds]
          2012-09-17 17:49:58  [<ffffffffa0eb22c6>] mds_precleanup+0x1c6/0x4b0 [mds]
          2012-09-17 17:49:58  [<ffffffffa07d00c7>] class_cleanup+0x1f7/0xdc0 [obdclass]
          2012-09-17 17:49:58  [<ffffffff81271fe5>] ? _atomic_dec_and_lock+0x55/0x80
          2012-09-17 17:49:58  [<ffffffffa07e9827>] ? lu_object_put+0x157/0x290 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa0eee8c4>] mdd_fini_obd+0x54/0x240 [mdd]
          2012-09-17 17:49:58  [<ffffffffa0f0d777>] mdd_process_config+0x2d7/0xa20 [mdd]
          2012-09-17 17:49:58  [<ffffffff814fdca0>] ? thread_return+0x4e/0x76e
          2012-09-17 17:49:58  [<ffffffffa06a3b6b>] cmm_process_config+0x7b/0xd10 [cmm]
          2012-09-17 17:49:58  [<ffffffffa0f6fd8a>] mdt_stack_fini+0x5da/0xd40 [mdt]
          2012-09-17 17:49:58  [<ffffffffa06a2886>] ? cmm_init_capa_ctxt+0x46/0x140 [cmm]
          2012-09-17 17:49:58  [<ffffffffa0f707c5>] mdt_device_fini+0x2d5/0x5f0 [mdt]
          2012-09-17 17:49:58  [<ffffffffa07d0447>] class_cleanup+0x577/0xdc0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa07b12b6>] ? class_name2dev+0x56/0xe0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa07d243b>] class_process_config+0x102b/0x1c30 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa0398be0>] ? cfs_alloc+0x30/0x60 [libcfs]
          2012-09-17 17:49:58  [<ffffffffa07cbe43>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa07d31b9>] class_manual_cleanup+0x179/0x6f0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa07b12b6>] ? class_name2dev+0x56/0xe0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa07dd3c9>] server_put_super+0x6f9/0xcf0 [obdclass]
          2012-09-17 17:49:58  [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0
          2012-09-17 17:49:58  [<ffffffff8117d436>] kill_anon_super+0x16/0x60
          2012-09-17 17:49:58  [<ffffffffa07d4df6>] lustre_kill_super+0x36/0x60 [obdclass]
          2012-09-17 17:49:58  [<ffffffff8117e4b0>] deactivate_super+0x70/0x90
          2012-09-17 17:49:58  [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110
          2012-09-17 17:49:58  [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0
          2012-09-17 17:49:58  [<ffffffff81082de1>] ? sigprocmask+0x71/0x110
          2012-09-17 17:49:58  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
          2012-09-17 17:49:58 Code: 55 c8 e9 39 fe ff ff 31 db 41 83 7f 4c 00 0f 84 7e fd ff ff 0f 0b eb fe 0f 0b eb fe 0f 0b 0f 1f 80 00 00 00 00 eb f7 0f 0b eb fe <0f> 0b 0f 1f 84 00 00 00 00 00 eb f6 66 66 66 66 66 2e 0f 1f 84
          2012-09-17 17:49:58 RIP  [<ffffffffa07214b6>] ldiskfs_mb_release_inode_pa+0x346/0x360 [ldiskfs]
          2012-09-17 17:49:58  RSP <ffff88026e7152b8>
          2012-09-17 17:49:58 Initializing cgroup subsys cpuset
          

          am rebooting the node now, will run fsck script.

          cliffw Cliff White (Inactive) added a comment - MDS just went read-only, then crashed when I attempted a umount -f 2012-09-17 17:44:00 Lustre: 4130:0:(service.c:2101:ptlrpc_handle_rs()) All locks stolen from rs ffff880132a5d000 x1413385599017471.t4342523245 o0 NID 192.168.119.37@o2ib1 2012-09-17 17:44:00 Lustre: 4130:0:(service.c:2101:ptlrpc_handle_rs()) Skipped 402 previous similar messages 2012-09-17 17:44:00 LustreError: 5729:0:(mdt_recovery.c:611:mdt_steal_ack_locks()) Skipped 426 previous similar messages 2012-09-17 17:47:49 LDISKFS-fs error (device md1): ldiskfs_add_entry: 2012-09-17 17:47:49 LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #116917774: rec_len is smaller than minimal - block=58525219offset=312(312), inode=117441145, rec_len=0, name_len=4 2012-09-17 17:47:49 Aborting journal on device md1-8. 2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted 2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted 2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted 2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted 2012-09-17 17:47:49 LDISKFS-fs (md1): 2012-09-17 17:47:49 LustreError: 5731:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 2012-09-17 17:47:49 LDISKFS-fs (md1): 2012-09-17 17:47:49 LDISKFS-fs (md1): 2012-09-17 17:47:49 LDISKFS-fs (md1): Remounting filesystem read-onlyRemounting filesystem read-onlyRemounting filesystem read-only Kernel BUG 2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: pa free mismatch: [pa ffff88031438e748] [phy 190752] [logic 608] [len 16] [free 13] [error 0] [inode 7340126] [freed 16] 2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: free 16, pa_free 13 2012-09-17 17:49:57 ------------[ cut here ]------------ 2012-09-17 17:49:57 kernel BUG at / var /lib/jenkins/workspace/lustre-b2_3/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/mballoc.c:3784! 2012-09-17 17:49:57 invalid opcode: 0000 [#1] SMP 2012-09-17 17:49:57 last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map 2012-09-17 17:49:57 CPU 14 2012-09-17 17:49:57 Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ldiskfs(U) mbcache jbd2 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan] 2012-09-17 17:49:57 2012-09-17 17:49:57 Pid: 10207, comm: umount Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R610/0K399H 2012-09-17 17:49:57 RIP: 0010:[<ffffffffa07214b6>] [<ffffffffa07214b6>] ldiskfs_mb_release_inode_pa+0x346/0x360 [ldiskfs] 2012-09-17 17:49:57 RSP: 0018:ffff88026e7152b8 EFLAGS: 00010216 2012-09-17 17:49:57 RAX: 000000000000000d RBX: 0000000000000010 RCX: ffff88032a805800 2012-09-17 17:49:57 RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff880316f3a2c0 2012-09-17 17:49:57 RBP: ffff88026e715368 R08: 0000000000000000 R09: 0000000000000000 2012-09-17 17:49:57 R10: 00000000001 R11: 0000000000000000 R12: ffff88025db65898 2012-09-17 17:49:57 R13: ffff88023f45d3a0 R14: 0000000000002931 R15: ffff88031438e748 2012-09-17 17:49:57 FS: 00002aaaab690740(0000) GS:ffff8801b58e0000(0000) knlGS:0000000000000000 2012-09-17 17:49:57 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2012-09-17 17:49:57 CR2: 00002aaaaace3387 CR3: 00000002b6ee5000 CR4: 00000000000006e0 2012-09-17 17:49:57 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2012-09-17 17:49:57 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2012-09-17 17:49:57 Process umount (pid: 10207, threadinfo ffff88026e714000, task ffff88007fa10080) 2012-09-17 17:49:57 Stack: 2012-09-17 17:49:57 ffff880200000010 000000000000000d ffff880200000000 000000000070005e 2012-09-17 17:49:57 <d> 0000000000000010 000000000000000b ffff88026e7152f8 ffffffff811ae146 2012-09-17 17:49:57 <d> ffff88032a805800 ffff88032a3b8c00 ffff88031438e748 000000000002c000 2012-09-17 17:49:57 Call Trace: 2012-09-17 17:49:57 [<ffffffff811ae146>] ? __wait_on_buffer+0x26/0x30 2012-09-17 17:49:57 [<ffffffffa072701e>] ldiskfs_discard_preallocations+0x1fe/0x490 [ldiskfs] 2012-09-17 17:49:57 [<ffffffffa0711077>] ldiskfs_release_file+0xb7/0xd0 [ldiskfs] 2012-09-17 17:49:57 [<ffffffff8117ca65>] __fput+0xf5/0x210 2012-09-17 17:49:57 [<ffffffff8117cba5>] fput+0x25/0x30 2012-09-17 17:49:57 [<ffffffff811785cd>] filp_close+0x5d/0x90 2012-09-17 17:49:57 [<ffffffffa07a1dbe>] llog_lvfs_close+0x2e/0x130 [obdclass] 2012-09-17 17:49:57 [<ffffffffa079afff>] llog_close+0x5f/0x1b0 [obdclass] 2012-09-17 17:49:57 [<ffffffffa079e92d>] llog_cat_put+0x5d/0x160 [obdclass] 2012-09-17 17:49:57 [<ffffffffa07a76bc>] llog_obd_origin_cleanup+0x1bc/0x460 [obdclass] 2012-09-17 17:49:57 [<ffffffffa07a7366>] __llog_ctxt_put+0xc6/0x260 [obdclass] 2012-09-17 17:49:57 [<ffffffffa07a7a0c>] llog_cleanup+0xac/0x490 [obdclass] 2012-09-17 17:49:57 [<ffffffffa0b7b6ef>] osc_llog_finish+0x7f/0x250 [osc] 2012-09-17 17:49:57 [<ffffffffa07a6438>] obd_llog_finish+0x88/0x1a0 [obdclass] 2012-09-17 17:49:57 [<ffffffffa0b76dca>] osc_precleanup+0x2ea/0x3b0 [osc] 2012-09-17 17:49:57 [<ffffffffa07d00c7>] class_cleanup+0x1f7/0xdc0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07b12b6>] ? class_name2dev+0x56/0xe0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07d243b>] class_process_config+0x102b/0x1c30 [obdclass] 2012-09-17 17:49:58 [<ffffffffa0398be0>] ? cfs_alloc+0x30/0x60 [libcfs] 2012-09-17 17:49:58 [<ffffffffa07cbe43>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07d31b9>] class_manual_cleanup+0x179/0x6f0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa0c09743>] lov_putref+0x393/0xb30 [lov] 2012-09-17 17:49:58 [<ffffffff814ff18e>] ? mutex_lock+0x1e/0x50 2012-09-17 17:49:58 [<ffffffffa0c1335f>] lov_disconnect+0x19f/0x4c0 [lov] 2012-09-17 17:49:58 [<ffffffffa0eb1c72>] mds_lov_clean+0x262/0x6f0 [mds] 2012-09-17 17:49:58 [<ffffffffa0eb22c6>] mds_precleanup+0x1c6/0x4b0 [mds] 2012-09-17 17:49:58 [<ffffffffa07d00c7>] class_cleanup+0x1f7/0xdc0 [obdclass] 2012-09-17 17:49:58 [<ffffffff81271fe5>] ? _atomic_dec_and_lock+0x55/0x80 2012-09-17 17:49:58 [<ffffffffa07e9827>] ? lu_object_put+0x157/0x290 [obdclass] 2012-09-17 17:49:58 [<ffffffffa0eee8c4>] mdd_fini_obd+0x54/0x240 [mdd] 2012-09-17 17:49:58 [<ffffffffa0f0d777>] mdd_process_config+0x2d7/0xa20 [mdd] 2012-09-17 17:49:58 [<ffffffff814fdca0>] ? thread_return+0x4e/0x76e 2012-09-17 17:49:58 [<ffffffffa06a3b6b>] cmm_process_config+0x7b/0xd10 [cmm] 2012-09-17 17:49:58 [<ffffffffa0f6fd8a>] mdt_stack_fini+0x5da/0xd40 [mdt] 2012-09-17 17:49:58 [<ffffffffa06a2886>] ? cmm_init_capa_ctxt+0x46/0x140 [cmm] 2012-09-17 17:49:58 [<ffffffffa0f707c5>] mdt_device_fini+0x2d5/0x5f0 [mdt] 2012-09-17 17:49:58 [<ffffffffa07d0447>] class_cleanup+0x577/0xdc0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07b12b6>] ? class_name2dev+0x56/0xe0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07d243b>] class_process_config+0x102b/0x1c30 [obdclass] 2012-09-17 17:49:58 [<ffffffffa0398be0>] ? cfs_alloc+0x30/0x60 [libcfs] 2012-09-17 17:49:58 [<ffffffffa07cbe43>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07d31b9>] class_manual_cleanup+0x179/0x6f0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07b12b6>] ? class_name2dev+0x56/0xe0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07dd3c9>] server_put_super+0x6f9/0xcf0 [obdclass] 2012-09-17 17:49:58 [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0 2012-09-17 17:49:58 [<ffffffff8117d436>] kill_anon_super+0x16/0x60 2012-09-17 17:49:58 [<ffffffffa07d4df6>] lustre_kill_super+0x36/0x60 [obdclass] 2012-09-17 17:49:58 [<ffffffff8117e4b0>] deactivate_super+0x70/0x90 2012-09-17 17:49:58 [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110 2012-09-17 17:49:58 [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0 2012-09-17 17:49:58 [<ffffffff81082de1>] ? sigprocmask+0x71/0x110 2012-09-17 17:49:58 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 2012-09-17 17:49:58 Code: 55 c8 e9 39 fe ff ff 31 db 41 83 7f 4c 00 0f 84 7e fd ff ff 0f 0b eb fe 0f 0b eb fe 0f 0b 0f 1f 80 00 00 00 00 eb f7 0f 0b eb fe <0f> 0b 0f 1f 84 00 00 00 00 00 eb f6 66 66 66 66 66 2e 0f 1f 84 2012-09-17 17:49:58 RIP [<ffffffffa07214b6>] ldiskfs_mb_release_inode_pa+0x346/0x360 [ldiskfs] 2012-09-17 17:49:58 RSP <ffff88026e7152b8> 2012-09-17 17:49:58 Initializing cgroup subsys cpuset am rebooting the node now, will run fsck script.

          Thanks, I see. Is the test started from a fresh reformatted MDS? or the MDS was just been fixed by e2fsck -fy before the run? If the MDS has not been changed yet, could you find out the directory name of 88606751? I guess you can just mount it as ldiskfs, and get it by find /mnt/mds -inum 88606751 Thanks.

          di.wang Di Wang (Inactive) added a comment - Thanks, I see. Is the test started from a fresh reformatted MDS? or the MDS was just been fixed by e2fsck -fy before the run? If the MDS has not been changed yet, could you find out the directory name of 88606751? I guess you can just mount it as ldiskfs, and get it by find /mnt/mds -inum 88606751 Thanks.

          current script:

          #!/bin/bash
          ostdev=$1
          logdir=/home/white215/wham/lu1015/logs/
          stamp=`date +%Y%m%d.%M%S`
          dumpe2fs /dev/$ostdev > $logdir/$ostdev.$stamp.stats
          debugfs -c -R "dump <8>" /dev/$ostdev > $logdir/$ostdev.$stamp.journal
          debugfs -c -R "logdump -a" /dev/$ostdev > $logdir/$ostdev.$stamp.logdump
          e2fsck -fp /dev/$ostdev 2>&1 | tee $logdir/$ostdev.$stamp.e2fsck
          dumpe2fs /dev/$ostdev > $logdir/$ostdev.$stamp.stats.post
          
          cliffw Cliff White (Inactive) added a comment - current script: #!/bin/bash ostdev=$1 logdir=/home/white215/wham/lu1015/logs/ stamp=`date +%Y%m%d.%M%S` dumpe2fs /dev/$ostdev > $logdir/$ostdev.$stamp.stats debugfs -c -R "dump <8>" /dev/$ostdev > $logdir/$ostdev.$stamp.journal debugfs -c -R "logdump -a" /dev/$ostdev > $logdir/$ostdev.$stamp.logdump e2fsck -fp /dev/$ostdev 2>&1 | tee $logdir/$ostdev.$stamp.e2fsck dumpe2fs /dev/$ostdev > $logdir/$ostdev.$stamp.stats.post

          The test being run is SWL, which is essentially a large mix of multiple tasks (ior,fdtree,mib,etc) In this case, the MDS went read-only in the
          middle of the test. I halted the MDS and ran the data-dump script created for LU-1015. e2fsprogs was 1.41.90-wc4-7, i will reinstall newer.

          cliffw Cliff White (Inactive) added a comment - The test being run is SWL, which is essentially a large mix of multiple tasks (ior,fdtree,mib,etc) In this case, the MDS went read-only in the middle of the test. I halted the MDS and ran the data-dump script created for LU-1015 . e2fsprogs was 1.41.90-wc4-7, i will reinstall newer.

          Cliff, Could you tell me more about your test? on lu-1015, I saw most discussion is about OST fails and run fsck.
          Did you do the same thing about MDS? run tests, uncleanup power off, fsck -p (found problem), then run fsck -fy, start MDS again, found these problem? What test did you run? Could you please post the script here? And I assume you use e2fsprogs-1.42.3.wc3?

          di.wang Di Wang (Inactive) added a comment - Cliff, Could you tell me more about your test? on lu-1015, I saw most discussion is about OST fails and run fsck. Did you do the same thing about MDS? run tests, uncleanup power off, fsck -p (found problem), then run fsck -fy, start MDS again, found these problem? What test did you run? Could you please post the script here? And I assume you use e2fsprogs-1.42.3.wc3?
          pjones Peter Jones added a comment -

          Bobijam

          Is there anything that you can determine from this report?

          Peter

          pjones Peter Jones added a comment - Bobijam Is there anything that you can determine from this report? Peter

          People

            di.wang Di Wang (Inactive)
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: