Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.3.0
    • None
    • SWL - Hyperion/LLNL RHEL6 servers and clients
    • 3
    • 6323

    Description

      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry:
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #88606751: rec_len is smaller than minimal - block=44369457offset=536(536), inode=88627711, rec_len=0, name_len=4
      Sep 16 01:28:12 hyperion-rst6 kernel: Aborting journal on device md1-8.
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_new_inode: Journal has aborted
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_delete_inode: Journal has aborted
      Sep 16 01:28:12 hyperion-rst6 kernel: LustreError: 4489:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
      Sep 16 01:28:12 hyperion-rst6 kernel: LustreError: 4186:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30

      Disk appeared to be quite messed up with fsck -fy. Ran the data capture script from lu-1015, results attached.

      Attachments

        Activity

          [LU-1948] ldiskfs - MDS goes read-only (SWL)

          fdtree does not create symlinks, which only includes mkdir, create, dd, unlink, rmdir. But SWL includes 5 tests, fdtree, simul, IOR, mirIO, mdtest. Simul definitely include create symlinks here.

          di.wang Di Wang (Inactive) added a comment - fdtree does not create symlinks, which only includes mkdir, create, dd, unlink, rmdir. But SWL includes 5 tests, fdtree, simul, IOR, mirIO, mdtest. Simul definitely include create symlinks here.

          Isn't osd_ldiskfs_write_record() writing one-byte off the buffer limit if write_NUL is true?

          liwei Li Wei (Inactive) added a comment - Isn't osd_ldiskfs_write_record() writing one-byte off the buffer limit if write_NUL is true?

          There haven't been changes to mballoc, but there was a change to the symlink NUL termination recently. Does this workload create symlinks?

          adilger Andreas Dilger added a comment - There haven't been changes to mballoc, but there was a change to the symlink NUL termination recently. Does this workload create symlinks?

          Add andreas to the ticket, in case there are some mballoc changes recently for ext4.

          di.wang Di Wang (Inactive) added a comment - Add andreas to the ticket, in case there are some mballoc changes recently for ext4.

          According to test result the corrupt directory comes from fdtree test. " 78644697: name L0F7, lck 0000000000000004, name_len=568931904". Hmm during fdtree test, each thread will only create files in its own directory, and only 10 files for each dir.

           
           \\srun -N $NODES -n $PROCS --wait=7200 -o fdtree.%t.out \$SWL/IO/fdtree.bash -l 3 -d 10 >& tmp_file
          

          So this problem should be unrelated with ldiskfs pdir patch. Now I suspect it might be related with mballoc, because we saw this during the test

           
          "2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: pa free mismatch: [pa ffff88031438e748] [phy 190752] [logic 608] [len 16] [free 13] [error 0] [inode 7340126] [freed 16]
          2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: free 16, pa_free 13 "
          

          though it happened after the fs turned to read-only.

          di.wang Di Wang (Inactive) added a comment - According to test result the corrupt directory comes from fdtree test. " 78644697: name L0F7, lck 0000000000000004, name_len=568931904". Hmm during fdtree test, each thread will only create files in its own directory, and only 10 files for each dir. \\srun -N $NODES -n $PROCS --wait=7200 -o fdtree.%t.out \$SWL/IO/fdtree.bash -l 3 -d 10 >& tmp_file So this problem should be unrelated with ldiskfs pdir patch. Now I suspect it might be related with mballoc, because we saw this during the test "2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: pa free mismatch: [pa ffff88031438e748] [phy 190752] [logic 608] [len 16] [free 13] [error 0] [inode 7340126] [freed 16] 2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: free 16, pa_free 13 " though it happened after the fs turned to read-only.

          Okay, the MDS just 'sploded.

          2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent:
          2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #78644697: rec_len is smaller than minimal - block=39336951offset=152(152), inode=78655820, rec_len=0, name_len=4
          2012-09-18 14:13:37 Aborting journal on device md1-8.
          2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal
          2012-09-18 14:13:37 LDISKFS-fs (md1): Remounting filesystem read-only
          2012-09-18 14:13:37 LDISKFS-fs (md1): Remounting filesystem read-only
          2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent: 78644697: lck ffff8801304ad000 off=152(78655820),inode=32, rec_len=4, name_len=0
          2012-09-18 14:13:37
          2012-09-18 14:13:37 LustreError: 8052:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
          2012-09-18 14:13:37 LustreError: 8052:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30
          2012-09-18 14:13:37 LustreError: 8052:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30
          2012-09-18 14:13:37 78644697: name L0F7, lck 0000000000000004, name_len=568931904
          2012-09-18 14:13:37
          2012-09-18 14:13:37 ------------[ cut here ]------------
          2012-09-18 14:13:37 kernel BUG at /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/namei.c:1757!
          2012-09-18 14:13:37 invalid opcode: 0000 [#1] SMP
          2012-09-18 14:13:37 last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/0000:05:00.0/infiniband_mad/umad0/port
          2012-09-18 14:13:37 CPU 7
          2012-09-18 14:13:37 Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) ldiskfs(U) mbcache jbd2 lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan]
          2012-09-18 14:13:37
          2012-09-18 14:13:37 Pid: 4898, comm: mdt03_008 Tainted: P           ---------------    2.6.32-279.5.1.el6_lustre.gb4cc145.x86_64 #1 Dell Inc. PowerEdge R610/0K399H
          
          2012-09-18 14:13:37 RIP: 0010:[<ffffffffa0db21d0>]  [<ffffffffa0db21d0>] add_dirent_to_buf+0x4c0/0x530 [ldiskfs]
          2012-09-18 14:13:37 RSP: 0018:ffff880121e935f0  EFLAGS: 00010246
          2012-09-18 14:13:37 RAX: ffff880181248000 RBX: ffff880268654078 RCX: 00000000000014b5
          2012-09-18 14:13:37 RDX: ffff880268654098 RSI: 0000000000000046 RDI: ffff88016ff93c00
          2012-09-18 14:13:37 RBP: ffff880121e936b0 R08: 0000000000000000 R09: 0000000000000000
          2012-09-18 14:13:37 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880130e16ad0
          2012-09-18 14:13:37 R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000020
          2012-09-18 14:13:37 FS:  00002aaaab47e700(0000) GS:ffff880028260000(0000) knlGS:0000000000000000
          2012-09-18 14:13:37 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          2012-09-18 14:13:37 CR2: 00000000006d3a80 CR3: 0000000001a85000 CR4: 00000000000006e0
          2012-09-18 14:13:37 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          2012-09-18 14:13:37 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          2012-09-18 14:13:37 Process mdt03_008 (pid: 4898, threadinfo ffff880121e92000, task ffff88012df62080)
          2012-09-18 14:13:37 LustreError: 4099:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30
          2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30
          2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30
          2012-09-18 14:13:37 Stack:
          2012-09-18 14:13:37  ffff880121e93640 0000000002583bf7 0000000000001000 0000000000000000
          2012-09-18 14:13:37 <d> ffff8801abdad5c0 ffff880200000004 ffff880104b03141 0000000000000004
          2012-09-18 14:13:37 <d> 00000020e2f30800 ffff8801258d9e00 ffff8801064c3288 ffff880268654fe0
          2012-09-18 14:13:37 Call Trace:
          2012-09-18 14:13:37  [<ffffffffa0db5c5d>] ldiskfs_add_entry+0xcd/0x500 [ldiskfs]
          2012-09-18 14:13:37  [<ffffffffa1001689>] __osd_ea_add_rec+0xb9/0x190 [osd_ldiskfs]
          2012-09-18 14:13:37  [<ffffffffa100dbeb>] osd_index_ea_insert+0x21b/0x5e0 [osd_ldiskfs]
          2012-09-18 14:13:37  [<ffffffffa0eee977>] __mdd_index_insert_only+0x147/0x150 [mdd]
          2012-09-18 14:13:37  [<ffffffffa0eef9b1>] __mdd_index_insert+0x51/0x1f0 [mdd]
          2012-09-18 14:13:37  [<ffffffffa0ef55e3>] mdd_create+0x19a3/0x20c0 [mdd]
          2012-09-18 14:13:37  [<ffffffffa03aa5b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
          2012-09-18 14:13:37  [<ffffffffa06a4637>] cml_create+0x97/0x250 [cmm]
          2012-09-18 14:13:37  [<ffffffffa0f8bb9f>] mdt_reint_open+0x108f/0x18a0 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f75151>] mdt_reint_rec+0x41/0xe0 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f6e9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f6ef7d>] mdt_intent_reint+0x1ed/0x500 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f6b191>] mdt_intent_policy+0x371/0x6a0 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0859881>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa08819bf>] ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa0f6b506>] mdt_enqueue+0x46/0x130 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f62802>] mdt_handle_common+0x922/0x1740 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f636f5>] mdt_regular_handle+0x15/0x20 [mdt]
          2012-09-18 14:13:37  [<ffffffffa08b199d>] ptlrpc_server_handle_request+0x40d/0xea0 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa08a8f37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa03aa5b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
          2012-09-18 14:13:37  [<ffffffff810533f3>] ? __wake_up+0x53/0x70
          2012-09-18 14:13:37  [<ffffffffa08b2f89>] ptlrpc_main+0xb59/0x1860 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffff8100c14a>] child_rip+0xa/0x20
          2012-09-18 14:13:37  [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffff8100c140>] ? child_rip+0x0/0x20
          2012-09-18 14:13:37 Code: 00 00 48 c7 c2 20 81 dd a0 4c 8b 45 a8 45 89 f1 48 c7 c6 e5 d2 dd a0 e8 9f 56 01 00 48 8b 55 c8 66 83 7a 04 00 0f 85 a9 fd ff ff <0f> 0b eb fe 49 8b
          2012-09-18 14:13:37 LustreError: 4099:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30
          2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30
          2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30
          2012-09-18 14:13:37 4c 24 40 49 8b bc 24 08 01 00 00 31 c0 4c 8b
          2012-09-18 14:13:37 RIP  [<ffffffffa0db21d0>] add_dirent_to_buf+0x4c0/0x530 [ldiskfs]
          2012-09-18 14:13:37  RSP <ffff880121e935f0>
          2012-09-18 14:13:38 Initializing cgroup subsys cpuset
          2012-09-18 14:13:38 Initializing cgroup subsys cpu
          2012-09-18 14:13:38 Linux version 2.6.32-279.5.1.el6_lustre.gb4cc145.x86
          

          vmcore has been captured, is on brent.whamcloud.com ~/cliffw/lu1948/vmcore-rst6-c2.gz

          cliffw Cliff White (Inactive) added a comment - Okay, the MDS just 'sploded. 2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent: 2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #78644697: rec_len is smaller than minimal - block=39336951offset=152(152), inode=78655820, rec_len=0, name_len=4 2012-09-18 14:13:37 Aborting journal on device md1-8. 2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal 2012-09-18 14:13:37 LDISKFS-fs (md1): Remounting filesystem read-only 2012-09-18 14:13:37 LDISKFS-fs (md1): Remounting filesystem read-only 2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent: 78644697: lck ffff8801304ad000 off=152(78655820),inode=32, rec_len=4, name_len=0 2012-09-18 14:13:37 2012-09-18 14:13:37 LustreError: 8052:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 2012-09-18 14:13:37 LustreError: 8052:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30 2012-09-18 14:13:37 LustreError: 8052:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30 2012-09-18 14:13:37 78644697: name L0F7, lck 0000000000000004, name_len=568931904 2012-09-18 14:13:37 2012-09-18 14:13:37 ------------[ cut here ]------------ 2012-09-18 14:13:37 kernel BUG at / var /lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/namei.c:1757! 2012-09-18 14:13:37 invalid opcode: 0000 [#1] SMP 2012-09-18 14:13:37 last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/0000:05:00.0/infiniband_mad/umad0/port 2012-09-18 14:13:37 CPU 7 2012-09-18 14:13:37 Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) ldiskfs(U) mbcache jbd2 lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan] 2012-09-18 14:13:37 2012-09-18 14:13:37 Pid: 4898, comm: mdt03_008 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.gb4cc145.x86_64 #1 Dell Inc. PowerEdge R610/0K399H 2012-09-18 14:13:37 RIP: 0010:[<ffffffffa0db21d0>] [<ffffffffa0db21d0>] add_dirent_to_buf+0x4c0/0x530 [ldiskfs] 2012-09-18 14:13:37 RSP: 0018:ffff880121e935f0 EFLAGS: 00010246 2012-09-18 14:13:37 RAX: ffff880181248000 RBX: ffff880268654078 RCX: 00000000000014b5 2012-09-18 14:13:37 RDX: ffff880268654098 RSI: 0000000000000046 RDI: ffff88016ff93c00 2012-09-18 14:13:37 RBP: ffff880121e936b0 R08: 0000000000000000 R09: 0000000000000000 2012-09-18 14:13:37 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880130e16ad0 2012-09-18 14:13:37 R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000020 2012-09-18 14:13:37 FS: 00002aaaab47e700(0000) GS:ffff880028260000(0000) knlGS:0000000000000000 2012-09-18 14:13:37 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2012-09-18 14:13:37 CR2: 00000000006d3a80 CR3: 0000000001a85000 CR4: 00000000000006e0 2012-09-18 14:13:37 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2012-09-18 14:13:37 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2012-09-18 14:13:37 Process mdt03_008 (pid: 4898, threadinfo ffff880121e92000, task ffff88012df62080) 2012-09-18 14:13:37 LustreError: 4099:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30 2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30 2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30 2012-09-18 14:13:37 Stack: 2012-09-18 14:13:37 ffff880121e93640 0000000002583bf7 0000000000001000 0000000000000000 2012-09-18 14:13:37 <d> ffff8801abdad5c0 ffff880200000004 ffff880104b03141 0000000000000004 2012-09-18 14:13:37 <d> 00000020e2f30800 ffff8801258d9e00 ffff8801064c3288 ffff880268654fe0 2012-09-18 14:13:37 Call Trace: 2012-09-18 14:13:37 [<ffffffffa0db5c5d>] ldiskfs_add_entry+0xcd/0x500 [ldiskfs] 2012-09-18 14:13:37 [<ffffffffa1001689>] __osd_ea_add_rec+0xb9/0x190 [osd_ldiskfs] 2012-09-18 14:13:37 [<ffffffffa100dbeb>] osd_index_ea_insert+0x21b/0x5e0 [osd_ldiskfs] 2012-09-18 14:13:37 [<ffffffffa0eee977>] __mdd_index_insert_only+0x147/0x150 [mdd] 2012-09-18 14:13:37 [<ffffffffa0eef9b1>] __mdd_index_insert+0x51/0x1f0 [mdd] 2012-09-18 14:13:37 [<ffffffffa0ef55e3>] mdd_create+0x19a3/0x20c0 [mdd] 2012-09-18 14:13:37 [<ffffffffa03aa5b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2012-09-18 14:13:37 [<ffffffffa06a4637>] cml_create+0x97/0x250 [cmm] 2012-09-18 14:13:37 [<ffffffffa0f8bb9f>] mdt_reint_open+0x108f/0x18a0 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f75151>] mdt_reint_rec+0x41/0xe0 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f6e9aa>] mdt_reint_internal+0x50a/0x810 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f6ef7d>] mdt_intent_reint+0x1ed/0x500 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f6b191>] mdt_intent_policy+0x371/0x6a0 [mdt] 2012-09-18 14:13:37 [<ffffffffa0859881>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa08819bf>] ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa0f6b506>] mdt_enqueue+0x46/0x130 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f62802>] mdt_handle_common+0x922/0x1740 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f636f5>] mdt_regular_handle+0x15/0x20 [mdt] 2012-09-18 14:13:37 [<ffffffffa08b199d>] ptlrpc_server_handle_request+0x40d/0xea0 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa08a8f37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa03aa5b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2012-09-18 14:13:37 [<ffffffff810533f3>] ? __wake_up+0x53/0x70 2012-09-18 14:13:37 [<ffffffffa08b2f89>] ptlrpc_main+0xb59/0x1860 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc] 2012-09-18 14:13:37 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-09-18 14:13:37 [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc] 2012-09-18 14:13:37 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-09-18 14:13:37 Code: 00 00 48 c7 c2 20 81 dd a0 4c 8b 45 a8 45 89 f1 48 c7 c6 e5 d2 dd a0 e8 9f 56 01 00 48 8b 55 c8 66 83 7a 04 00 0f 85 a9 fd ff ff <0f> 0b eb fe 49 8b 2012-09-18 14:13:37 LustreError: 4099:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30 2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30 2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30 2012-09-18 14:13:37 4c 24 40 49 8b bc 24 08 01 00 00 31 c0 4c 8b 2012-09-18 14:13:37 RIP [<ffffffffa0db21d0>] add_dirent_to_buf+0x4c0/0x530 [ldiskfs] 2012-09-18 14:13:37 RSP <ffff880121e935f0> 2012-09-18 14:13:38 Initializing cgroup subsys cpuset 2012-09-18 14:13:38 Initializing cgroup subsys cpu 2012-09-18 14:13:38 Linux version 2.6.32-279.5.1.el6_lustre.gb4cc145.x86 vmcore has been captured, is on brent.whamcloud.com ~/cliffw/lu1948/vmcore-rst6-c2.gz

          I add some debug patch based on b2_3 (http://review.whamcloud.com/#change,4020 only patch on el6 ldiskfs series). build http://build.whamcloud.com/job/lustre-reviews/9250/ Cliff, could you please test with this rpm. And I suspect this problem might be related with ldiskfs htree, so Liang is looking at this problem as well.

          di.wang Di Wang (Inactive) added a comment - I add some debug patch based on b2_3 ( http://review.whamcloud.com/#change,4020 only patch on el6 ldiskfs series). build http://build.whamcloud.com/job/lustre-reviews/9250/ Cliff, could you please test with this rpm. And I suspect this problem might be related with ldiskfs htree, so Liang is looking at this problem as well.

          MDS just went read-only, then crashed when I attempted a umount -f

          2012-09-17 17:44:00 Lustre: 4130:0:(service.c:2101:ptlrpc_handle_rs()) All locks stolen from rs ffff880132a5d000 x1413385599017471.t4342523245 o0 NID 192.168.119.37@o2ib1
          2012-09-17 17:44:00 Lustre: 4130:0:(service.c:2101:ptlrpc_handle_rs()) Skipped 402 previous similar messages
          2012-09-17 17:44:00 LustreError: 5729:0:(mdt_recovery.c:611:mdt_steal_ack_locks()) Skipped 426 previous similar messages
          2012-09-17 17:47:49 LDISKFS-fs error (device md1): ldiskfs_add_entry:
          2012-09-17 17:47:49 LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #116917774: rec_len is smaller than minimal - block=58525219offset=312(312), inode=117441145, rec_len=0, name_len=4
          2012-09-17 17:47:49 Aborting journal on device md1-8.  
          2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted
          2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted
          2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted
          2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted
          2012-09-17 17:47:49 LDISKFS-fs (md1):
          2012-09-17 17:47:49 LustreError: 5731:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
          2012-09-17 17:47:49 LDISKFS-fs (md1):
          2012-09-17 17:47:49 LDISKFS-fs (md1):
          2012-09-17 17:47:49 LDISKFS-fs (md1): Remounting filesystem read-onlyRemounting filesystem read-onlyRemounting filesystem read-only
          

          Kernel BUG

          2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: pa free mismatch: [pa ffff88031438e748] [phy 190752] [logic 608] [len 16] [free 13] [error 0] [inode 7340126] [freed 16]
          2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: free 16, pa_free 13
          2012-09-17 17:49:57 ------------[ cut here ]------------
          2012-09-17 17:49:57 kernel BUG at /var/lib/jenkins/workspace/lustre-b2_3/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/mballoc.c:3784!
          2012-09-17 17:49:57 invalid opcode: 0000 [#1] SMP
          2012-09-17 17:49:57 last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
          2012-09-17 17:49:57 CPU 14
          2012-09-17 17:49:57 Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ldiskfs(U) mbcache jbd2 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan]
          2012-09-17 17:49:57
          2012-09-17 17:49:57 Pid: 10207, comm: umount Tainted: P           ---------------    2.6.32-279.5.1.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R610/0K399H
          2012-09-17 17:49:57 RIP: 0010:[<ffffffffa07214b6>]  [<ffffffffa07214b6>] ldiskfs_mb_release_inode_pa+0x346/0x360 [ldiskfs]
          2012-09-17 17:49:57 RSP: 0018:ffff88026e7152b8  EFLAGS: 00010216
          2012-09-17 17:49:57 RAX: 000000000000000d RBX: 0000000000000010 RCX: ffff88032a805800
          2012-09-17 17:49:57 RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff880316f3a2c0
          2012-09-17 17:49:57 RBP: ffff88026e715368 R08: 0000000000000000 R09: 0000000000000000
          2012-09-17 17:49:57 R10: 00000000001 R11: 0000000000000000 R12: ffff88025db65898
          2012-09-17 17:49:57 R13: ffff88023f45d3a0 R14: 0000000000002931 R15: ffff88031438e748
          2012-09-17 17:49:57 FS:  00002aaaab690740(0000) GS:ffff8801b58e0000(0000) knlGS:0000000000000000
          2012-09-17 17:49:57 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          2012-09-17 17:49:57 CR2: 00002aaaaace3387 CR3: 00000002b6ee5000 CR4: 00000000000006e0
          2012-09-17 17:49:57 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          2012-09-17 17:49:57 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          2012-09-17 17:49:57 Process umount (pid: 10207, threadinfo ffff88026e714000, task ffff88007fa10080)
          2012-09-17 17:49:57 Stack:
          2012-09-17 17:49:57  ffff880200000010 000000000000000d ffff880200000000 000000000070005e
          2012-09-17 17:49:57 <d> 0000000000000010 000000000000000b ffff88026e7152f8 ffffffff811ae146
          2012-09-17 17:49:57 <d> ffff88032a805800 ffff88032a3b8c00 ffff88031438e748 000000000002c000
          2012-09-17 17:49:57 Call Trace:
          
          2012-09-17 17:49:57  [<ffffffff811ae146>] ? __wait_on_buffer+0x26/0x30
          2012-09-17 17:49:57  [<ffffffffa072701e>] ldiskfs_discard_preallocations+0x1fe/0x490 [ldiskfs]
          2012-09-17 17:49:57  [<ffffffffa0711077>] ldiskfs_release_file+0xb7/0xd0 [ldiskfs]
          2012-09-17 17:49:57  [<ffffffff8117ca65>] __fput+0xf5/0x210
          2012-09-17 17:49:57  [<ffffffff8117cba5>] fput+0x25/0x30
          2012-09-17 17:49:57  [<ffffffff811785cd>] filp_close+0x5d/0x90
          2012-09-17 17:49:57  [<ffffffffa07a1dbe>] llog_lvfs_close+0x2e/0x130 [obdclass]
          2012-09-17 17:49:57  [<ffffffffa079afff>] llog_close+0x5f/0x1b0 [obdclass]
          2012-09-17 17:49:57  [<ffffffffa079e92d>] llog_cat_put+0x5d/0x160 [obdclass]
          2012-09-17 17:49:57  [<ffffffffa07a76bc>] llog_obd_origin_cleanup+0x1bc/0x460 [obdclass]
          2012-09-17 17:49:57  [<ffffffffa07a7366>] __llog_ctxt_put+0xc6/0x260 [obdclass]
          2012-09-17 17:49:57  [<ffffffffa07a7a0c>] llog_cleanup+0xac/0x490 [obdclass]
          2012-09-17 17:49:57  [<ffffffffa0b7b6ef>] osc_llog_finish+0x7f/0x250 [osc]
          2012-09-17 17:49:57  [<ffffffffa07a6438>] obd_llog_finish+0x88/0x1a0 [obdclass]
          2012-09-17 17:49:57  [<ffffffffa0b76dca>] osc_precleanup+0x2ea/0x3b0 [osc]
          2012-09-17 17:49:57  [<ffffffffa07d00c7>] class_cleanup+0x1f7/0xdc0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa07b12b6>] ? class_name2dev+0x56/0xe0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa07d243b>] class_process_config+0x102b/0x1c30 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa0398be0>] ? cfs_alloc+0x30/0x60 [libcfs]
          2012-09-17 17:49:58  [<ffffffffa07cbe43>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa07d31b9>] class_manual_cleanup+0x179/0x6f0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa0c09743>] lov_putref+0x393/0xb30 [lov]
          2012-09-17 17:49:58  [<ffffffff814ff18e>] ? mutex_lock+0x1e/0x50
          2012-09-17 17:49:58  [<ffffffffa0c1335f>] lov_disconnect+0x19f/0x4c0 [lov]
          2012-09-17 17:49:58  [<ffffffffa0eb1c72>] mds_lov_clean+0x262/0x6f0 [mds]
          2012-09-17 17:49:58  [<ffffffffa0eb22c6>] mds_precleanup+0x1c6/0x4b0 [mds]
          2012-09-17 17:49:58  [<ffffffffa07d00c7>] class_cleanup+0x1f7/0xdc0 [obdclass]
          2012-09-17 17:49:58  [<ffffffff81271fe5>] ? _atomic_dec_and_lock+0x55/0x80
          2012-09-17 17:49:58  [<ffffffffa07e9827>] ? lu_object_put+0x157/0x290 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa0eee8c4>] mdd_fini_obd+0x54/0x240 [mdd]
          2012-09-17 17:49:58  [<ffffffffa0f0d777>] mdd_process_config+0x2d7/0xa20 [mdd]
          2012-09-17 17:49:58  [<ffffffff814fdca0>] ? thread_return+0x4e/0x76e
          2012-09-17 17:49:58  [<ffffffffa06a3b6b>] cmm_process_config+0x7b/0xd10 [cmm]
          2012-09-17 17:49:58  [<ffffffffa0f6fd8a>] mdt_stack_fini+0x5da/0xd40 [mdt]
          2012-09-17 17:49:58  [<ffffffffa06a2886>] ? cmm_init_capa_ctxt+0x46/0x140 [cmm]
          2012-09-17 17:49:58  [<ffffffffa0f707c5>] mdt_device_fini+0x2d5/0x5f0 [mdt]
          2012-09-17 17:49:58  [<ffffffffa07d0447>] class_cleanup+0x577/0xdc0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa07b12b6>] ? class_name2dev+0x56/0xe0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa07d243b>] class_process_config+0x102b/0x1c30 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa0398be0>] ? cfs_alloc+0x30/0x60 [libcfs]
          2012-09-17 17:49:58  [<ffffffffa07cbe43>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa07d31b9>] class_manual_cleanup+0x179/0x6f0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa07b12b6>] ? class_name2dev+0x56/0xe0 [obdclass]
          2012-09-17 17:49:58  [<ffffffffa07dd3c9>] server_put_super+0x6f9/0xcf0 [obdclass]
          2012-09-17 17:49:58  [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0
          2012-09-17 17:49:58  [<ffffffff8117d436>] kill_anon_super+0x16/0x60
          2012-09-17 17:49:58  [<ffffffffa07d4df6>] lustre_kill_super+0x36/0x60 [obdclass]
          2012-09-17 17:49:58  [<ffffffff8117e4b0>] deactivate_super+0x70/0x90
          2012-09-17 17:49:58  [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110
          2012-09-17 17:49:58  [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0
          2012-09-17 17:49:58  [<ffffffff81082de1>] ? sigprocmask+0x71/0x110
          2012-09-17 17:49:58  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
          2012-09-17 17:49:58 Code: 55 c8 e9 39 fe ff ff 31 db 41 83 7f 4c 00 0f 84 7e fd ff ff 0f 0b eb fe 0f 0b eb fe 0f 0b 0f 1f 80 00 00 00 00 eb f7 0f 0b eb fe <0f> 0b 0f 1f 84 00 00 00 00 00 eb f6 66 66 66 66 66 2e 0f 1f 84
          2012-09-17 17:49:58 RIP  [<ffffffffa07214b6>] ldiskfs_mb_release_inode_pa+0x346/0x360 [ldiskfs]
          2012-09-17 17:49:58  RSP <ffff88026e7152b8>
          2012-09-17 17:49:58 Initializing cgroup subsys cpuset
          

          am rebooting the node now, will run fsck script.

          cliffw Cliff White (Inactive) added a comment - MDS just went read-only, then crashed when I attempted a umount -f 2012-09-17 17:44:00 Lustre: 4130:0:(service.c:2101:ptlrpc_handle_rs()) All locks stolen from rs ffff880132a5d000 x1413385599017471.t4342523245 o0 NID 192.168.119.37@o2ib1 2012-09-17 17:44:00 Lustre: 4130:0:(service.c:2101:ptlrpc_handle_rs()) Skipped 402 previous similar messages 2012-09-17 17:44:00 LustreError: 5729:0:(mdt_recovery.c:611:mdt_steal_ack_locks()) Skipped 426 previous similar messages 2012-09-17 17:47:49 LDISKFS-fs error (device md1): ldiskfs_add_entry: 2012-09-17 17:47:49 LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #116917774: rec_len is smaller than minimal - block=58525219offset=312(312), inode=117441145, rec_len=0, name_len=4 2012-09-17 17:47:49 Aborting journal on device md1-8. 2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted 2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted 2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted 2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted 2012-09-17 17:47:49 LDISKFS-fs (md1): 2012-09-17 17:47:49 LustreError: 5731:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 2012-09-17 17:47:49 LDISKFS-fs (md1): 2012-09-17 17:47:49 LDISKFS-fs (md1): 2012-09-17 17:47:49 LDISKFS-fs (md1): Remounting filesystem read-onlyRemounting filesystem read-onlyRemounting filesystem read-only Kernel BUG 2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: pa free mismatch: [pa ffff88031438e748] [phy 190752] [logic 608] [len 16] [free 13] [error 0] [inode 7340126] [freed 16] 2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: free 16, pa_free 13 2012-09-17 17:49:57 ------------[ cut here ]------------ 2012-09-17 17:49:57 kernel BUG at / var /lib/jenkins/workspace/lustre-b2_3/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/mballoc.c:3784! 2012-09-17 17:49:57 invalid opcode: 0000 [#1] SMP 2012-09-17 17:49:57 last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map 2012-09-17 17:49:57 CPU 14 2012-09-17 17:49:57 Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ldiskfs(U) mbcache jbd2 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan] 2012-09-17 17:49:57 2012-09-17 17:49:57 Pid: 10207, comm: umount Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R610/0K399H 2012-09-17 17:49:57 RIP: 0010:[<ffffffffa07214b6>] [<ffffffffa07214b6>] ldiskfs_mb_release_inode_pa+0x346/0x360 [ldiskfs] 2012-09-17 17:49:57 RSP: 0018:ffff88026e7152b8 EFLAGS: 00010216 2012-09-17 17:49:57 RAX: 000000000000000d RBX: 0000000000000010 RCX: ffff88032a805800 2012-09-17 17:49:57 RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff880316f3a2c0 2012-09-17 17:49:57 RBP: ffff88026e715368 R08: 0000000000000000 R09: 0000000000000000 2012-09-17 17:49:57 R10: 00000000001 R11: 0000000000000000 R12: ffff88025db65898 2012-09-17 17:49:57 R13: ffff88023f45d3a0 R14: 0000000000002931 R15: ffff88031438e748 2012-09-17 17:49:57 FS: 00002aaaab690740(0000) GS:ffff8801b58e0000(0000) knlGS:0000000000000000 2012-09-17 17:49:57 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2012-09-17 17:49:57 CR2: 00002aaaaace3387 CR3: 00000002b6ee5000 CR4: 00000000000006e0 2012-09-17 17:49:57 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2012-09-17 17:49:57 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2012-09-17 17:49:57 Process umount (pid: 10207, threadinfo ffff88026e714000, task ffff88007fa10080) 2012-09-17 17:49:57 Stack: 2012-09-17 17:49:57 ffff880200000010 000000000000000d ffff880200000000 000000000070005e 2012-09-17 17:49:57 <d> 0000000000000010 000000000000000b ffff88026e7152f8 ffffffff811ae146 2012-09-17 17:49:57 <d> ffff88032a805800 ffff88032a3b8c00 ffff88031438e748 000000000002c000 2012-09-17 17:49:57 Call Trace: 2012-09-17 17:49:57 [<ffffffff811ae146>] ? __wait_on_buffer+0x26/0x30 2012-09-17 17:49:57 [<ffffffffa072701e>] ldiskfs_discard_preallocations+0x1fe/0x490 [ldiskfs] 2012-09-17 17:49:57 [<ffffffffa0711077>] ldiskfs_release_file+0xb7/0xd0 [ldiskfs] 2012-09-17 17:49:57 [<ffffffff8117ca65>] __fput+0xf5/0x210 2012-09-17 17:49:57 [<ffffffff8117cba5>] fput+0x25/0x30 2012-09-17 17:49:57 [<ffffffff811785cd>] filp_close+0x5d/0x90 2012-09-17 17:49:57 [<ffffffffa07a1dbe>] llog_lvfs_close+0x2e/0x130 [obdclass] 2012-09-17 17:49:57 [<ffffffffa079afff>] llog_close+0x5f/0x1b0 [obdclass] 2012-09-17 17:49:57 [<ffffffffa079e92d>] llog_cat_put+0x5d/0x160 [obdclass] 2012-09-17 17:49:57 [<ffffffffa07a76bc>] llog_obd_origin_cleanup+0x1bc/0x460 [obdclass] 2012-09-17 17:49:57 [<ffffffffa07a7366>] __llog_ctxt_put+0xc6/0x260 [obdclass] 2012-09-17 17:49:57 [<ffffffffa07a7a0c>] llog_cleanup+0xac/0x490 [obdclass] 2012-09-17 17:49:57 [<ffffffffa0b7b6ef>] osc_llog_finish+0x7f/0x250 [osc] 2012-09-17 17:49:57 [<ffffffffa07a6438>] obd_llog_finish+0x88/0x1a0 [obdclass] 2012-09-17 17:49:57 [<ffffffffa0b76dca>] osc_precleanup+0x2ea/0x3b0 [osc] 2012-09-17 17:49:57 [<ffffffffa07d00c7>] class_cleanup+0x1f7/0xdc0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07b12b6>] ? class_name2dev+0x56/0xe0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07d243b>] class_process_config+0x102b/0x1c30 [obdclass] 2012-09-17 17:49:58 [<ffffffffa0398be0>] ? cfs_alloc+0x30/0x60 [libcfs] 2012-09-17 17:49:58 [<ffffffffa07cbe43>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07d31b9>] class_manual_cleanup+0x179/0x6f0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa0c09743>] lov_putref+0x393/0xb30 [lov] 2012-09-17 17:49:58 [<ffffffff814ff18e>] ? mutex_lock+0x1e/0x50 2012-09-17 17:49:58 [<ffffffffa0c1335f>] lov_disconnect+0x19f/0x4c0 [lov] 2012-09-17 17:49:58 [<ffffffffa0eb1c72>] mds_lov_clean+0x262/0x6f0 [mds] 2012-09-17 17:49:58 [<ffffffffa0eb22c6>] mds_precleanup+0x1c6/0x4b0 [mds] 2012-09-17 17:49:58 [<ffffffffa07d00c7>] class_cleanup+0x1f7/0xdc0 [obdclass] 2012-09-17 17:49:58 [<ffffffff81271fe5>] ? _atomic_dec_and_lock+0x55/0x80 2012-09-17 17:49:58 [<ffffffffa07e9827>] ? lu_object_put+0x157/0x290 [obdclass] 2012-09-17 17:49:58 [<ffffffffa0eee8c4>] mdd_fini_obd+0x54/0x240 [mdd] 2012-09-17 17:49:58 [<ffffffffa0f0d777>] mdd_process_config+0x2d7/0xa20 [mdd] 2012-09-17 17:49:58 [<ffffffff814fdca0>] ? thread_return+0x4e/0x76e 2012-09-17 17:49:58 [<ffffffffa06a3b6b>] cmm_process_config+0x7b/0xd10 [cmm] 2012-09-17 17:49:58 [<ffffffffa0f6fd8a>] mdt_stack_fini+0x5da/0xd40 [mdt] 2012-09-17 17:49:58 [<ffffffffa06a2886>] ? cmm_init_capa_ctxt+0x46/0x140 [cmm] 2012-09-17 17:49:58 [<ffffffffa0f707c5>] mdt_device_fini+0x2d5/0x5f0 [mdt] 2012-09-17 17:49:58 [<ffffffffa07d0447>] class_cleanup+0x577/0xdc0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07b12b6>] ? class_name2dev+0x56/0xe0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07d243b>] class_process_config+0x102b/0x1c30 [obdclass] 2012-09-17 17:49:58 [<ffffffffa0398be0>] ? cfs_alloc+0x30/0x60 [libcfs] 2012-09-17 17:49:58 [<ffffffffa07cbe43>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07d31b9>] class_manual_cleanup+0x179/0x6f0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07b12b6>] ? class_name2dev+0x56/0xe0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07dd3c9>] server_put_super+0x6f9/0xcf0 [obdclass] 2012-09-17 17:49:58 [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0 2012-09-17 17:49:58 [<ffffffff8117d436>] kill_anon_super+0x16/0x60 2012-09-17 17:49:58 [<ffffffffa07d4df6>] lustre_kill_super+0x36/0x60 [obdclass] 2012-09-17 17:49:58 [<ffffffff8117e4b0>] deactivate_super+0x70/0x90 2012-09-17 17:49:58 [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110 2012-09-17 17:49:58 [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0 2012-09-17 17:49:58 [<ffffffff81082de1>] ? sigprocmask+0x71/0x110 2012-09-17 17:49:58 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 2012-09-17 17:49:58 Code: 55 c8 e9 39 fe ff ff 31 db 41 83 7f 4c 00 0f 84 7e fd ff ff 0f 0b eb fe 0f 0b eb fe 0f 0b 0f 1f 80 00 00 00 00 eb f7 0f 0b eb fe <0f> 0b 0f 1f 84 00 00 00 00 00 eb f6 66 66 66 66 66 2e 0f 1f 84 2012-09-17 17:49:58 RIP [<ffffffffa07214b6>] ldiskfs_mb_release_inode_pa+0x346/0x360 [ldiskfs] 2012-09-17 17:49:58 RSP <ffff88026e7152b8> 2012-09-17 17:49:58 Initializing cgroup subsys cpuset am rebooting the node now, will run fsck script.

          Thanks, I see. Is the test started from a fresh reformatted MDS? or the MDS was just been fixed by e2fsck -fy before the run? If the MDS has not been changed yet, could you find out the directory name of 88606751? I guess you can just mount it as ldiskfs, and get it by find /mnt/mds -inum 88606751 Thanks.

          di.wang Di Wang (Inactive) added a comment - Thanks, I see. Is the test started from a fresh reformatted MDS? or the MDS was just been fixed by e2fsck -fy before the run? If the MDS has not been changed yet, could you find out the directory name of 88606751? I guess you can just mount it as ldiskfs, and get it by find /mnt/mds -inum 88606751 Thanks.

          current script:

          #!/bin/bash
          ostdev=$1
          logdir=/home/white215/wham/lu1015/logs/
          stamp=`date +%Y%m%d.%M%S`
          dumpe2fs /dev/$ostdev > $logdir/$ostdev.$stamp.stats
          debugfs -c -R "dump <8>" /dev/$ostdev > $logdir/$ostdev.$stamp.journal
          debugfs -c -R "logdump -a" /dev/$ostdev > $logdir/$ostdev.$stamp.logdump
          e2fsck -fp /dev/$ostdev 2>&1 | tee $logdir/$ostdev.$stamp.e2fsck
          dumpe2fs /dev/$ostdev > $logdir/$ostdev.$stamp.stats.post
          
          cliffw Cliff White (Inactive) added a comment - current script: #!/bin/bash ostdev=$1 logdir=/home/white215/wham/lu1015/logs/ stamp=`date +%Y%m%d.%M%S` dumpe2fs /dev/$ostdev > $logdir/$ostdev.$stamp.stats debugfs -c -R "dump <8>" /dev/$ostdev > $logdir/$ostdev.$stamp.journal debugfs -c -R "logdump -a" /dev/$ostdev > $logdir/$ostdev.$stamp.logdump e2fsck -fp /dev/$ostdev 2>&1 | tee $logdir/$ostdev.$stamp.e2fsck dumpe2fs /dev/$ostdev > $logdir/$ostdev.$stamp.stats.post

          The test being run is SWL, which is essentially a large mix of multiple tasks (ior,fdtree,mib,etc) In this case, the MDS went read-only in the
          middle of the test. I halted the MDS and ran the data-dump script created for LU-1015. e2fsprogs was 1.41.90-wc4-7, i will reinstall newer.

          cliffw Cliff White (Inactive) added a comment - The test being run is SWL, which is essentially a large mix of multiple tasks (ior,fdtree,mib,etc) In this case, the MDS went read-only in the middle of the test. I halted the MDS and ran the data-dump script created for LU-1015 . e2fsprogs was 1.41.90-wc4-7, i will reinstall newer.

          People

            di.wang Di Wang (Inactive)
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: