Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
None
-
Lustre 2.3.0
-
None
-
SWL - Hyperion/LLNL RHEL6 servers and clients
-
3
-
6323
Description
Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry:
Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #88606751: rec_len is smaller than minimal - block=44369457offset=536(536), inode=88627711, rec_len=0, name_len=4
Sep 16 01:28:12 hyperion-rst6 kernel: Aborting journal on device md1-8.
Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted
Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_new_inode: Journal has aborted
Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_delete_inode: Journal has aborted
Sep 16 01:28:12 hyperion-rst6 kernel: LustreError: 4489:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
Sep 16 01:28:12 hyperion-rst6 kernel: LustreError: 4186:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30
Disk appeared to be quite messed up with fsck -fy. Ran the data capture script from lu-1015, results attached.
Attachments
Activity
Isn't osd_ldiskfs_write_record() writing one-byte off the buffer limit if write_NUL is true?
There haven't been changes to mballoc, but there was a change to the symlink NUL termination recently. Does this workload create symlinks?
Add andreas to the ticket, in case there are some mballoc changes recently for ext4.
According to test result the corrupt directory comes from fdtree test. " 78644697: name L0F7, lck 0000000000000004, name_len=568931904". Hmm during fdtree test, each thread will only create files in its own directory, and only 10 files for each dir.
\\srun -N $NODES -n $PROCS --wait=7200 -o fdtree.%t.out \$SWL/IO/fdtree.bash -l 3 -d 10 >& tmp_file
So this problem should be unrelated with ldiskfs pdir patch. Now I suspect it might be related with mballoc, because we saw this during the test
"2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: pa free mismatch: [pa ffff88031438e748] [phy 190752] [logic 608] [len 16] [free 13] [error 0] [inode 7340126] [freed 16] 2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: free 16, pa_free 13 "
though it happened after the fs turned to read-only.
Okay, the MDS just 'sploded.
2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent: 2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #78644697: rec_len is smaller than minimal - block=39336951offset=152(152), inode=78655820, rec_len=0, name_len=4 2012-09-18 14:13:37 Aborting journal on device md1-8. 2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal 2012-09-18 14:13:37 LDISKFS-fs (md1): Remounting filesystem read-only 2012-09-18 14:13:37 LDISKFS-fs (md1): Remounting filesystem read-only 2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent: 78644697: lck ffff8801304ad000 off=152(78655820),inode=32, rec_len=4, name_len=0 2012-09-18 14:13:37 2012-09-18 14:13:37 LustreError: 8052:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 2012-09-18 14:13:37 LustreError: 8052:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30 2012-09-18 14:13:37 LustreError: 8052:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30 2012-09-18 14:13:37 78644697: name L0F7, lck 0000000000000004, name_len=568931904 2012-09-18 14:13:37 2012-09-18 14:13:37 ------------[ cut here ]------------ 2012-09-18 14:13:37 kernel BUG at /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/namei.c:1757! 2012-09-18 14:13:37 invalid opcode: 0000 [#1] SMP 2012-09-18 14:13:37 last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/0000:05:00.0/infiniband_mad/umad0/port 2012-09-18 14:13:37 CPU 7 2012-09-18 14:13:37 Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) ldiskfs(U) mbcache jbd2 lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan] 2012-09-18 14:13:37 2012-09-18 14:13:37 Pid: 4898, comm: mdt03_008 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.gb4cc145.x86_64 #1 Dell Inc. PowerEdge R610/0K399H 2012-09-18 14:13:37 RIP: 0010:[<ffffffffa0db21d0>] [<ffffffffa0db21d0>] add_dirent_to_buf+0x4c0/0x530 [ldiskfs] 2012-09-18 14:13:37 RSP: 0018:ffff880121e935f0 EFLAGS: 00010246 2012-09-18 14:13:37 RAX: ffff880181248000 RBX: ffff880268654078 RCX: 00000000000014b5 2012-09-18 14:13:37 RDX: ffff880268654098 RSI: 0000000000000046 RDI: ffff88016ff93c00 2012-09-18 14:13:37 RBP: ffff880121e936b0 R08: 0000000000000000 R09: 0000000000000000 2012-09-18 14:13:37 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880130e16ad0 2012-09-18 14:13:37 R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000020 2012-09-18 14:13:37 FS: 00002aaaab47e700(0000) GS:ffff880028260000(0000) knlGS:0000000000000000 2012-09-18 14:13:37 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2012-09-18 14:13:37 CR2: 00000000006d3a80 CR3: 0000000001a85000 CR4: 00000000000006e0 2012-09-18 14:13:37 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2012-09-18 14:13:37 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2012-09-18 14:13:37 Process mdt03_008 (pid: 4898, threadinfo ffff880121e92000, task ffff88012df62080) 2012-09-18 14:13:37 LustreError: 4099:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30 2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30 2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30 2012-09-18 14:13:37 Stack: 2012-09-18 14:13:37 ffff880121e93640 0000000002583bf7 0000000000001000 0000000000000000 2012-09-18 14:13:37 <d> ffff8801abdad5c0 ffff880200000004 ffff880104b03141 0000000000000004 2012-09-18 14:13:37 <d> 00000020e2f30800 ffff8801258d9e00 ffff8801064c3288 ffff880268654fe0 2012-09-18 14:13:37 Call Trace: 2012-09-18 14:13:37 [<ffffffffa0db5c5d>] ldiskfs_add_entry+0xcd/0x500 [ldiskfs] 2012-09-18 14:13:37 [<ffffffffa1001689>] __osd_ea_add_rec+0xb9/0x190 [osd_ldiskfs] 2012-09-18 14:13:37 [<ffffffffa100dbeb>] osd_index_ea_insert+0x21b/0x5e0 [osd_ldiskfs] 2012-09-18 14:13:37 [<ffffffffa0eee977>] __mdd_index_insert_only+0x147/0x150 [mdd] 2012-09-18 14:13:37 [<ffffffffa0eef9b1>] __mdd_index_insert+0x51/0x1f0 [mdd] 2012-09-18 14:13:37 [<ffffffffa0ef55e3>] mdd_create+0x19a3/0x20c0 [mdd] 2012-09-18 14:13:37 [<ffffffffa03aa5b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2012-09-18 14:13:37 [<ffffffffa06a4637>] cml_create+0x97/0x250 [cmm] 2012-09-18 14:13:37 [<ffffffffa0f8bb9f>] mdt_reint_open+0x108f/0x18a0 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f75151>] mdt_reint_rec+0x41/0xe0 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f6e9aa>] mdt_reint_internal+0x50a/0x810 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f6ef7d>] mdt_intent_reint+0x1ed/0x500 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f6b191>] mdt_intent_policy+0x371/0x6a0 [mdt] 2012-09-18 14:13:37 [<ffffffffa0859881>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa08819bf>] ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa0f6b506>] mdt_enqueue+0x46/0x130 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f62802>] mdt_handle_common+0x922/0x1740 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f636f5>] mdt_regular_handle+0x15/0x20 [mdt] 2012-09-18 14:13:37 [<ffffffffa08b199d>] ptlrpc_server_handle_request+0x40d/0xea0 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa08a8f37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa03aa5b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2012-09-18 14:13:37 [<ffffffff810533f3>] ? __wake_up+0x53/0x70 2012-09-18 14:13:37 [<ffffffffa08b2f89>] ptlrpc_main+0xb59/0x1860 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc] 2012-09-18 14:13:37 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-09-18 14:13:37 [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc] 2012-09-18 14:13:37 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-09-18 14:13:37 Code: 00 00 48 c7 c2 20 81 dd a0 4c 8b 45 a8 45 89 f1 48 c7 c6 e5 d2 dd a0 e8 9f 56 01 00 48 8b 55 c8 66 83 7a 04 00 0f 85 a9 fd ff ff <0f> 0b eb fe 49 8b 2012-09-18 14:13:37 LustreError: 4099:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30 2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30 2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30 2012-09-18 14:13:37 4c 24 40 49 8b bc 24 08 01 00 00 31 c0 4c 8b 2012-09-18 14:13:37 RIP [<ffffffffa0db21d0>] add_dirent_to_buf+0x4c0/0x530 [ldiskfs] 2012-09-18 14:13:37 RSP <ffff880121e935f0> 2012-09-18 14:13:38 Initializing cgroup subsys cpuset 2012-09-18 14:13:38 Initializing cgroup subsys cpu 2012-09-18 14:13:38 Linux version 2.6.32-279.5.1.el6_lustre.gb4cc145.x86
vmcore has been captured, is on brent.whamcloud.com ~/cliffw/lu1948/vmcore-rst6-c2.gz
I add some debug patch based on b2_3 (http://review.whamcloud.com/#change,4020 only patch on el6 ldiskfs series). build http://build.whamcloud.com/job/lustre-reviews/9250/ Cliff, could you please test with this rpm. And I suspect this problem might be related with ldiskfs htree, so Liang is looking at this problem as well.
MDS just went read-only, then crashed when I attempted a umount -f
2012-09-17 17:44:00 Lustre: 4130:0:(service.c:2101:ptlrpc_handle_rs()) All locks stolen from rs ffff880132a5d000 x1413385599017471.t4342523245 o0 NID 192.168.119.37@o2ib1 2012-09-17 17:44:00 Lustre: 4130:0:(service.c:2101:ptlrpc_handle_rs()) Skipped 402 previous similar messages 2012-09-17 17:44:00 LustreError: 5729:0:(mdt_recovery.c:611:mdt_steal_ack_locks()) Skipped 426 previous similar messages 2012-09-17 17:47:49 LDISKFS-fs error (device md1): ldiskfs_add_entry: 2012-09-17 17:47:49 LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #116917774: rec_len is smaller than minimal - block=58525219offset=312(312), inode=117441145, rec_len=0, name_len=4 2012-09-17 17:47:49 Aborting journal on device md1-8. 2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted 2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted 2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted 2012-09-17 17:47:49 LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted 2012-09-17 17:47:49 LDISKFS-fs (md1): 2012-09-17 17:47:49 LustreError: 5731:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 2012-09-17 17:47:49 LDISKFS-fs (md1): 2012-09-17 17:47:49 LDISKFS-fs (md1): 2012-09-17 17:47:49 LDISKFS-fs (md1): Remounting filesystem read-onlyRemounting filesystem read-onlyRemounting filesystem read-only
Kernel BUG
2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: pa free mismatch: [pa ffff88031438e748] [phy 190752] [logic 608] [len 16] [free 13] [error 0] [inode 7340126] [freed 16] 2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: free 16, pa_free 13 2012-09-17 17:49:57 ------------[ cut here ]------------ 2012-09-17 17:49:57 kernel BUG at /var/lib/jenkins/workspace/lustre-b2_3/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/mballoc.c:3784! 2012-09-17 17:49:57 invalid opcode: 0000 [#1] SMP 2012-09-17 17:49:57 last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map 2012-09-17 17:49:57 CPU 14 2012-09-17 17:49:57 Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ldiskfs(U) mbcache jbd2 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan] 2012-09-17 17:49:57 2012-09-17 17:49:57 Pid: 10207, comm: umount Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R610/0K399H 2012-09-17 17:49:57 RIP: 0010:[<ffffffffa07214b6>] [<ffffffffa07214b6>] ldiskfs_mb_release_inode_pa+0x346/0x360 [ldiskfs] 2012-09-17 17:49:57 RSP: 0018:ffff88026e7152b8 EFLAGS: 00010216 2012-09-17 17:49:57 RAX: 000000000000000d RBX: 0000000000000010 RCX: ffff88032a805800 2012-09-17 17:49:57 RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff880316f3a2c0 2012-09-17 17:49:57 RBP: ffff88026e715368 R08: 0000000000000000 R09: 0000000000000000 2012-09-17 17:49:57 R10: 00000000001 R11: 0000000000000000 R12: ffff88025db65898 2012-09-17 17:49:57 R13: ffff88023f45d3a0 R14: 0000000000002931 R15: ffff88031438e748 2012-09-17 17:49:57 FS: 00002aaaab690740(0000) GS:ffff8801b58e0000(0000) knlGS:0000000000000000 2012-09-17 17:49:57 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2012-09-17 17:49:57 CR2: 00002aaaaace3387 CR3: 00000002b6ee5000 CR4: 00000000000006e0 2012-09-17 17:49:57 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2012-09-17 17:49:57 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2012-09-17 17:49:57 Process umount (pid: 10207, threadinfo ffff88026e714000, task ffff88007fa10080) 2012-09-17 17:49:57 Stack: 2012-09-17 17:49:57 ffff880200000010 000000000000000d ffff880200000000 000000000070005e 2012-09-17 17:49:57 <d> 0000000000000010 000000000000000b ffff88026e7152f8 ffffffff811ae146 2012-09-17 17:49:57 <d> ffff88032a805800 ffff88032a3b8c00 ffff88031438e748 000000000002c000 2012-09-17 17:49:57 Call Trace: 2012-09-17 17:49:57 [<ffffffff811ae146>] ? __wait_on_buffer+0x26/0x30 2012-09-17 17:49:57 [<ffffffffa072701e>] ldiskfs_discard_preallocations+0x1fe/0x490 [ldiskfs] 2012-09-17 17:49:57 [<ffffffffa0711077>] ldiskfs_release_file+0xb7/0xd0 [ldiskfs] 2012-09-17 17:49:57 [<ffffffff8117ca65>] __fput+0xf5/0x210 2012-09-17 17:49:57 [<ffffffff8117cba5>] fput+0x25/0x30 2012-09-17 17:49:57 [<ffffffff811785cd>] filp_close+0x5d/0x90 2012-09-17 17:49:57 [<ffffffffa07a1dbe>] llog_lvfs_close+0x2e/0x130 [obdclass] 2012-09-17 17:49:57 [<ffffffffa079afff>] llog_close+0x5f/0x1b0 [obdclass] 2012-09-17 17:49:57 [<ffffffffa079e92d>] llog_cat_put+0x5d/0x160 [obdclass] 2012-09-17 17:49:57 [<ffffffffa07a76bc>] llog_obd_origin_cleanup+0x1bc/0x460 [obdclass] 2012-09-17 17:49:57 [<ffffffffa07a7366>] __llog_ctxt_put+0xc6/0x260 [obdclass] 2012-09-17 17:49:57 [<ffffffffa07a7a0c>] llog_cleanup+0xac/0x490 [obdclass] 2012-09-17 17:49:57 [<ffffffffa0b7b6ef>] osc_llog_finish+0x7f/0x250 [osc] 2012-09-17 17:49:57 [<ffffffffa07a6438>] obd_llog_finish+0x88/0x1a0 [obdclass] 2012-09-17 17:49:57 [<ffffffffa0b76dca>] osc_precleanup+0x2ea/0x3b0 [osc] 2012-09-17 17:49:57 [<ffffffffa07d00c7>] class_cleanup+0x1f7/0xdc0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07b12b6>] ? class_name2dev+0x56/0xe0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07d243b>] class_process_config+0x102b/0x1c30 [obdclass] 2012-09-17 17:49:58 [<ffffffffa0398be0>] ? cfs_alloc+0x30/0x60 [libcfs] 2012-09-17 17:49:58 [<ffffffffa07cbe43>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07d31b9>] class_manual_cleanup+0x179/0x6f0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa0c09743>] lov_putref+0x393/0xb30 [lov] 2012-09-17 17:49:58 [<ffffffff814ff18e>] ? mutex_lock+0x1e/0x50 2012-09-17 17:49:58 [<ffffffffa0c1335f>] lov_disconnect+0x19f/0x4c0 [lov] 2012-09-17 17:49:58 [<ffffffffa0eb1c72>] mds_lov_clean+0x262/0x6f0 [mds] 2012-09-17 17:49:58 [<ffffffffa0eb22c6>] mds_precleanup+0x1c6/0x4b0 [mds] 2012-09-17 17:49:58 [<ffffffffa07d00c7>] class_cleanup+0x1f7/0xdc0 [obdclass] 2012-09-17 17:49:58 [<ffffffff81271fe5>] ? _atomic_dec_and_lock+0x55/0x80 2012-09-17 17:49:58 [<ffffffffa07e9827>] ? lu_object_put+0x157/0x290 [obdclass] 2012-09-17 17:49:58 [<ffffffffa0eee8c4>] mdd_fini_obd+0x54/0x240 [mdd] 2012-09-17 17:49:58 [<ffffffffa0f0d777>] mdd_process_config+0x2d7/0xa20 [mdd] 2012-09-17 17:49:58 [<ffffffff814fdca0>] ? thread_return+0x4e/0x76e 2012-09-17 17:49:58 [<ffffffffa06a3b6b>] cmm_process_config+0x7b/0xd10 [cmm] 2012-09-17 17:49:58 [<ffffffffa0f6fd8a>] mdt_stack_fini+0x5da/0xd40 [mdt] 2012-09-17 17:49:58 [<ffffffffa06a2886>] ? cmm_init_capa_ctxt+0x46/0x140 [cmm] 2012-09-17 17:49:58 [<ffffffffa0f707c5>] mdt_device_fini+0x2d5/0x5f0 [mdt] 2012-09-17 17:49:58 [<ffffffffa07d0447>] class_cleanup+0x577/0xdc0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07b12b6>] ? class_name2dev+0x56/0xe0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07d243b>] class_process_config+0x102b/0x1c30 [obdclass] 2012-09-17 17:49:58 [<ffffffffa0398be0>] ? cfs_alloc+0x30/0x60 [libcfs] 2012-09-17 17:49:58 [<ffffffffa07cbe43>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07d31b9>] class_manual_cleanup+0x179/0x6f0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07b12b6>] ? class_name2dev+0x56/0xe0 [obdclass] 2012-09-17 17:49:58 [<ffffffffa07dd3c9>] server_put_super+0x6f9/0xcf0 [obdclass] 2012-09-17 17:49:58 [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0 2012-09-17 17:49:58 [<ffffffff8117d436>] kill_anon_super+0x16/0x60 2012-09-17 17:49:58 [<ffffffffa07d4df6>] lustre_kill_super+0x36/0x60 [obdclass] 2012-09-17 17:49:58 [<ffffffff8117e4b0>] deactivate_super+0x70/0x90 2012-09-17 17:49:58 [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110 2012-09-17 17:49:58 [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0 2012-09-17 17:49:58 [<ffffffff81082de1>] ? sigprocmask+0x71/0x110 2012-09-17 17:49:58 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 2012-09-17 17:49:58 Code: 55 c8 e9 39 fe ff ff 31 db 41 83 7f 4c 00 0f 84 7e fd ff ff 0f 0b eb fe 0f 0b eb fe 0f 0b 0f 1f 80 00 00 00 00 eb f7 0f 0b eb fe <0f> 0b 0f 1f 84 00 00 00 00 00 eb f6 66 66 66 66 66 2e 0f 1f 84 2012-09-17 17:49:58 RIP [<ffffffffa07214b6>] ldiskfs_mb_release_inode_pa+0x346/0x360 [ldiskfs] 2012-09-17 17:49:58 RSP <ffff88026e7152b8> 2012-09-17 17:49:58 Initializing cgroup subsys cpuset
am rebooting the node now, will run fsck script.
Thanks, I see. Is the test started from a fresh reformatted MDS? or the MDS was just been fixed by e2fsck -fy before the run? If the MDS has not been changed yet, could you find out the directory name of 88606751? I guess you can just mount it as ldiskfs, and get it by find /mnt/mds -inum 88606751 Thanks.
current script:
#!/bin/bash ostdev=$1 logdir=/home/white215/wham/lu1015/logs/ stamp=`date +%Y%m%d.%M%S` dumpe2fs /dev/$ostdev > $logdir/$ostdev.$stamp.stats debugfs -c -R "dump <8>" /dev/$ostdev > $logdir/$ostdev.$stamp.journal debugfs -c -R "logdump -a" /dev/$ostdev > $logdir/$ostdev.$stamp.logdump e2fsck -fp /dev/$ostdev 2>&1 | tee $logdir/$ostdev.$stamp.e2fsck dumpe2fs /dev/$ostdev > $logdir/$ostdev.$stamp.stats.post
The test being run is SWL, which is essentially a large mix of multiple tasks (ior,fdtree,mib,etc) In this case, the MDS went read-only in the
middle of the test. I halted the MDS and ran the data-dump script created for LU-1015. e2fsprogs was 1.41.90-wc4-7, i will reinstall newer.
fdtree does not create symlinks, which only includes mkdir, create, dd, unlink, rmdir. But SWL includes 5 tests, fdtree, simul, IOR, mirIO, mdtest. Simul definitely include create symlinks here.