[LU-2041] SWL ldiskfs_add_entry: bad entry in directory #127928380: rec_len is smaller than minimal Created: 28/Sep/12 Updated: 18/Oct/12 Resolved: 18/Oct/12 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.4.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Cliff White (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
SWL - Hyperion/LLNL |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 4217 | ||||||||
| Description |
|
Running SWL test, after ~6 hours seeing ldiskfs failure and EROFS on the MDT. Sep 27 20:13:20 hyperion-rst6 kernel: Lustre: 4918:0:(mdt_recovery.c:622:mdt_steal_ack_locks()) Stealing 1 locks from rs ffff880108e55000 x1414295525519816.t4537566185 o0 NID 192.168.116.112@o2ib1 Sep 27 20:13:20 hyperion-rst6 kernel: Lustre: 4233:0:(service.c:2105:ptlrpc_handle_rs()) All locks stolen from rs ffff880108e55000 x1414295525519816.t4537566185 o0 NID 192.168.116.112@o2ib1 Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #127928380: rec_len is smaller than minimal - block=64030811offset=536(536), inode=127946027, rec_len=0, name_len=4 Sep 27 20:16:39 hyperion-rst6 kernel: Aborting journal on device md1-8. Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30 Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs (md1): bad entry in directory #127928380: rec_len is smaller than minimal - block=64030811offset=536(536), inode=127946027, rec_len=0, name_len=4 Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30 Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30 Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30 Sep 27 20:16:39 hyperion-rst6 kernel: Remounting filesystem read-only Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted Ran fsck on device, recovered LDISKFS-fs (md1): mounted filesystem with ordered data mode. quota=off. Opts: LDISKFS-fs (md1): mounted filesystem with ordered data mode. quota=off. Opts: Lustre: MGC192.168.127.6@o2ib: Reactivating import Lustre: lustre-MDT0000: used disk, loading Lustre: 5311:0:(ldlm_lib.c:2139:target_recovery_init()) RECOVERY: service lustre-MDT0000, 91 recoverable clients, last_transno 4541171590 Lustre: lustre-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 Lustre: 4326:0:(mgc_request.c:1534:mgc_process_recover_log()) Process recover log lustre-mdtir error -22 Lustre: 4326:0:(mgc_request.c:1534:mgc_process_recover_log()) Skipped 2 previous similar messages Lustre: lustre-MDT0000: Will be in recovery for at least 2:30, or until 91 clients reconnect LustreError: 5315:0:(ldlm_lib.c:1717:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4541171591, ql: 62, comp: 29, conn: 91, next: 4541177085, last_committed: 4541171590) LustreError: 5315:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids LustreError: 5315:0:(ldlm_lib.c:1717:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4541177086, ql: 62, comp: 29, conn: 91, next: 4541181324, last_committed: 4541171590) Lustre: lustre-MDT0000: disconnecting 1 stale clients Lustre: lustre-MDT0000: Recovery over after 1:24, of 91 clients 90 recovered and 1 was evicted. Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0028_UUID now active, resetting orphans Will continue to run, see if we get repeat. |
| Comments |
| Comment by Peter Jones [ 28/Sep/12 ] |
|
Fanyong Could you please look into this one? Thanks Peter |
| Comment by nasf (Inactive) [ 28/Sep/12 ] |
|
It is really strange failure. Several possible reasons: 1) The block was reused by others, and the dentry was re-written. Recently, only OI shrink patch may affect the block allocation, but such patch only affect the blocks belong to OI files, those blocks can never be reused by other non-OI directories, in spite of whether the OI shrink patch added or not. So it should not related such patch. On the other hand, if the block is re-written by others, then it is quite possible that the whole entry was re-wrriten, but not just the "rec_len" itself. But from the log, the "ino#" and "name_len" are reasonable values. So the blocks should not be re-written. 2) ldiskfs internal bugs. 3) Hardware issues. So if we can reproduce the failure, then it may be caused by 2), otherwise, may be 3). |
| Comment by nasf (Inactive) [ 28/Sep/12 ] |
|
This is the patch for 2): |
| Comment by Di Wang [ 28/Sep/12 ] |
|
Sorry for stepping in here. Clearly 1948 is not being fixed with the patch of 1976. I just re-investigated the result Cliff got on 1948. It seems to me two threads concurrently access the dir. Thread A adds entry to dir 78644697, and met zero rec_len here. 2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent: 2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #78644697: rec_len is smaller than minimal - block=39336951offset=152(152), inode=78655820, rec_len=0, name_len=4 2012-09-18 14:13:37 Aborting journal on device md1-8. Thread B hit BUG_ON in the later place, which I added for debugging the problem,
1930 @@ -1387,7 +1751,17 @@
1931 » if (de->inode) {
1932 » » struct ext4_dir_entry_2 *de1 = (struct ext4_dir_entry_2 *)((char *)de + nlen);
1933 » » de1->rec_len = ext4_rec_len_to_disk(rlen - nlen, blocksize);
1934 +» » if (de1->rec_len == 0) {
1935 +» » » __ext4_error(dir->i_sb, function, "%lu: name %s, lck %p, name_len=%d\n",
1936 +» » » » dir->i_ino, name, name_len);
1937 +» » » BUG_ON(de1->rec_len == 0);
1938 +» » }
1939 » » de->rec_len = ext4_rec_len_to_disk(nlen, blocksize);
1940 +» » if (de->rec_len == 0) {
1941 +» » » __ext4_error(dir->i_sb, function, "%lu: name %s, lck %p name_len=%d\n",
1942 +» » » » dir->i_ino, name, name_len);
1943 +» » » BUG_ON(de->rec_len == 0);
1944 +» » }
1945 » » de = de1;
1946 » }
1947 » de->file_type = EXT4_FT_UNKNOWN;
The console message for hitting the BUG. 2012-09-18 14:13:37 78644697: name L0F7, name_len=4 2012-09-18 14:13:37 2012-09-18 14:13:37 ------------[ cut here ]------------ 2012-09-18 14:13:37 kernel BUG at /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/namei.c:1757! 2012-09-18 14:13:37 invalid opcode: 0000 [#1] SMP 2012-09-18 14:13:37 last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/0000:05:00.0/infiniband_mad/umad0/port 2012-09-18 14:13:37 CPU 7 ...... If thread A and B are serialized for this function, both of them should fail at ldiskfs_check_dir_entry. So it still seems ldiskfs pdir patch problem. Liang, Could you please have a look again for this one. And also for this test, each dir only includes 10 files, so it should not involve index splitting IMHO, all dir are just single block dir. |
| Comment by Liang Zhen (Inactive) [ 28/Sep/12 ] |
|
for directory w/o indices, PDO lock equals to a RW semaphore which is exclusive for writing, which means PDO is actually disabled for small directory like this. But I will check my code again... |
| Comment by Liang Zhen (Inactive) [ 28/Sep/12 ] |
|
Nasf, I would suggest to run with wangdi's patch on |
| Comment by nasf (Inactive) [ 28/Sep/12 ] |
|
Di, why you think the "If thread A and B are serialized for this function, both of them should fail at ldiskfs_check_dir_entry" ? If thread A call add_dirent_to_buf() with NULL "de", it will call ldiskfs_check_dir_entry(); If thread B call ldiskfs_check_dir_entry() with non-NULL "de", it will not call ldiskfs_check_dir_entry(). So thread A and thread B may fail at different positions. Maybe I miss something, please correct me. |
| Comment by Di Wang [ 29/Sep/12 ] |
|
Please see the trace on 2012-09-18 14:13:37 [<ffffffffa0db5c5d>] ldiskfs_add_entry+0xcd/0x500 [ldiskfs] Then in ldiskfs_add_entry, because there are only about 10 entries per dir, so it should not be dx dir, and the size should be == 4096. blocks = dir->i_size >> sb->s_blocksize_bits; if (retval != -ENOSPC) { brelse(bh); return retval; } if (blocks == 1 && !dx_fallback && |
| Comment by Liang Zhen (Inactive) [ 29/Sep/12 ] |
|
fdtree will actually create 10 subdirs and 10 files, so it has 20 entries, but the directory is still plain w/o index, I tried this on my local workstation, and verified this. |
| Comment by Liang Zhen (Inactive) [ 29/Sep/12 ] |
|
I checked PDO patch again but found nothing, because directory created by fdtree only has 20 entries and it's still a plain directory w/o index, I didn't change this part at all, also, directory created by fdtree has hostname+pid as part directory name, which means directory should be unique for each thread, if there's a bug to mess up file creation for a single thread, we might have seen it months ago because PDO is a 2.2 feature. However, I can't explain why we can see two failures in different places (wangdi's previous comment), the only possibility comes to my mind is OSD created two instances of htree_lock_head for the directory, or even created two lu_object for the same directory in rare case. The patch I posted in previous comment should give information to verify this (http://review.whamcloud.com/#change,4020, build is http://build.whamcloud.com/job/lustre-reviews/9547/), so we might have to wait for the other try with this patch. |
| Comment by Liang Zhen (Inactive) [ 30/Sep/12 ] |
|
I rebased my debug patch and make it depend on patch of |
| Comment by Liang Zhen (Inactive) [ 02/Oct/12 ] |
|
I merged fix of |
| Comment by Liang Zhen (Inactive) [ 03/Oct/12 ] |
|
also add another fix from wangdi to the debug patch, the latest build is: http://build.whamcloud.com/job/lustre-reviews/9641/ |
| Comment by Liang Zhen (Inactive) [ 04/Oct/12 ] |
|
just posted a patch : http://review.whamcloud.com/#change,4186 |
| Comment by Cliff White (Inactive) [ 05/Oct/12 ] |
|
Crashed.
Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs Dumping htree_lock ffff88011e8db200, mode 3, head ffff880139a39180
Oct 5 18:27:58 hyperion-rst6 kernel: granted mode: 3:1
Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs error (device md1): add_dirent(de1): dir(134759485, dx: 1, nblks: 33) de(mdtest.16.41 : 12), off 2376 rlen 40 nlen 40
Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_dx_find_entry: bad entry in directory #134759485: rec_len is smaller than minimal - block=67436580offset=2416(59760), inode=134767633, rec_len=0, name_len=5
Oct 5 18:27:58 hyperion-rst6 kernel: Aborting journal on device md1-8.
Oct 5 18:27:58 hyperion-rst6 kernel: LustreError: 4331:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
Oct 5 18:27:58 hyperion-rst6 kernel: LustreError: 4331:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30
Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in osd_trans_stop: Journal has aborted
Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
Oct 5 18:27:58 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
Oct 5 18:27:58 hyperion-rst6 kernel: LustreError: 4331:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30
Oct 5 18:27:58 hyperion-rst6 kernel:
Oct 5 18:27:58 hyperion-rst6 kernel: ------------[ cut here ]------------
Oct 5 18:27:58 hyperion-rst6 kernel: kernel BUG at /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/namei.c:1757!
Oct 5 18:27:58 hyperion-rst6 kernel: invalid opcode: 0000 [#1] SMP
Oct 5 18:27:58 hyperion-rst6 kernel: last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
Oct 5 18:27:58 hyperion-rst6 kernel: CPU 5
Oct 5 18:27:59 hyperion-rst6 kernel: Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ldiskfs(U) mbcache jbd2 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan]
Oct 5 18:27:59 hyperion-rst6 kernel:
Oct 5 18:27:59 hyperion-rst6 kernel: Pid: 6331, comm: mdt03_016 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.gb16fe80.x86_64 #1 Dell Inc. PowerEdge R610/0K399H
Oct 5 18:27:59 hyperion-rst6 kernel: RIP: 0010:[<ffffffffa0726502>] [<ffffffffa0726502>] add_dirent_to_buf+0x552/0x6a0 [ldiskfs]
Oct 5 18:27:59 hyperion-rst6 kernel: RSP: 0018:ffff880159a17480 EFLAGS: 00010246
Oct 5 18:27:59 hyperion-rst6 kernel: RAX: ffff8801eea83800 RBX: ffff880313470948 RCX: ffff880313470970
Oct 5 18:27:59 hyperion-rst6 kernel: RDX: ffff88031833f400 RSI: 0000000000000046 RDI: ffff8801eea83c00
Oct 5 18:27:59 hyperion-rst6 kernel: RBP: ffff880159a17550 R08: 0000000000000000 R09: 0000000000000000
Oct 5 18:27:59 hyperion-rst6 kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff880146a44250
Oct 5 18:27:59 hyperion-rst6 kernel: R13: 0000000000000948 R14: 000000000000000c R15: 0000000000000028
Oct 5 18:27:59 hyperion-rst6 kernel: FS: 00002aaaab47e700(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
Oct 5 18:27:59 hyperion-rst6 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Oct 5 18:27:59 hyperion-rst6 kernel: CR2: 00002aaaab34230c CR3: 0000000001a85000 CR4: 00000000000006e0
Oct 5 18:27:59 hyperion-rst6 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 5 18:27:59 hyperion-rst6 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct 5 18:27:59 hyperion-rst6 kernel: Stack:
Oct 5 18:27:59 hyperion-rst6 kernel: ffffc9002ae7e1f0 000000000000000c ffff880100000948 0000000000000028
Oct 5 18:27:59 hyperion-rst6 kernel: <d> 0000000000000028 00000000ecc23e5e ffff8801abec6080 ffffffff080863e4
Oct 5 18:27:59 hyperion-rst6 kernel: <d> 0000000000021000 ffffffffa074720c ffff88011e8db200 ffff880113f7cc18
Oct 5 18:27:59 hyperion-rst6 kernel: Call Trace:
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0728999>] ldiskfs_dx_add_entry+0x169/0x7e0 [ldiskfs]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa07433f1>] ? ldiskfs_xattr_set_handle+0x151/0x560 [ldiskfs]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa072929d>] ldiskfs_add_entry+0x28d/0x500 [ldiskfs]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0f13689>] __osd_ea_add_rec+0xb9/0x190 [osd_ldiskfs]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0f1fc1b>] osd_index_ea_insert+0x21b/0x5e0 [osd_ldiskfs]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e21d2d>] ? __mdd_xattr_set+0x25d/0x380 [mdd]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e3a767>] __mdd_index_insert_only+0x147/0x150 [mdd]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e3b901>] __mdd_index_insert+0x51/0x1f0 [mdd]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e3bdf9>] ? mdd_object_capa+0x29/0x1b0 [mdd]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e416eb>] mdd_create+0x1a0b/0x2180 [mdd]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0f19f9f>] ? osd_xattr_get+0x9f/0x360 [osd_ldiskfs]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0f6d637>] cml_create+0x97/0x250 [cmm]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0eaaddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0ebeb9f>] mdt_reint_open+0x108f/0x18a0 [mdt]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e4728e>] ? md_ucred+0x1e/0x60 [mdd]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e8c235>] ? mdt_ucred+0x15/0x20 [mdt]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0ea8151>] mdt_reint_rec+0x41/0xe0 [mdt]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0ea19aa>] mdt_reint_internal+0x50a/0x810 [mdt]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0ea1f7d>] mdt_intent_reint+0x1ed/0x500 [mdt]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e9e191>] mdt_intent_policy+0x371/0x6a0 [mdt]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa090d881>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa09359bf>] ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e9e506>] mdt_enqueue+0x46/0x130 [mdt]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e95802>] mdt_handle_common+0x922/0x1740 [mdt]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0e966f5>] mdt_regular_handle+0x15/0x20 [mdt]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0965b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa03a514f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa095cf37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0967111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0966520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0966520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffffa0966520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
Oct 5 18:27:59 hyperion-rst6 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Oct 5 18:27:59 hyperion-rst6 kernel: Code: 54 24 18 48 8b 55 b8 48 89 c1 31 c0 48 89 14 24 48 c7 c2 38 c3 74 a0 e8 0d 55 01 00 48 8b 4d a8 66 83 79 04 00 0f 85 13 fd ff ff <0f> 0b eb fe 48 8b 7d 80 e8 d1 e7 01 00 49 8b bc 24 08 01 00 00
Oct 5 18:27:59 hyperion-rst6 kernel: RIP [<ffffffffa0726502>] add_dirent_to_buf+0x552/0x6a0 [ldiskfs]
Oct 5 18:27:59 hyperion-rst6 kernel: RSP <ffff880159a17480>
I don't think we dropped a vmcore on this one, still looking |
| Comment by Liang Zhen (Inactive) [ 05/Oct/12 ] |
|
this time we crashed in mdtest, and the directory is a directory with indices, so it's more like a PDO issue, however:
Based on previous two things, it could be either: or b) two OSD objects with different FIDs point to a same inode (directory) for some unknown reason. I think the only choice now is disable several things one by one:
So I have to create different builds to try... |
| Comment by Liang Zhen (Inactive) [ 06/Oct/12 ] |
|
latest build is here: http://build.whamcloud.com/job/lustre-reviews/9744/ |
| Comment by nasf (Inactive) [ 06/Oct/12 ] |
|
This patch fixes some potential OI mapping cache inconsistency issues: |
| Comment by Liang Zhen (Inactive) [ 07/Oct/12 ] |
|
The latest build is here: http://build.whamcloud.com/job/lustre-reviews/9758 I merged fanyong's fix into this build. |
| Comment by Peter Jones [ 08/Oct/12 ] |
|
Dropping priority because landed for 2.3 |
| Comment by Jodi Levi (Inactive) [ 17/Oct/12 ] |
|
Patch for master here: http://review.whamcloud.com/#change,4208 |
| Comment by Jodi Levi (Inactive) [ 18/Oct/12 ] |
|
Please let me know if this needs to be reopened for more work. |