[LU-1080] mds-survey crash Created: 08/Feb/12  Updated: 21/Feb/12  Resolved: 21/Feb/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: Lustre 2.2.0

Type: Bug Priority: Blocker
Reporter: Richard Henwood (Inactive) Assignee: Alex Zhuravlev
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Lustre Master


Issue Links:
Related
is related to LU-1082 add mds-survey to lustre-tests Resolved
Severity: 3
Rank (Obsolete): 6468

 Description   

Running on a real machine:

$ mkfs.lustre --fsname=survey --mdt --index=0 /dev/sda3
$ mount -t lustre /dev/sda3 /mnt
$ thrhi=64 file_count=200000 sh mds-survey

then crash:

Build Version: jenkins-arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel-4610-g614
Lustre: Added LNI 10.45.1.8@tcp [8/256/0/180]
Lustre: Accept all, port 988
LDISKFS-fs (sda3): recovery complete
LDISKFS-fs (sda3): mounted filesystem with ordered data mode. Opts:
LDISKFS-fs (sda3): mounted filesystem with ordered data mode. Opts:
Lustre: MGC10.45.1.8@tcp: Reactivating import
Lustre: survey-MDT0000: used disk, loading
Lustre: Echo OBD driver; http://www.lustre.org/
LustreError: 1821:0:(echo_client.c:1810:echo_md_destroy_internal())
Can not unlink child tests: rc = -39
LustreError: 1823:0:(echo_client.c:1810:echo_md_destroy_internal())
Can not unlink child tests1: rc = -39
LustreError: 1831:0:(osd_handler.c:2294:osd_object_ref_del())
ASSERTION((oh)->ot_declare_ref_del > 0) failed
LustreError: 1831:0:(osd_handler.c:2294:osd_object_ref_del()) LBUG
Pid: 1831, comm: lctl

Call Trace:
 [<ffffffffa038e855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa038ee95>] lbug_with_loc+0x75/0xe0 [libcfs]
 [<ffffffffa0399d96>] libcfs_assertion_failed+0x66/0x70 [libcfs]
 [<ffffffffa0a1781a>] osd_object_ref_del+0x14a/0x180 [osd_ldiskfs]
 [<ffffffffa096ecbb>] __mdd_ref_del+0x5b/0xa0 [mdd]
 [<ffffffffa09777a2>] mdd_create+0x1ae2/0x2470 [mdd]
 [<ffffffffa051190d>] ? htable_lookup+0xed/0x190 [obdclass]
 [<ffffffffa041b5a9>] ? cfs_hash_bd_add_locked+0x29/0x90 [libcfs]
 [<ffffffff81275894>] ? vsnprintf+0x484/0x5f0
 [<ffffffffa0a6822b>] echo_md_create_internal+0xab/0x4b0 [obdecho]
 [<ffffffff81275a40>] ? sprintf+0x40/0x50
 [<ffffffffa0a6ff40>] echo_md_handler+0x1380/0x1dd0 [obdecho]
 [<ffffffffa040d87e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs]
 [<ffffffffa0a75ae6>] echo_client_iocontrol+0x1c86/0x2a30 [obdecho]
 [<ffffffff81127e77>] ? ____pagevec_lru_add+0x167/0x180
 [<ffffffffa040da13>] ? cfs_alloc+0x63/0x90 [libcfs]
 [<ffffffffa04c0f52>] ? obd_ioctl_getdata+0x172/0x1060 [obdclass]
 [<ffffffffa04d6264>] class_handle_ioctl+0x14d4/0x2340 [obdclass]
 [<ffffffff8120d5df>] ? security_inode_permission+0x1f/0x30
 [<ffffffffa04c0313>] obd_class_ioctl+0x53/0x240 [obdclass]
 [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff81189342>] vfs_ioctl+0x22/0xa0
 [<ffffffff811894c9>] ? do_vfs_ioctl+0x69/0x580
 [<ffffffff811894e4>] do_vfs_ioctl+0x84/0x580
 [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff81189a61>] sys_ioctl+0x81/0xa0
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

Kernel panic - not syncing: LBUG
Pid: 7014, comm: lctl Not tainted 2.6.32-220.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff814ec701>] ? panic+0x78/0x143
 [<ffffffffa040ceeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
 [<ffffffffa0417d96>] ? libcfs_assertion_failed+0x66/0x70 [libcfs]
 [<ffffffffa0a1781a>] ? osd_object_ref_del+0x14a/0x180 [osd_ldiskfs]
 [<ffffffffa096ecbb>] ? __mdd_ref_del+0x5b/0xa0 [mdd]
 [<ffffffffa09777a2>] ? mdd_create+0x1ae2/0x2470 [mdd]
 [<ffffffffa051190d>] ? htable_lookup+0xed/0x190 [obdclass]
 [<ffffffffa041b5a9>] ? cfs_hash_bd_add_locked+0x29/0x90 [libcfs]
 [<ffffffff81275894>] ? vsnprintf+0x484/0x5f0
 [<ffffffffa0a6822b>] ? echo_md_create_internal+0xab/0x4b0 [obdecho]
 [<ffffffff81275a40>] ? sprintf+0x40/0x50
 [<ffffffffa0a6ff40>] ? echo_md_handler+0x1380/0x1dd0 [obdecho]
 [<ffffffffa040d87e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs]
 [<ffffffffa0a75ae6>] ? echo_client_iocontrol+0x1c86/0x2a30 [obdecho]
 [<ffffffff81127e77>] ? ____pagevec_lru_add+0x167/0x180
 [<ffffffffa040da13>] ? cfs_alloc+0x63/0x90 [libcfs]
 [<ffffffffa04c0f52>] ? obd_ioctl_getdata+0x172/0x1060 [obdclass]
 [<ffffffffa04d6264>] ? class_handle_ioctl+0x14d4/0x2340 [obdclass]
 [<ffffffff8120d5df>] ? security_inode_permission+0x1f/0x30
 [<ffffffffa04c0313>] ? obd_class_ioctl+0x53/0x240 [obdclass]
 [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff81189342>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff811894c9>] ? do_vfs_ioctl+0x69/0x580
 [<ffffffff811894e4>] ? do_vfs_ioctl+0x84/0x580
 [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff81189a61>] ? sys_ioctl+0x81/0xa0
 [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b


 Comments   
Comment by Richard Henwood (Inactive) [ 08/Feb/12 ]
# rpm -qa | grep lustre
lustre-ldiskfs-3.3.0-2.6.32_220.el6_lustre.x86_64_g61f62a1.x86_64
kernel-2.6.32-131.17.1.el6_lustre.g60f4e35.x86_64
kernel-2.6.32-220.el6_lustre.x86_64
lustre-modules-2.1.55-2.6.32_220.el6_lustre.x86_64_g61f62a1.x86_64
lustre-tests-2.1.55-2.6.32_220.el6_lustre.x86_64_g61f62a1.x86_64
kernel-firmware-2.6.32-220.el6_lustre.x86_64
lustre-2.1.55-2.6.32_220.el6_lustre.x86_64_g61f62a1.x86_64

mds-survey and libecho are from:
http://review.whamcloud.com/#change,1969,patchset=7

Comment by Di Wang [ 09/Feb/12 ]

It seems this is related with recent osd API change, instead of mds-survey bug. Richard, it seems mdd_create is failed somewhere? could you be able to find the debug log of LBUG.

Comment by Di Wang [ 09/Feb/12 ]

Assign this to Alex.

Comment by Andreas Dilger [ 13/Feb/12 ]

Alex, do you have any ideas on how this might be fixed?

Comment by Alex Zhuravlev [ 13/Feb/12 ]

sorry, still thinking how to solve this easily ...

Comment by Niu Yawei (Inactive) [ 16/Feb/12 ]

Since we can't declare undo operations, I've removed this LASSERT in LU-993(see ec20be97b9f977d3f4944523baaffb1bf95cf76c LU-993 osd: code cleanup for directory nlink count), but I'm not sure why the echo create failed, is it a normal failure during the test?

Comment by Alex Zhuravlev [ 17/Feb/12 ]

well, we can declare undo ops - that could be the easiest solution, but that results in more credits.

btw, any idea why [DTO_INDEX_DELETE] = 16 ? ldiskfs never shrinks dir nor it updates neighbor blocks during entry removal
nor it changes quota usage. I'd think 1 should be enough?

Comment by Niu Yawei (Inactive) [ 17/Feb/12 ]

btw, any idea why [DTO_INDEX_DELETE] = 16 ? ldiskfs never shrinks dir nor it updates neighbor blocks during entry removal
nor it changes quota usage. I'd think 1 should be enough?

I have no idea why it was 16. You are the ext3/4 expert, I believe you are right , 1 block could be enough.

Comment by Alex Zhuravlev [ 17/Feb/12 ]

I'm beginning to think that if deletion consts 1 credit, then 1..2 more credits to set nlink.
so 2..3 additional credits for undo path won't hurt us, probably? at least in mdd_create() case.
another (enormous) case is mdd_rename() - it'll take mode credits for undo, but probably still
acceptable. and at some point we're going to change the approach to be more object-based than
just summing ops.

Comment by Peter Jones [ 19/Feb/12 ]

Andreas/Johann

Could one of you please comment on this 2.2 blocker?

Thanks

Peter

Comment by Richard Henwood (Inactive) [ 20/Feb/12 ]

FYI: I have been running a more recent Lustre and I have not been able to reproduce this issue.

# rpm -qa | grep lustre
lustre-ldiskfs-3.3.0-2.6.32_220.el6_lustre.gfd1c51d.x86_64_g0204171.x86_64
kernel-2.6.32-220.el6_lustre.gfd1c51d.x86_64
lustre-modules-2.1.55-2.6.32_220.el6_lustre.gfd1c51d.x86_64_g0204171.x86_64
kernel-firmware-2.6.32-220.el6_lustre.gfd1c51d.x86_64
lustre-2.1.55-2.6.32_220.el6_lustre.gfd1c51d.x86_64_g0204171.x86_64

These rpms are from build 480:
http://build.whamcloud.com/job/lustre-master/480/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/

Comment by Andreas Dilger [ 21/Feb/12 ]

LU-1082 is tracking the test which will run mds-survey during normal testing ensure it keeps working.

Generated at Sat Feb 10 01:13:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.