[LU-3505] Failure on test suite racer test_1: ASSERTION( !mdd_object_exists(c) ) failed Created: 25/Jun/13  Updated: 01/Nov/17

Status: In Progress
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1, Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: yuc2
Environment:

server and client: lustre-master build #1536
client is running FC18


Severity: 3
Rank (Obsolete): 8823

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/4a81e8ac-dcc9-11e2-8c97-52540035b04c.

The sub-test test_1 failed with the following error:

test failed to respond and timed out

MDS console:

16:14:21:Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-14vm1,client-14vm2.lab.whamcloud.com DURATION=900 ========== 16:17:28 (1372029448)
16:14:21:Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt
16:14:43:LustreError: 8932:0:(osd_oi.c:630:osd_oi_insert()) lustre-MDT0000: the FID [0x200000400:0x1fb:0x0] is used by two objects: 178/3292964813 201/3292964829
16:18:26:LustreError: 8914:0:(osd_oi.c:630:osd_oi_insert()) lustre-MDT0000: the FID [0x200000402:0x1618:0x0] is used by two objects: 409/3292973200 464/3292973217
16:22:10:LustreError: 8916:0:(mdd_object.c:380:mdd_object_create_internal()) ASSERTION( !mdd_object_exists(c) ) failed: 
16:22:10:LustreError: 8916:0:(mdd_object.c:380:mdd_object_create_internal()) LBUG
16:22:10:Pid: 8916, comm: mdt00_014
16:22:10:
16:22:10:Call Trace:
16:22:10: [<ffffffffa0ba7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
16:22:10: [<ffffffffa0ba7e97>] lbug_with_loc+0x47/0xb0 [libcfs]
16:22:10: [<ffffffffa1588810>] mdd_declare_object_create_internal+0x0/0x1f0 [mdd]
16:22:10: [<ffffffffa159925c>] mdd_create+0xa3c/0x1740 [mdd]
16:22:10: [<ffffffffa05fcc20>] ? lod_index_lookup+0x0/0x30 [lod]
16:22:10: [<ffffffffa055ffdf>] mdt_reint_open+0x135f/0x20c0 [mdt]
16:22:10: [<ffffffffa0bc45ee>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
16:22:10: [<ffffffffa12f6ffc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
16:22:10: [<ffffffffa0ceedc0>] ? lu_ucred+0x20/0x30 [obdclass]
16:22:10: [<ffffffffa054a951>] mdt_reint_rec+0x41/0xe0 [mdt]
16:22:10: [<ffffffffa052fb73>] mdt_reint_internal+0x4c3/0x780 [mdt]
16:22:10: [<ffffffffa05300fd>] mdt_intent_reint+0x1ed/0x520 [mdt]
16:22:10: [<ffffffffa052df6e>] mdt_intent_policy+0x3ae/0x770 [mdt]
16:22:10: [<ffffffffa12ae441>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
16:22:10: [<ffffffffa12d53bf>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
16:22:10: [<ffffffffa052e436>] mdt_enqueue+0x46/0xe0 [mdt]
16:22:10: [<ffffffffa0534b88>] mdt_handle_common+0x648/0x1660 [mdt]
16:22:10: [<ffffffffa056e335>] mds_regular_handle+0x15/0x20 [mdt]
16:22:10: [<ffffffffa1307598>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
16:22:10: [<ffffffffa0ba854e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
16:22:10: [<ffffffffa0bb9c2f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
16:22:10: [<ffffffffa12fe959>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
16:22:10: [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
16:22:10: [<ffffffffa130892e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
16:22:10: [<ffffffffa1307e60>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
16:22:10: [<ffffffff81096936>] kthread+0x96/0xa0
16:22:10: [<ffffffff8100c0ca>] child_rip+0xa/0x20
16:22:10: [<ffffffff810968a0>] ? kthread+0x0/0xa0
16:22:10: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
16:22:10:
16:22:10:Kernel panic - not syncing: LBUG
16:22:10:Pid: 8916, comm: mdt00_014 Not tainted 2.6.32-358.11.1.el6_lustre.x86_64 #1
16:22:10:Call Trace:
16:22:10: [<ffffffff8150d8f8>] ? panic+0xa7/0x16f
16:22:10: [<ffffffffa0ba7eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
16:22:10: [<ffffffffa1588810>] ? mdd_declare_object_create_internal+0x0/0x1f0 [mdd]
16:22:10: [<ffffffffa159925c>] ? mdd_create+0xa3c/0x1740 [mdd]
16:22:10: [<ffffffffa05fcc20>] ? lod_index_lookup+0x0/0x30 [lod]
16:22:10: [<ffffffffa055ffdf>] ? mdt_reint_open+0x135f/0x20c0 [mdt]
16:22:11: [<ffffffffa0bc45ee>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
16:22:11: [<ffffffffa12f6ffc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
16:22:11: [<ffffffffa0ceedc0>] ? lu_ucred+0x20/0x30 [obdclass]
16:22:11: [<ffffffffa054a951>] ? mdt_reint_rec+0x41/0xe0 [mdt]
16:22:11: [<ffffffffa052fb73>] ? mdt_reint_internal+0x4c3/0x780 [mdt]
16:22:11: [<ffffffffa05300fd>] ? mdt_intent_reint+0x1ed/0x520 [mdt]
16:22:11: [<ffffffffa052df6e>] ? mdt_intent_policy+0x3ae/0x770 [mdt]
16:22:11: [<ffffffffa12ae441>] ? ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
16:22:11: [<ffffffffa12d53bf>] ? ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
16:22:11: [<ffffffffa052e436>] ? mdt_enqueue+0x46/0xe0 [mdt]
16:22:11: [<ffffffffa0534b88>] ? mdt_handle_common+0x648/0x1660 [mdt]
16:22:11: [<ffffffffa056e335>] ? mds_regular_handle+0x15/0x20 [mdt]
16:22:11: [<ffffffffa1307598>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
16:22:11: [<ffffffffa0ba854e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
16:22:11: [<ffffffffa0bb9c2f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
16:22:11: [<ffffffffa12fe959>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
16:22:11: [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
16:22:11: [<ffffffffa130892e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc]
16:22:11: [<ffffffffa1307e60>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
16:22:11: [<ffffffff81096936>] ? kthread+0x96/0xa0
16:22:11: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
16:22:11: [<ffffffff810968a0>] ? kthread+0x0/0xa0
16:22:11: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
16:22:11:Initializing cgroup subsys cpuset
16:22:11:Initializing cgroup subsys cpu


 Comments   
Comment by Peter Jones [ 03/Jul/13 ]

Lai

Could you please look into this one?

Thanks

Peter

Comment by Lai Siyao [ 09/Jul/13 ]

It's worth noting that the test client is installed with FC18, I'm afraid it's some race in atomic_open. I'll review related code firstly.

Comment by Lai Siyao [ 15/Jul/13 ]

Sarah, could you help verify this is a FC18 client specific issue? That is, could you try to reproduce this on both FC18 and RHEL6 client repeatedly and compare the test results?

Comment by Sarah Liu [ 15/Jul/13 ]

ok, will get back to you when I have a result.

Comment by Sarah Liu [ 16/Jul/13 ]

I ran 4 times racer on RHEL6 client and cannot reproduce it

Comment by Lai Siyao [ 22/Jul/13 ]

Client side doesn't enforce open by fid (and disable open by name) in ll_intent_file_open() and ll_revalidate_it(), which may cause two objects with the same fid created. Patch is on http://review.whamcloud.com/#/c/7068/.

Sarah, could you help verify it on FC18 client?

Comment by Sarah Liu [ 22/Jul/13 ]

looks like this issue cannot be reproduced on FC18, here is the latest result on FC18 client:

https://maloo.whamcloud.com/test_sets/e2540f8a-f01e-11e2-b957-52540035b04c

Comment by Lai Siyao [ 23/Jul/13 ]

IMO this is not easy to reproduce, so we may need to run racer repeatedly for hours, could you do it or tell me how to do it via commit parameters?

Comment by Sarah Liu [ 29/Jul/13 ]

Hello Lai,

I will try again

Comment by Sarah Liu [ 04/Aug/13 ]

Lai,

I hit this error when running your patch on FC18 client:

client console:

Logging to local directory: /tmp/test_logs/2013-08-03/231154
running: racer 
run_suite racer /usr/lib64/lustre/tests/racer.sh
[  761.198075] LustreError: 152-6: Ignoring deprecated mount option 'acl'.
[ 1133.933370] LustreError: 27150:0:(file.c:2728:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000403:0x4335:0x0] error: rc = -116
[ 1146.981086] LustreError: 1285:0:(file.c:2728:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000402:0x46e9:0x0] error: rc = -116
[ 1736.870529] LustreError: 30716:0:(lu_object.c:1141:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
[ 1736.883706] LustreError: 30716:0:(lu_object.c:1141:lu_device_fini()) LBUG
[ 1736.891369] Kernel panic - not syncing: LBUG
[ 1736.895878] Pid: 30716, comm: umount Tainted: GF          O 3.6.10-4.fc18.x86_64 #1
[ 1736.903953] Call Trace:
[ 1736.906535]  [<ffffffff816198db>] panic+0xc1/0x1d0
[ 1736.911597]  [<ffffffffa0294e5b>] lbug_with_loc+0xab/0xc0 [libcfs]
[ 1736.918142]  [<ffffffffa03fa4da>] lu_device_fini+0xba/0xc0 [obdclass]
[ 1736.924948]  [<ffffffffa0743b06>] lovsub_device_free+0x66/0x1b0 [lov]
[ 1736.931766]  [<ffffffffa03fe22e>] lu_stack_fini+0x7e/0xc0 [obdclass]
[ 1736.938491]  [<ffffffffa040384e>] cl_stack_fini+0xe/0x10 [obdclass]
[ 1736.945110]  [<ffffffffa0733df9>] lov_device_fini+0x59/0x120 [lov]
[ 1736.951651]  [<ffffffffa03fe1f9>] lu_stack_fini+0x49/0xc0 [obdclass]
[ 1736.958374]  [<ffffffffa040384e>] cl_stack_fini+0xe/0x10 [obdclass]
[ 1736.965008]  [<ffffffffa0822921>] cl_sb_fini+0x61/0x1a0 [lustre]
[ 1736.971358]  [<ffffffffa07e51d4>] client_common_put_super+0x54/0xb20 [lustre]
[ 1736.978901]  [<ffffffffa07e6624>] ll_put_super+0xd4/0x370 [lustre]
[ 1736.985418]  [<ffffffff811aa7ee>] ? dispose_list+0x3e/0x60
[ 1736.991209]  [<ffffffff81192761>] generic_shutdown_super+0x61/0xe0
[ 1736.997730]  [<ffffffff81192876>] kill_anon_super+0x16/0x30
[ 1737.003636]  [<ffffffffa03efd6a>] lustre_kill_super+0x4a/0x60 [obdclass]
[ 1737.010705]  [<ffffffff81192bf7>] deactivate_locked_super+0x57/0x90
[ 1737.017317]  [<ffffffff811937ae>] deactivate_super+0x4e/0x70
[ 1737.023284]  [<ffffffff811ae167>] mntput_no_expire+0xd7/0x130
[ 1737.029348]  [<ffffffff811af146>] sys_umount+0x76/0x390
[ 1737.034864]  [<ffffffff8162bae9>] system_call_fastpath+0x16/0x1b
[ 1737.041338] panic occurred, switching back to text console
Comment by Lai Siyao [ 06/Aug/13 ]

Hey Sarah, IMO it's another bug. Could you redo this test (and possibly on both RHEL6 and FC18 client)?

Comment by Sarah Liu [ 06/Aug/13 ]

ok, will update the ticket when I have result

Comment by Jian Yu [ 09/Aug/13 ]

Lustre Branch: b2_4
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/27/
Distro/Arch: RHEL6.4/x86_64 + FC18/x86_64 (Server + Client)

The racer test hit the same ASSERTION failure on MDS:

15:32:50:Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-16vm1,client-16vm2.lab.whamcloud.com DURATION=900 ========== 15:32:27 (1375655547)
15:32:50:Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt
15:35:17:LustreError: 10737:0:(osd_oi.c:630:osd_oi_insert()) lustre-MDT0000: the FID [0x200000402:0xeb0:0x0] is used by two objects: 359/4210961870 422/4210961920
15:38:22:LustreError: 10750:0:(mdd_object.c:380:mdd_object_create_internal()) ASSERTION( !mdd_object_exists(c) ) failed: 
15:38:22:LustreError: 10750:0:(mdd_object.c:380:mdd_object_create_internal()) LBUG
15:38:22:Pid: 10750, comm: mdt00_029
15:38:22:
15:38:22:Call Trace:
15:38:22: [<ffffffffa06e5895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
15:38:22: [<ffffffffa06e5e97>] lbug_with_loc+0x47/0xb0 [libcfs]
15:38:22: [<ffffffffa0d9e810>] mdd_declare_object_create_internal+0x0/0x1f0 [mdd]
15:38:22: [<ffffffffa0daf25c>] mdd_create+0xa3c/0x1740 [mdd]
15:38:22: [<ffffffffa1038c00>] ? lod_index_lookup+0x0/0x30 [lod]
15:38:22: [<ffffffffa0f9be4f>] mdt_reint_open+0x135f/0x20c0 [mdt]
15:38:22: [<ffffffffa07027be>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
15:38:22: [<ffffffffa0997d8c>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
15:38:22: [<ffffffffa082ce60>] ? lu_ucred+0x20/0x30 [obdclass]
15:38:22: [<ffffffffa0f868f1>] mdt_reint_rec+0x41/0xe0 [mdt]
15:38:22: [<ffffffffa0f6bb03>] mdt_reint_internal+0x4c3/0x780 [mdt]
15:38:22: [<ffffffffa0f6c08d>] mdt_intent_reint+0x1ed/0x520 [mdt]
15:38:22: [<ffffffffa0f69f3e>] mdt_intent_policy+0x39e/0x720 [mdt]
15:38:22: [<ffffffffa094f7e1>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
15:38:22: [<ffffffffa09761bf>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
15:38:22: [<ffffffffa0f6a3c6>] mdt_enqueue+0x46/0xe0 [mdt]
15:38:22: [<ffffffffa0f70ab7>] mdt_handle_common+0x647/0x16d0 [mdt]
15:38:22: [<ffffffffa0faa1a5>] mds_regular_handle+0x15/0x20 [mdt]
15:38:22: [<ffffffffa09a8388>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
15:38:22: [<ffffffffa06e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
15:38:22: [<ffffffffa06f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
15:38:22: [<ffffffffa099f6e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
15:38:22: [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
15:38:22: [<ffffffffa09a971e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
15:38:22: [<ffffffffa09a8c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
15:38:22: [<ffffffff8100c0ca>] child_rip+0xa/0x20
15:38:22: [<ffffffffa09a8c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
15:38:22: [<ffffffffa09a8c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
15:38:22: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
15:38:22:
15:38:22:Kernel panic - not syncing: LBUG

Maloo report: https://maloo.whamcloud.com/test_sets/2587377a-fd7e-11e2-9fdb-52540035b04c

Comment by Lai Siyao [ 12/Aug/13 ]

what's the test result for RHEL6 client?

Comment by Jian Yu [ 12/Aug/13 ]

what's the test result for RHEL6 client?

For Lustre b2_4 build #27, the racer test passed on RHEL6 and SLES11SP2 clients.

Comment by Oleg Drokin [ 23/Aug/13 ]

So, I see there's a patch, but it did not land anywhere, no inspections. Do we need it, what's the current status of this ticket?

Comment by Lai Siyao [ 02/Sep/13 ]

That patch is a verification patch, but it's not seriously tested against both fc18 and rhel6 client. And all test results lead to the conclusion that this is a FC18 client only issue.

Comment by Peter Jones [ 03/Sep/13 ]

Dropping priority to reflect this is a hard issue to hit and affects fc18 cleints inly

Comment by Jian Yu [ 04/Sep/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1)
Distro/Arch: RHEL6.4/x86_64 + FC18/x86_64 (Server + Client)

racer test hit the same failure:
https://maloo.whamcloud.com/test_sets/bbce5356-14f0-11e3-ac48-52540035b04c

Generated at Sat Feb 10 01:34:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.