[LU-4252] Failure on test suite racer test_1 Created: 14/Nov/13  Updated: 15/Nov/19  Resolved: 20/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.5.1
Fix Version/s: Lustre 2.6.0, Lustre 2.5.2

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre-master build # 1753 ldiskfs
client is running SLES11 SP3


Issue Links:
Related
is related to LU-6446 Warn-on in ldiskfs_orphan_add/del Resolved
is related to LU-12977 fix i_mutex for ldiskfs_truncate() in... Resolved
Severity: 3
Rank (Obsolete): 11590

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/03808e92-49af-11e3-9f02-52540035b04c.

The sub-test test_1 failed with the following error:

test failed to respond and timed out

test log shows:

== racer test 1: racer on clients: client-16vm1,client-16vm2 DURATION=900 ============================ 04:13:57 (1383999237)
racers pids: 28586 28587
CMD: client-16vm1,client-16vm2 DURATION=900 MDSCOUNT=1 				   /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer 
CMD: client-16vm1,client-16vm2 DURATION=900 MDSCOUNT=1 				   /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer 
client-16vm2: file_create.sh: no process found
client-16vm2: dir_create.sh: no process found
client-16vm2: file_rm.sh: no process found
client-16vm2: file_rename.sh: no process found
client-16vm2: file_link.sh: no process found
client-16vm2: file_symlink.sh: no process found
client-16vm2: file_list.sh: no process found
client-16vm2: file_create.sh: no process found
client-16vm2: file_create.sh: no process found
client-16vm2: dir_create.sh: no process found
client-16vm2: file_rm.sh: no process found
client-16vm2: dir_create.sh: no process found
client-16vm2: file_rename.sh: no process found
client-16vm2: file_rm.sh: no process found
client-16vm2: file_link.sh: no process found
client-16vm2: file_rename.sh: no process found
client-16vm2: file_symlink.sh: no process found
client-16vm2: file_link.sh: no process found
client-16vm2: file_list.sh: no process found
client-16vm2: file_symlink.sh: no process found
client-16vm2: file_concat.sh: no process found
client-16vm2: file_list.sh: no process found
client-16vm2: file_exec.sh: no process found
client-16vm2: file_concat.sh: no process found
client-16vm2: file_exec.sh: no process found


 Comments   
Comment by Andreas Dilger [ 10/Dec/13 ]

John,
Could you please take a look if this is different that the many other racer bugs we have? There are quite a few and I don't know the subtleties of each one.

Comment by John Hammond [ 10/Dec/13 ]

Sorry, I can't tell from what's on maloo. It looks like a client rebooted but there are no console files for either client.

Comment by Jian Yu [ 06/Jan/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5/
Distro/Arch: RHEL6.4/x86_64(server), SLES11SP3/x86_64(client)

The same failure occurred:
https://maloo.whamcloud.com/test_sets/2dbe5d7a-7640-11e3-b3c0-52540035b04c

Comment by Oleg Drokin [ 12/Mar/14 ]

Two threads that appear to be deadlocked:

[34058.700432] ldlm_elt        D ffff880062ffe010     0 32149      2 0x00000000
[34058.700432]  ffff880062fff9f8 0000000000000046 ffff880062ffe010 0000000000011800
[34058.700432]  0000000000011800 0000000000011800 0000000000011800 ffff880062ffffd8
[34058.700432]  ffff880062ffffd8 0000000000011800 ffff88006b7103c0 ffff8800772e8140
[34058.700432] Call Trace:
[34058.700432]  [<ffffffff8145fda8>] __mutex_lock_slowpath+0xf8/0x150
[34058.700432]  [<ffffffff8145f83a>] mutex_lock+0x1a/0x40
[34058.700432]  [<ffffffffa062cfc5>] ldiskfs_sync_file+0xc5/0x360 [ldiskfs]
[34058.700432]  [<ffffffffa0e66b9c>] osd_object_sync+0x11c/0x150 [osd_ldiskfs]
[34058.700432]  [<ffffffffa0ee8590>] ofd_sync+0x540/0x930 [ofd]
[34058.700432]  [<ffffffffa0dd802f>] obd_sync+0xdf/0x240 [ost]
[34058.700432]  [<ffffffffa0de72aa>] ost_blocking_ast+0x48a/0xee0 [ost]
[34058.700432]  [<ffffffffa09d910f>] ldlm_cancel_callback+0x5f/0x180 [ptlrpc]
[34058.700432]  [<ffffffffa09d9284>] ldlm_lock_cancel+0x54/0x1e0 [ptlrpc]
[34058.700432]  [<ffffffffa09da1ef>] ldlm_cancel_locks_for_export_cb+0x8f/0x200 [ptlrpc]
[34058.700432]  [<ffffffffa06b3ba8>] cfs_hash_for_each_relax+0x178/0x340 [libcfs]
[34058.700432]  [<ffffffffa06b5cae>] cfs_hash_for_each_empty+0x10e/0x1f0 [libcfs]
[34058.700432]  [<ffffffffa09d59f6>] ldlm_cancel_locks_for_export+0x26/0x30 [ptlrpc]
[34058.700432]  [<ffffffffa09e6bad>] server_disconnect_export+0x5d/0x1a0 [ptlrpc]
[34058.700432]  [<ffffffffa0ede6c0>] ofd_obd_disconnect+0x60/0x1f0 [ofd]
[34058.700432]  [<ffffffffa07b2528>] class_fail_export+0x118/0x580 [obdclass]
[34058.700432]  [<ffffffffa09ff52f>] expired_lock_main+0x41f/0x880 [ptlrpc]
[34058.700432]  [<ffffffff81082966>] kthread+0x96/0xa0
[34058.700432]  [<ffffffff81469ee4>] kernel_thread_helper+0x4/0x10

vs

[34058.700432] ldlm_cn00_003   D ffff8800691ffa88     0  1252      2 0x00000000
[34058.700432]  ffff8800691ffa50 0000000000000046 ffff8800691fe010 0000000000011800
[34058.700432]  0000000000011800 0000000000011800 0000000000011800 ffff8800691fffd8
[34058.700432]  ffff8800691fffd8 0000000000011800 ffff88006e08a5c0 ffff88007bcae300
[34058.700432] Call Trace:
[34058.700432]  [<ffffffffa0612205>] jbd2_log_wait_commit+0xb5/0x130 [jbd2]
[34058.700432]  [<ffffffffa062d0dd>] ldiskfs_sync_file+0x1dd/0x360 [ldiskfs]
[34058.700432]  [<ffffffffa0e66b9c>] osd_object_sync+0x11c/0x150 [osd_ldiskfs]
[34058.700432]  [<ffffffffa0ee8590>] ofd_sync+0x540/0x930 [ofd]
[34058.700432]  [<ffffffffa0dd802f>] obd_sync+0xdf/0x240 [ost]
[34058.700432]  [<ffffffffa0de72aa>] ost_blocking_ast+0x48a/0xee0 [ost]
[34058.700432]  [<ffffffffa09d910f>] ldlm_cancel_callback+0x5f/0x180 [ptlrpc]
[34058.700432]  [<ffffffffa09d9284>] ldlm_lock_cancel+0x54/0x1e0 [ptlrpc]
[34058.700432]  [<ffffffffa09fcdcb>] ldlm_request_cancel+0x25b/0x420 [ptlrpc]
[34058.700432]  [<ffffffffa09fd0d5>] ldlm_handle_cancel+0x145/0x250 [ptlrpc]
[34058.700432]  [<ffffffffa0a02329>] ldlm_cancel_handler+0x1e9/0x510 [ptlrpc]
[34058.700432]  [<ffffffffa0a32fcb>] ptlrpc_server_handle_request+0x37b/0xcd0 [ptlrpc]
[34058.700432]  [<ffffffffa0a35627>] ptlrpc_main+0x917/0xd70 [ptlrpc]
[34058.700432]  [<ffffffff81082966>] kthread+0x96/0xa0
[34058.700432]  [<ffffffff81469ee4>] kernel_thread_helper+0x4/0x10
Comment by Peter Jones [ 12/Mar/14 ]

Hongchao

Could you please look into this one?

Thanks

Peter

Comment by Hongchao Zhang [ 13/Mar/14 ]

thread "ll_ost_io00_037" has started journal but can't get the "i_mutex" lock.

ll_ost_io00_037 D ffff880048206010     0  1448      2 0x00000000
[34058.700432]  ffff880048207a38 0000000000000046 ffff880048206010 0000000000011800
[34058.700432]  0000000000011800 0000000000011800 0000000000011800 ffff880048207fd8
[34058.700432]  ffff880048207fd8 0000000000011800 ffff880048204080 ffff880068ea6200
[34058.700432] Call Trace:
[34058.700432]  [<ffffffff8145fda8>] __mutex_lock_slowpath+0xf8/0x150
[34058.700432]  [<ffffffff8145f83a>] mutex_lock+0x1a/0x40
[34058.700432]  [<ffffffffa0e83649>] osd_punch+0x479/0x610 [osd_ldiskfs]
[34058.700432]  [<ffffffffa0ef1bd1>] ofd_object_punch+0x741/0x960 [ofd]
[34058.700432]  [<ffffffffa0ee71ed>] ofd_punch+0x5dd/0xb20 [ofd]
[34058.700432]  [<ffffffffa0de493f>] ost_punch+0x3af/0xa30 [ost]
[34058.700432]  [<ffffffffa0de6b1e>] ost_handle+0x1b5e/0x1e60 [ost]
[34058.700432]  [<ffffffffa0a32fcb>] ptlrpc_server_handle_request+0x37b/0xcd0 [ptlrpc]
[34058.700432]  [<ffffffffa0a35627>] ptlrpc_main+0x917/0xd70 [ptlrpc]
[34058.700432]  [<ffffffff81082966>] kthread+0x96/0xa0
[34058.700432]  [<ffffffff81469ee4>] kernel_thread_helper+0x4/0x10

thread ldlm_cn00_003 got the "i_mutex" lock and start to commit the journal, but the above thread holding one journal handle,
then the journal can't be committed.

[34058.700432] ldlm_cn00_003   D ffff8800691ffa88     0  1252      2 0x00000000
[34058.700432]  ffff8800691ffa50 0000000000000046 ffff8800691fe010 0000000000011800
[34058.700432]  0000000000011800 0000000000011800 0000000000011800 ffff8800691fffd8
[34058.700432]  ffff8800691fffd8 0000000000011800 ffff88006e08a5c0 ffff88007bcae300
[34058.700432] Call Trace:
[34058.700432]  [<ffffffffa0612205>] jbd2_log_wait_commit+0xb5/0x130 [jbd2]
[34058.700432]  [<ffffffffa062d0dd>] ldiskfs_sync_file+0x1dd/0x360 [ldiskfs]
[34058.700432]  [<ffffffffa0e66b9c>] osd_object_sync+0x11c/0x150 [osd_ldiskfs]
[34058.700432]  [<ffffffffa0ee8590>] ofd_sync+0x540/0x930 [ofd]
[34058.700432]  [<ffffffffa0dd802f>] obd_sync+0xdf/0x240 [ost]
[34058.700432]  [<ffffffffa0de72aa>] ost_blocking_ast+0x48a/0xee0 [ost]
[34058.700432]  [<ffffffffa09d910f>] ldlm_cancel_callback+0x5f/0x180 [ptlrpc]
[34058.700432]  [<ffffffffa09d9284>] ldlm_lock_cancel+0x54/0x1e0 [ptlrpc]
[34058.700432]  [<ffffffffa09fcdcb>] ldlm_request_cancel+0x25b/0x420 [ptlrpc]
[34058.700432]  [<ffffffffa09fd0d5>] ldlm_handle_cancel+0x145/0x250 [ptlrpc]
[34058.700432]  [<ffffffffa0a02329>] ldlm_cancel_handler+0x1e9/0x510 [ptlrpc]
[34058.700432]  [<ffffffffa0a32fcb>] ptlrpc_server_handle_request+0x37b/0xcd0 [ptlrpc]
[34058.700432]  [<ffffffffa0a35627>] ptlrpc_main+0x917/0xd70 [ptlrpc]
[34058.700432]  [<ffffffff81082966>] kthread+0x96/0xa0
[34058.700432]  [<ffffffff81469ee4>] kernel_thread_helper+0x4/0x10

the patch is tracked at http://review.whamcloud.com/#/c/9644/

Comment by Hongchao Zhang [ 18/Mar/14 ]

after discussing with Yangsheng and Niu, this issue should be introduced by the patch http://review.whamcloud.com/#/c/7794/ in LU-3373

Reassign to yangsheng to fix it.

Comment by Peter Jones [ 20/Mar/14 ]

Landed for 2.6

Comment by Yang Sheng [ 21/Mar/14 ]

Port to b2_5: http://review.whamcloud.com/9742

Generated at Sat Feb 10 01:41:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.