[LU-4252] Failure on test suite racer test_1 Created: 14/Nov/13 Updated: 15/Nov/19 Resolved: 20/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.5.1 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.2 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Yang Sheng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre-master build # 1753 ldiskfs |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 11590 | ||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/03808e92-49af-11e3-9f02-52540035b04c. The sub-test test_1 failed with the following error:
test log shows: == racer test 1: racer on clients: client-16vm1,client-16vm2 DURATION=900 ============================ 04:13:57 (1383999237) racers pids: 28586 28587 CMD: client-16vm1,client-16vm2 DURATION=900 MDSCOUNT=1 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer CMD: client-16vm1,client-16vm2 DURATION=900 MDSCOUNT=1 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer client-16vm2: file_create.sh: no process found client-16vm2: dir_create.sh: no process found client-16vm2: file_rm.sh: no process found client-16vm2: file_rename.sh: no process found client-16vm2: file_link.sh: no process found client-16vm2: file_symlink.sh: no process found client-16vm2: file_list.sh: no process found client-16vm2: file_create.sh: no process found client-16vm2: file_create.sh: no process found client-16vm2: dir_create.sh: no process found client-16vm2: file_rm.sh: no process found client-16vm2: dir_create.sh: no process found client-16vm2: file_rename.sh: no process found client-16vm2: file_rm.sh: no process found client-16vm2: file_link.sh: no process found client-16vm2: file_rename.sh: no process found client-16vm2: file_symlink.sh: no process found client-16vm2: file_link.sh: no process found client-16vm2: file_list.sh: no process found client-16vm2: file_symlink.sh: no process found client-16vm2: file_concat.sh: no process found client-16vm2: file_list.sh: no process found client-16vm2: file_exec.sh: no process found client-16vm2: file_concat.sh: no process found client-16vm2: file_exec.sh: no process found |
| Comments |
| Comment by Andreas Dilger [ 10/Dec/13 ] |
|
John, |
| Comment by John Hammond [ 10/Dec/13 ] |
|
Sorry, I can't tell from what's on maloo. It looks like a client rebooted but there are no console files for either client. |
| Comment by Jian Yu [ 06/Jan/14 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5/ The same failure occurred: |
| Comment by Oleg Drokin [ 12/Mar/14 ] |
|
Two threads that appear to be deadlocked: [34058.700432] ldlm_elt D ffff880062ffe010 0 32149 2 0x00000000 [34058.700432] ffff880062fff9f8 0000000000000046 ffff880062ffe010 0000000000011800 [34058.700432] 0000000000011800 0000000000011800 0000000000011800 ffff880062ffffd8 [34058.700432] ffff880062ffffd8 0000000000011800 ffff88006b7103c0 ffff8800772e8140 [34058.700432] Call Trace: [34058.700432] [<ffffffff8145fda8>] __mutex_lock_slowpath+0xf8/0x150 [34058.700432] [<ffffffff8145f83a>] mutex_lock+0x1a/0x40 [34058.700432] [<ffffffffa062cfc5>] ldiskfs_sync_file+0xc5/0x360 [ldiskfs] [34058.700432] [<ffffffffa0e66b9c>] osd_object_sync+0x11c/0x150 [osd_ldiskfs] [34058.700432] [<ffffffffa0ee8590>] ofd_sync+0x540/0x930 [ofd] [34058.700432] [<ffffffffa0dd802f>] obd_sync+0xdf/0x240 [ost] [34058.700432] [<ffffffffa0de72aa>] ost_blocking_ast+0x48a/0xee0 [ost] [34058.700432] [<ffffffffa09d910f>] ldlm_cancel_callback+0x5f/0x180 [ptlrpc] [34058.700432] [<ffffffffa09d9284>] ldlm_lock_cancel+0x54/0x1e0 [ptlrpc] [34058.700432] [<ffffffffa09da1ef>] ldlm_cancel_locks_for_export_cb+0x8f/0x200 [ptlrpc] [34058.700432] [<ffffffffa06b3ba8>] cfs_hash_for_each_relax+0x178/0x340 [libcfs] [34058.700432] [<ffffffffa06b5cae>] cfs_hash_for_each_empty+0x10e/0x1f0 [libcfs] [34058.700432] [<ffffffffa09d59f6>] ldlm_cancel_locks_for_export+0x26/0x30 [ptlrpc] [34058.700432] [<ffffffffa09e6bad>] server_disconnect_export+0x5d/0x1a0 [ptlrpc] [34058.700432] [<ffffffffa0ede6c0>] ofd_obd_disconnect+0x60/0x1f0 [ofd] [34058.700432] [<ffffffffa07b2528>] class_fail_export+0x118/0x580 [obdclass] [34058.700432] [<ffffffffa09ff52f>] expired_lock_main+0x41f/0x880 [ptlrpc] [34058.700432] [<ffffffff81082966>] kthread+0x96/0xa0 [34058.700432] [<ffffffff81469ee4>] kernel_thread_helper+0x4/0x10 vs [34058.700432] ldlm_cn00_003 D ffff8800691ffa88 0 1252 2 0x00000000 [34058.700432] ffff8800691ffa50 0000000000000046 ffff8800691fe010 0000000000011800 [34058.700432] 0000000000011800 0000000000011800 0000000000011800 ffff8800691fffd8 [34058.700432] ffff8800691fffd8 0000000000011800 ffff88006e08a5c0 ffff88007bcae300 [34058.700432] Call Trace: [34058.700432] [<ffffffffa0612205>] jbd2_log_wait_commit+0xb5/0x130 [jbd2] [34058.700432] [<ffffffffa062d0dd>] ldiskfs_sync_file+0x1dd/0x360 [ldiskfs] [34058.700432] [<ffffffffa0e66b9c>] osd_object_sync+0x11c/0x150 [osd_ldiskfs] [34058.700432] [<ffffffffa0ee8590>] ofd_sync+0x540/0x930 [ofd] [34058.700432] [<ffffffffa0dd802f>] obd_sync+0xdf/0x240 [ost] [34058.700432] [<ffffffffa0de72aa>] ost_blocking_ast+0x48a/0xee0 [ost] [34058.700432] [<ffffffffa09d910f>] ldlm_cancel_callback+0x5f/0x180 [ptlrpc] [34058.700432] [<ffffffffa09d9284>] ldlm_lock_cancel+0x54/0x1e0 [ptlrpc] [34058.700432] [<ffffffffa09fcdcb>] ldlm_request_cancel+0x25b/0x420 [ptlrpc] [34058.700432] [<ffffffffa09fd0d5>] ldlm_handle_cancel+0x145/0x250 [ptlrpc] [34058.700432] [<ffffffffa0a02329>] ldlm_cancel_handler+0x1e9/0x510 [ptlrpc] [34058.700432] [<ffffffffa0a32fcb>] ptlrpc_server_handle_request+0x37b/0xcd0 [ptlrpc] [34058.700432] [<ffffffffa0a35627>] ptlrpc_main+0x917/0xd70 [ptlrpc] [34058.700432] [<ffffffff81082966>] kthread+0x96/0xa0 [34058.700432] [<ffffffff81469ee4>] kernel_thread_helper+0x4/0x10 |
| Comment by Peter Jones [ 12/Mar/14 ] |
|
Hongchao Could you please look into this one? Thanks Peter |
| Comment by Hongchao Zhang [ 13/Mar/14 ] |
|
thread "ll_ost_io00_037" has started journal but can't get the "i_mutex" lock. ll_ost_io00_037 D ffff880048206010 0 1448 2 0x00000000 [34058.700432] ffff880048207a38 0000000000000046 ffff880048206010 0000000000011800 [34058.700432] 0000000000011800 0000000000011800 0000000000011800 ffff880048207fd8 [34058.700432] ffff880048207fd8 0000000000011800 ffff880048204080 ffff880068ea6200 [34058.700432] Call Trace: [34058.700432] [<ffffffff8145fda8>] __mutex_lock_slowpath+0xf8/0x150 [34058.700432] [<ffffffff8145f83a>] mutex_lock+0x1a/0x40 [34058.700432] [<ffffffffa0e83649>] osd_punch+0x479/0x610 [osd_ldiskfs] [34058.700432] [<ffffffffa0ef1bd1>] ofd_object_punch+0x741/0x960 [ofd] [34058.700432] [<ffffffffa0ee71ed>] ofd_punch+0x5dd/0xb20 [ofd] [34058.700432] [<ffffffffa0de493f>] ost_punch+0x3af/0xa30 [ost] [34058.700432] [<ffffffffa0de6b1e>] ost_handle+0x1b5e/0x1e60 [ost] [34058.700432] [<ffffffffa0a32fcb>] ptlrpc_server_handle_request+0x37b/0xcd0 [ptlrpc] [34058.700432] [<ffffffffa0a35627>] ptlrpc_main+0x917/0xd70 [ptlrpc] [34058.700432] [<ffffffff81082966>] kthread+0x96/0xa0 [34058.700432] [<ffffffff81469ee4>] kernel_thread_helper+0x4/0x10 thread ldlm_cn00_003 got the "i_mutex" lock and start to commit the journal, but the above thread holding one journal handle, [34058.700432] ldlm_cn00_003 D ffff8800691ffa88 0 1252 2 0x00000000 [34058.700432] ffff8800691ffa50 0000000000000046 ffff8800691fe010 0000000000011800 [34058.700432] 0000000000011800 0000000000011800 0000000000011800 ffff8800691fffd8 [34058.700432] ffff8800691fffd8 0000000000011800 ffff88006e08a5c0 ffff88007bcae300 [34058.700432] Call Trace: [34058.700432] [<ffffffffa0612205>] jbd2_log_wait_commit+0xb5/0x130 [jbd2] [34058.700432] [<ffffffffa062d0dd>] ldiskfs_sync_file+0x1dd/0x360 [ldiskfs] [34058.700432] [<ffffffffa0e66b9c>] osd_object_sync+0x11c/0x150 [osd_ldiskfs] [34058.700432] [<ffffffffa0ee8590>] ofd_sync+0x540/0x930 [ofd] [34058.700432] [<ffffffffa0dd802f>] obd_sync+0xdf/0x240 [ost] [34058.700432] [<ffffffffa0de72aa>] ost_blocking_ast+0x48a/0xee0 [ost] [34058.700432] [<ffffffffa09d910f>] ldlm_cancel_callback+0x5f/0x180 [ptlrpc] [34058.700432] [<ffffffffa09d9284>] ldlm_lock_cancel+0x54/0x1e0 [ptlrpc] [34058.700432] [<ffffffffa09fcdcb>] ldlm_request_cancel+0x25b/0x420 [ptlrpc] [34058.700432] [<ffffffffa09fd0d5>] ldlm_handle_cancel+0x145/0x250 [ptlrpc] [34058.700432] [<ffffffffa0a02329>] ldlm_cancel_handler+0x1e9/0x510 [ptlrpc] [34058.700432] [<ffffffffa0a32fcb>] ptlrpc_server_handle_request+0x37b/0xcd0 [ptlrpc] [34058.700432] [<ffffffffa0a35627>] ptlrpc_main+0x917/0xd70 [ptlrpc] [34058.700432] [<ffffffff81082966>] kthread+0x96/0xa0 [34058.700432] [<ffffffff81469ee4>] kernel_thread_helper+0x4/0x10 the patch is tracked at http://review.whamcloud.com/#/c/9644/ |
| Comment by Hongchao Zhang [ 18/Mar/14 ] |
|
after discussing with Yangsheng and Niu, this issue should be introduced by the patch http://review.whamcloud.com/#/c/7794/ in Reassign to yangsheng to fix it. |
| Comment by Peter Jones [ 20/Mar/14 ] |
|
Landed for 2.6 |
| Comment by Yang Sheng [ 21/Mar/14 ] |
|
Port to b2_5: http://review.whamcloud.com/9742 |