[LU-5915] racer test_1: FAIL: test_1 failed with 4 Created: 12/Nov/14  Updated: 22/Nov/18  Resolved: 22/Nov/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0, Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: dne
Environment:

OpenSFS cluster with two MDSs with one MDT each, three OSSs with two OSTs each and three clients. Lustre master tag 2.6.90 build #2734


Issue Links:
Related
is related to LU-4840 Deadlock when truncating file during... Resolved
Severity: 3
Rank (Obsolete): 16514

 Description   

Racer test_1 failed with 'test_1 failed with 4'. Logs at https://testing.hpdd.intel.com/test_sets/525741e4-6a21-11e4-aeb4-5254006e85c2

From the client test_log, there are many lines like the following:

c13: /usr/lib64/lustre/tests/racer/racer.sh: line 70: ./file_delxattr.sh: No such file or directory
c13: /usr/lib64/lustre/tests/racer/racer.sh: line 70: ./file_truncate.sh: No such file or directory
c13: /usr/lib64/lustre/tests/racer/racer.sh: line 70: ./file_chmod.sh: No such file or directory
c13: /usr/lib64/lustre/tests/racer/racer.sh: line 70: ./file_chown.sh: No such file or directory
c13: /usr/lib64/lustre/tests/racer/racer.sh: line 70: ./file_truncate.sh: No such file or directory

From the client dmesg log:

LustreError: 29864:0:(file.c:3040:ll_migrate()) scratch: migrate 1 , but fid [0x0:0x0:0x0] is insane
LustreError: 29864:0:(file.c:3040:ll_migrate()) Skipped 1 previous similar message
LustreError: 29864:0:(file.c:3040:ll_migrate()) scratch: migrate 9 , but fid [0x0:0x0:0x0] is insane
LustreError: 29864:0:(file.c:3040:ll_migrate()) Skipped 2 previous similar messages
LustreError: 1562:0:(file.c:3040:ll_migrate()) scratch: migrate 2 , but fid [0x0:0x0:0x0] is insane
LustreError: 1562:0:(file.c:3040:ll_migrate()) Skipped 3 previous similar messages
LustreError: 3290:0:(file.c:3040:ll_migrate()) scratch: migrate 2 , but fid [0x0:0x0:0x0] is insane
LustreError: 3290:0:(file.c:3040:ll_migrate()) Skipped 12 previous similar messages
INFO: task dir_create.sh:27528 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dir_create.sh D 0000000000000004     0 27528  27501 0x00000080
 ffff8805c6993b98 0000000000000086 0000004b00000000 ffffffffa130f683
 000000000000009b 0020000000000080 545dced000000004 00000000000e4a17
 ffff88081325bab8 ffff8805c6993fd8 000000000000fbc8 ffff88081325bab8
Call Trace:
 [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff811a4148>] ? __d_lookup+0xd8/0x150
 [<ffffffff8152a45b>] mutex_lock+0x2b/0x50
 [<ffffffff811989ab>] do_lookup+0x11b/0x230
 [<ffffffff81199100>] __link_path_walk+0x200/0x1000
 [<ffffffff8119a1ba>] path_walk+0x6a/0xe0
 [<ffffffff8119b99a>] do_filp_open+0x1fa/0xd20
 [<ffffffff8109b39c>] ? remove_wait_queue+0x3c/0x50
 [<ffffffff81016c71>] ? fpu_finit+0x21/0x40
 [<ffffffff8128f83a>] ? strncpy_from_user+0x4a/0x90
 [<ffffffff811a8b82>] ? alloc_fd+0x92/0x160
 [<ffffffff81185be9>] do_sys_open+0x69/0x140
 [<ffffffff81185d00>] sys_open+0x20/0x30
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
...
INFO: task ls:662 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ls            D 0000000000000005     0   662  27725 0x00000080
 ffff880507b91a78 0000000000000082 0000029600000000 ffff880800aff0b8
 0000000000000000 0000000000000001 ffff880812126000 ffff8807fe90b088
 ffff880544ddd058 ffff880507b91fd8 000000000000fbc8 ffff880544ddd058
Call Trace:
 [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff811a28ef>] ? d_free+0x3f/0x60
 [<ffffffff8152a45b>] mutex_lock+0x2b/0x50
 [<ffffffff811989ab>] do_lookup+0x11b/0x230
 [<ffffffff811996a4>] __link_path_walk+0x7a4/0x1000
 [<ffffffffa12e22d0>] ? ll_follow_link+0x350/0xdb0 [lustre]
 [<ffffffff81199bdf>] __link_path_walk+0xcdf/0x1000
 [<ffffffff8119a1ba>] path_walk+0x6a/0xe0
 [<ffffffff8119a3cb>] filename_lookup+0x6b/0xc0
 [<ffffffff81226d56>] ? security_file_alloc+0x16/0x20
 [<ffffffff8119b8a4>] do_filp_open+0x104/0xd20
 [<ffffffff811a28ef>] ? d_free+0x3f/0x60
 [<ffffffff8128f83a>] ? strncpy_from_user+0x4a/0x90
 [<ffffffff811a8b82>] ? alloc_fd+0x92/0x160
 [<ffffffff81185be9>] do_sys_open+0x69/0x140
 [<ffffffff81185d00>] sys_open+0x20/0x30
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
...


 Comments   
Comment by Jian Yu [ 04/Dec/14 ]

The same failure occurred consistently on master branch with DNE configuration:
https://testing.hpdd.intel.com/test_sets/aab4bc48-773e-11e4-b1ab-5254006e85c2
https://testing.hpdd.intel.com/test_sets/ec11e802-7705-11e4-8068-5254006e85c2
https://testing.hpdd.intel.com/test_sets/a719387e-76fe-11e4-8068-5254006e85c2
https://testing.hpdd.intel.com/test_sets/7110661c-757d-11e4-817b-5254006e85c2
https://testing.hpdd.intel.com/test_sets/ce4421a8-736a-11e4-9f3d-5254006e85c2

On client node:

18:42:51:LustreError: 5242:0:(file.c:3036:ll_migrate()) lustre: migrate 15 , but fid [0x0:0x0:0x0] is insane
18:42:51:LustreError: 5657:0:(file.c:3036:ll_migrate()) lustre: migrate 13 , but fid [0x0:0x0:0x0] is insane
<~snip~>
18:56:26:INFO: task dir_create.sh:3672 blocked for more than 120 seconds.
18:56:26:      Not tainted 2.6.32-431.29.2.el6.x86_64 #1
18:56:26:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
18:56:26:dir_create.sh D 0000000000000001     0  3672   3603 0x00000084
18:56:26: ffff88006a787b98 0000000000000082 0000004b00000000 ffffffffa0b0397b
18:56:26: 0000000000000098 0020000000000080 5477e3cb00000001 0000000000050cbc
18:56:26: ffff88006823e638 ffff88006a787fd8 000000000000fbc8 ffff88006823e638
18:56:26:Call Trace:
18:56:26: [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180
18:56:26: [<ffffffffa03e01a1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
18:56:26: [<ffffffff8152a45b>] mutex_lock+0x2b/0x50
18:56:26: [<ffffffff811989ab>] do_lookup+0x11b/0x230
18:56:26: [<ffffffff81199100>] __link_path_walk+0x200/0x1000
18:56:26: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0
18:56:26: [<ffffffff8119b99a>] do_filp_open+0x1fa/0xd20
18:56:26: [<ffffffff8109b39c>] ? remove_wait_queue+0x3c/0x50
18:56:26: [<ffffffff81016c71>] ? fpu_finit+0x21/0x40
18:56:26: [<ffffffff8128f83a>] ? strncpy_from_user+0x4a/0x90
18:56:26: [<ffffffff811a8b82>] ? alloc_fd+0x92/0x160
18:56:26: [<ffffffff81185be9>] do_sys_open+0x69/0x140
18:56:26: [<ffffffff81185d00>] sys_open+0x20/0x30
18:56:26: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

On MDS node:

18:42:34:LustreError: 4812:0:(mdd_dir.c:4021:mdd_migrate()) lustre-MDD0000: [0x3c0000403:0x7f4:0x0]18 is already opened count 1: rc = -16
18:48:57:LustreError: 4812:0:(mdd_dir.c:4021:mdd_migrate()) Skipped 14 previous similar messages
18:48:57:LustreError: 4820:0:(mdt_open.c:1580:mdt_cross_open()) lustre-MDT0000: [0x3c0000400:0x8b9:0x0] doesn't exist!: rc = -14
18:48:57:LustreError: 4820:0:(mdt_open.c:1580:mdt_cross_open()) Skipped 3 previous similar messages
18:48:57:LustreError: 4811:0:(mdt_reint.c:1509:mdt_reint_migrate_internal()) lustre-MDT0000: source [0x400000401:0xbc5:0x0] is on the remote MDT
18:48:57:LustreError: 4807:0:(mdt_reint.c:1509:mdt_reint_migrate_internal()) lustre-MDT0000: source [0x400000400:0xc51:0x0] is on the remote MDT
18:48:57:LustreError: 4807:0:(mdt_reint.c:1509:mdt_reint_migrate_internal()) Skipped 1 previous similar message
18:48:57:LustreError: 3879:0:(mdt_reint.c:1155:mdt_reint_link()) lustre-MDT0000: source inode [0x400000403:0xe13:0x0] on remote MDT from [0x3c0000400:0xddd:0x0]
18:48:57:LustreError: 3879:0:(mdt_reint.c:1155:mdt_reint_link()) Skipped 62 previous similar messages
18:48:57:LustreError: 4846:0:(mdt_reint.c:1509:mdt_reint_migrate_internal()) lustre-MDT0000: source [0x400000402:0xf24:0x0] is on the remote MDT
18:48:57:LustreError: 4846:0:(mdt_reint.c:1509:mdt_reint_migrate_internal()) Skipped 6 previous similar messages
18:48:57:LustreError: 4860:0:(mdt_reint.c:1518:mdt_reint_migrate_internal()) lustre-MDT0000: parent [0x3c0000400:0x1:0x0] is still on the same MDT, which should be migrated first: rc = -1
18:48:57:LustreError: 4860:0:(mdt_reint.c:1518:mdt_reint_migrate_internal()) Skipped 45 previous similar messages
<~snip~>
18:56:13:INFO: task mdt00_013:4815 blocked for more than 120 seconds.
18:56:13:      Not tainted 2.6.32-431.29.2.el6_lustre.g2856d31.x86_64 #1
18:56:13:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
18:56:13:mdt00_013     D 0000000000000001     0  4815      2 0x00000080
18:56:13: ffff88005cbcfa90 0000000000000046 0000000000000000 ffff880059fe1308
18:56:13: ffff880059fe1308 ffff88007a5df000 ffff88005cbcfa90 ffffffffa0d66e99
18:56:13: ffff8800674b5058 ffff88005cbcffd8 000000000000fbc8 ffff8800674b5058
18:56:13:Call Trace:
18:56:13: [<ffffffffa0d66e99>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
19:00:17: [<ffffffffa0d670e5>] lu_object_find_at+0x35/0x100 [obdclass]
19:00:17: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
19:00:17: [<ffffffffa1b841c9>] ? lod_xattr_get+0x229/0x760 [lod]
19:00:17: [<ffffffffa0d671c6>] lu_object_find+0x16/0x20 [obdclass]
19:00:17: [<ffffffffa1ab2066>] mdt_object_find+0x56/0x170 [mdt]
19:00:17: [<ffffffffa1abb272>] mdt_object_find_lock+0x42/0x170 [mdt]
19:00:17: [<ffffffffa1ad90f8>] mdt_lock_slaves+0x228/0x520 [mdt]
19:00:17: [<ffffffffa1adafd3>] mdt_reint_unlink+0x8c3/0x10c0 [mdt]
19:00:17: [<ffffffffa0d83c10>] ? lu_ucred+0x20/0x30 [obdclass]
19:00:17: [<ffffffffa1ab0ed5>] ? mdt_ucred+0x15/0x20 [mdt]
19:00:17: [<ffffffffa1ad10bd>] mdt_reint_rec+0x5d/0x200 [mdt]
19:00:17: [<ffffffffa1ab518b>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
19:00:17: [<ffffffffa1ab59eb>] mdt_reint+0x6b/0x120 [mdt]
19:00:17: [<ffffffffa0f979ce>] tgt_request_handle+0x6fe/0xaf0 [ptlrpc]
19:00:17: [<ffffffffa0f47331>] ptlrpc_main+0xe41/0x1950 [ptlrpc]
19:00:17: [<ffffffffa0f464f0>] ? ptlrpc_main+0x0/0x1950 [ptlrpc]
19:00:17: [<ffffffff8109abf6>] kthread+0x96/0xa0
19:00:17: [<ffffffff8100c20a>] child_rip+0xa/0x20
19:00:17: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
19:00:17: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Comment by Sarah Liu [ 02/Jul/15 ]

Also hit with
server: lustre-master build #3072 EL7
client: SLES11 SP3

https://testing.hpdd.intel.com/test_sets/552b29d0-15a7-11e5-8d31-5254006e85c2

Comment by Oleg Drokin [ 31/Aug/15 ]

Last racer failure reported is a different issue in EL7 - due to journaling changes MDS crashed

Comment by Sarah Liu [ 23/Sep/15 ]

another instance on master
https://testing.hpdd.intel.com/test_sets/439d9980-60b3-11e5-ba37-5254006e85c2

client and server: lustre-master build# 3192 RHEL6.7 DNE

Generated at Sat Feb 10 01:55:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.