Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5915

racer test_1: FAIL: test_1 failed with 4

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.7.0, Lustre 2.8.0, Lustre 2.9.0
    • OpenSFS cluster with two MDSs with one MDT each, three OSSs with two OSTs each and three clients. Lustre master tag 2.6.90 build #2734
    • 3
    • 16514

    Description

      Racer test_1 failed with 'test_1 failed with 4'. Logs at https://testing.hpdd.intel.com/test_sets/525741e4-6a21-11e4-aeb4-5254006e85c2

      From the client test_log, there are many lines like the following:

      c13: /usr/lib64/lustre/tests/racer/racer.sh: line 70: ./file_delxattr.sh: No such file or directory
      c13: /usr/lib64/lustre/tests/racer/racer.sh: line 70: ./file_truncate.sh: No such file or directory
      c13: /usr/lib64/lustre/tests/racer/racer.sh: line 70: ./file_chmod.sh: No such file or directory
      c13: /usr/lib64/lustre/tests/racer/racer.sh: line 70: ./file_chown.sh: No such file or directory
      c13: /usr/lib64/lustre/tests/racer/racer.sh: line 70: ./file_truncate.sh: No such file or directory
      

      From the client dmesg log:

      LustreError: 29864:0:(file.c:3040:ll_migrate()) scratch: migrate 1 , but fid [0x0:0x0:0x0] is insane
      LustreError: 29864:0:(file.c:3040:ll_migrate()) Skipped 1 previous similar message
      LustreError: 29864:0:(file.c:3040:ll_migrate()) scratch: migrate 9 , but fid [0x0:0x0:0x0] is insane
      LustreError: 29864:0:(file.c:3040:ll_migrate()) Skipped 2 previous similar messages
      LustreError: 1562:0:(file.c:3040:ll_migrate()) scratch: migrate 2 , but fid [0x0:0x0:0x0] is insane
      LustreError: 1562:0:(file.c:3040:ll_migrate()) Skipped 3 previous similar messages
      LustreError: 3290:0:(file.c:3040:ll_migrate()) scratch: migrate 2 , but fid [0x0:0x0:0x0] is insane
      LustreError: 3290:0:(file.c:3040:ll_migrate()) Skipped 12 previous similar messages
      INFO: task dir_create.sh:27528 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      dir_create.sh D 0000000000000004     0 27528  27501 0x00000080
       ffff8805c6993b98 0000000000000086 0000004b00000000 ffffffffa130f683
       000000000000009b 0020000000000080 545dced000000004 00000000000e4a17
       ffff88081325bab8 ffff8805c6993fd8 000000000000fbc8 ffff88081325bab8
      Call Trace:
       [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180
       [<ffffffff811a4148>] ? __d_lookup+0xd8/0x150
       [<ffffffff8152a45b>] mutex_lock+0x2b/0x50
       [<ffffffff811989ab>] do_lookup+0x11b/0x230
       [<ffffffff81199100>] __link_path_walk+0x200/0x1000
       [<ffffffff8119a1ba>] path_walk+0x6a/0xe0
       [<ffffffff8119b99a>] do_filp_open+0x1fa/0xd20
       [<ffffffff8109b39c>] ? remove_wait_queue+0x3c/0x50
       [<ffffffff81016c71>] ? fpu_finit+0x21/0x40
       [<ffffffff8128f83a>] ? strncpy_from_user+0x4a/0x90
       [<ffffffff811a8b82>] ? alloc_fd+0x92/0x160
       [<ffffffff81185be9>] do_sys_open+0x69/0x140
       [<ffffffff81185d00>] sys_open+0x20/0x30
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      ...
      INFO: task ls:662 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      ls            D 0000000000000005     0   662  27725 0x00000080
       ffff880507b91a78 0000000000000082 0000029600000000 ffff880800aff0b8
       0000000000000000 0000000000000001 ffff880812126000 ffff8807fe90b088
       ffff880544ddd058 ffff880507b91fd8 000000000000fbc8 ffff880544ddd058
      Call Trace:
       [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180
       [<ffffffff811a28ef>] ? d_free+0x3f/0x60
       [<ffffffff8152a45b>] mutex_lock+0x2b/0x50
       [<ffffffff811989ab>] do_lookup+0x11b/0x230
       [<ffffffff811996a4>] __link_path_walk+0x7a4/0x1000
       [<ffffffffa12e22d0>] ? ll_follow_link+0x350/0xdb0 [lustre]
       [<ffffffff81199bdf>] __link_path_walk+0xcdf/0x1000
       [<ffffffff8119a1ba>] path_walk+0x6a/0xe0
       [<ffffffff8119a3cb>] filename_lookup+0x6b/0xc0
       [<ffffffff81226d56>] ? security_file_alloc+0x16/0x20
       [<ffffffff8119b8a4>] do_filp_open+0x104/0xd20
       [<ffffffff811a28ef>] ? d_free+0x3f/0x60
       [<ffffffff8128f83a>] ? strncpy_from_user+0x4a/0x90
       [<ffffffff811a8b82>] ? alloc_fd+0x92/0x160
       [<ffffffff81185be9>] do_sys_open+0x69/0x140
       [<ffffffff81185d00>] sys_open+0x20/0x30
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      ...
      

      Attachments

        Issue Links

          Activity

            [LU-5915] racer test_1: FAIL: test_1 failed with 4
            sarah Sarah Liu added a comment -

            another instance on master
            https://testing.hpdd.intel.com/test_sets/439d9980-60b3-11e5-ba37-5254006e85c2

            client and server: lustre-master build# 3192 RHEL6.7 DNE

            sarah Sarah Liu added a comment - another instance on master https://testing.hpdd.intel.com/test_sets/439d9980-60b3-11e5-ba37-5254006e85c2 client and server: lustre-master build# 3192 RHEL6.7 DNE
            green Oleg Drokin added a comment -

            Last racer failure reported is a different issue in EL7 - due to journaling changes MDS crashed

            green Oleg Drokin added a comment - Last racer failure reported is a different issue in EL7 - due to journaling changes MDS crashed
            sarah Sarah Liu added a comment -

            Also hit with
            server: lustre-master build #3072 EL7
            client: SLES11 SP3

            https://testing.hpdd.intel.com/test_sets/552b29d0-15a7-11e5-8d31-5254006e85c2

            sarah Sarah Liu added a comment - Also hit with server: lustre-master build #3072 EL7 client: SLES11 SP3 https://testing.hpdd.intel.com/test_sets/552b29d0-15a7-11e5-8d31-5254006e85c2
            yujian Jian Yu added a comment -

            The same failure occurred consistently on master branch with DNE configuration:
            https://testing.hpdd.intel.com/test_sets/aab4bc48-773e-11e4-b1ab-5254006e85c2
            https://testing.hpdd.intel.com/test_sets/ec11e802-7705-11e4-8068-5254006e85c2
            https://testing.hpdd.intel.com/test_sets/a719387e-76fe-11e4-8068-5254006e85c2
            https://testing.hpdd.intel.com/test_sets/7110661c-757d-11e4-817b-5254006e85c2
            https://testing.hpdd.intel.com/test_sets/ce4421a8-736a-11e4-9f3d-5254006e85c2

            On client node:

            18:42:51:LustreError: 5242:0:(file.c:3036:ll_migrate()) lustre: migrate 15 , but fid [0x0:0x0:0x0] is insane
            18:42:51:LustreError: 5657:0:(file.c:3036:ll_migrate()) lustre: migrate 13 , but fid [0x0:0x0:0x0] is insane
            <~snip~>
            18:56:26:INFO: task dir_create.sh:3672 blocked for more than 120 seconds.
            18:56:26:      Not tainted 2.6.32-431.29.2.el6.x86_64 #1
            18:56:26:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            18:56:26:dir_create.sh D 0000000000000001     0  3672   3603 0x00000084
            18:56:26: ffff88006a787b98 0000000000000082 0000004b00000000 ffffffffa0b0397b
            18:56:26: 0000000000000098 0020000000000080 5477e3cb00000001 0000000000050cbc
            18:56:26: ffff88006823e638 ffff88006a787fd8 000000000000fbc8 ffff88006823e638
            18:56:26:Call Trace:
            18:56:26: [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180
            18:56:26: [<ffffffffa03e01a1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            18:56:26: [<ffffffff8152a45b>] mutex_lock+0x2b/0x50
            18:56:26: [<ffffffff811989ab>] do_lookup+0x11b/0x230
            18:56:26: [<ffffffff81199100>] __link_path_walk+0x200/0x1000
            18:56:26: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0
            18:56:26: [<ffffffff8119b99a>] do_filp_open+0x1fa/0xd20
            18:56:26: [<ffffffff8109b39c>] ? remove_wait_queue+0x3c/0x50
            18:56:26: [<ffffffff81016c71>] ? fpu_finit+0x21/0x40
            18:56:26: [<ffffffff8128f83a>] ? strncpy_from_user+0x4a/0x90
            18:56:26: [<ffffffff811a8b82>] ? alloc_fd+0x92/0x160
            18:56:26: [<ffffffff81185be9>] do_sys_open+0x69/0x140
            18:56:26: [<ffffffff81185d00>] sys_open+0x20/0x30
            18:56:26: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            

            On MDS node:

            18:42:34:LustreError: 4812:0:(mdd_dir.c:4021:mdd_migrate()) lustre-MDD0000: [0x3c0000403:0x7f4:0x0]18 is already opened count 1: rc = -16
            18:48:57:LustreError: 4812:0:(mdd_dir.c:4021:mdd_migrate()) Skipped 14 previous similar messages
            18:48:57:LustreError: 4820:0:(mdt_open.c:1580:mdt_cross_open()) lustre-MDT0000: [0x3c0000400:0x8b9:0x0] doesn't exist!: rc = -14
            18:48:57:LustreError: 4820:0:(mdt_open.c:1580:mdt_cross_open()) Skipped 3 previous similar messages
            18:48:57:LustreError: 4811:0:(mdt_reint.c:1509:mdt_reint_migrate_internal()) lustre-MDT0000: source [0x400000401:0xbc5:0x0] is on the remote MDT
            18:48:57:LustreError: 4807:0:(mdt_reint.c:1509:mdt_reint_migrate_internal()) lustre-MDT0000: source [0x400000400:0xc51:0x0] is on the remote MDT
            18:48:57:LustreError: 4807:0:(mdt_reint.c:1509:mdt_reint_migrate_internal()) Skipped 1 previous similar message
            18:48:57:LustreError: 3879:0:(mdt_reint.c:1155:mdt_reint_link()) lustre-MDT0000: source inode [0x400000403:0xe13:0x0] on remote MDT from [0x3c0000400:0xddd:0x0]
            18:48:57:LustreError: 3879:0:(mdt_reint.c:1155:mdt_reint_link()) Skipped 62 previous similar messages
            18:48:57:LustreError: 4846:0:(mdt_reint.c:1509:mdt_reint_migrate_internal()) lustre-MDT0000: source [0x400000402:0xf24:0x0] is on the remote MDT
            18:48:57:LustreError: 4846:0:(mdt_reint.c:1509:mdt_reint_migrate_internal()) Skipped 6 previous similar messages
            18:48:57:LustreError: 4860:0:(mdt_reint.c:1518:mdt_reint_migrate_internal()) lustre-MDT0000: parent [0x3c0000400:0x1:0x0] is still on the same MDT, which should be migrated first: rc = -1
            18:48:57:LustreError: 4860:0:(mdt_reint.c:1518:mdt_reint_migrate_internal()) Skipped 45 previous similar messages
            <~snip~>
            18:56:13:INFO: task mdt00_013:4815 blocked for more than 120 seconds.
            18:56:13:      Not tainted 2.6.32-431.29.2.el6_lustre.g2856d31.x86_64 #1
            18:56:13:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            18:56:13:mdt00_013     D 0000000000000001     0  4815      2 0x00000080
            18:56:13: ffff88005cbcfa90 0000000000000046 0000000000000000 ffff880059fe1308
            18:56:13: ffff880059fe1308 ffff88007a5df000 ffff88005cbcfa90 ffffffffa0d66e99
            18:56:13: ffff8800674b5058 ffff88005cbcffd8 000000000000fbc8 ffff8800674b5058
            18:56:13:Call Trace:
            18:56:13: [<ffffffffa0d66e99>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
            19:00:17: [<ffffffffa0d670e5>] lu_object_find_at+0x35/0x100 [obdclass]
            19:00:17: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
            19:00:17: [<ffffffffa1b841c9>] ? lod_xattr_get+0x229/0x760 [lod]
            19:00:17: [<ffffffffa0d671c6>] lu_object_find+0x16/0x20 [obdclass]
            19:00:17: [<ffffffffa1ab2066>] mdt_object_find+0x56/0x170 [mdt]
            19:00:17: [<ffffffffa1abb272>] mdt_object_find_lock+0x42/0x170 [mdt]
            19:00:17: [<ffffffffa1ad90f8>] mdt_lock_slaves+0x228/0x520 [mdt]
            19:00:17: [<ffffffffa1adafd3>] mdt_reint_unlink+0x8c3/0x10c0 [mdt]
            19:00:17: [<ffffffffa0d83c10>] ? lu_ucred+0x20/0x30 [obdclass]
            19:00:17: [<ffffffffa1ab0ed5>] ? mdt_ucred+0x15/0x20 [mdt]
            19:00:17: [<ffffffffa1ad10bd>] mdt_reint_rec+0x5d/0x200 [mdt]
            19:00:17: [<ffffffffa1ab518b>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
            19:00:17: [<ffffffffa1ab59eb>] mdt_reint+0x6b/0x120 [mdt]
            19:00:17: [<ffffffffa0f979ce>] tgt_request_handle+0x6fe/0xaf0 [ptlrpc]
            19:00:17: [<ffffffffa0f47331>] ptlrpc_main+0xe41/0x1950 [ptlrpc]
            19:00:17: [<ffffffffa0f464f0>] ? ptlrpc_main+0x0/0x1950 [ptlrpc]
            19:00:17: [<ffffffff8109abf6>] kthread+0x96/0xa0
            19:00:17: [<ffffffff8100c20a>] child_rip+0xa/0x20
            19:00:17: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
            19:00:17: [<ffffffff8100c200>] ? child_rip+0x0/0x20
            
            yujian Jian Yu added a comment - The same failure occurred consistently on master branch with DNE configuration: https://testing.hpdd.intel.com/test_sets/aab4bc48-773e-11e4-b1ab-5254006e85c2 https://testing.hpdd.intel.com/test_sets/ec11e802-7705-11e4-8068-5254006e85c2 https://testing.hpdd.intel.com/test_sets/a719387e-76fe-11e4-8068-5254006e85c2 https://testing.hpdd.intel.com/test_sets/7110661c-757d-11e4-817b-5254006e85c2 https://testing.hpdd.intel.com/test_sets/ce4421a8-736a-11e4-9f3d-5254006e85c2 On client node: 18:42:51:LustreError: 5242:0:(file.c:3036:ll_migrate()) lustre: migrate 15 , but fid [0x0:0x0:0x0] is insane 18:42:51:LustreError: 5657:0:(file.c:3036:ll_migrate()) lustre: migrate 13 , but fid [0x0:0x0:0x0] is insane <~snip~> 18:56:26:INFO: task dir_create.sh:3672 blocked for more than 120 seconds. 18:56:26: Not tainted 2.6.32-431.29.2.el6.x86_64 #1 18:56:26:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 18:56:26:dir_create.sh D 0000000000000001 0 3672 3603 0x00000084 18:56:26: ffff88006a787b98 0000000000000082 0000004b00000000 ffffffffa0b0397b 18:56:26: 0000000000000098 0020000000000080 5477e3cb00000001 0000000000050cbc 18:56:26: ffff88006823e638 ffff88006a787fd8 000000000000fbc8 ffff88006823e638 18:56:26:Call Trace: 18:56:26: [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180 18:56:26: [<ffffffffa03e01a1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 18:56:26: [<ffffffff8152a45b>] mutex_lock+0x2b/0x50 18:56:26: [<ffffffff811989ab>] do_lookup+0x11b/0x230 18:56:26: [<ffffffff81199100>] __link_path_walk+0x200/0x1000 18:56:26: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0 18:56:26: [<ffffffff8119b99a>] do_filp_open+0x1fa/0xd20 18:56:26: [<ffffffff8109b39c>] ? remove_wait_queue+0x3c/0x50 18:56:26: [<ffffffff81016c71>] ? fpu_finit+0x21/0x40 18:56:26: [<ffffffff8128f83a>] ? strncpy_from_user+0x4a/0x90 18:56:26: [<ffffffff811a8b82>] ? alloc_fd+0x92/0x160 18:56:26: [<ffffffff81185be9>] do_sys_open+0x69/0x140 18:56:26: [<ffffffff81185d00>] sys_open+0x20/0x30 18:56:26: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b On MDS node: 18:42:34:LustreError: 4812:0:(mdd_dir.c:4021:mdd_migrate()) lustre-MDD0000: [0x3c0000403:0x7f4:0x0]18 is already opened count 1: rc = -16 18:48:57:LustreError: 4812:0:(mdd_dir.c:4021:mdd_migrate()) Skipped 14 previous similar messages 18:48:57:LustreError: 4820:0:(mdt_open.c:1580:mdt_cross_open()) lustre-MDT0000: [0x3c0000400:0x8b9:0x0] doesn't exist!: rc = -14 18:48:57:LustreError: 4820:0:(mdt_open.c:1580:mdt_cross_open()) Skipped 3 previous similar messages 18:48:57:LustreError: 4811:0:(mdt_reint.c:1509:mdt_reint_migrate_internal()) lustre-MDT0000: source [0x400000401:0xbc5:0x0] is on the remote MDT 18:48:57:LustreError: 4807:0:(mdt_reint.c:1509:mdt_reint_migrate_internal()) lustre-MDT0000: source [0x400000400:0xc51:0x0] is on the remote MDT 18:48:57:LustreError: 4807:0:(mdt_reint.c:1509:mdt_reint_migrate_internal()) Skipped 1 previous similar message 18:48:57:LustreError: 3879:0:(mdt_reint.c:1155:mdt_reint_link()) lustre-MDT0000: source inode [0x400000403:0xe13:0x0] on remote MDT from [0x3c0000400:0xddd:0x0] 18:48:57:LustreError: 3879:0:(mdt_reint.c:1155:mdt_reint_link()) Skipped 62 previous similar messages 18:48:57:LustreError: 4846:0:(mdt_reint.c:1509:mdt_reint_migrate_internal()) lustre-MDT0000: source [0x400000402:0xf24:0x0] is on the remote MDT 18:48:57:LustreError: 4846:0:(mdt_reint.c:1509:mdt_reint_migrate_internal()) Skipped 6 previous similar messages 18:48:57:LustreError: 4860:0:(mdt_reint.c:1518:mdt_reint_migrate_internal()) lustre-MDT0000: parent [0x3c0000400:0x1:0x0] is still on the same MDT, which should be migrated first: rc = -1 18:48:57:LustreError: 4860:0:(mdt_reint.c:1518:mdt_reint_migrate_internal()) Skipped 45 previous similar messages <~snip~> 18:56:13:INFO: task mdt00_013:4815 blocked for more than 120 seconds. 18:56:13: Not tainted 2.6.32-431.29.2.el6_lustre.g2856d31.x86_64 #1 18:56:13:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 18:56:13:mdt00_013 D 0000000000000001 0 4815 2 0x00000080 18:56:13: ffff88005cbcfa90 0000000000000046 0000000000000000 ffff880059fe1308 18:56:13: ffff880059fe1308 ffff88007a5df000 ffff88005cbcfa90 ffffffffa0d66e99 18:56:13: ffff8800674b5058 ffff88005cbcffd8 000000000000fbc8 ffff8800674b5058 18:56:13:Call Trace: 18:56:13: [<ffffffffa0d66e99>] ? lu_object_find_try+0x99/0x2b0 [obdclass] 19:00:17: [<ffffffffa0d670e5>] lu_object_find_at+0x35/0x100 [obdclass] 19:00:17: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 19:00:17: [<ffffffffa1b841c9>] ? lod_xattr_get+0x229/0x760 [lod] 19:00:17: [<ffffffffa0d671c6>] lu_object_find+0x16/0x20 [obdclass] 19:00:17: [<ffffffffa1ab2066>] mdt_object_find+0x56/0x170 [mdt] 19:00:17: [<ffffffffa1abb272>] mdt_object_find_lock+0x42/0x170 [mdt] 19:00:17: [<ffffffffa1ad90f8>] mdt_lock_slaves+0x228/0x520 [mdt] 19:00:17: [<ffffffffa1adafd3>] mdt_reint_unlink+0x8c3/0x10c0 [mdt] 19:00:17: [<ffffffffa0d83c10>] ? lu_ucred+0x20/0x30 [obdclass] 19:00:17: [<ffffffffa1ab0ed5>] ? mdt_ucred+0x15/0x20 [mdt] 19:00:17: [<ffffffffa1ad10bd>] mdt_reint_rec+0x5d/0x200 [mdt] 19:00:17: [<ffffffffa1ab518b>] mdt_reint_internal+0x4cb/0x7a0 [mdt] 19:00:17: [<ffffffffa1ab59eb>] mdt_reint+0x6b/0x120 [mdt] 19:00:17: [<ffffffffa0f979ce>] tgt_request_handle+0x6fe/0xaf0 [ptlrpc] 19:00:17: [<ffffffffa0f47331>] ptlrpc_main+0xe41/0x1950 [ptlrpc] 19:00:17: [<ffffffffa0f464f0>] ? ptlrpc_main+0x0/0x1950 [ptlrpc] 19:00:17: [<ffffffff8109abf6>] kthread+0x96/0xa0 19:00:17: [<ffffffff8100c20a>] child_rip+0xa/0x20 19:00:17: [<ffffffff8109ab60>] ? kthread+0x0/0xa0 19:00:17: [<ffffffff8100c200>] ? child_rip+0x0/0x20

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: