Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6758

racer test_1: test failed to respond and timed out

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • None
    • Lustre 2.8.0, Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0
    • None
    • server: lustre-master build # 3071 EL7 ldiskfs
      client: lustre-master build # 3071 EL6 ldiskfs
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/8303c8ec-13d8-11e5-b4b0-5254006e85c2.

      The sub-test test_1 failed with the following error:

      test failed to respond and timed out
      

      There is no MDS logs
      client console

      16:40:41:Lustre: DEBUG MARKER: == racer test 1: racer on clients: onyx-40vm5,onyx-40vm6.onyx.hpdd.intel.com DURATION=900 == 09:30:44 (1434299444)
      16:40:41:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 				   LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer 
      16:40:41:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 				   LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer 
      16:40:41:LustreError: 21155:0:(llite_lib.c:1497:ll_md_setattr()) md_setattr fails: rc = -30
      16:40:42:Lustre: 21248:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299452/real 1434299452]  req@ffff88006d4753c0 x1503970942065616/t0(0) o36->lustre-MDT0000-mdc-ffff88007c74cc00@10.2.4.185@tcp:12/10 lens 496/568 e 0 to 1 dl 1434299459 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      16:40:42:Lustre: lustre-MDT0000-mdc-ffff88007c74cc00: Connection to lustre-MDT0000 (at 10.2.4.185@tcp) was lost; in progress operations using this service will wait for recovery to complete
      16:40:43:Lustre: 31077:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299452/real 1434299452]  req@ffff880070c73c80 x1503970942069540/t0(0) o400->MGC10.2.4.185@tcp@10.2.4.185@tcp:26/25 lens 224/224 e 0 to 1 dl 1434299459 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      16:40:43:Lustre: 31077:0:(client.c:2003:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
      16:40:43:LustreError: 166-1: MGC10.2.4.185@tcp: Connection to MGS (at 10.2.4.185@tcp) was lost; in progress operations using this service will fail
      16:40:43:Lustre: 31077:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299457/real 1434299457]  req@ffff880070cd1080 x1503970942081192/t0(0) o400->lustre-MDT0000-mdc-ffff880079777c00@10.2.4.185@tcp:12/10 lens 224/224 e 0 to 1 dl 1434299464 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      16:40:43:Lustre: 31077:0:(client.c:2003:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      16:40:43:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299470/real 1434299470]  req@ffff880070167380 x1503970942081360/t0(0) o250->MGC10.2.4.185@tcp@10.2.4.185@tcp:26/25 lens 520/544 e 0 to 1 dl 1434299481 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      16:40:43:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
      16:40:43:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1434299485/real 0]  req@ffff880070a2c3c0 x1503970942081548/t0(0) o250->MGC10.2.4.185@tcp@10.2.4.185@tcp:26/25 lens 520/544 e 0 to 1 dl 1434299501 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      16:40:44:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      16:40:44:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299515/real 1434299515]  req@ffff88007010b080 x1503970942081892/t0(0) o38->lustre-MDT0000-mdc-ffff880079777c00@10.2.4.185@tcp:12/10 lens 520/544 e 0 to 1 dl 1434299531 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      16:40:44:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299535/real 1434299536]  req@ffff88006d7f20c0 x1503970942082156/t0(0) o38->lustre-MDT0000-mdc-ffff88007c74cc00@10.2.4.185@tcp:12/10 lens 520/544 e 0 to 1 dl 1434299556 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      16:40:44:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      16:40:44:INFO: task dir_create.sh:5103 blocked for more than 120 seconds.
      16:40:44:      Not tainted 2.6.32-504.16.2.el6.x86_64 #1
      16:40:44:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      16:40:45:dir_create.sh D 0000000000000000     0  5103   5096 0x00000080
      16:40:45: ffff88007c2e1bc8 0000000000000082 0000000000000000 ffff880000000000
      16:40:45: ffff880000000065 ffff88007a19e104 ffffffffa1b012c0 0000000000000000
      16:40:46: 000200000a0204b9 ffffffffa1dba813 ffff880079059068 ffff88007c2e1fd8
      16:40:46:Call Trace:
      16:40:46: [<ffffffffa1d5f5f5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
      16:40:46: [<ffffffffa1d86a22>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
      16:40:46: [<ffffffff8152b7a6>] __mutex_lock_slowpath+0x96/0x210
      16:40:46: [<ffffffff8152b2cb>] mutex_lock+0x2b/0x50
      16:40:46: [<ffffffffa1f40c7a>] mdc_close+0x19a/0xa60 [mdc]
      16:40:46: [<ffffffffa2017c09>] ? ll_i2suppgid+0x19/0x30 [lustre]
      16:40:46: [<ffffffffa1efefe7>] lmv_close+0x2c7/0x580 [lmv]
      16:40:46: [<ffffffffa1feac4f>] ll_close_inode_openhandle+0x2ef/0xe90 [lustre]
      16:40:47: [<ffffffffa1febb97>] ll_md_real_close+0x197/0x210 [lustre]
      16:40:47: [<ffffffffa1fed331>] ll_file_release+0x641/0xad0 [lustre]
      16:40:47: [<ffffffff8118fb85>] __fput+0xf5/0x210
      16:40:47: [<ffffffff8118fcc5>] fput+0x25/0x30
      16:40:47: [<ffffffff8118af1d>] filp_close+0x5d/0x90
      16:40:47: [<ffffffff811a3760>] sys_dup3+0x130/0x190
      16:40:47: [<ffffffff811a37d4>] sys_dup2+0x14/0x50
      16:40:47: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      16:40:47:INFO: task cat:5274 blocked for more than 120 seconds.
      16:40:47:      Not tainted 2.6.32-504.16.2.el6.x86_64 #1
      16:40:47:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      16:40:47:cat           D 0000000000000000     0  5274   5133 0x00000080
      16:40:47: ffff880071bfbbf8 0000000000000086 0000000000000000 ffff880000000000
      16:40:47: ffff880000000065 ffff88007a19e104 ffffffffa1b01f80 0000000000000000
      16:40:47: 000200000a0204b9 ffffffffa1dba813 ffff880071bf9ad8 ffff880071bfbfd8
      16:40:48:Call Trace:
      16:40:48: [<ffffffffa1d5f5f5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
      16:40:48: [<ffffffffa1d86a22>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
      16:40:48: [<ffffffff8152b7a6>] __mutex_lock_slowpath+0x96/0x210
      16:40:48: [<ffffffff8152b2cb>] mutex_lock+0x2b/0x50
      16:40:48: [<ffffffffa1f40c7a>] mdc_close+0x19a/0xa60 [mdc]
      16:40:49: [<ffffffffa2017c09>] ? ll_i2suppgid+0x19/0x30 [lustre]
      16:40:49: [<ffffffffa1efefe7>] lmv_close+0x2c7/0x580 [lmv]
      16:40:49: [<ffffffffa1feac4f>] ll_close_inode_openhandle+0x2ef/0xe90 [lustre]
      16:40:49: [<ffffffffa1febb97>] ll_md_real_close+0x197/0x210 [lustre]
      16:40:49: [<ffffffffa1fed331>] ll_file_release+0x641/0xad0 [lustre]
      16:40:49: [<ffffffff8118fb85>] __fput+0xf5/0x210
      16:40:49: [<ffffffff8118fcc5>] fput+0x25/0x30
      16:40:50: [<ffffffff8118af1d>] filp_close+0x5d/0x90
      16:40:50: [<ffffffff8118aff5>] sys_close+0xa5/0x100
      16:40:50: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      16:40:50:INFO: task dd:6983 blocked for more than 120 seconds.
      16:40:50:      Not tainted 2.6.32-504.16.2.el6.x86_64 #1
      16:40:50:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      16:40:50:dd            D 0000000000000000     0  6983   5154 0x00000080
      16:40:50: ffff880071917bf8 0000000000000086 0000000000000000 ffff880000000000
      16:40:50: ffff880000000065 ffff88007a56c244 ffffffffa1b03a00 0000000000000000
      16:40:51: 000200000a0204b9 ffffffffa1dba813 ffff880070fafad8 ffff880071917fd8
      16:40:51:Call Trace:
      16:40:51: [<ffffffffa1d5f5f5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
      16:40:51: [<ffffffffa1d86a22>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
      16:40:51: [<ffffffff8152b7a6>] __mutex_lock_slowpath+0x96/0x210
      16:40:51: [<ffffffff8152b2cb>] mutex_lock+0x2b/0x50
      16:40:51: [<ffffffffa1f40c7a>] mdc_close+0x19a/0xa60 [mdc]
      16:40:51: [<ffffffffa2017c09>] ? ll_i2suppgid+0x19/0x30 [lustre]
      16:40:52: [<ffffffffa1efefe7>] lmv_close+0x2c7/0x580 [lmv]
      16:40:52: [<ffffffffa1feac4f>] ll_close_inode_openhandle+0x2ef/0xe90 [lustre]
      16:40:52: [<ffffffffa1febb97>] ll_md_real_close+0x197/0x210 [lustre]
      16:40:52: [<ffffffffa1fed331>] ll_file_release+0x641/0xad0 [lustre]
      16:40:53: [<ffffffff8118fb85>] __fput+0xf5/0x210
      16:40:53: [<ffffffff8118fcc5>] fput+0x25/0x30
      16:40:53: [<ffffffff8118af1d>] filp_close+0x5d/0x90
      16:40:53: [<ffffffff8118aff5>] sys_close+0xa5/0x100
      16:40:53: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      

      Attachments

        Issue Links

          Activity

            [LU-6758] racer test_1: test failed to respond and timed out

            Created LU-11359 to track this issue because this ticket is probably several issues.

            jamesanunez James Nunez (Inactive) added a comment - Created LU-11359 to track this issue because this ticket is probably several issues.

            We're seeing this issue or something very close to it at
            https://testing.whamcloud.com/test_sets/df96b5ee-aae1-11e8-80f7-52540065bddc
            https://testing.whamcloud.com/test_sets/e801c64e-a918-11e8-80f7-52540065bddc
            https://testing.whamcloud.com/test_sets/7fcd4432-ab38-11e8-80f7-52540065bddc

            In these, we see ls or dir_create hang on the client console logs. In the MDS console log, we see

            [ 8094.197467] Lustre: DEBUG MARKER: == racer test 1: racer on clients: onyx-41vm10,onyx-41vm9.onyx.whamcloud.com DURATION=300 ============ 01:57:42 (1535507862)
            [ 8106.463255] LNetError: 14167:0:(lib-msg.c:779:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
            [ 8113.461582] Lustre: lustre-MDT0000: Client 27389196-d811-d02a-8653-acd954b84381 (at 10.2.8.140@tcp) reconnecting
            [ 8260.743287] LNet: Service thread pid 27085 was inactive for 62.03s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [ 8260.745290] Pid: 27085, comm: mdt00_003 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Fri Aug 17 20:37:05 UTC 2018
            [ 8260.746424] Call Trace:
            [ 8260.746727]  [<ffffffffc0d55f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
            [ 8260.747759]  [<ffffffffc0d57e93>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
            [ 8260.748642]  [<ffffffffc11b1bee>] mdt_dom_discard_data+0xfe/0x130 [mdt]
            [ 8260.749734]  [<ffffffffc118b603>] mdt_reint_rename_internal.isra.40+0x1923/0x28a0 [mdt]
            [ 8260.750712]  [<ffffffffc118dfcb>] mdt_reint_rename_or_migrate.isra.43+0x19b/0x860 [mdt]
            [ 8260.751671]  [<ffffffffc118e6c3>] mdt_reint_rename+0x13/0x20 [mdt]
            [ 8260.752421]  [<ffffffffc1193033>] mdt_reint_rec+0x83/0x210 [mdt]
            [ 8260.753455]  [<ffffffffc11721d2>] mdt_reint_internal+0x6b2/0xa80 [mdt]
            [ 8260.754249]  [<ffffffffc117d1e7>] mdt_reint+0x67/0x140 [mdt]
            [ 8260.754927]  [<ffffffffc0df431a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [ 8260.755931]  [<ffffffffc0d974ab>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [ 8260.756855]  [<ffffffffc0d9ace4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
            [ 8260.757643]  [<ffffffffa3cbb621>] kthread+0xd1/0xe0
            [ 8260.758294]  [<ffffffffa43205f7>] ret_from_fork_nospec_end+0x0/0x39
            [ 8260.759073]  [<ffffffffffffffff>] 0xffffffffffffffff
            [ 8260.759863] LustreError: dumping log to /tmp/lustre-log.1535508029.27085
            [ 8261.761641] Pid: 3455, comm: mdt00_028 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Fri Aug 17 20:37:05 UTC 2018
            [ 8261.762905] Call Trace:
            [ 8261.763216]  [<ffffffffc0d55f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
            [ 8261.764129]  [<ffffffffc0d57e93>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
            [ 8261.764980]  [<ffffffffc11763c7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
            [ 8261.765776]  [<ffffffffc1176a70>] mdt_object_lock_internal+0x70/0x330 [mdt]
            [ 8261.766546]  [<ffffffffc1177ada>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt]
            [ 8261.767348]  [<ffffffffc117f885>] mdt_intent_getattr+0x2b5/0x480 [mdt]
            [ 8261.768089]  [<ffffffffc117c768>] mdt_intent_policy+0x2f8/0xd10 [mdt]
            [ 8261.768912]  [<ffffffffc0d3ce9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
            [ 8261.769714]  [<ffffffffc0d65523>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
            [ 8261.770557]  [<ffffffffc0deb9d2>] tgt_enqueue+0x62/0x210 [ptlrpc]
            [ 8261.771307]  [<ffffffffc0df431a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [ 8261.772124]  [<ffffffffc0d974ab>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [ 8261.773076]  [<ffffffffc0d9ace4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
            [ 8261.773824]  [<ffffffffa3cbb621>] kthread+0xd1/0xe0
            [ 8261.774388]  [<ffffffffa43205f7>] ret_from_fork_nospec_end+0x0/0x39
            [ 8261.775109]  [<ffffffffffffffff>] 0xffffffffffffffff
            
            jamesanunez James Nunez (Inactive) added a comment - We're seeing this issue or something very close to it at https://testing.whamcloud.com/test_sets/df96b5ee-aae1-11e8-80f7-52540065bddc https://testing.whamcloud.com/test_sets/e801c64e-a918-11e8-80f7-52540065bddc https://testing.whamcloud.com/test_sets/7fcd4432-ab38-11e8-80f7-52540065bddc In these, we see ls or dir_create hang on the client console logs. In the MDS console log, we see [ 8094.197467] Lustre: DEBUG MARKER: == racer test 1: racer on clients: onyx-41vm10,onyx-41vm9.onyx.whamcloud.com DURATION=300 ============ 01:57:42 (1535507862) [ 8106.463255] LNetError: 14167:0:(lib-msg.c:779:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [ 8113.461582] Lustre: lustre-MDT0000: Client 27389196-d811-d02a-8653-acd954b84381 (at 10.2.8.140@tcp) reconnecting [ 8260.743287] LNet: Service thread pid 27085 was inactive for 62.03s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 8260.745290] Pid: 27085, comm: mdt00_003 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Fri Aug 17 20:37:05 UTC 2018 [ 8260.746424] Call Trace: [ 8260.746727] [<ffffffffc0d55f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 8260.747759] [<ffffffffc0d57e93>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 8260.748642] [<ffffffffc11b1bee>] mdt_dom_discard_data+0xfe/0x130 [mdt] [ 8260.749734] [<ffffffffc118b603>] mdt_reint_rename_internal.isra.40+0x1923/0x28a0 [mdt] [ 8260.750712] [<ffffffffc118dfcb>] mdt_reint_rename_or_migrate.isra.43+0x19b/0x860 [mdt] [ 8260.751671] [<ffffffffc118e6c3>] mdt_reint_rename+0x13/0x20 [mdt] [ 8260.752421] [<ffffffffc1193033>] mdt_reint_rec+0x83/0x210 [mdt] [ 8260.753455] [<ffffffffc11721d2>] mdt_reint_internal+0x6b2/0xa80 [mdt] [ 8260.754249] [<ffffffffc117d1e7>] mdt_reint+0x67/0x140 [mdt] [ 8260.754927] [<ffffffffc0df431a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 8260.755931] [<ffffffffc0d974ab>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 8260.756855] [<ffffffffc0d9ace4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 8260.757643] [<ffffffffa3cbb621>] kthread+0xd1/0xe0 [ 8260.758294] [<ffffffffa43205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 8260.759073] [<ffffffffffffffff>] 0xffffffffffffffff [ 8260.759863] LustreError: dumping log to /tmp/lustre-log.1535508029.27085 [ 8261.761641] Pid: 3455, comm: mdt00_028 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Fri Aug 17 20:37:05 UTC 2018 [ 8261.762905] Call Trace: [ 8261.763216] [<ffffffffc0d55f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 8261.764129] [<ffffffffc0d57e93>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 8261.764980] [<ffffffffc11763c7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 8261.765776] [<ffffffffc1176a70>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 8261.766546] [<ffffffffc1177ada>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt] [ 8261.767348] [<ffffffffc117f885>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 8261.768089] [<ffffffffc117c768>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 8261.768912] [<ffffffffc0d3ce9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 8261.769714] [<ffffffffc0d65523>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 8261.770557] [<ffffffffc0deb9d2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 8261.771307] [<ffffffffc0df431a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 8261.772124] [<ffffffffc0d974ab>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 8261.773076] [<ffffffffc0d9ace4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 8261.773824] [<ffffffffa3cbb621>] kthread+0xd1/0xe0 [ 8261.774388] [<ffffffffa43205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 8261.775109] [<ffffffffffffffff>] 0xffffffffffffffff

            We still see what looks like this hang with parallel_scale_nfsv3 test_racer_on_nfs. Logs for the most recent hang is at
            https://testing.hpdd.intel.com/test_sets/59c82126-7286-11e7-bb95-5254006e85c2

            The logs are interesting. On the client 1, in the console log, we see a segfault and then hangs on several different processes including dd, truncate, file_concat.sh, setfattr, ln, and more:

            03:08:04:[11566.277896] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 03:04:22 (1501124662)
            03:08:04:[11566.741676] 8[31347]: segfault at 8 ip 00007fc99889b3b8 sp 00007fff9753dd40 error 4 in ld-2.17.so[7fc998890000+20000]
            03:08:04:[11760.098045] INFO: task dd:3382 blocked for more than 120 seconds.
            03:08:04:[11760.100574] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            03:08:04:[11760.103192] dd              D ffff88007ae64278     0  3382  30972 0x00000080
            03:08:04:[11760.105770]  ffff880036c9fd10 0000000000000082 ffff880078760fb0 ffff880036c9ffd8
            03:08:04:[11760.108465]  ffff880036c9ffd8 ffff880036c9ffd8 ffff880078760fb0 ffff88007ae64270
            03:08:04:[11760.111045]  ffff88007ae64274 ffff880078760fb0 00000000ffffffff ffff88007ae64278
            03:08:04:[11760.113570] Call Trace:
            03:08:04:[11760.115753]  [<ffffffff8168d6c9>] schedule_preempt_disabled+0x29/0x70
            03:08:04:[11760.118168]  [<ffffffff8168b315>] __mutex_lock_slowpath+0xc5/0x1d0
            03:08:04:[11760.120533]  [<ffffffff8168a76f>] mutex_lock+0x1f/0x2f
            03:08:04:[11760.122898]  [<ffffffff811833f6>] generic_file_aio_write+0x46/0xa0
            03:08:04:[11760.125271]  [<ffffffffa057911b>] nfs_file_write+0xbb/0x1e0 [nfs]
            03:08:04:[11760.127568]  [<ffffffff811fdfbd>] do_sync_write+0x8d/0xd0
            03:08:04:[11760.129784]  [<ffffffff811fe82d>] vfs_write+0xbd/0x1e0
            03:08:04:[11760.131928]  [<ffffffff811fe6f7>] ? vfs_read+0xf7/0x170
            03:08:04:[11760.134049]  [<ffffffff811ff34f>] SyS_write+0x7f/0xe0
            03:08:04:[11760.136168]  [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b
            03:08:04:[11760.138325] INFO: task truncate:3454 blocked for more than 120 seconds.
            03:08:04:[11760.140499] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            03:08:04:[11760.142775] truncate        D ffffffff8168a630     0  3454  30984 0x00000080
            03:08:04:[11760.145011]  ffff88006e3c3af0 0000000000000082 ffff880079121f60 ffff88006e3c3fd8
            03:08:04:[11760.147321]  ffff88006e3c3fd8 ffff88006e3c3fd8 ffff880079121f60 ffff88007fc16c40
            03:08:04:[11760.149568]  0000000000000000 7fffffffffffffff ffff88007ff616e8 ffffffff8168a630
            03:08:04:[11760.151888] Call Trace:
            03:08:04:[11760.153683]  [<ffffffff8168a630>] ? bit_wait+0x50/0x50
            03:08:04:[11760.155721]  [<ffffffff8168c5d9>] schedule+0x29/0x70
            03:08:04:[11760.157686]  [<ffffffff8168a019>] schedule_timeout+0x239/0x2c0
            03:08:04:[11760.159733]  [<ffffffff8113d22e>] ? delayacct_end+0x9e/0xb0
            03:08:04:[11760.161691]  [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30
            03:08:04:[11760.163663]  [<ffffffff8168a630>] ? bit_wait+0x50/0x50
            03:08:04:[11760.165551]  [<ffffffff8168bb7e>] io_schedule_timeout+0xae/0x130
            03:08:04:[11760.167469]  [<ffffffff8168bc18>] io_schedule+0x18/0x20
            03:08:04:[11760.169321]  [<ffffffff8168a641>] bit_wait_io+0x11/0x50
            03:08:04:[11760.171825]  [<ffffffff8168a165>] __wait_on_bit+0x65/0x90
            03:08:04:[11760.174073]  [<ffffffff81180211>] wait_on_page_bit+0x81/0xa0
            03:08:04:[11760.175867]  [<ffffffff810b1be0>] ? wake_bit_function+0x40/0x40
            03:08:04:[11760.177632]  [<ffffffff81180341>] __filemap_fdatawait_range+0x111/0x190
            03:08:04:[11760.179428]  [<ffffffff811803d4>] filemap_fdatawait_range+0x14/0x30
            03:08:04:[11760.181160]  [<ffffffff81180417>] filemap_fdatawait+0x27/0x30
            03:08:04:[11760.182841]  [<ffffffff8118258c>] filemap_write_and_wait+0x4c/0x80
            03:08:04:[11760.184525]  [<ffffffffa058a3b0>] nfs_wb_all+0x20/0x100 [nfs]
            03:08:04:[11760.186135]  [<ffffffffa057ca08>] nfs_setattr+0x1d8/0x200 [nfs]
            03:08:04:[11760.187756]  [<ffffffff8121be19>] notify_change+0x279/0x3d0
            03:08:04:[11760.189312]  [<ffffffff811fc995>] do_truncate+0x75/0xc0
            03:08:04:[11760.190830]  [<ffffffffa0578a29>] ? nfs_permission+0x199/0x1e0 [nfs]
            03:08:04:[11760.192390]  [<ffffffff811fcb44>] vfs_truncate+0x164/0x190
            03:08:04:[11760.193849]  [<ffffffff811fcbfc>] do_sys_truncate+0x8c/0xb0
            03:08:04:[11760.195276]  [<ffffffff811fcdae>] SyS_truncate+0xe/0x10
            03:08:04:[11760.196650]  [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b
            03:08:04:[11760.198062] INFO: task file_concat.sh:3471 blocked for more than 120 seconds.
            03:08:04:[11760.199552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            03:08:04:[11760.201110] file_concat.sh  D ffff88007ae64278     0  3471  31015 0x00000080
            

            On the same client, in the stack trace log we see dd and mkdir hung and an nfs server problem

            04:06:47:[11760.356684]  [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b
            04:06:47:[11760.358129] INFO: task mkdir:3524 blocked for more than 120 seconds.
            04:06:47:[11760.359580] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            04:06:47:[11760.361164] mkdir           D ffff880006f53bc8     0  3524  31009 0x00000080
            04:06:47:[11760.362709]  ffff88006a7d7d90 0000000000000082 ffff88005b1dce70 ffff88006a7d7fd8
            04:06:47:[11760.364352]  ffff88006a7d7fd8 ffff88006a7d7fd8 ffff88005b1dce70 ffff880006f53bc0
            04:06:47:[11760.365957]  ffff880006f53bc4 ffff88005b1dce70 00000000ffffffff ffff880006f53bc8
            04:06:47:[11760.367612] Call Trace:
            04:06:47:[11760.368814]  [<ffffffff8168d6c9>] schedule_preempt_disabled+0x29/0x70
            04:06:47:[11760.370391]  [<ffffffff8168b315>] __mutex_lock_slowpath+0xc5/0x1d0
            04:06:47:[11760.371915]  [<ffffffff8168a76f>] mutex_lock+0x1f/0x2f
            04:06:47:[11760.373363]  [<ffffffff8120cae5>] filename_create+0x85/0x180
            04:06:47:[11760.374860]  [<ffffffffa057870e>] ? nfs_do_access+0x23e/0x390 [nfs]
            04:06:47:[11760.376396]  [<ffffffff811de665>] ? kmem_cache_alloc+0x35/0x1e0
            04:06:47:[11760.377893]  [<ffffffff8120f2bf>] ? getname_flags+0x4f/0x1a0
            04:06:47:[11760.379357]  [<ffffffff8120f334>] ? getname_flags+0xc4/0x1a0
            04:06:47:[11760.380793]  [<ffffffff8121eede>] ? mntput_no_expire+0x3e/0x120
            04:06:47:[11760.382236]  [<ffffffff8120f5e1>] user_path_create+0x41/0x60
            04:06:47:[11760.383667]  [<ffffffff812108f6>] SyS_mkdirat+0x46/0xe0
            04:06:47:[11760.385066]  [<ffffffff812109a9>] SyS_mkdir+0x19/0x20
            04:06:47:[11760.386419]  [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b
            04:06:47:[11760.387839] INFO: task dd:3529 blocked for more than 120 seconds.
            04:06:47:[11760.389275] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            04:06:47:[11760.390867] dd              D ffffffff8168a630     0  3529  31008 0x00000080
            04:06:47:[11760.392426]  ffff880077fffbc0 0000000000000086 ffff880074fb0000 ffff880077ffffd8
            04:06:48:[11760.394065]  ffff880077ffffd8 ffff880077ffffd8 ffff880074fb0000 ffff88007fd16c40
            04:06:48:[11760.395686]  0000000000000000 7fffffffffffffff ffff88007ff5cee8 ffffffff8168a630
            04:06:48:[11760.397353] Call Trace:
            04:06:48:[11760.398564]  [<ffffffff8168a630>] ? bit_wait+0x50/0x50
            04:06:48:[11760.400060]  [<ffffffff8168c5d9>] schedule+0x29/0x70
            04:06:48:[11760.401498]  [<ffffffff8168a019>] schedule_timeout+0x239/0x2c0
            04:06:48:[11760.403027]  [<ffffffffa0267e10>] ? rpc_put_task+0x10/0x20 [sunrpc]
            04:06:48:[11760.404538]  [<ffffffffa0584174>] ? nfs_initiate_pgio+0xd4/0x160 [nfs]
            04:06:48:[11760.406022]  [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30
            04:06:48:[11760.407526]  [<ffffffff8168a630>] ? bit_wait+0x50/0x50
            04:06:48:[11760.408925]  [<ffffffff8168bb7e>] io_schedule_timeout+0xae/0x130
            04:06:48:[11760.410387]  [<ffffffff8168bc18>] io_schedule+0x18/0x20
            04:06:48:[11760.411777]  [<ffffffff8168a641>] bit_wait_io+0x11/0x50
            04:06:48:[11760.413167]  [<ffffffff8168a165>] __wait_on_bit+0x65/0x90
            04:06:48:[11760.414551]  [<ffffffff81180211>] wait_on_page_bit+0x81/0xa0
            04:06:48:[11760.415953]  [<ffffffff810b1be0>] ? wake_bit_function+0x40/0x40
            04:06:48:[11760.417375]  [<ffffffff81180341>] __filemap_fdatawait_range+0x111/0x190
            04:06:48:[11760.418862]  [<ffffffff811803d4>] filemap_fdatawait_range+0x14/0x30
            04:06:48:[11760.420312]  [<ffffffff81182656>] filemap_write_and_wait_range+0x56/0x90
            04:06:48:[11760.421798]  [<ffffffffa0578e46>] nfs_file_fsync+0x86/0x110 [nfs]
            04:06:48:[11760.423232]  [<ffffffff812302cb>] vfs_fsync+0x2b/0x40
            04:06:48:[11760.424589]  [<ffffffffa0579286>] nfs_file_flush+0x46/0x60 [nfs]
            04:06:48:[11760.426028]  [<ffffffff811fbfd4>] filp_close+0x34/0x80
            04:06:48:[11760.427400]  [<ffffffff8121d338>] __close_fd+0x78/0xa0
            04:06:48:[11760.428755]  [<ffffffff811fdb73>] SyS_close+0x23/0x50
            04:06:48:[11760.430084]  [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b
            04:06:48:[11826.143103] nfs: server trevis-6vm4 not responding, still trying
            04:06:48:[12018.655111] nfs: server trevis-6vm4 not responding, still trying
            
            jamesanunez James Nunez (Inactive) added a comment - We still see what looks like this hang with parallel_scale_nfsv3 test_racer_on_nfs. Logs for the most recent hang is at https://testing.hpdd.intel.com/test_sets/59c82126-7286-11e7-bb95-5254006e85c2 The logs are interesting. On the client 1, in the console log, we see a segfault and then hangs on several different processes including dd, truncate, file_concat.sh, setfattr, ln, and more: 03:08:04:[11566.277896] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 03:04:22 (1501124662) 03:08:04:[11566.741676] 8[31347]: segfault at 8 ip 00007fc99889b3b8 sp 00007fff9753dd40 error 4 in ld-2.17.so[7fc998890000+20000] 03:08:04:[11760.098045] INFO: task dd:3382 blocked for more than 120 seconds. 03:08:04:[11760.100574] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 03:08:04:[11760.103192] dd D ffff88007ae64278 0 3382 30972 0x00000080 03:08:04:[11760.105770] ffff880036c9fd10 0000000000000082 ffff880078760fb0 ffff880036c9ffd8 03:08:04:[11760.108465] ffff880036c9ffd8 ffff880036c9ffd8 ffff880078760fb0 ffff88007ae64270 03:08:04:[11760.111045] ffff88007ae64274 ffff880078760fb0 00000000ffffffff ffff88007ae64278 03:08:04:[11760.113570] Call Trace: 03:08:04:[11760.115753] [<ffffffff8168d6c9>] schedule_preempt_disabled+0x29/0x70 03:08:04:[11760.118168] [<ffffffff8168b315>] __mutex_lock_slowpath+0xc5/0x1d0 03:08:04:[11760.120533] [<ffffffff8168a76f>] mutex_lock+0x1f/0x2f 03:08:04:[11760.122898] [<ffffffff811833f6>] generic_file_aio_write+0x46/0xa0 03:08:04:[11760.125271] [<ffffffffa057911b>] nfs_file_write+0xbb/0x1e0 [nfs] 03:08:04:[11760.127568] [<ffffffff811fdfbd>] do_sync_write+0x8d/0xd0 03:08:04:[11760.129784] [<ffffffff811fe82d>] vfs_write+0xbd/0x1e0 03:08:04:[11760.131928] [<ffffffff811fe6f7>] ? vfs_read+0xf7/0x170 03:08:04:[11760.134049] [<ffffffff811ff34f>] SyS_write+0x7f/0xe0 03:08:04:[11760.136168] [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b 03:08:04:[11760.138325] INFO: task truncate:3454 blocked for more than 120 seconds. 03:08:04:[11760.140499] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 03:08:04:[11760.142775] truncate D ffffffff8168a630 0 3454 30984 0x00000080 03:08:04:[11760.145011] ffff88006e3c3af0 0000000000000082 ffff880079121f60 ffff88006e3c3fd8 03:08:04:[11760.147321] ffff88006e3c3fd8 ffff88006e3c3fd8 ffff880079121f60 ffff88007fc16c40 03:08:04:[11760.149568] 0000000000000000 7fffffffffffffff ffff88007ff616e8 ffffffff8168a630 03:08:04:[11760.151888] Call Trace: 03:08:04:[11760.153683] [<ffffffff8168a630>] ? bit_wait+0x50/0x50 03:08:04:[11760.155721] [<ffffffff8168c5d9>] schedule+0x29/0x70 03:08:04:[11760.157686] [<ffffffff8168a019>] schedule_timeout+0x239/0x2c0 03:08:04:[11760.159733] [<ffffffff8113d22e>] ? delayacct_end+0x9e/0xb0 03:08:04:[11760.161691] [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30 03:08:04:[11760.163663] [<ffffffff8168a630>] ? bit_wait+0x50/0x50 03:08:04:[11760.165551] [<ffffffff8168bb7e>] io_schedule_timeout+0xae/0x130 03:08:04:[11760.167469] [<ffffffff8168bc18>] io_schedule+0x18/0x20 03:08:04:[11760.169321] [<ffffffff8168a641>] bit_wait_io+0x11/0x50 03:08:04:[11760.171825] [<ffffffff8168a165>] __wait_on_bit+0x65/0x90 03:08:04:[11760.174073] [<ffffffff81180211>] wait_on_page_bit+0x81/0xa0 03:08:04:[11760.175867] [<ffffffff810b1be0>] ? wake_bit_function+0x40/0x40 03:08:04:[11760.177632] [<ffffffff81180341>] __filemap_fdatawait_range+0x111/0x190 03:08:04:[11760.179428] [<ffffffff811803d4>] filemap_fdatawait_range+0x14/0x30 03:08:04:[11760.181160] [<ffffffff81180417>] filemap_fdatawait+0x27/0x30 03:08:04:[11760.182841] [<ffffffff8118258c>] filemap_write_and_wait+0x4c/0x80 03:08:04:[11760.184525] [<ffffffffa058a3b0>] nfs_wb_all+0x20/0x100 [nfs] 03:08:04:[11760.186135] [<ffffffffa057ca08>] nfs_setattr+0x1d8/0x200 [nfs] 03:08:04:[11760.187756] [<ffffffff8121be19>] notify_change+0x279/0x3d0 03:08:04:[11760.189312] [<ffffffff811fc995>] do_truncate+0x75/0xc0 03:08:04:[11760.190830] [<ffffffffa0578a29>] ? nfs_permission+0x199/0x1e0 [nfs] 03:08:04:[11760.192390] [<ffffffff811fcb44>] vfs_truncate+0x164/0x190 03:08:04:[11760.193849] [<ffffffff811fcbfc>] do_sys_truncate+0x8c/0xb0 03:08:04:[11760.195276] [<ffffffff811fcdae>] SyS_truncate+0xe/0x10 03:08:04:[11760.196650] [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b 03:08:04:[11760.198062] INFO: task file_concat.sh:3471 blocked for more than 120 seconds. 03:08:04:[11760.199552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 03:08:04:[11760.201110] file_concat.sh D ffff88007ae64278 0 3471 31015 0x00000080 On the same client, in the stack trace log we see dd and mkdir hung and an nfs server problem 04:06:47:[11760.356684] [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b 04:06:47:[11760.358129] INFO: task mkdir:3524 blocked for more than 120 seconds. 04:06:47:[11760.359580] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 04:06:47:[11760.361164] mkdir D ffff880006f53bc8 0 3524 31009 0x00000080 04:06:47:[11760.362709] ffff88006a7d7d90 0000000000000082 ffff88005b1dce70 ffff88006a7d7fd8 04:06:47:[11760.364352] ffff88006a7d7fd8 ffff88006a7d7fd8 ffff88005b1dce70 ffff880006f53bc0 04:06:47:[11760.365957] ffff880006f53bc4 ffff88005b1dce70 00000000ffffffff ffff880006f53bc8 04:06:47:[11760.367612] Call Trace: 04:06:47:[11760.368814] [<ffffffff8168d6c9>] schedule_preempt_disabled+0x29/0x70 04:06:47:[11760.370391] [<ffffffff8168b315>] __mutex_lock_slowpath+0xc5/0x1d0 04:06:47:[11760.371915] [<ffffffff8168a76f>] mutex_lock+0x1f/0x2f 04:06:47:[11760.373363] [<ffffffff8120cae5>] filename_create+0x85/0x180 04:06:47:[11760.374860] [<ffffffffa057870e>] ? nfs_do_access+0x23e/0x390 [nfs] 04:06:47:[11760.376396] [<ffffffff811de665>] ? kmem_cache_alloc+0x35/0x1e0 04:06:47:[11760.377893] [<ffffffff8120f2bf>] ? getname_flags+0x4f/0x1a0 04:06:47:[11760.379357] [<ffffffff8120f334>] ? getname_flags+0xc4/0x1a0 04:06:47:[11760.380793] [<ffffffff8121eede>] ? mntput_no_expire+0x3e/0x120 04:06:47:[11760.382236] [<ffffffff8120f5e1>] user_path_create+0x41/0x60 04:06:47:[11760.383667] [<ffffffff812108f6>] SyS_mkdirat+0x46/0xe0 04:06:47:[11760.385066] [<ffffffff812109a9>] SyS_mkdir+0x19/0x20 04:06:47:[11760.386419] [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b 04:06:47:[11760.387839] INFO: task dd:3529 blocked for more than 120 seconds. 04:06:47:[11760.389275] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 04:06:47:[11760.390867] dd D ffffffff8168a630 0 3529 31008 0x00000080 04:06:47:[11760.392426] ffff880077fffbc0 0000000000000086 ffff880074fb0000 ffff880077ffffd8 04:06:48:[11760.394065] ffff880077ffffd8 ffff880077ffffd8 ffff880074fb0000 ffff88007fd16c40 04:06:48:[11760.395686] 0000000000000000 7fffffffffffffff ffff88007ff5cee8 ffffffff8168a630 04:06:48:[11760.397353] Call Trace: 04:06:48:[11760.398564] [<ffffffff8168a630>] ? bit_wait+0x50/0x50 04:06:48:[11760.400060] [<ffffffff8168c5d9>] schedule+0x29/0x70 04:06:48:[11760.401498] [<ffffffff8168a019>] schedule_timeout+0x239/0x2c0 04:06:48:[11760.403027] [<ffffffffa0267e10>] ? rpc_put_task+0x10/0x20 [sunrpc] 04:06:48:[11760.404538] [<ffffffffa0584174>] ? nfs_initiate_pgio+0xd4/0x160 [nfs] 04:06:48:[11760.406022] [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30 04:06:48:[11760.407526] [<ffffffff8168a630>] ? bit_wait+0x50/0x50 04:06:48:[11760.408925] [<ffffffff8168bb7e>] io_schedule_timeout+0xae/0x130 04:06:48:[11760.410387] [<ffffffff8168bc18>] io_schedule+0x18/0x20 04:06:48:[11760.411777] [<ffffffff8168a641>] bit_wait_io+0x11/0x50 04:06:48:[11760.413167] [<ffffffff8168a165>] __wait_on_bit+0x65/0x90 04:06:48:[11760.414551] [<ffffffff81180211>] wait_on_page_bit+0x81/0xa0 04:06:48:[11760.415953] [<ffffffff810b1be0>] ? wake_bit_function+0x40/0x40 04:06:48:[11760.417375] [<ffffffff81180341>] __filemap_fdatawait_range+0x111/0x190 04:06:48:[11760.418862] [<ffffffff811803d4>] filemap_fdatawait_range+0x14/0x30 04:06:48:[11760.420312] [<ffffffff81182656>] filemap_write_and_wait_range+0x56/0x90 04:06:48:[11760.421798] [<ffffffffa0578e46>] nfs_file_fsync+0x86/0x110 [nfs] 04:06:48:[11760.423232] [<ffffffff812302cb>] vfs_fsync+0x2b/0x40 04:06:48:[11760.424589] [<ffffffffa0579286>] nfs_file_flush+0x46/0x60 [nfs] 04:06:48:[11760.426028] [<ffffffff811fbfd4>] filp_close+0x34/0x80 04:06:48:[11760.427400] [<ffffffff8121d338>] __close_fd+0x78/0xa0 04:06:48:[11760.428755] [<ffffffff811fdb73>] SyS_close+0x23/0x50 04:06:48:[11760.430084] [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b 04:06:48:[11826.143103] nfs: server trevis-6vm4 not responding, still trying 04:06:48:[12018.655111] nfs: server trevis-6vm4 not responding, still trying
            sarah Sarah Liu added a comment -

            hi Jim,
            I think the failure you saw is LU-8584, caused by the "unable handle kernel error"

            sarah Sarah Liu added a comment - hi Jim, I think the failure you saw is LU-8584 , caused by the "unable handle kernel error"
            jcasper James Casper (Inactive) added a comment - Link for parallel-scale-nfsv3 test_racer_on_nfs: https://testing.hpdd.intel.com/test_sessions/d5c0f53e-f881-11e6-887f-5254006e85c2

            Similar "process D" messages seen with parallel-scale-nfsv3 test_racer_on_nfs.

            jcasper James Casper (Inactive) added a comment - Similar "process D" messages seen with parallel-scale-nfsv3 test_racer_on_nfs.
            sarah Sarah Liu added a comment -

            I created TEI-3677 to track the missing log issue

            sarah Sarah Liu added a comment - I created TEI-3677 to track the missing log issue
            sarah Sarah Liu added a comment -

            There is TEI-3392 for similar issue, I will open another one for partial logs missing

            sarah Sarah Liu added a comment - There is TEI-3392 for similar issue, I will open another one for partial logs missing

            Looks like the client got a -30 error (-EROFS) and the MDS crashed. No way to know what is happening without the MDS logs.

            Sarah, is there a TEI ticket open to fix the console logs for RHEL7? If not, can you please open one.

            adilger Andreas Dilger added a comment - Looks like the client got a -30 error (-EROFS) and the MDS crashed. No way to know what is happening without the MDS logs. Sarah, is there a TEI ticket open to fix the console logs for RHEL7? If not, can you please open one.
            green Oleg Drokin added a comment -

            no mds logs in the report and we need thta to understand why did something remounted readonly

            green Oleg Drokin added a comment - no mds logs in the report and we need thta to understand why did something remounted readonly

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: