[LU-6758] racer test_1: test failed to respond and timed out Created: 23/Jun/15 Updated: 11/Sep/18 Resolved: 10/Sep/18 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Environment: |
server: lustre-master build # 3071 EL7 ldiskfs |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for sarah_lw <wei3.liu@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/8303c8ec-13d8-11e5-b4b0-5254006e85c2. The sub-test test_1 failed with the following error: test failed to respond and timed out There is no MDS logs 16:40:41:Lustre: DEBUG MARKER: == racer test 1: racer on clients: onyx-40vm5,onyx-40vm6.onyx.hpdd.intel.com DURATION=900 == 09:30:44 (1434299444) 16:40:41:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer 16:40:41:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer 16:40:41:LustreError: 21155:0:(llite_lib.c:1497:ll_md_setattr()) md_setattr fails: rc = -30 16:40:42:Lustre: 21248:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299452/real 1434299452] req@ffff88006d4753c0 x1503970942065616/t0(0) o36->lustre-MDT0000-mdc-ffff88007c74cc00@10.2.4.185@tcp:12/10 lens 496/568 e 0 to 1 dl 1434299459 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 16:40:42:Lustre: lustre-MDT0000-mdc-ffff88007c74cc00: Connection to lustre-MDT0000 (at 10.2.4.185@tcp) was lost; in progress operations using this service will wait for recovery to complete 16:40:43:Lustre: 31077:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299452/real 1434299452] req@ffff880070c73c80 x1503970942069540/t0(0) o400->MGC10.2.4.185@tcp@10.2.4.185@tcp:26/25 lens 224/224 e 0 to 1 dl 1434299459 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 16:40:43:Lustre: 31077:0:(client.c:2003:ptlrpc_expire_one_request()) Skipped 11 previous similar messages 16:40:43:LustreError: 166-1: MGC10.2.4.185@tcp: Connection to MGS (at 10.2.4.185@tcp) was lost; in progress operations using this service will fail 16:40:43:Lustre: 31077:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299457/real 1434299457] req@ffff880070cd1080 x1503970942081192/t0(0) o400->lustre-MDT0000-mdc-ffff880079777c00@10.2.4.185@tcp:12/10 lens 224/224 e 0 to 1 dl 1434299464 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 16:40:43:Lustre: 31077:0:(client.c:2003:ptlrpc_expire_one_request()) Skipped 1 previous similar message 16:40:43:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299470/real 1434299470] req@ffff880070167380 x1503970942081360/t0(0) o250->MGC10.2.4.185@tcp@10.2.4.185@tcp:26/25 lens 520/544 e 0 to 1 dl 1434299481 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 16:40:43:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) Skipped 3 previous similar messages 16:40:43:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1434299485/real 0] req@ffff880070a2c3c0 x1503970942081548/t0(0) o250->MGC10.2.4.185@tcp@10.2.4.185@tcp:26/25 lens 520/544 e 0 to 1 dl 1434299501 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 16:40:44:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) Skipped 2 previous similar messages 16:40:44:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299515/real 1434299515] req@ffff88007010b080 x1503970942081892/t0(0) o38->lustre-MDT0000-mdc-ffff880079777c00@10.2.4.185@tcp:12/10 lens 520/544 e 0 to 1 dl 1434299531 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 16:40:44:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299535/real 1434299536] req@ffff88006d7f20c0 x1503970942082156/t0(0) o38->lustre-MDT0000-mdc-ffff88007c74cc00@10.2.4.185@tcp:12/10 lens 520/544 e 0 to 1 dl 1434299556 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 16:40:44:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) Skipped 2 previous similar messages 16:40:44:INFO: task dir_create.sh:5103 blocked for more than 120 seconds. 16:40:44: Not tainted 2.6.32-504.16.2.el6.x86_64 #1 16:40:44:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 16:40:45:dir_create.sh D 0000000000000000 0 5103 5096 0x00000080 16:40:45: ffff88007c2e1bc8 0000000000000082 0000000000000000 ffff880000000000 16:40:45: ffff880000000065 ffff88007a19e104 ffffffffa1b012c0 0000000000000000 16:40:46: 000200000a0204b9 ffffffffa1dba813 ffff880079059068 ffff88007c2e1fd8 16:40:46:Call Trace: 16:40:46: [<ffffffffa1d5f5f5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 16:40:46: [<ffffffffa1d86a22>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] 16:40:46: [<ffffffff8152b7a6>] __mutex_lock_slowpath+0x96/0x210 16:40:46: [<ffffffff8152b2cb>] mutex_lock+0x2b/0x50 16:40:46: [<ffffffffa1f40c7a>] mdc_close+0x19a/0xa60 [mdc] 16:40:46: [<ffffffffa2017c09>] ? ll_i2suppgid+0x19/0x30 [lustre] 16:40:46: [<ffffffffa1efefe7>] lmv_close+0x2c7/0x580 [lmv] 16:40:46: [<ffffffffa1feac4f>] ll_close_inode_openhandle+0x2ef/0xe90 [lustre] 16:40:47: [<ffffffffa1febb97>] ll_md_real_close+0x197/0x210 [lustre] 16:40:47: [<ffffffffa1fed331>] ll_file_release+0x641/0xad0 [lustre] 16:40:47: [<ffffffff8118fb85>] __fput+0xf5/0x210 16:40:47: [<ffffffff8118fcc5>] fput+0x25/0x30 16:40:47: [<ffffffff8118af1d>] filp_close+0x5d/0x90 16:40:47: [<ffffffff811a3760>] sys_dup3+0x130/0x190 16:40:47: [<ffffffff811a37d4>] sys_dup2+0x14/0x50 16:40:47: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 16:40:47:INFO: task cat:5274 blocked for more than 120 seconds. 16:40:47: Not tainted 2.6.32-504.16.2.el6.x86_64 #1 16:40:47:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 16:40:47:cat D 0000000000000000 0 5274 5133 0x00000080 16:40:47: ffff880071bfbbf8 0000000000000086 0000000000000000 ffff880000000000 16:40:47: ffff880000000065 ffff88007a19e104 ffffffffa1b01f80 0000000000000000 16:40:47: 000200000a0204b9 ffffffffa1dba813 ffff880071bf9ad8 ffff880071bfbfd8 16:40:48:Call Trace: 16:40:48: [<ffffffffa1d5f5f5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 16:40:48: [<ffffffffa1d86a22>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] 16:40:48: [<ffffffff8152b7a6>] __mutex_lock_slowpath+0x96/0x210 16:40:48: [<ffffffff8152b2cb>] mutex_lock+0x2b/0x50 16:40:48: [<ffffffffa1f40c7a>] mdc_close+0x19a/0xa60 [mdc] 16:40:49: [<ffffffffa2017c09>] ? ll_i2suppgid+0x19/0x30 [lustre] 16:40:49: [<ffffffffa1efefe7>] lmv_close+0x2c7/0x580 [lmv] 16:40:49: [<ffffffffa1feac4f>] ll_close_inode_openhandle+0x2ef/0xe90 [lustre] 16:40:49: [<ffffffffa1febb97>] ll_md_real_close+0x197/0x210 [lustre] 16:40:49: [<ffffffffa1fed331>] ll_file_release+0x641/0xad0 [lustre] 16:40:49: [<ffffffff8118fb85>] __fput+0xf5/0x210 16:40:49: [<ffffffff8118fcc5>] fput+0x25/0x30 16:40:50: [<ffffffff8118af1d>] filp_close+0x5d/0x90 16:40:50: [<ffffffff8118aff5>] sys_close+0xa5/0x100 16:40:50: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 16:40:50:INFO: task dd:6983 blocked for more than 120 seconds. 16:40:50: Not tainted 2.6.32-504.16.2.el6.x86_64 #1 16:40:50:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 16:40:50:dd D 0000000000000000 0 6983 5154 0x00000080 16:40:50: ffff880071917bf8 0000000000000086 0000000000000000 ffff880000000000 16:40:50: ffff880000000065 ffff88007a56c244 ffffffffa1b03a00 0000000000000000 16:40:51: 000200000a0204b9 ffffffffa1dba813 ffff880070fafad8 ffff880071917fd8 16:40:51:Call Trace: 16:40:51: [<ffffffffa1d5f5f5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 16:40:51: [<ffffffffa1d86a22>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] 16:40:51: [<ffffffff8152b7a6>] __mutex_lock_slowpath+0x96/0x210 16:40:51: [<ffffffff8152b2cb>] mutex_lock+0x2b/0x50 16:40:51: [<ffffffffa1f40c7a>] mdc_close+0x19a/0xa60 [mdc] 16:40:51: [<ffffffffa2017c09>] ? ll_i2suppgid+0x19/0x30 [lustre] 16:40:52: [<ffffffffa1efefe7>] lmv_close+0x2c7/0x580 [lmv] 16:40:52: [<ffffffffa1feac4f>] ll_close_inode_openhandle+0x2ef/0xe90 [lustre] 16:40:52: [<ffffffffa1febb97>] ll_md_real_close+0x197/0x210 [lustre] 16:40:52: [<ffffffffa1fed331>] ll_file_release+0x641/0xad0 [lustre] 16:40:53: [<ffffffff8118fb85>] __fput+0xf5/0x210 16:40:53: [<ffffffff8118fcc5>] fput+0x25/0x30 16:40:53: [<ffffffff8118af1d>] filp_close+0x5d/0x90 16:40:53: [<ffffffff8118aff5>] sys_close+0xa5/0x100 16:40:53: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Oleg Drokin [ 03/Jul/15 ] |
|
no mds logs in the report and we need thta to understand why did something remounted readonly |
| Comment by Andreas Dilger [ 03/Jul/15 ] |
|
Looks like the client got a -30 error (-EROFS) and the MDS crashed. No way to know what is happening without the MDS logs. Sarah, is there a TEI ticket open to fix the console logs for RHEL7? If not, can you please open one. |
| Comment by Sarah Liu [ 06/Jul/15 ] |
|
There is TEI-3392 for similar issue, I will open another one for partial logs missing |
| Comment by Sarah Liu [ 09/Jul/15 ] |
|
I created TEI-3677 to track the missing log issue |
| Comment by James Casper [ 09/Mar/17 ] |
|
Similar "process D" messages seen with parallel-scale-nfsv3 test_racer_on_nfs. |
| Comment by James Casper [ 09/Mar/17 ] |
|
Link for parallel-scale-nfsv3 test_racer_on_nfs: |
| Comment by Sarah Liu [ 09/Mar/17 ] |
|
hi Jim, |
| Comment by James Nunez (Inactive) [ 01/Aug/17 ] |
|
We still see what looks like this hang with parallel_scale_nfsv3 test_racer_on_nfs. Logs for the most recent hang is at The logs are interesting. On the client 1, in the console log, we see a segfault and then hangs on several different processes including dd, truncate, file_concat.sh, setfattr, ln, and more: 03:08:04:[11566.277896] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 03:04:22 (1501124662) 03:08:04:[11566.741676] 8[31347]: segfault at 8 ip 00007fc99889b3b8 sp 00007fff9753dd40 error 4 in ld-2.17.so[7fc998890000+20000] 03:08:04:[11760.098045] INFO: task dd:3382 blocked for more than 120 seconds. 03:08:04:[11760.100574] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 03:08:04:[11760.103192] dd D ffff88007ae64278 0 3382 30972 0x00000080 03:08:04:[11760.105770] ffff880036c9fd10 0000000000000082 ffff880078760fb0 ffff880036c9ffd8 03:08:04:[11760.108465] ffff880036c9ffd8 ffff880036c9ffd8 ffff880078760fb0 ffff88007ae64270 03:08:04:[11760.111045] ffff88007ae64274 ffff880078760fb0 00000000ffffffff ffff88007ae64278 03:08:04:[11760.113570] Call Trace: 03:08:04:[11760.115753] [<ffffffff8168d6c9>] schedule_preempt_disabled+0x29/0x70 03:08:04:[11760.118168] [<ffffffff8168b315>] __mutex_lock_slowpath+0xc5/0x1d0 03:08:04:[11760.120533] [<ffffffff8168a76f>] mutex_lock+0x1f/0x2f 03:08:04:[11760.122898] [<ffffffff811833f6>] generic_file_aio_write+0x46/0xa0 03:08:04:[11760.125271] [<ffffffffa057911b>] nfs_file_write+0xbb/0x1e0 [nfs] 03:08:04:[11760.127568] [<ffffffff811fdfbd>] do_sync_write+0x8d/0xd0 03:08:04:[11760.129784] [<ffffffff811fe82d>] vfs_write+0xbd/0x1e0 03:08:04:[11760.131928] [<ffffffff811fe6f7>] ? vfs_read+0xf7/0x170 03:08:04:[11760.134049] [<ffffffff811ff34f>] SyS_write+0x7f/0xe0 03:08:04:[11760.136168] [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b 03:08:04:[11760.138325] INFO: task truncate:3454 blocked for more than 120 seconds. 03:08:04:[11760.140499] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 03:08:04:[11760.142775] truncate D ffffffff8168a630 0 3454 30984 0x00000080 03:08:04:[11760.145011] ffff88006e3c3af0 0000000000000082 ffff880079121f60 ffff88006e3c3fd8 03:08:04:[11760.147321] ffff88006e3c3fd8 ffff88006e3c3fd8 ffff880079121f60 ffff88007fc16c40 03:08:04:[11760.149568] 0000000000000000 7fffffffffffffff ffff88007ff616e8 ffffffff8168a630 03:08:04:[11760.151888] Call Trace: 03:08:04:[11760.153683] [<ffffffff8168a630>] ? bit_wait+0x50/0x50 03:08:04:[11760.155721] [<ffffffff8168c5d9>] schedule+0x29/0x70 03:08:04:[11760.157686] [<ffffffff8168a019>] schedule_timeout+0x239/0x2c0 03:08:04:[11760.159733] [<ffffffff8113d22e>] ? delayacct_end+0x9e/0xb0 03:08:04:[11760.161691] [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30 03:08:04:[11760.163663] [<ffffffff8168a630>] ? bit_wait+0x50/0x50 03:08:04:[11760.165551] [<ffffffff8168bb7e>] io_schedule_timeout+0xae/0x130 03:08:04:[11760.167469] [<ffffffff8168bc18>] io_schedule+0x18/0x20 03:08:04:[11760.169321] [<ffffffff8168a641>] bit_wait_io+0x11/0x50 03:08:04:[11760.171825] [<ffffffff8168a165>] __wait_on_bit+0x65/0x90 03:08:04:[11760.174073] [<ffffffff81180211>] wait_on_page_bit+0x81/0xa0 03:08:04:[11760.175867] [<ffffffff810b1be0>] ? wake_bit_function+0x40/0x40 03:08:04:[11760.177632] [<ffffffff81180341>] __filemap_fdatawait_range+0x111/0x190 03:08:04:[11760.179428] [<ffffffff811803d4>] filemap_fdatawait_range+0x14/0x30 03:08:04:[11760.181160] [<ffffffff81180417>] filemap_fdatawait+0x27/0x30 03:08:04:[11760.182841] [<ffffffff8118258c>] filemap_write_and_wait+0x4c/0x80 03:08:04:[11760.184525] [<ffffffffa058a3b0>] nfs_wb_all+0x20/0x100 [nfs] 03:08:04:[11760.186135] [<ffffffffa057ca08>] nfs_setattr+0x1d8/0x200 [nfs] 03:08:04:[11760.187756] [<ffffffff8121be19>] notify_change+0x279/0x3d0 03:08:04:[11760.189312] [<ffffffff811fc995>] do_truncate+0x75/0xc0 03:08:04:[11760.190830] [<ffffffffa0578a29>] ? nfs_permission+0x199/0x1e0 [nfs] 03:08:04:[11760.192390] [<ffffffff811fcb44>] vfs_truncate+0x164/0x190 03:08:04:[11760.193849] [<ffffffff811fcbfc>] do_sys_truncate+0x8c/0xb0 03:08:04:[11760.195276] [<ffffffff811fcdae>] SyS_truncate+0xe/0x10 03:08:04:[11760.196650] [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b 03:08:04:[11760.198062] INFO: task file_concat.sh:3471 blocked for more than 120 seconds. 03:08:04:[11760.199552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 03:08:04:[11760.201110] file_concat.sh D ffff88007ae64278 0 3471 31015 0x00000080 On the same client, in the stack trace log we see dd and mkdir hung and an nfs server problem 04:06:47:[11760.356684] [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b 04:06:47:[11760.358129] INFO: task mkdir:3524 blocked for more than 120 seconds. 04:06:47:[11760.359580] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 04:06:47:[11760.361164] mkdir D ffff880006f53bc8 0 3524 31009 0x00000080 04:06:47:[11760.362709] ffff88006a7d7d90 0000000000000082 ffff88005b1dce70 ffff88006a7d7fd8 04:06:47:[11760.364352] ffff88006a7d7fd8 ffff88006a7d7fd8 ffff88005b1dce70 ffff880006f53bc0 04:06:47:[11760.365957] ffff880006f53bc4 ffff88005b1dce70 00000000ffffffff ffff880006f53bc8 04:06:47:[11760.367612] Call Trace: 04:06:47:[11760.368814] [<ffffffff8168d6c9>] schedule_preempt_disabled+0x29/0x70 04:06:47:[11760.370391] [<ffffffff8168b315>] __mutex_lock_slowpath+0xc5/0x1d0 04:06:47:[11760.371915] [<ffffffff8168a76f>] mutex_lock+0x1f/0x2f 04:06:47:[11760.373363] [<ffffffff8120cae5>] filename_create+0x85/0x180 04:06:47:[11760.374860] [<ffffffffa057870e>] ? nfs_do_access+0x23e/0x390 [nfs] 04:06:47:[11760.376396] [<ffffffff811de665>] ? kmem_cache_alloc+0x35/0x1e0 04:06:47:[11760.377893] [<ffffffff8120f2bf>] ? getname_flags+0x4f/0x1a0 04:06:47:[11760.379357] [<ffffffff8120f334>] ? getname_flags+0xc4/0x1a0 04:06:47:[11760.380793] [<ffffffff8121eede>] ? mntput_no_expire+0x3e/0x120 04:06:47:[11760.382236] [<ffffffff8120f5e1>] user_path_create+0x41/0x60 04:06:47:[11760.383667] [<ffffffff812108f6>] SyS_mkdirat+0x46/0xe0 04:06:47:[11760.385066] [<ffffffff812109a9>] SyS_mkdir+0x19/0x20 04:06:47:[11760.386419] [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b 04:06:47:[11760.387839] INFO: task dd:3529 blocked for more than 120 seconds. 04:06:47:[11760.389275] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 04:06:47:[11760.390867] dd D ffffffff8168a630 0 3529 31008 0x00000080 04:06:47:[11760.392426] ffff880077fffbc0 0000000000000086 ffff880074fb0000 ffff880077ffffd8 04:06:48:[11760.394065] ffff880077ffffd8 ffff880077ffffd8 ffff880074fb0000 ffff88007fd16c40 04:06:48:[11760.395686] 0000000000000000 7fffffffffffffff ffff88007ff5cee8 ffffffff8168a630 04:06:48:[11760.397353] Call Trace: 04:06:48:[11760.398564] [<ffffffff8168a630>] ? bit_wait+0x50/0x50 04:06:48:[11760.400060] [<ffffffff8168c5d9>] schedule+0x29/0x70 04:06:48:[11760.401498] [<ffffffff8168a019>] schedule_timeout+0x239/0x2c0 04:06:48:[11760.403027] [<ffffffffa0267e10>] ? rpc_put_task+0x10/0x20 [sunrpc] 04:06:48:[11760.404538] [<ffffffffa0584174>] ? nfs_initiate_pgio+0xd4/0x160 [nfs] 04:06:48:[11760.406022] [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30 04:06:48:[11760.407526] [<ffffffff8168a630>] ? bit_wait+0x50/0x50 04:06:48:[11760.408925] [<ffffffff8168bb7e>] io_schedule_timeout+0xae/0x130 04:06:48:[11760.410387] [<ffffffff8168bc18>] io_schedule+0x18/0x20 04:06:48:[11760.411777] [<ffffffff8168a641>] bit_wait_io+0x11/0x50 04:06:48:[11760.413167] [<ffffffff8168a165>] __wait_on_bit+0x65/0x90 04:06:48:[11760.414551] [<ffffffff81180211>] wait_on_page_bit+0x81/0xa0 04:06:48:[11760.415953] [<ffffffff810b1be0>] ? wake_bit_function+0x40/0x40 04:06:48:[11760.417375] [<ffffffff81180341>] __filemap_fdatawait_range+0x111/0x190 04:06:48:[11760.418862] [<ffffffff811803d4>] filemap_fdatawait_range+0x14/0x30 04:06:48:[11760.420312] [<ffffffff81182656>] filemap_write_and_wait_range+0x56/0x90 04:06:48:[11760.421798] [<ffffffffa0578e46>] nfs_file_fsync+0x86/0x110 [nfs] 04:06:48:[11760.423232] [<ffffffff812302cb>] vfs_fsync+0x2b/0x40 04:06:48:[11760.424589] [<ffffffffa0579286>] nfs_file_flush+0x46/0x60 [nfs] 04:06:48:[11760.426028] [<ffffffff811fbfd4>] filp_close+0x34/0x80 04:06:48:[11760.427400] [<ffffffff8121d338>] __close_fd+0x78/0xa0 04:06:48:[11760.428755] [<ffffffff811fdb73>] SyS_close+0x23/0x50 04:06:48:[11760.430084] [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b 04:06:48:[11826.143103] nfs: server trevis-6vm4 not responding, still trying 04:06:48:[12018.655111] nfs: server trevis-6vm4 not responding, still trying |
| Comment by James Nunez (Inactive) [ 29/Aug/18 ] |
|
We're seeing this issue or something very close to it at In these, we see ls or dir_create hang on the client console logs. In the MDS console log, we see [ 8094.197467] Lustre: DEBUG MARKER: == racer test 1: racer on clients: onyx-41vm10,onyx-41vm9.onyx.whamcloud.com DURATION=300 ============ 01:57:42 (1535507862) [ 8106.463255] LNetError: 14167:0:(lib-msg.c:779:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [ 8113.461582] Lustre: lustre-MDT0000: Client 27389196-d811-d02a-8653-acd954b84381 (at 10.2.8.140@tcp) reconnecting [ 8260.743287] LNet: Service thread pid 27085 was inactive for 62.03s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 8260.745290] Pid: 27085, comm: mdt00_003 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Fri Aug 17 20:37:05 UTC 2018 [ 8260.746424] Call Trace: [ 8260.746727] [<ffffffffc0d55f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 8260.747759] [<ffffffffc0d57e93>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 8260.748642] [<ffffffffc11b1bee>] mdt_dom_discard_data+0xfe/0x130 [mdt] [ 8260.749734] [<ffffffffc118b603>] mdt_reint_rename_internal.isra.40+0x1923/0x28a0 [mdt] [ 8260.750712] [<ffffffffc118dfcb>] mdt_reint_rename_or_migrate.isra.43+0x19b/0x860 [mdt] [ 8260.751671] [<ffffffffc118e6c3>] mdt_reint_rename+0x13/0x20 [mdt] [ 8260.752421] [<ffffffffc1193033>] mdt_reint_rec+0x83/0x210 [mdt] [ 8260.753455] [<ffffffffc11721d2>] mdt_reint_internal+0x6b2/0xa80 [mdt] [ 8260.754249] [<ffffffffc117d1e7>] mdt_reint+0x67/0x140 [mdt] [ 8260.754927] [<ffffffffc0df431a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 8260.755931] [<ffffffffc0d974ab>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 8260.756855] [<ffffffffc0d9ace4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 8260.757643] [<ffffffffa3cbb621>] kthread+0xd1/0xe0 [ 8260.758294] [<ffffffffa43205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 8260.759073] [<ffffffffffffffff>] 0xffffffffffffffff [ 8260.759863] LustreError: dumping log to /tmp/lustre-log.1535508029.27085 [ 8261.761641] Pid: 3455, comm: mdt00_028 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Fri Aug 17 20:37:05 UTC 2018 [ 8261.762905] Call Trace: [ 8261.763216] [<ffffffffc0d55f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 8261.764129] [<ffffffffc0d57e93>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 8261.764980] [<ffffffffc11763c7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 8261.765776] [<ffffffffc1176a70>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 8261.766546] [<ffffffffc1177ada>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt] [ 8261.767348] [<ffffffffc117f885>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 8261.768089] [<ffffffffc117c768>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 8261.768912] [<ffffffffc0d3ce9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 8261.769714] [<ffffffffc0d65523>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 8261.770557] [<ffffffffc0deb9d2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 8261.771307] [<ffffffffc0df431a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 8261.772124] [<ffffffffc0d974ab>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 8261.773076] [<ffffffffc0d9ace4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 8261.773824] [<ffffffffa3cbb621>] kthread+0xd1/0xe0 [ 8261.774388] [<ffffffffa43205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 8261.775109] [<ffffffffffffffff>] 0xffffffffffffffff |
| Comment by James Nunez (Inactive) [ 10/Sep/18 ] |
|
Created |