[LU-6758] racer test_1: test failed to respond and timed out Created: 23/Jun/15  Updated: 11/Sep/18  Resolved: 10/Sep/18

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

server: lustre-master build # 3071 EL7 ldiskfs
client: lustre-master build # 3071 EL6 ldiskfs


Issue Links:
Related
is related to LU-11359 racer test 1 times out with client hu... Resolved
is related to LU-10525 racer test_1: Failure to initialize c... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/8303c8ec-13d8-11e5-b4b0-5254006e85c2.

The sub-test test_1 failed with the following error:

test failed to respond and timed out

There is no MDS logs
client console

16:40:41:Lustre: DEBUG MARKER: == racer test 1: racer on clients: onyx-40vm5,onyx-40vm6.onyx.hpdd.intel.com DURATION=900 == 09:30:44 (1434299444)
16:40:41:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 				   LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer 
16:40:41:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 				   LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer 
16:40:41:LustreError: 21155:0:(llite_lib.c:1497:ll_md_setattr()) md_setattr fails: rc = -30
16:40:42:Lustre: 21248:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299452/real 1434299452]  req@ffff88006d4753c0 x1503970942065616/t0(0) o36->lustre-MDT0000-mdc-ffff88007c74cc00@10.2.4.185@tcp:12/10 lens 496/568 e 0 to 1 dl 1434299459 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
16:40:42:Lustre: lustre-MDT0000-mdc-ffff88007c74cc00: Connection to lustre-MDT0000 (at 10.2.4.185@tcp) was lost; in progress operations using this service will wait for recovery to complete
16:40:43:Lustre: 31077:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299452/real 1434299452]  req@ffff880070c73c80 x1503970942069540/t0(0) o400->MGC10.2.4.185@tcp@10.2.4.185@tcp:26/25 lens 224/224 e 0 to 1 dl 1434299459 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
16:40:43:Lustre: 31077:0:(client.c:2003:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
16:40:43:LustreError: 166-1: MGC10.2.4.185@tcp: Connection to MGS (at 10.2.4.185@tcp) was lost; in progress operations using this service will fail
16:40:43:Lustre: 31077:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299457/real 1434299457]  req@ffff880070cd1080 x1503970942081192/t0(0) o400->lustre-MDT0000-mdc-ffff880079777c00@10.2.4.185@tcp:12/10 lens 224/224 e 0 to 1 dl 1434299464 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
16:40:43:Lustre: 31077:0:(client.c:2003:ptlrpc_expire_one_request()) Skipped 1 previous similar message
16:40:43:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299470/real 1434299470]  req@ffff880070167380 x1503970942081360/t0(0) o250->MGC10.2.4.185@tcp@10.2.4.185@tcp:26/25 lens 520/544 e 0 to 1 dl 1434299481 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
16:40:43:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
16:40:43:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1434299485/real 0]  req@ffff880070a2c3c0 x1503970942081548/t0(0) o250->MGC10.2.4.185@tcp@10.2.4.185@tcp:26/25 lens 520/544 e 0 to 1 dl 1434299501 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
16:40:44:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
16:40:44:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299515/real 1434299515]  req@ffff88007010b080 x1503970942081892/t0(0) o38->lustre-MDT0000-mdc-ffff880079777c00@10.2.4.185@tcp:12/10 lens 520/544 e 0 to 1 dl 1434299531 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
16:40:44:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434299535/real 1434299536]  req@ffff88006d7f20c0 x1503970942082156/t0(0) o38->lustre-MDT0000-mdc-ffff88007c74cc00@10.2.4.185@tcp:12/10 lens 520/544 e 0 to 1 dl 1434299556 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
16:40:44:Lustre: 31075:0:(client.c:2003:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
16:40:44:INFO: task dir_create.sh:5103 blocked for more than 120 seconds.
16:40:44:      Not tainted 2.6.32-504.16.2.el6.x86_64 #1
16:40:44:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
16:40:45:dir_create.sh D 0000000000000000     0  5103   5096 0x00000080
16:40:45: ffff88007c2e1bc8 0000000000000082 0000000000000000 ffff880000000000
16:40:45: ffff880000000065 ffff88007a19e104 ffffffffa1b012c0 0000000000000000
16:40:46: 000200000a0204b9 ffffffffa1dba813 ffff880079059068 ffff88007c2e1fd8
16:40:46:Call Trace:
16:40:46: [<ffffffffa1d5f5f5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
16:40:46: [<ffffffffa1d86a22>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
16:40:46: [<ffffffff8152b7a6>] __mutex_lock_slowpath+0x96/0x210
16:40:46: [<ffffffff8152b2cb>] mutex_lock+0x2b/0x50
16:40:46: [<ffffffffa1f40c7a>] mdc_close+0x19a/0xa60 [mdc]
16:40:46: [<ffffffffa2017c09>] ? ll_i2suppgid+0x19/0x30 [lustre]
16:40:46: [<ffffffffa1efefe7>] lmv_close+0x2c7/0x580 [lmv]
16:40:46: [<ffffffffa1feac4f>] ll_close_inode_openhandle+0x2ef/0xe90 [lustre]
16:40:47: [<ffffffffa1febb97>] ll_md_real_close+0x197/0x210 [lustre]
16:40:47: [<ffffffffa1fed331>] ll_file_release+0x641/0xad0 [lustre]
16:40:47: [<ffffffff8118fb85>] __fput+0xf5/0x210
16:40:47: [<ffffffff8118fcc5>] fput+0x25/0x30
16:40:47: [<ffffffff8118af1d>] filp_close+0x5d/0x90
16:40:47: [<ffffffff811a3760>] sys_dup3+0x130/0x190
16:40:47: [<ffffffff811a37d4>] sys_dup2+0x14/0x50
16:40:47: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
16:40:47:INFO: task cat:5274 blocked for more than 120 seconds.
16:40:47:      Not tainted 2.6.32-504.16.2.el6.x86_64 #1
16:40:47:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
16:40:47:cat           D 0000000000000000     0  5274   5133 0x00000080
16:40:47: ffff880071bfbbf8 0000000000000086 0000000000000000 ffff880000000000
16:40:47: ffff880000000065 ffff88007a19e104 ffffffffa1b01f80 0000000000000000
16:40:47: 000200000a0204b9 ffffffffa1dba813 ffff880071bf9ad8 ffff880071bfbfd8
16:40:48:Call Trace:
16:40:48: [<ffffffffa1d5f5f5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
16:40:48: [<ffffffffa1d86a22>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
16:40:48: [<ffffffff8152b7a6>] __mutex_lock_slowpath+0x96/0x210
16:40:48: [<ffffffff8152b2cb>] mutex_lock+0x2b/0x50
16:40:48: [<ffffffffa1f40c7a>] mdc_close+0x19a/0xa60 [mdc]
16:40:49: [<ffffffffa2017c09>] ? ll_i2suppgid+0x19/0x30 [lustre]
16:40:49: [<ffffffffa1efefe7>] lmv_close+0x2c7/0x580 [lmv]
16:40:49: [<ffffffffa1feac4f>] ll_close_inode_openhandle+0x2ef/0xe90 [lustre]
16:40:49: [<ffffffffa1febb97>] ll_md_real_close+0x197/0x210 [lustre]
16:40:49: [<ffffffffa1fed331>] ll_file_release+0x641/0xad0 [lustre]
16:40:49: [<ffffffff8118fb85>] __fput+0xf5/0x210
16:40:49: [<ffffffff8118fcc5>] fput+0x25/0x30
16:40:50: [<ffffffff8118af1d>] filp_close+0x5d/0x90
16:40:50: [<ffffffff8118aff5>] sys_close+0xa5/0x100
16:40:50: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
16:40:50:INFO: task dd:6983 blocked for more than 120 seconds.
16:40:50:      Not tainted 2.6.32-504.16.2.el6.x86_64 #1
16:40:50:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
16:40:50:dd            D 0000000000000000     0  6983   5154 0x00000080
16:40:50: ffff880071917bf8 0000000000000086 0000000000000000 ffff880000000000
16:40:50: ffff880000000065 ffff88007a56c244 ffffffffa1b03a00 0000000000000000
16:40:51: 000200000a0204b9 ffffffffa1dba813 ffff880070fafad8 ffff880071917fd8
16:40:51:Call Trace:
16:40:51: [<ffffffffa1d5f5f5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
16:40:51: [<ffffffffa1d86a22>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
16:40:51: [<ffffffff8152b7a6>] __mutex_lock_slowpath+0x96/0x210
16:40:51: [<ffffffff8152b2cb>] mutex_lock+0x2b/0x50
16:40:51: [<ffffffffa1f40c7a>] mdc_close+0x19a/0xa60 [mdc]
16:40:51: [<ffffffffa2017c09>] ? ll_i2suppgid+0x19/0x30 [lustre]
16:40:52: [<ffffffffa1efefe7>] lmv_close+0x2c7/0x580 [lmv]
16:40:52: [<ffffffffa1feac4f>] ll_close_inode_openhandle+0x2ef/0xe90 [lustre]
16:40:52: [<ffffffffa1febb97>] ll_md_real_close+0x197/0x210 [lustre]
16:40:52: [<ffffffffa1fed331>] ll_file_release+0x641/0xad0 [lustre]
16:40:53: [<ffffffff8118fb85>] __fput+0xf5/0x210
16:40:53: [<ffffffff8118fcc5>] fput+0x25/0x30
16:40:53: [<ffffffff8118af1d>] filp_close+0x5d/0x90
16:40:53: [<ffffffff8118aff5>] sys_close+0xa5/0x100
16:40:53: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Oleg Drokin [ 03/Jul/15 ]

no mds logs in the report and we need thta to understand why did something remounted readonly

Comment by Andreas Dilger [ 03/Jul/15 ]

Looks like the client got a -30 error (-EROFS) and the MDS crashed. No way to know what is happening without the MDS logs.

Sarah, is there a TEI ticket open to fix the console logs for RHEL7? If not, can you please open one.

Comment by Sarah Liu [ 06/Jul/15 ]

There is TEI-3392 for similar issue, I will open another one for partial logs missing

Comment by Sarah Liu [ 09/Jul/15 ]

I created TEI-3677 to track the missing log issue

Comment by James Casper [ 09/Mar/17 ]

Similar "process D" messages seen with parallel-scale-nfsv3 test_racer_on_nfs.

Comment by James Casper [ 09/Mar/17 ]

Link for parallel-scale-nfsv3 test_racer_on_nfs:
https://testing.hpdd.intel.com/test_sessions/d5c0f53e-f881-11e6-887f-5254006e85c2

Comment by Sarah Liu [ 09/Mar/17 ]

hi Jim,
I think the failure you saw is LU-8584, caused by the "unable handle kernel error"

Comment by James Nunez (Inactive) [ 01/Aug/17 ]

We still see what looks like this hang with parallel_scale_nfsv3 test_racer_on_nfs. Logs for the most recent hang is at
https://testing.hpdd.intel.com/test_sets/59c82126-7286-11e7-bb95-5254006e85c2

The logs are interesting. On the client 1, in the console log, we see a segfault and then hangs on several different processes including dd, truncate, file_concat.sh, setfattr, ln, and more:

03:08:04:[11566.277896] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 03:04:22 (1501124662)
03:08:04:[11566.741676] 8[31347]: segfault at 8 ip 00007fc99889b3b8 sp 00007fff9753dd40 error 4 in ld-2.17.so[7fc998890000+20000]
03:08:04:[11760.098045] INFO: task dd:3382 blocked for more than 120 seconds.
03:08:04:[11760.100574] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
03:08:04:[11760.103192] dd              D ffff88007ae64278     0  3382  30972 0x00000080
03:08:04:[11760.105770]  ffff880036c9fd10 0000000000000082 ffff880078760fb0 ffff880036c9ffd8
03:08:04:[11760.108465]  ffff880036c9ffd8 ffff880036c9ffd8 ffff880078760fb0 ffff88007ae64270
03:08:04:[11760.111045]  ffff88007ae64274 ffff880078760fb0 00000000ffffffff ffff88007ae64278
03:08:04:[11760.113570] Call Trace:
03:08:04:[11760.115753]  [<ffffffff8168d6c9>] schedule_preempt_disabled+0x29/0x70
03:08:04:[11760.118168]  [<ffffffff8168b315>] __mutex_lock_slowpath+0xc5/0x1d0
03:08:04:[11760.120533]  [<ffffffff8168a76f>] mutex_lock+0x1f/0x2f
03:08:04:[11760.122898]  [<ffffffff811833f6>] generic_file_aio_write+0x46/0xa0
03:08:04:[11760.125271]  [<ffffffffa057911b>] nfs_file_write+0xbb/0x1e0 [nfs]
03:08:04:[11760.127568]  [<ffffffff811fdfbd>] do_sync_write+0x8d/0xd0
03:08:04:[11760.129784]  [<ffffffff811fe82d>] vfs_write+0xbd/0x1e0
03:08:04:[11760.131928]  [<ffffffff811fe6f7>] ? vfs_read+0xf7/0x170
03:08:04:[11760.134049]  [<ffffffff811ff34f>] SyS_write+0x7f/0xe0
03:08:04:[11760.136168]  [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b
03:08:04:[11760.138325] INFO: task truncate:3454 blocked for more than 120 seconds.
03:08:04:[11760.140499] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
03:08:04:[11760.142775] truncate        D ffffffff8168a630     0  3454  30984 0x00000080
03:08:04:[11760.145011]  ffff88006e3c3af0 0000000000000082 ffff880079121f60 ffff88006e3c3fd8
03:08:04:[11760.147321]  ffff88006e3c3fd8 ffff88006e3c3fd8 ffff880079121f60 ffff88007fc16c40
03:08:04:[11760.149568]  0000000000000000 7fffffffffffffff ffff88007ff616e8 ffffffff8168a630
03:08:04:[11760.151888] Call Trace:
03:08:04:[11760.153683]  [<ffffffff8168a630>] ? bit_wait+0x50/0x50
03:08:04:[11760.155721]  [<ffffffff8168c5d9>] schedule+0x29/0x70
03:08:04:[11760.157686]  [<ffffffff8168a019>] schedule_timeout+0x239/0x2c0
03:08:04:[11760.159733]  [<ffffffff8113d22e>] ? delayacct_end+0x9e/0xb0
03:08:04:[11760.161691]  [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30
03:08:04:[11760.163663]  [<ffffffff8168a630>] ? bit_wait+0x50/0x50
03:08:04:[11760.165551]  [<ffffffff8168bb7e>] io_schedule_timeout+0xae/0x130
03:08:04:[11760.167469]  [<ffffffff8168bc18>] io_schedule+0x18/0x20
03:08:04:[11760.169321]  [<ffffffff8168a641>] bit_wait_io+0x11/0x50
03:08:04:[11760.171825]  [<ffffffff8168a165>] __wait_on_bit+0x65/0x90
03:08:04:[11760.174073]  [<ffffffff81180211>] wait_on_page_bit+0x81/0xa0
03:08:04:[11760.175867]  [<ffffffff810b1be0>] ? wake_bit_function+0x40/0x40
03:08:04:[11760.177632]  [<ffffffff81180341>] __filemap_fdatawait_range+0x111/0x190
03:08:04:[11760.179428]  [<ffffffff811803d4>] filemap_fdatawait_range+0x14/0x30
03:08:04:[11760.181160]  [<ffffffff81180417>] filemap_fdatawait+0x27/0x30
03:08:04:[11760.182841]  [<ffffffff8118258c>] filemap_write_and_wait+0x4c/0x80
03:08:04:[11760.184525]  [<ffffffffa058a3b0>] nfs_wb_all+0x20/0x100 [nfs]
03:08:04:[11760.186135]  [<ffffffffa057ca08>] nfs_setattr+0x1d8/0x200 [nfs]
03:08:04:[11760.187756]  [<ffffffff8121be19>] notify_change+0x279/0x3d0
03:08:04:[11760.189312]  [<ffffffff811fc995>] do_truncate+0x75/0xc0
03:08:04:[11760.190830]  [<ffffffffa0578a29>] ? nfs_permission+0x199/0x1e0 [nfs]
03:08:04:[11760.192390]  [<ffffffff811fcb44>] vfs_truncate+0x164/0x190
03:08:04:[11760.193849]  [<ffffffff811fcbfc>] do_sys_truncate+0x8c/0xb0
03:08:04:[11760.195276]  [<ffffffff811fcdae>] SyS_truncate+0xe/0x10
03:08:04:[11760.196650]  [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b
03:08:04:[11760.198062] INFO: task file_concat.sh:3471 blocked for more than 120 seconds.
03:08:04:[11760.199552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
03:08:04:[11760.201110] file_concat.sh  D ffff88007ae64278     0  3471  31015 0x00000080

On the same client, in the stack trace log we see dd and mkdir hung and an nfs server problem

04:06:47:[11760.356684]  [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b
04:06:47:[11760.358129] INFO: task mkdir:3524 blocked for more than 120 seconds.
04:06:47:[11760.359580] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
04:06:47:[11760.361164] mkdir           D ffff880006f53bc8     0  3524  31009 0x00000080
04:06:47:[11760.362709]  ffff88006a7d7d90 0000000000000082 ffff88005b1dce70 ffff88006a7d7fd8
04:06:47:[11760.364352]  ffff88006a7d7fd8 ffff88006a7d7fd8 ffff88005b1dce70 ffff880006f53bc0
04:06:47:[11760.365957]  ffff880006f53bc4 ffff88005b1dce70 00000000ffffffff ffff880006f53bc8
04:06:47:[11760.367612] Call Trace:
04:06:47:[11760.368814]  [<ffffffff8168d6c9>] schedule_preempt_disabled+0x29/0x70
04:06:47:[11760.370391]  [<ffffffff8168b315>] __mutex_lock_slowpath+0xc5/0x1d0
04:06:47:[11760.371915]  [<ffffffff8168a76f>] mutex_lock+0x1f/0x2f
04:06:47:[11760.373363]  [<ffffffff8120cae5>] filename_create+0x85/0x180
04:06:47:[11760.374860]  [<ffffffffa057870e>] ? nfs_do_access+0x23e/0x390 [nfs]
04:06:47:[11760.376396]  [<ffffffff811de665>] ? kmem_cache_alloc+0x35/0x1e0
04:06:47:[11760.377893]  [<ffffffff8120f2bf>] ? getname_flags+0x4f/0x1a0
04:06:47:[11760.379357]  [<ffffffff8120f334>] ? getname_flags+0xc4/0x1a0
04:06:47:[11760.380793]  [<ffffffff8121eede>] ? mntput_no_expire+0x3e/0x120
04:06:47:[11760.382236]  [<ffffffff8120f5e1>] user_path_create+0x41/0x60
04:06:47:[11760.383667]  [<ffffffff812108f6>] SyS_mkdirat+0x46/0xe0
04:06:47:[11760.385066]  [<ffffffff812109a9>] SyS_mkdir+0x19/0x20
04:06:47:[11760.386419]  [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b
04:06:47:[11760.387839] INFO: task dd:3529 blocked for more than 120 seconds.
04:06:47:[11760.389275] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
04:06:47:[11760.390867] dd              D ffffffff8168a630     0  3529  31008 0x00000080
04:06:47:[11760.392426]  ffff880077fffbc0 0000000000000086 ffff880074fb0000 ffff880077ffffd8
04:06:48:[11760.394065]  ffff880077ffffd8 ffff880077ffffd8 ffff880074fb0000 ffff88007fd16c40
04:06:48:[11760.395686]  0000000000000000 7fffffffffffffff ffff88007ff5cee8 ffffffff8168a630
04:06:48:[11760.397353] Call Trace:
04:06:48:[11760.398564]  [<ffffffff8168a630>] ? bit_wait+0x50/0x50
04:06:48:[11760.400060]  [<ffffffff8168c5d9>] schedule+0x29/0x70
04:06:48:[11760.401498]  [<ffffffff8168a019>] schedule_timeout+0x239/0x2c0
04:06:48:[11760.403027]  [<ffffffffa0267e10>] ? rpc_put_task+0x10/0x20 [sunrpc]
04:06:48:[11760.404538]  [<ffffffffa0584174>] ? nfs_initiate_pgio+0xd4/0x160 [nfs]
04:06:48:[11760.406022]  [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30
04:06:48:[11760.407526]  [<ffffffff8168a630>] ? bit_wait+0x50/0x50
04:06:48:[11760.408925]  [<ffffffff8168bb7e>] io_schedule_timeout+0xae/0x130
04:06:48:[11760.410387]  [<ffffffff8168bc18>] io_schedule+0x18/0x20
04:06:48:[11760.411777]  [<ffffffff8168a641>] bit_wait_io+0x11/0x50
04:06:48:[11760.413167]  [<ffffffff8168a165>] __wait_on_bit+0x65/0x90
04:06:48:[11760.414551]  [<ffffffff81180211>] wait_on_page_bit+0x81/0xa0
04:06:48:[11760.415953]  [<ffffffff810b1be0>] ? wake_bit_function+0x40/0x40
04:06:48:[11760.417375]  [<ffffffff81180341>] __filemap_fdatawait_range+0x111/0x190
04:06:48:[11760.418862]  [<ffffffff811803d4>] filemap_fdatawait_range+0x14/0x30
04:06:48:[11760.420312]  [<ffffffff81182656>] filemap_write_and_wait_range+0x56/0x90
04:06:48:[11760.421798]  [<ffffffffa0578e46>] nfs_file_fsync+0x86/0x110 [nfs]
04:06:48:[11760.423232]  [<ffffffff812302cb>] vfs_fsync+0x2b/0x40
04:06:48:[11760.424589]  [<ffffffffa0579286>] nfs_file_flush+0x46/0x60 [nfs]
04:06:48:[11760.426028]  [<ffffffff811fbfd4>] filp_close+0x34/0x80
04:06:48:[11760.427400]  [<ffffffff8121d338>] __close_fd+0x78/0xa0
04:06:48:[11760.428755]  [<ffffffff811fdb73>] SyS_close+0x23/0x50
04:06:48:[11760.430084]  [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b
04:06:48:[11826.143103] nfs: server trevis-6vm4 not responding, still trying
04:06:48:[12018.655111] nfs: server trevis-6vm4 not responding, still trying
Comment by James Nunez (Inactive) [ 29/Aug/18 ]

We're seeing this issue or something very close to it at
https://testing.whamcloud.com/test_sets/df96b5ee-aae1-11e8-80f7-52540065bddc
https://testing.whamcloud.com/test_sets/e801c64e-a918-11e8-80f7-52540065bddc
https://testing.whamcloud.com/test_sets/7fcd4432-ab38-11e8-80f7-52540065bddc

In these, we see ls or dir_create hang on the client console logs. In the MDS console log, we see

[ 8094.197467] Lustre: DEBUG MARKER: == racer test 1: racer on clients: onyx-41vm10,onyx-41vm9.onyx.whamcloud.com DURATION=300 ============ 01:57:42 (1535507862)
[ 8106.463255] LNetError: 14167:0:(lib-msg.c:779:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
[ 8113.461582] Lustre: lustre-MDT0000: Client 27389196-d811-d02a-8653-acd954b84381 (at 10.2.8.140@tcp) reconnecting
[ 8260.743287] LNet: Service thread pid 27085 was inactive for 62.03s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 8260.745290] Pid: 27085, comm: mdt00_003 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Fri Aug 17 20:37:05 UTC 2018
[ 8260.746424] Call Trace:
[ 8260.746727]  [<ffffffffc0d55f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[ 8260.747759]  [<ffffffffc0d57e93>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
[ 8260.748642]  [<ffffffffc11b1bee>] mdt_dom_discard_data+0xfe/0x130 [mdt]
[ 8260.749734]  [<ffffffffc118b603>] mdt_reint_rename_internal.isra.40+0x1923/0x28a0 [mdt]
[ 8260.750712]  [<ffffffffc118dfcb>] mdt_reint_rename_or_migrate.isra.43+0x19b/0x860 [mdt]
[ 8260.751671]  [<ffffffffc118e6c3>] mdt_reint_rename+0x13/0x20 [mdt]
[ 8260.752421]  [<ffffffffc1193033>] mdt_reint_rec+0x83/0x210 [mdt]
[ 8260.753455]  [<ffffffffc11721d2>] mdt_reint_internal+0x6b2/0xa80 [mdt]
[ 8260.754249]  [<ffffffffc117d1e7>] mdt_reint+0x67/0x140 [mdt]
[ 8260.754927]  [<ffffffffc0df431a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ 8260.755931]  [<ffffffffc0d974ab>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 8260.756855]  [<ffffffffc0d9ace4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
[ 8260.757643]  [<ffffffffa3cbb621>] kthread+0xd1/0xe0
[ 8260.758294]  [<ffffffffa43205f7>] ret_from_fork_nospec_end+0x0/0x39
[ 8260.759073]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 8260.759863] LustreError: dumping log to /tmp/lustre-log.1535508029.27085
[ 8261.761641] Pid: 3455, comm: mdt00_028 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Fri Aug 17 20:37:05 UTC 2018
[ 8261.762905] Call Trace:
[ 8261.763216]  [<ffffffffc0d55f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[ 8261.764129]  [<ffffffffc0d57e93>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
[ 8261.764980]  [<ffffffffc11763c7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
[ 8261.765776]  [<ffffffffc1176a70>] mdt_object_lock_internal+0x70/0x330 [mdt]
[ 8261.766546]  [<ffffffffc1177ada>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt]
[ 8261.767348]  [<ffffffffc117f885>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[ 8261.768089]  [<ffffffffc117c768>] mdt_intent_policy+0x2f8/0xd10 [mdt]
[ 8261.768912]  [<ffffffffc0d3ce9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
[ 8261.769714]  [<ffffffffc0d65523>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
[ 8261.770557]  [<ffffffffc0deb9d2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[ 8261.771307]  [<ffffffffc0df431a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ 8261.772124]  [<ffffffffc0d974ab>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 8261.773076]  [<ffffffffc0d9ace4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
[ 8261.773824]  [<ffffffffa3cbb621>] kthread+0xd1/0xe0
[ 8261.774388]  [<ffffffffa43205f7>] ret_from_fork_nospec_end+0x0/0x39
[ 8261.775109]  [<ffffffffffffffff>] 0xffffffffffffffff
Comment by James Nunez (Inactive) [ 10/Sep/18 ]

Created LU-11359 to track this issue because this ticket is probably several issues.

Generated at Sat Feb 10 02:02:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.