[LU-10415] parallel-scale-nfsv3 test_racer_on_nfs: Timeout occurred after 335 mins, last suite running was parallel-scale-nfsv3, restarting cluster to continue tests Created: 20/Dec/17  Updated: 01/Apr/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Casper Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

onyx, full DNE
servers: el7.4, ldiskfs, branch master, v2.10.56, b3678
clients: el7.4, branch master, v2.10.56, b3678


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

session: https://testing.hpdd.intel.com/test_sessions/45ec3e40-419a-47db-95d1-7dbe1c6a0b66
test set: https://testing.hpdd.intel.com/test_sets/bd99b32a-e0a6-11e7-9c63-52540065bddc

There are 10 traces after parallel-scale-nfsv3 times out, and the tops of the dd and ln traces look the same:

From console log:

[17022.391464] nfs: server onyx-30vm4 not responding, still trying
[17040.277998] INFO: task dd:9525 blocked for more than 120 seconds.
[17040.278758] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[17040.279567] dd              D ffffffff816a76b0     0  9525   3854 0x00000080
[17040.280435]  ffff88006b797bd0 0000000000000082 ffff88004ef3bf40 ffff88006b797fd8
[17040.281271]  ffff88006b797fd8 ffff88006b797fd8 ffff88004ef3bf40 ffff88007fd16cc0
[17040.282108]  0000000000000000 7fffffffffffffff ffff88007ff682e8 ffffffff816a76b0
[17040.282991] Call Trace:
[17040.283302]  [<ffffffff816a76b0>] ? bit_wait+0x50/0x50
[17040.283823]  [<ffffffff816a9589>] schedule+0x29/0x70
[17040.284335]  [<ffffffff816a7099>] schedule_timeout+0x239/0x2c0
[17040.285022]  [<ffffffff81062efe>] ? kvm_clock_get_cycles+0x1e/0x20
[17040.285681]  [<ffffffff810e93ac>] ? ktime_get_ts64+0x4c/0xf0
[17040.286276]  [<ffffffff81062efe>] ? kvm_clock_get_cycles+0x1e/0x20
[17040.286954]  [<ffffffff810e93ac>] ? ktime_get_ts64+0x4c/0xf0
[17040.287534]  [<ffffffff816a76b0>] ? bit_wait+0x50/0x50
[17040.288113]  [<ffffffff816a8c0d>] io_schedule_timeout+0xad/0x130
[17040.288743]  [<ffffffff816a8ca8>] io_schedule+0x18/0x20
[17040.289291]  [<ffffffff816a76c1>] bit_wait_io+0x11/0x50
[17040.289862]  [<ffffffff816a71e5>] __wait_on_bit+0x65/0x90
[17040.290411]  [<ffffffff81181cc1>] wait_on_page_bit+0x81/0xa0
[17040.291041]  [<ffffffff810b19e0>] ? wake_bit_function+0x40/0x40
[17040.291656]  [<ffffffff81181df1>] __filemap_fdatawait_range+0x111/0x190
[17040.292335]  [<ffffffff81181e84>] filemap_fdatawait_range+0x14/0x30

[17040.293040]  [<ffffffff81183dc6>] filemap_write_and_wait_range+0x56/0x90
[17040.293751]  [<ffffffffc04e3516>] nfs_file_fsync+0x86/0x110 [nfs]
[17040.294387]  [<ffffffff812333cb>] vfs_fsync+0x2b/0x40
[17040.294982]  [<ffffffffc04e3956>] nfs_file_flush+0x46/0x60 [nfs]
[17040.295583]  [<ffffffff811fe294>] filp_close+0x34/0x80
[17040.296148]  [<ffffffff81220388>] __close_fd+0x78/0xa0
[17040.296709]  [<ffffffff811ffd03>] SyS_close+0x23/0x50
[17040.297241]  [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b

and

[17040.339627] ln              D ffffffff816a76b0     0  9837   3840 0x00000080
[17040.340461]  ffff880046a7bb20 0000000000000086 ffff88007b6eeeb0 ffff880046a7bfd8
[17040.341300]  ffff880046a7bfd8 ffff880046a7bfd8 ffff88007b6eeeb0 ffff88007fd16cc0
[17040.342135]  0000000000000000 7fffffffffffffff ffff88007ff682e8 ffffffff816a76b0
[17040.343018] Call Trace:
[17040.343304]  [<ffffffff816a76b0>] ? bit_wait+0x50/0x50
[17040.343836]  [<ffffffff816a9589>] schedule+0x29/0x70
[17040.344344]  [<ffffffff816a7099>] schedule_timeout+0x239/0x2c0
[17040.345048]  [<ffffffff81062efe>] ? kvm_clock_get_cycles+0x1e/0x20
[17040.345696]  [<ffffffff810e93ac>] ? ktime_get_ts64+0x4c/0xf0
[17040.346285]  [<ffffffff81062efe>] ? kvm_clock_get_cycles+0x1e/0x20
[17040.346973]  [<ffffffff810e93ac>] ? ktime_get_ts64+0x4c/0xf0
[17040.347563]  [<ffffffff816a76b0>] ? bit_wait+0x50/0x50
[17040.348104]  [<ffffffff816a8c0d>] io_schedule_timeout+0xad/0x130
[17040.348751]  [<ffffffff816a8ca8>] io_schedule+0x18/0x20
[17040.349299]  [<ffffffff816a76c1>] bit_wait_io+0x11/0x50
[17040.349880]  [<ffffffff816a71e5>] __wait_on_bit+0x65/0x90
[17040.350431]  [<ffffffff81181cc1>] wait_on_page_bit+0x81/0xa0
[17040.351069]  [<ffffffff810b19e0>] ? wake_bit_function+0x40/0x40
[17040.351685]  [<ffffffff81181df1>] __filemap_fdatawait_range+0x111/0x190
[17040.352376]  [<ffffffff81181e84>] filemap_fdatawait_range+0x14/0x30

[17040.353073]  [<ffffffff81181ec7>] filemap_fdatawait+0x27/0x30
[17040.353681]  [<ffffffff81183cfc>] filemap_write_and_wait+0x4c/0x80
[17040.354327]  [<ffffffffc04f4910>] nfs_wb_all+0x20/0x100 [nfs]
[17040.354982]  [<ffffffffc04e7b7b>] nfs_getattr+0x1bb/0x250 [nfs]
[17040.355571]  [<ffffffff812062c6>] vfs_getattr+0x46/0x80
[17040.356112]  [<ffffffff812063f5>] vfs_fstatat+0x75/0xc0
[17040.356705]  [<ffffffff8120694e>] SYSC_newstat+0x2e/0x60
[17040.357262]  [<ffffffff816b0456>] ? trace_do_page_fault+0x56/0x150
[17040.357940]  [<ffffffff816afaea>] ? do_async_page_fault+0x1a/0xd0
[17040.358556]  [<ffffffff816ac5f8>] ? async_page_fault+0x28/0x30
[17040.359218]  [<ffffffff81206c2e>] SyS_newstat+0xe/0x10
[17040.359763]  [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b

Generated at Sat Feb 10 02:34:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.