Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.13.0
-
None
-
3
-
9223372036854775807
Description
parallel-scale-nfsv3 test_compilebench hangs.
We have tickets for many different parallel-scale-nfs* test compilebench hagns/crashes and even more can be found in Maloo, but I don't think we have a ticket that reflects this hang that has, so far only been seen once.
Looking at the client1 console log for https://testing.whamcloud.com/test_sets/0cf1218c-035b-11ea-b934-52540065bddc, we see many python2 threads are blocked
[113647.675108] Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.14272 -i 2 -r 2 --makej [113973.657133] nfs: server trevis-10vm9 not responding, still trying [114001.169356] INFO: task python2:18055 blocked for more than 120 seconds. [114001.170025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [114001.170758] python2 D ffff93cffb5aa0e0 0 18055 17683 0x00000080 [114001.171465] Call Trace: [114001.205334] [<ffffffff9a17d1f0>] ? bit_wait+0x50/0x50 [114001.205915] [<ffffffff9a17f229>] schedule+0x29/0x70 [114001.206412] [<ffffffff9a17cbb1>] schedule_timeout+0x221/0x2d0 [114001.207084] [<ffffffffc05f97e6>] ? rpc_run_task+0xf6/0x150 [sunrpc] [114001.215173] [<ffffffffc06085c0>] ? rpc_put_task+0x10/0x20 [sunrpc] [114001.224520] [<ffffffff99a6c27e>] ? kvm_clock_get_cycles+0x1e/0x20 [114001.225563] [<ffffffff99b047e2>] ? ktime_get_ts64+0x52/0xf0 [114001.226160] [<ffffffff9a17d1f0>] ? bit_wait+0x50/0x50 [114001.226675] [<ffffffff9a17e79d>] io_schedule_timeout+0xad/0x130 [114001.227293] [<ffffffff9a17e838>] io_schedule+0x18/0x20 [114001.227812] [<ffffffff9a17d201>] bit_wait_io+0x11/0x50 [114001.228384] [<ffffffff9a17cd27>] __wait_on_bit+0x67/0x90 [114001.232799] [<ffffffff99bbaf91>] wait_on_page_bit+0x81/0xa0 [114001.233941] [<ffffffff99ac6280>] ? wake_bit_function+0x40/0x40 [114001.234530] [<ffffffff99bbb0c1>] __filemap_fdatawait_range+0x111/0x190 [114001.235267] [<ffffffff99bc8d31>] ? do_writepages+0x21/0x50 [114001.235828] [<ffffffff99bbb154>] filemap_fdatawait_range+0x14/0x30 [114001.236496] [<ffffffff99bbd616>] filemap_write_and_wait_range+0x56/0x90 [114001.237197] [<ffffffffc08b6746>] nfs_file_fsync+0x86/0x110 [nfs] [114001.242659] [<ffffffff99c7d7ee>] vfs_fsync+0x2e/0x40 [114001.243247] [<ffffffffc08b6b86>] nfs_file_flush+0x46/0x60 [nfs] [114001.285726] [<ffffffff99c45b87>] filp_close+0x37/0x90 [114001.286356] [<ffffffff99c6932c>] __close_fd+0x8c/0xb0 [114001.286861] [<ffffffff99c476a3>] SyS_close+0x23/0x50 [114001.287363] [<ffffffff9a18cede>] system_call_fastpath+0x25/0x2a [114001.288074] [<ffffffff9a18ce21>] ? system_call_after_swapgs+0xae/0x146
Looking at the MDS console log, we see precreate errors
[49327.258797] Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.14272 -i 2 -r 2 --makej [50073.882612] Lustre: 1329:0:(client.c:2219:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1573341502/real 1573341502] req@ffff96cdd619f180 x1649716485161792/t0(0) o5->lustre-OST0001-osc-MDT0000@10.9.4.114@tcp:28/4 lens 432/432 e 24 to 1 dl 1573342103 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'osp-pre-1-0.0' [50073.885539] Lustre: lustre-OST0001-osc-MDT0000: Connection to lustre-OST0001 (at 10.9.4.114@tcp) was lost; in progress operations using this service will wait for recovery to complete [50073.887123] LustreError: 1329:0:(osp_precreate.c:680:osp_precreate_send()) lustre-OST0001-osc-MDT0000: can't precreate: rc = -11 [50073.888246] LustreError: 1329:0:(osp_precreate.c:1342:osp_precreate_thread()) lustre-OST0001-osc-MDT0000: cannot precreate objects: rc = -11 [50073.889887] LustreError: 1329:0:(osp_precreate.c:1342:osp_precreate_thread()) Skipped 1 previous similar message [50073.891298] Lustre: lustre-OST0000-osc-MDT0000: Connection restored to 10.9.4.114@tcp (at 10.9.4.114@tcp) [50080.302649] Lustre: 2442:0:(client.c:2219:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1573341508/real 1573341508] req@ffff96cddbab9f80 x1649716485173056/t0(0) o103->lustre-OST0001-osc-ffff96ce28e20800@10.9.4.114@tcp:17/18 lens 3584/224 e 24 to 1 dl 1573342109 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'ldlm_bl_03.0' [50080.305618] Lustre: 2442:0:(client.c:2219:ptlrpc_expire_one_request()) Skipped 3 previous similar messages [50089.211705] LustreError: 24140:0:(osc_cache.c:962:osc_extent_wait()) extent ffff96cdf8f299d8@{[0 -> 255/255], [3|0|+|rpc|wihY|ffff96cdf8f2d000], [1703936|2|+|-|ffff96cdf8f50000|256|ffff96cddcf13150]} lustre-OST0000-osc-ffff96ce28e20800: wait ext to 0 timedout, recovery in progress? [50089.214130] LustreError: 24140:0:(osc_cache.c:962:osc_extent_wait()) ### extent: ffff96cdf8f299d8 ns: lustre-OST0000-osc-ffff96ce28e20800 lock: ffff96cdf8f50000/0xf87744c60f1507dc lrc: 3/0,0 mode: PW/PW res: [0xa8aaf:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->8191) flags: 0x800029400000000 nid: local remote: 0x41701af42269a674 expref: -99 pid: 13661 timeout: 0 lvb_type: 1 [50099.269764] LustreError: 23731:0:(osc_cache.c:962:osc_extent_wait()) extent ffff96cdfac29738@{[0 -> 255/255], [3|0|+|rpc|wihY|ffff96cdfac28000], [1703936|1|+|-|ffff96ce106f6fc0|256|ffff96ce2af1b150]} lustre-OST0001-osc-ffff96ce28e20800: wait ext to 0 timedout, recovery in progress? [50099.272268] LustreError: 23731:0:(osc_cache.c:962:osc_extent_wait()) ### extent: ffff96cdfac29738 ns: lustre-OST0001-osc-ffff96ce28e20800 lock: ffff96ce106f6fc0/0xf87744c60f171f86 lrc: 3/0,0 mode: PW/PW res: [0x4dc2e:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x800029400000000 nid: local remote: 0x41701af4226a1483 expref: -99 pid: 13661 timeout: 0 lvb_type: 1 [50119.273909] LustreError: 9636:0:(osc_cache.c:962:osc_extent_wait()) extent ffff96cdd5a57e70@{[0 -> 255/255], [3|0|+|rpc|wiuY|ffff96cddc2cb280], [1703936|2|+|-|ffff96ce28f8a480|256|ffff96ce035da0e0]} lustre-OST0000-osc-ffff96ce28e20800: wait ext to 0 timedout, recovery in progress? [50119.276401] LustreError: 9636:0:(osc_cache.c:962:osc_extent_wait()) Skipped 1 previous similar message [50119.277329] LustreError: 9636:0:(osc_cache.c:962:osc_extent_wait()) ### extent: ffff96cdd5a57e70 ns: lustre-OST0000-osc-ffff96ce28e20800 lock: ffff96ce28f8a480/0xf87744c60f17d31e lrc: 3/0,0 mode: PW/PW res: [0xa8db8:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->8191) flags: 0x800029400000000 nid: local remote: 0x41701af4226a3b2d expref: -99 pid: 13661 timeout: 0 lvb_type: 1 [50119.280639] LustreError: 9636:0:(osc_cache.c:962:osc_extent_wait()) Skipped 1 previous similar message
Looking at the OSS console log, we see
[113559.530046] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test compilebench: compilebench ============================================== 23:15:55 (1573341355) [113559.848465] Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.compilebench.14272 -i 2 -r 2 --makej [113560.073261] Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.14272 -i 2 -r 2 --makej [113649.380426] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [113649.381606] LNetError: Skipped 9 previous similar messages [113649.382196] LNetError: 4778:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [113649.383192] LNetError: 4778:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages [113649.384133] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [113649.385618] LNetError: Skipped 9 previous similar messages [113704.382622] LNetError: 4785:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0 [113704.384004] LNetError: 4785:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 42 previous similar messages [113745.813831] Lustre: ll_ost00_004: service thread pid 8437 was inactive for 40.091 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [113745.815728] Pid: 8437, comm: ll_ost00_004 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Tue Oct 1 17:02:18 UTC 2019 [113745.816701] Call Trace: [113745.817090] [<ffffffffbdd90f88>] call_rwsem_down_read_failed+0x18/0x30 [113745.817805] [<ffffffffc0cd7fba>] lu_cache_shrink+0x6a/0x2d0 [obdclass] [113745.818647] [<ffffffffbdbcf73e>] shrink_slab+0xae/0x340 [113745.819337] [<ffffffffbdbd2aca>] do_try_to_free_pages+0x3ca/0x520 [113745.819973] [<ffffffffbdbd2d1c>] try_to_free_pages+0xfc/0x180 [113745.820606] [<ffffffffbe1745fd>] __alloc_pages_slowpath+0x457/0x724 [113745.821282] [<ffffffffbdbc6b84>] __alloc_pages_nodemask+0x404/0x420 [113745.822027] [<ffffffffbdc14c68>] alloc_pages_current+0x98/0x110 [113745.822684] [<ffffffffbdc1fea5>] new_slab+0x2c5/0x390 [113745.823287] [<ffffffffbdc2170c>] ___slab_alloc+0x3ac/0x4f0 [113745.823863] [<ffffffffbe175a1e>] __slab_alloc+0x40/0x5c [113745.824452] [<ffffffffbdc235fb>] kmem_cache_alloc+0x19b/0x1f0 [113745.825051] [<ffffffffc03aa3c9>] spl_kmem_cache_alloc+0x99/0x150 [spl] [113745.825788] [<ffffffffc053d627>] zio_buf_alloc+0x57/0x60 [zfs] [113745.836584] [<ffffffffc0475d7d>] arc_get_data_buf.isra.25+0x4d/0x60 [zfs] [113745.837313] [<ffffffffc0476a0c>] arc_buf_alloc_impl.isra.29+0x21c/0x2e0 [zfs] [113745.838138] [<ffffffffc04777e1>] arc_read+0x711/0xab0 [zfs] [113745.838774] [<ffffffffc0480cc9>] dbuf_read+0x349/0x9e0 [zfs] [113745.839428] [<ffffffffc048d090>] dmu_buf_hold_by_dnode+0x50/0x80 [zfs] [113745.840115] [<ffffffffc05006f2>] zap_get_leaf_byblk.isra.11+0x92/0x260 [zfs] [113745.840897] [<ffffffffc05009a8>] zap_deref_leaf+0xe8/0x100 [zfs] [113745.841569] [<ffffffffc05019e2>] fzap_lookup+0x62/0x180 [zfs] [113745.842173] [<ffffffffc05062e1>] zap_lookup_impl+0xd1/0x200 [zfs] [113745.842866] [<ffffffffc0506c5e>] zap_lookup_norm+0x8e/0xd0 [zfs] [113745.843544] [<ffffffffc0506cd3>] zap_lookup+0x33/0x40 [zfs] [113745.844134] [<ffffffffc12057c1>] osd_fid_lookup+0x3e1/0x4c0 [osd_zfs] [113745.844951] [<ffffffffc11fd729>] osd_object_init+0x389/0xca0 [osd_zfs] [113745.845666] [<ffffffffc0cd67fb>] lu_object_start.isra.35+0x8b/0x120 [obdclass] [113745.846486] [<ffffffffc0cda511>] lu_object_find_at+0x1e1/0xa60 [obdclass] [113745.847209] [<ffffffffc0cdada6>] lu_object_find+0x16/0x20 [obdclass] [113745.847935] [<ffffffffc133b315>] ofd_object_find+0x35/0x100 [ofd] [113745.848640] [<ffffffffc13495b7>] ofd_lvbo_update+0x577/0xe00 [ofd] [113745.849293] [<ffffffffc0f920e4>] ldlm_request_cancel+0x404/0x780 [ptlrpc] [113745.850259] [<ffffffffc0f93ab4>] ldlm_handle_enqueue0+0x84/0x15f0 [ptlrpc] [113745.851042] [<ffffffffc101e042>] tgt_enqueue+0x62/0x210 [ptlrpc] [113745.851805] [<ffffffffc102683a>] tgt_request_handle+0x98a/0x1630 [ptlrpc] [113745.852685] [<ffffffffc0fc8a96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [113745.853560] [<ffffffffc0fcc5cc>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [113745.854232] [<ffffffffbdac50d1>] kthread+0xd1/0xe0 [113745.854803] [<ffffffffbe18cd37>] ret_from_fork_nospec_end+0x0/0x39 [113745.855524] [<ffffffffffffffff>] 0xffffffffffffffff [113745.856105] Pid: 8438, comm: ll_ost00_005 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Tue Oct 1 17:02:18 UTC 2019 [113745.857140] Call Trace: [113745.857424] [<ffffffffbdd90f88>] call_rwsem_down_read_failed+0x18/0x30 [113745.858097] [<ffffffffc0cd7fba>] lu_cache_shrink+0x6a/0x2d0 [obdclass] [113745.858845] [<ffffffffbdbcf73e>] shrink_slab+0xae/0x340 [113745.859483] [<ffffffffbdbd2aca>] do_try_to_free_pages+0x3ca/0x520 [113745.860117] [<ffffffffbdbd2d1c>] try_to_free_pages+0xfc/0x180 [113745.860755] [<ffffffffbe1745fd>] __alloc_pages_slowpath+0x457/0x724 [113745.861467] [<ffffffffbdbc6b84>] __alloc_pages_nodemask+0x404/0x420 [113745.862110] [<ffffffffbdc14c68>] alloc_pages_current+0x98/0x110 [113745.862754] [<ffffffffbdc1fea5>] new_slab+0x2c5/0x390 [113745.863355] [<ffffffffbdc2170c>] ___slab_alloc+0x3ac/0x4f0 [113745.863927] [<ffffffffbe175a1e>] __slab_alloc+0x40/0x5c [113745.864510] [<ffffffffbdc235fb>] kmem_cache_alloc+0x19b/0x1f0 [113745.865107] [<ffffffffc03aa3c9>] spl_kmem_cache_alloc+0x99/0x150 [spl] [113745.865879] [<ffffffffc053d627>] zio_buf_alloc+0x57/0x60 [zfs] [113745.866546] [<ffffffffc0475d7d>] arc_get_data_buf.isra.25+0x4d/0x60 [zfs] [113745.867254] [<ffffffffc0476a0c>] arc_buf_alloc_impl.isra.29+0x21c/0x2e0 [zfs] [113745.868051] [<ffffffffc047abd5>] arc_alloc_buf+0x55/0xb0 [zfs] [113745.868718] [<ffffffffc0480426>] dbuf_hold_copy+0x56/0xf0 [zfs] [113745.869419] [<ffffffffc04831d0>] __dbuf_hold_impl+0x440/0x5c0 [zfs] [113745.870082] [<ffffffffc04833f2>] dbuf_hold_impl+0xa2/0xd0 [zfs] [113745.870748] [<ffffffffc0483455>] dbuf_hold_level+0x35/0x60 [zfs] [113745.871443] [<ffffffffc0483496>] dbuf_hold+0x16/0x20 [zfs] [113745.872034] [<ffffffffc048cf39>] dmu_buf_hold_noread+0x89/0x110 [zfs] [113745.872776] [<ffffffffc048cfef>] dmu_buf_hold+0x2f/0x80 [zfs] [113745.873453] [<ffffffffc050674e>] zap_lockdir+0x4e/0xd0 [zfs] [113745.874072] [<ffffffffc0506c30>] zap_lookup_norm+0x60/0xd0 [zfs] [113745.874757] [<ffffffffc0506cd3>] zap_lookup+0x33/0x40 [zfs] [113745.875423] [<ffffffffc12057c1>] osd_fid_lookup+0x3e1/0x4c0 [osd_zfs] [113745.876091] [<ffffffffc11fd729>] osd_object_init+0x389/0xca0 [osd_zfs] [113745.876799] [<ffffffffc0cd67fb>] lu_object_start.isra.35+0x8b/0x120 [obdclass] [113745.877615] [<ffffffffc0cda511>] lu_object_find_at+0x1e1/0xa60 [obdclass] [113745.878334] [<ffffffffc0cdada6>] lu_object_find+0x16/0x20 [obdclass] [113745.879049] [<ffffffffc133b315>] ofd_object_find+0x35/0x100 [ofd] [113745.879749] [<ffffffffc13495b7>] ofd_lvbo_update+0x577/0xe00 [ofd] [113745.880463] [<ffffffffc0f920e4>] ldlm_request_cancel+0x404/0x780 [ptlrpc] [113745.881187] [<ffffffffc0f93ab4>] ldlm_handle_enqueue0+0x84/0x15f0 [ptlrpc] [113745.881964] [<ffffffffc101e042>] tgt_enqueue+0x62/0x210 [ptlrpc] [113745.882657] [<ffffffffc102683a>] tgt_request_handle+0x98a/0x1630 [ptlrpc] [113745.883464] [<ffffffffc0fc8a96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [113745.884261] [<ffffffffc0fcc5cc>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [113745.884981] [<ffffffffbdac50d1>] kthread+0xd1/0xe0 [113745.885543] [<ffffffffbe18cd37>] ret_from_fork_nospec_end+0x0/0x39 [113745.886175] [<ffffffffffffffff>] 0xffffffffffffffff [113745.886783] Pid: 8440, comm: ll_ost00_007 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Tue Oct 1 17:02:18 UTC 2019 [113745.887753] Call Trace: [113745.888033] [<ffffffffbdd90f88>] call_rwsem_down_read_failed+0x18/0x30 [113745.888760] [<ffffffffc0cd7fba>] lu_cache_shrink+0x6a/0x2d0 [obdclass] [113745.889509] [<ffffffffbdbcf73e>] shrink_slab+0xae/0x340 [113745.890061] [<ffffffffbdbd2aca>] do_try_to_free_pages+0x3ca/0x520 [113745.890894] [<ffffffffbdbd2d1c>] try_to_free_pages+0xfc/0x180 [113745.891533] [<ffffffffbe1745fd>] __alloc_pages_slowpath+0x457/0x724 [113745.892180] [<ffffffffbdbc6b84>] __alloc_pages_nodemask+0x404/0x420 [113745.892889] [<ffffffffbdc14c68>] alloc_pages_current+0x98/0x110 [113745.893585] [<ffffffffbdc1fea5>] new_slab+0x2c5/0x390 [113745.894122] [<ffffffffbdc2170c>] ___slab_alloc+0x3ac/0x4f0 [113745.894747] [<ffffffffbe175a1e>] __slab_alloc+0x40/0x5c [113745.895486] [<ffffffffbdc235fb>] kmem_cache_alloc+0x19b/0x1f0 [113745.896337] [<ffffffffc03aa3c9>] spl_kmem_cache_alloc+0x99/0x150 [spl] [113745.897013] [<ffffffffc053d627>] zio_buf_alloc+0x57/0x60 [zfs] [113745.897685] [<ffffffffc0475d7d>] arc_get_data_buf.isra.25+0x4d/0x60 [zfs] [113745.898456] [<ffffffffc0476a0c>] arc_buf_alloc_impl.isra.29+0x21c/0x2e0 [zfs] [113745.899205] [<ffffffffc04777e1>] arc_read+0x711/0xab0 [zfs] [113745.899855] [<ffffffffc0480cc9>] dbuf_read+0x349/0x9e0 [zfs] [113745.900536] [<ffffffffc048d090>] dmu_buf_hold_by_dnode+0x50/0x80 [zfs] [113745.901227] [<ffffffffc05006f2>] zap_get_leaf_byblk.isra.11+0x92/0x260 [zfs] [113745.902013] [<ffffffffc05009a8>] zap_deref_leaf+0xe8/0x100 [zfs] [113745.902696] [<ffffffffc05019e2>] fzap_lookup+0x62/0x180 [zfs] [113745.903389] [<ffffffffc05062e1>] zap_lookup_impl+0xd1/0x200 [zfs] [113745.904060] [<ffffffffc0506c5e>] zap_lookup_norm+0x8e/0xd0 [zfs] [113745.904740] [<ffffffffc0506cd3>] zap_lookup+0x33/0x40 [zfs] [113745.905411] [<ffffffffc12057c1>] osd_fid_lookup+0x3e1/0x4c0 [osd_zfs] [113745.906085] [<ffffffffc11fd729>] osd_object_init+0x389/0xca0 [osd_zfs] [113745.906793] [<ffffffffc0cd67fb>] lu_object_start.isra.35+0x8b/0x120 [obdclass] [113745.907648] [<ffffffffc0cda511>] lu_object_find_at+0x1e1/0xa60 [obdclass] [113745.908417] [<ffffffffc0cdada6>] lu_object_find+0x16/0x20 [obdclass] [113745.909095] [<ffffffffc133b315>] ofd_object_find+0x35/0x100 [ofd] [113745.909764] [<ffffffffc133b879>] ofd_precreate_objects+0x229/0x1df0 [ofd] [113745.910523] [<ffffffffc132f024>] ofd_create_hdl+0x474/0x2100 [ofd] [113745.911167] [<ffffffffc102683a>] tgt_request_handle+0x98a/0x1630 [ptlrpc] [113745.911945] [<ffffffffc0fc8a96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [113745.912779] [<ffffffffc0fcc5cc>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [113745.913506] [<ffffffffbdac50d1>] kthread+0xd1/0xe0 [113745.914019] [<ffffffffbe18cd37>] ret_from_fork_nospec_end+0x0/0x39 [113745.914723] [<ffffffffffffffff>] 0xffffffffffffffff [113745.915315] Lustre: ll_ost00_006: service thread pid 8439 was inactive for 40.227 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [113762.069942] Lustre: ll_ost_io00_003: service thread pid 8436 was inactive for 40.050 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [113772.182011] Lustre: ll_ost_io00_005: service thread pid 8600 was inactive for 40.047 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [113800.086199] Lustre: ll_ost_io00_007: service thread pid 14628 was inactive for 62.213 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [113800.087709] Lustre: Skipped 4 previous similar messages [113832.342416] Lustre: ldlm_cn00_003: service thread pid 20512 was inactive for 120.072 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [113832.343880] Lustre: Skipped 1 previous similar message
After many network errors, all ll_ost* threads have call traces like
[115509.156340] Lustre: ll_ost00_001: service thread pid 18767 was inactive for 1202.443 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [115509.158178] Lustre: Skipped 2 previous similar messages [115509.158664] Pid: 18767, comm: ll_ost00_001 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Tue Oct 1 17:02:18 UTC 2019 [115509.159590] Call Trace: [115509.159896] [<ffffffffc132f873>] ofd_create_hdl+0xcc3/0x2100 [ofd] [115509.160527] [<ffffffffc102683a>] tgt_request_handle+0x98a/0x1630 [ptlrpc] [115509.161254] [<ffffffffc0fc8a96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [115509.162126] [<ffffffffc0fcc5cc>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [115509.162782] [<ffffffffbdac50d1>] kthread+0xd1/0xe0 [115509.163328] [<ffffffffbe18cd37>] ret_from_fork_nospec_end+0x0/0x39 [115509.163943] [<ffffffffffffffff>] 0xffffffffffffffff [115509.164476] Pid: 18773, comm: ll_ost_io00_002 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Tue Oct 1 17:02:18 UTC 2019 [115509.165436] Call Trace: [115509.165738] [<ffffffffc0cdab77>] lu_object_find_at+0x847/0xa60 [obdclass] [115509.166492] [<ffffffffc0cdada6>] lu_object_find+0x16/0x20 [obdclass] [115509.167142] [<ffffffffc133b315>] ofd_object_find+0x35/0x100 [ofd] [115509.167769] [<ffffffffc1344074>] ofd_preprw_write.isra.31+0xa4/0xec0 [ofd] [115509.168543] [<ffffffffc13452b2>] ofd_preprw+0x422/0x11b0 [ofd] [115509.169138] [<ffffffffc1021fa3>] tgt_brw_write+0xcd3/0x1c50 [ptlrpc] [115509.169824] [<ffffffffc102683a>] tgt_request_handle+0x98a/0x1630 [ptlrpc] [115509.170573] [<ffffffffc0fc8a96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [115509.171348] [<ffffffffc0fcc5cc>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [115509.172032] [<ffffffffbdac50d1>] kthread+0xd1/0xe0 [115509.172537] [<ffffffffbe18cd37>] ret_from_fork_nospec_end+0x0/0x39 [115509.173153] [<ffffffffffffffff>] 0xffffffffffffffff [115534.402093] LNetError: 4785:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
and
[116266.921609] Pid: 8474, comm: ll_ost00_008 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Tue Oct 1 17:02:18 UTC 2019 [116266.922582] Call Trace: [116266.922853] [<ffffffffc132f873>] ofd_create_hdl+0xcc3/0x2100 [ofd] [116266.923513] [<ffffffffc102683a>] tgt_request_handle+0x98a/0x1630 [ptlrpc] [116266.924378] [<ffffffffc0fc8a96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [116266.925228] [<ffffffffc0fcc5cc>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [116266.925922] [<ffffffffbdac50d1>] kthread+0xd1/0xe0 [116266.926443] [<ffffffffbe18cd37>] ret_from_fork_nospec_end+0x0/0x39 [116266.927077] [<ffffffffffffffff>] 0xffffffffffffffff [116266.927797] Pid: 8475, comm: ll_ost00_009 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Tue Oct 1 17:02:18 UTC 2019 [116266.928843] Call Trace: [116266.929166] [<ffffffffc132f873>] ofd_create_hdl+0xcc3/0x2100 [ofd] [116266.929860] [<ffffffffc102683a>] tgt_request_handle+0x98a/0x1630 [ptlrpc] [116266.930602] [<ffffffffc0fc8a96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [116266.931459] [<ffffffffc0fcc5cc>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [116266.932167] [<ffffffffbdac50d1>] kthread+0xd1/0xe0 [116266.932778] [<ffffffffbe18cd37>] ret_from_fork_nospec_end+0x0/0x39 [116266.933422] [<ffffffffffffffff>] 0xffffffffffffffff [116271.015250] Lustre: ll_ost00_003: service thread pid 21872 was inactive for 1203.282 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [116271.017183] Lustre: Skipped 1 previous similar message [116271.017704] Pid: 21872, comm: ll_ost00_003 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Tue Oct 1 17:02:18 UTC 2019 [116271.018758] Call Trace: [116271.019022] [<ffffffffbdac4f2a>] kthread_create_on_node+0xaa/0x140 [116271.019669] [<ffffffffc0fca385>] ptlrpc_start_thread+0x7f5/0xce0 [ptlrpc] [116271.020453] [<ffffffffc0fcc52b>] ptlrpc_main+0xb0b/0x1540 [ptlrpc] [116271.021157] [<ffffffffbdac50d1>] kthread+0xd1/0xe0 [116271.021775] [<ffffffffbe18cd37>] ret_from_fork_nospec_end+0x0/0x39 [116271.022425] [<ffffffffffffffff>] 0xffffffffffffffff