Details
-
Bug
-
Resolution: Unresolved
-
Blocker
-
None
-
Lustre 2.10.7
-
None
-
Most clients are 2.12.0 (Sherlock) behind 2.12 routers, a few local clients are 2.10.6
-
2
-
9223372036854775807
Description
We recently upgraded Oak servers to 2.10.7 but since then, the OSS are showing a lot of these errors:
[29459.670359] Pid: 184632, comm: ll_ost_io00_061 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018 [29459.682008] Call Trace: [29459.684747] [<ffffffffa9996684>] __lock_page+0x74/0x90 [29459.690595] [<ffffffffa99971a4>] __find_lock_page+0x54/0x70 [29459.696924] [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0 [29459.703545] [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [29459.711052] [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd] [29459.717672] [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc] [29459.724829] [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [29459.732543] [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [29459.741128] [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [29459.748166] [<ffffffffa98bdf21>] kthread+0xd1/0xe0 [29459.753622] [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39 [29459.760630] [<ffffffffffffffff>] 0xffffffffffffffff [29459.766199] LustreError: dumping log to /tmp/lustre-log.1554392681.184632 [29558.439589] Lustre: 184902:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply req@ffff883e21427450 x1628649065264016/t0(0) o3->214046b3-82aa-4e1f-6785-53da21760b7f@10.8.7.25@o2ib6:295/0 lens 488/432 e 0 to 0 dl 1554392785 ref 2 fl Interpret:/2/0 rc 0/0 [29558.471803] Lustre: 184902:0:(service.c:1346:ptlrpc_at_send_early_reply()) Skipped 86 previous similar messages [29565.756656] LustreError: 184652:0:(ldlm_lib.c:3199:target_bulk_io()) @@@ bulk READ failed: rc -107 req@ffff8841a2c23c50 x1629017475649200/t0(0) o3->9c21af89-2623-9dd8-0966-6a7e83d1ca0c@10.9.105.61@o2ib4:106/0 lens 488/432 e 0 to 0 dl 1554392596 ref 1 fl Interpret:/2/0 rc 0/0 [29565.783924] LustreError: 184652:0:(ldlm_lib.c:3199:target_bulk_io()) Skipped 41 previous similar messages [29565.794652] Lustre: 184652:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (600:191s); client may timeout. req@ffff8841a2c23c50 x1629017475649200/t0(0) o3->9c21af89-2623-9dd8-0966-6a7e83d1ca0c@10.9.105.61@o2ib4:106/0 lens 488/432 e 0 to 0 dl 1554392596 ref 1 fl Complete:/2/ffffffff rc -107/-1 [29565.827647] Lustre: 184652:0:(service.c:2114:ptlrpc_server_handle_request()) Skipped 4 previous similar messages [29592.627263] Lustre: oak-OST002f: deleting orphan objects from 0x0:23118956 to 0x0:23118977 [29652.157536] LNet: Service thread pid 184907 was inactive for 1200.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [29652.176670] Pid: 184907, comm: ll_ost_io01_098 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018 [29652.188320] Call Trace: [29652.191067] [<ffffffffa9996684>] __lock_page+0x74/0x90 [29652.196917] [<ffffffffa99971a4>] __find_lock_page+0x54/0x70 [29652.203246] [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0 [29652.209865] [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [29652.217369] [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd] [29652.223997] [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc] [29652.231172] [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [29652.238889] [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [29652.247474] [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [29652.254504] [<ffffffffa98bdf21>] kthread+0xd1/0xe0 [29652.259960] [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39 [29652.266967] [<ffffffffffffffff>] 0xffffffffffffffff [29652.272539] LustreError: dumping log to /tmp/lustre-log.1554392874.184907 [29688.912990] Lustre: oak-OST0023: Connection restored to MGC10.0.2.51@o2ib5_1 (at 10.0.2.52@o2ib5) [29688.922902] Lustre: Skipped 242 previous similar messages [29688.929136] Lustre: oak-OST0023: deleting orphan objects from 0x0:23044807 to 0x0:23044835 [29690.017000] Lustre: oak-OST001d: deleting orphan objects from 0x0:24237516 to 0x0:24237537 [29690.216982] Lustre: oak-OST000d: deleting orphan objects from 0x0:22136235 to 0x0:22136257 [29692.305008] Lustre: oak-OST0007: deleting orphan objects from 0x0:21489097 to 0x0:21489122 [29692.721065] Lustre: oak-OST0021: deleting orphan objects from 0x0:22812231 to 0x0:22812257 [29754.557446] LNet: Service thread pid 184833 was inactive for 1201.54s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [29754.576566] Pid: 184833, comm: ll_ost_io01_080 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018 [29754.588216] Call Trace: [29754.590953] [<ffffffffa9996684>] __lock_page+0x74/0x90 [29754.596802] [<ffffffffa99971a4>] __find_lock_page+0x54/0x70 [29754.603129] [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0 [29754.609750] [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [29754.617253] [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd] [29754.623881] [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc] [29754.631039] [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [29754.638753] [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [29754.647335] [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [29754.654363] [<ffffffffa98bdf21>] kthread+0xd1/0xe0 [29754.659821] [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39 [29754.666831] [<ffffffffffffffff>] 0xffffffffffffffff
Users are reporting problem accessing some file data (from 2.12.0 clients).
When checking with htop, the OSS are quiet, definitively too quiet, but the load keeps increasing, which makes me think there is some kind of thread deadlocks on the OSS. I'll attach logs soon.
Attachments
Issue Links
- duplicates
-
LU-12018 deadlock on OSS: quota reintegration vs memory release
- Resolved