Details
-
Bug
-
Resolution: Unresolved
-
Blocker
-
None
-
Lustre 2.10.7
-
None
-
Most clients are 2.12.0 (Sherlock) behind 2.12 routers, a few local clients are 2.10.6
-
2
-
9223372036854775807
Description
We recently upgraded Oak servers to 2.10.7 but since then, the OSS are showing a lot of these errors:
[29459.670359] Pid: 184632, comm: ll_ost_io00_061 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018 [29459.682008] Call Trace: [29459.684747] [<ffffffffa9996684>] __lock_page+0x74/0x90 [29459.690595] [<ffffffffa99971a4>] __find_lock_page+0x54/0x70 [29459.696924] [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0 [29459.703545] [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [29459.711052] [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd] [29459.717672] [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc] [29459.724829] [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [29459.732543] [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [29459.741128] [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [29459.748166] [<ffffffffa98bdf21>] kthread+0xd1/0xe0 [29459.753622] [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39 [29459.760630] [<ffffffffffffffff>] 0xffffffffffffffff [29459.766199] LustreError: dumping log to /tmp/lustre-log.1554392681.184632 [29558.439589] Lustre: 184902:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply req@ffff883e21427450 x1628649065264016/t0(0) o3->214046b3-82aa-4e1f-6785-53da21760b7f@10.8.7.25@o2ib6:295/0 lens 488/432 e 0 to 0 dl 1554392785 ref 2 fl Interpret:/2/0 rc 0/0 [29558.471803] Lustre: 184902:0:(service.c:1346:ptlrpc_at_send_early_reply()) Skipped 86 previous similar messages [29565.756656] LustreError: 184652:0:(ldlm_lib.c:3199:target_bulk_io()) @@@ bulk READ failed: rc -107 req@ffff8841a2c23c50 x1629017475649200/t0(0) o3->9c21af89-2623-9dd8-0966-6a7e83d1ca0c@10.9.105.61@o2ib4:106/0 lens 488/432 e 0 to 0 dl 1554392596 ref 1 fl Interpret:/2/0 rc 0/0 [29565.783924] LustreError: 184652:0:(ldlm_lib.c:3199:target_bulk_io()) Skipped 41 previous similar messages [29565.794652] Lustre: 184652:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (600:191s); client may timeout. req@ffff8841a2c23c50 x1629017475649200/t0(0) o3->9c21af89-2623-9dd8-0966-6a7e83d1ca0c@10.9.105.61@o2ib4:106/0 lens 488/432 e 0 to 0 dl 1554392596 ref 1 fl Complete:/2/ffffffff rc -107/-1 [29565.827647] Lustre: 184652:0:(service.c:2114:ptlrpc_server_handle_request()) Skipped 4 previous similar messages [29592.627263] Lustre: oak-OST002f: deleting orphan objects from 0x0:23118956 to 0x0:23118977 [29652.157536] LNet: Service thread pid 184907 was inactive for 1200.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [29652.176670] Pid: 184907, comm: ll_ost_io01_098 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018 [29652.188320] Call Trace: [29652.191067] [<ffffffffa9996684>] __lock_page+0x74/0x90 [29652.196917] [<ffffffffa99971a4>] __find_lock_page+0x54/0x70 [29652.203246] [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0 [29652.209865] [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [29652.217369] [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd] [29652.223997] [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc] [29652.231172] [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [29652.238889] [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [29652.247474] [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [29652.254504] [<ffffffffa98bdf21>] kthread+0xd1/0xe0 [29652.259960] [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39 [29652.266967] [<ffffffffffffffff>] 0xffffffffffffffff [29652.272539] LustreError: dumping log to /tmp/lustre-log.1554392874.184907 [29688.912990] Lustre: oak-OST0023: Connection restored to MGC10.0.2.51@o2ib5_1 (at 10.0.2.52@o2ib5) [29688.922902] Lustre: Skipped 242 previous similar messages [29688.929136] Lustre: oak-OST0023: deleting orphan objects from 0x0:23044807 to 0x0:23044835 [29690.017000] Lustre: oak-OST001d: deleting orphan objects from 0x0:24237516 to 0x0:24237537 [29690.216982] Lustre: oak-OST000d: deleting orphan objects from 0x0:22136235 to 0x0:22136257 [29692.305008] Lustre: oak-OST0007: deleting orphan objects from 0x0:21489097 to 0x0:21489122 [29692.721065] Lustre: oak-OST0021: deleting orphan objects from 0x0:22812231 to 0x0:22812257 [29754.557446] LNet: Service thread pid 184833 was inactive for 1201.54s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [29754.576566] Pid: 184833, comm: ll_ost_io01_080 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018 [29754.588216] Call Trace: [29754.590953] [<ffffffffa9996684>] __lock_page+0x74/0x90 [29754.596802] [<ffffffffa99971a4>] __find_lock_page+0x54/0x70 [29754.603129] [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0 [29754.609750] [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [29754.617253] [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd] [29754.623881] [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc] [29754.631039] [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [29754.638753] [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [29754.647335] [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [29754.654363] [<ffffffffa98bdf21>] kthread+0xd1/0xe0 [29754.659821] [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39 [29754.666831] [<ffffffffffffffff>] 0xffffffffffffffff
Users are reporting problem accessing some file data (from 2.12.0 clients).
When checking with htop, the OSS are quiet, definitively too quiet, but the load keeps increasing, which makes me think there is some kind of thread deadlocks on the OSS. I'll attach logs soon.
Attachments
Issue Links
- duplicates
-
LU-12018 deadlock on OSS: quota reintegration vs memory release
-
- Resolved
-
I think we're seeing the same issue at Pawsey following an upgrade of our astrofs servers the other week:
[root@astrofs-oss3 ~]# rpm -qa | grep lustre | sort
kernel-3.10.0-862.9.1.el7_lustre.x86_64
kernel-3.10.0-957.el7_lustre.x86_64
kmod-lustre-2.10.6-1.el7.x86_64
kmod-lustre-osd-ldiskfs-2.10.6-1.el7.x86_64
lustre-2.10.6-1.el7.x86_64
lustre-osd-ldiskfs-mount-2.10.6-1.el7.x86_64
lustre-resource-agents-2.10.6-1.el7.x86_64
[root@astrofs-oss3 ~]# uname -r
3.10.0-957.el7_lustre.x86_64
[root@astrofs-oss3 ~]#
and it's giving the same call trace that Stephane originally reported:
as well as a bunch of other Bulk IO read and write errors.
For info, the storage for this is via ib_srp to a DDN array rather than the SAS arrays we're using on our other filesystems - so I wonder if it is timing related?
and we're also seeing IO blocked