[LU-12162] Major issue on OSS after upgrading Oak to 2.10.7 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.10.7
Labels:
None
Environment:
Most clients are 2.12.0 (Sherlock) behind 2.12 routers, a few local clients are 2.10.6

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

We recently upgraded Oak servers to 2.10.7 but since then, the OSS are showing a lot of these errors:

[29459.670359] Pid: 184632, comm: ll_ost_io00_061 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018
[29459.682008] Call Trace:
[29459.684747]  [<ffffffffa9996684>] __lock_page+0x74/0x90
[29459.690595]  [<ffffffffa99971a4>] __find_lock_page+0x54/0x70
[29459.696924]  [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0
[29459.703545]  [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
[29459.711052]  [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd]
[29459.717672]  [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc]
[29459.724829]  [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[29459.732543]  [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[29459.741128]  [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[29459.748166]  [<ffffffffa98bdf21>] kthread+0xd1/0xe0
[29459.753622]  [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39
[29459.760630]  [<ffffffffffffffff>] 0xffffffffffffffff
[29459.766199] LustreError: dumping log to /tmp/lustre-log.1554392681.184632
[29558.439589] Lustre: 184902:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
  req@ffff883e21427450 x1628649065264016/t0(0) o3->214046b3-82aa-4e1f-6785-53da21760b7f@10.8.7.25@o2ib6:295/0 lens 488/432 e 0 to 0 dl 1554392785 ref 2 fl Interpret:/2/0 rc 0/0
[29558.471803] Lustre: 184902:0:(service.c:1346:ptlrpc_at_send_early_reply()) Skipped 86 previous similar messages
[29565.756656] LustreError: 184652:0:(ldlm_lib.c:3199:target_bulk_io()) @@@ bulk READ failed: rc -107  req@ffff8841a2c23c50 x1629017475649200/t0(0) o3->9c21af89-2623-9dd8-0966-6a7e83d1ca0c@10.9.105.61@o2ib4:106/0 lens 488/432 e 0 to 0 dl 1554392596 ref 1 fl Interpret:/2/0 rc 0/0
[29565.783924] LustreError: 184652:0:(ldlm_lib.c:3199:target_bulk_io()) Skipped 41 previous similar messages
[29565.794652] Lustre: 184652:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (600:191s); client may timeout.  req@ffff8841a2c23c50 x1629017475649200/t0(0) o3->9c21af89-2623-9dd8-0966-6a7e83d1ca0c@10.9.105.61@o2ib4:106/0 lens 488/432 e 0 to 0 dl 1554392596 ref 1 fl Complete:/2/ffffffff rc -107/-1
[29565.827647] Lustre: 184652:0:(service.c:2114:ptlrpc_server_handle_request()) Skipped 4 previous similar messages
[29592.627263] Lustre: oak-OST002f: deleting orphan objects from 0x0:23118956 to 0x0:23118977
[29652.157536] LNet: Service thread pid 184907 was inactive for 1200.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[29652.176670] Pid: 184907, comm: ll_ost_io01_098 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018
[29652.188320] Call Trace:
[29652.191067]  [<ffffffffa9996684>] __lock_page+0x74/0x90
[29652.196917]  [<ffffffffa99971a4>] __find_lock_page+0x54/0x70
[29652.203246]  [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0
[29652.209865]  [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
[29652.217369]  [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd]
[29652.223997]  [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc]
[29652.231172]  [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[29652.238889]  [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[29652.247474]  [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[29652.254504]  [<ffffffffa98bdf21>] kthread+0xd1/0xe0
[29652.259960]  [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39
[29652.266967]  [<ffffffffffffffff>] 0xffffffffffffffff
[29652.272539] LustreError: dumping log to /tmp/lustre-log.1554392874.184907
[29688.912990] Lustre: oak-OST0023: Connection restored to MGC10.0.2.51@o2ib5_1 (at 10.0.2.52@o2ib5)
[29688.922902] Lustre: Skipped 242 previous similar messages
[29688.929136] Lustre: oak-OST0023: deleting orphan objects from 0x0:23044807 to 0x0:23044835
[29690.017000] Lustre: oak-OST001d: deleting orphan objects from 0x0:24237516 to 0x0:24237537
[29690.216982] Lustre: oak-OST000d: deleting orphan objects from 0x0:22136235 to 0x0:22136257
[29692.305008] Lustre: oak-OST0007: deleting orphan objects from 0x0:21489097 to 0x0:21489122
[29692.721065] Lustre: oak-OST0021: deleting orphan objects from 0x0:22812231 to 0x0:22812257
[29754.557446] LNet: Service thread pid 184833 was inactive for 1201.54s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[29754.576566] Pid: 184833, comm: ll_ost_io01_080 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018
[29754.588216] Call Trace:
[29754.590953]  [<ffffffffa9996684>] __lock_page+0x74/0x90
[29754.596802]  [<ffffffffa99971a4>] __find_lock_page+0x54/0x70
[29754.603129]  [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0
[29754.609750]  [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
[29754.617253]  [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd]
[29754.623881]  [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc]
[29754.631039]  [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[29754.638753]  [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[29754.647335]  [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[29754.654363]  [<ffffffffa98bdf21>] kthread+0xd1/0xe0
[29754.659821]  [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39
[29754.666831]  [<ffffffffffffffff>] 0xffffffffffffffff

Users are reporting problem accessing some file data (from 2.12.0 clients).

When checking with htop, the OSS are quiet, definitively too quiet, but the load keeps increasing, which makes me think there is some kind of thread deadlocks on the OSS. I'll attach logs soon.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

oak-io1-s2-kern-and-sysrq-t.txt.gz
513 kB
04/Apr/19 4:02 PM
oak-io2-s2-20190404_sysrq_big.log
5.94 MB
05/Apr/19 6:11 AM

Issue Links

duplicates

LU-12018 deadlock on OSS: quota reintegration vs memory release

Resolved

Activity

[LU-12162] Major issue on OSS after upgrading Oak to 2.10.7

Andrew Elwell added a comment - 12/Apr/19 3:22 AM

I think we're seeing the same issue at Pawsey following an upgrade of our astrofs servers the other week:

[root@astrofs-oss3 ~]# rpm -qa | grep lustre | sort
kernel-3.10.0-862.9.1.el7_lustre.x86_64
kernel-3.10.0-957.el7_lustre.x86_64
kmod-lustre-2.10.6-1.el7.x86_64
kmod-lustre-osd-ldiskfs-2.10.6-1.el7.x86_64
lustre-2.10.6-1.el7.x86_64
lustre-osd-ldiskfs-mount-2.10.6-1.el7.x86_64
lustre-resource-agents-2.10.6-1.el7.x86_64
[root@astrofs-oss3 ~]# uname -r
3.10.0-957.el7_lustre.x86_64
[root@astrofs-oss3 ~]#

and it's giving the same call trace that Stephane originally reported:

Fri Apr 12 10:35:31 2019] LustreError: dumping log to /tmp/lustre-log.1555036556.16340
[Fri Apr 12 10:35:31 2019] Pid: 16367, comm: ll_ost_io02_074 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018
[Fri Apr 12 10:35:31 2019] Call Trace:
[Fri Apr 12 10:35:31 2019]  [<ffffffffb51b58d4>] __lock_page+0x74/0x90
[Fri Apr 12 10:35:31 2019]  [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70
[Fri Apr 12 10:35:31 2019]  [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0
[Fri Apr 12 10:35:31 2019]  [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
[Fri Apr 12 10:35:31 2019]  [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffb50c1c31>] kthread+0xd1/0xe0
[Fri Apr 12 10:35:32 2019]  [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21
[Fri Apr 12 10:35:32 2019]  [<ffffffffffffffff>] 0xffffffffffffffff
[Fri Apr 12 10:35:32 2019] Pid: 16256, comm: ll_ost_io02_020 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018
[Fri Apr 12 10:35:32 2019] Call Trace:
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b58d4>] __lock_page+0x74/0x90
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0
[Fri Apr 12 10:35:32 2019]  [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffb50c1c31>] kthread+0xd1/0xe0
[Fri Apr 12 10:35:32 2019]  [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21
[Fri Apr 12 10:35:32 2019]  [<ffffffffffffffff>] 0xffffffffffffffff
[Fri Apr 12 10:35:32 2019] Pid: 15878, comm: ll_ost_io02_002 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018
[Fri Apr 12 10:35:32 2019] Call Trace:
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b58d4>] __lock_page+0x74/0x90
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0
[Fri Apr 12 10:35:32 2019]  [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffb50c1c31>] kthread+0xd1/0xe0
[Fri Apr 12 10:35:32 2019]  [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21
[Fri Apr 12 10:35:32 2019]  [<ffffffffffffffff>] 0xffffffffffffffff
[Fri Apr 12 10:35:32 2019] Pid: 16302, comm: ll_ost_io02_037 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018
[Fri Apr 12 10:35:32 2019] Call Trace:
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b58d4>] __lock_page+0x74/0x90
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0
[Fri Apr 12 10:35:32 2019]  [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffb50c1c31>] kthread+0xd1/0xe0
[Fri Apr 12 10:35:32 2019]  [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21
[Fri Apr 12 10:35:32 2019]  [<ffffffffffffffff>] 0xffffffffffffffff
[Fri Apr 12 10:35:32 2019] LNet: Service thread pid 16285 was inactive for 1201.11s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[Fri Apr 12 10:35:32 2019] LNet: Skipped 34 previous similar messages

as well as a bunch of other Bulk IO read and write errors.

For info, the storage for this is via ib_srp to a DDN array rather than the SAS arrays we're using on our other filesystems - so I wonder if it is timing related?

and we're also seeing IO blocked

[root@astrofs-oss3 ~]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1 193      0 734084 3879288 120170720    0    0  4520  8360    2    1  0  1 67 32  0
 1 193      0 733936 3879288 120170720    0    0     0    16 3474 16751  0  0  0 100  0
 0 193      0 733944 3879288 120170720    0    0     0    20 3141 15211  0  0  0 100  0
 0 193      0 733944 3879288 120170720    0    0     0     8 2111 10145  0  0  0 100  0
 0 193      0 733944 3879288 120170720    0    0     0     8 2723 13012  0  0  0 100  0
 0 193      0 733944 3879288 120170720    0    0     0    12 2938 14461  0  0  0 100  0
 0 193      0 733944 3879288 120170720    0    0     0    16 4151 20537  0  0  0 100  0
 0 193      0 734156 3879288 120171072    0    0     0    20 4384 21279  0  0  0 100  0
^C

Andrew Elwell added a comment - 12/Apr/19 3:22 AM I think we're seeing the same issue at Pawsey following an upgrade of our astrofs servers the other week: [root@astrofs-oss3 ~] # rpm -qa | grep lustre | sort kernel-3.10.0-862.9.1.el7_lustre.x86_64 kernel-3.10.0-957.el7_lustre.x86_64 kmod-lustre-2.10.6-1.el7.x86_64 kmod-lustre-osd-ldiskfs-2.10.6-1.el7.x86_64 lustre-2.10.6-1.el7.x86_64 lustre-osd-ldiskfs-mount-2.10.6-1.el7.x86_64 lustre-resource-agents-2.10.6-1.el7.x86_64 [root@astrofs-oss3 ~] # uname -r 3.10.0-957.el7_lustre.x86_64 [root@astrofs-oss3 ~] # and it's giving the same call trace that Stephane originally reported: Fri Apr 12 10:35:31 2019] LustreError: dumping log to /tmp/lustre-log.1555036556.16340 [Fri Apr 12 10:35:31 2019] Pid: 16367, comm: ll_ost_io02_074 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018 [Fri Apr 12 10:35:31 2019] Call Trace: [Fri Apr 12 10:35:31 2019] [<ffffffffb51b58d4>] __lock_page+0x74/0x90 [Fri Apr 12 10:35:31 2019] [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70 [Fri Apr 12 10:35:31 2019] [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0 [Fri Apr 12 10:35:31 2019] [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [Fri Apr 12 10:35:31 2019] [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffb50c1c31>] kthread+0xd1/0xe0 [Fri Apr 12 10:35:32 2019] [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21 [Fri Apr 12 10:35:32 2019] [<ffffffffffffffff>] 0xffffffffffffffff [Fri Apr 12 10:35:32 2019] Pid: 16256, comm: ll_ost_io02_020 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018 [Fri Apr 12 10:35:32 2019] Call Trace: [Fri Apr 12 10:35:32 2019] [<ffffffffb51b58d4>] __lock_page+0x74/0x90 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0 [Fri Apr 12 10:35:32 2019] [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [Fri Apr 12 10:35:32 2019] [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffb50c1c31>] kthread+0xd1/0xe0 [Fri Apr 12 10:35:32 2019] [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21 [Fri Apr 12 10:35:32 2019] [<ffffffffffffffff>] 0xffffffffffffffff [Fri Apr 12 10:35:32 2019] Pid: 15878, comm: ll_ost_io02_002 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018 [Fri Apr 12 10:35:32 2019] Call Trace: [Fri Apr 12 10:35:32 2019] [<ffffffffb51b58d4>] __lock_page+0x74/0x90 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0 [Fri Apr 12 10:35:32 2019] [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [Fri Apr 12 10:35:32 2019] [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffb50c1c31>] kthread+0xd1/0xe0 [Fri Apr 12 10:35:32 2019] [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21 [Fri Apr 12 10:35:32 2019] [<ffffffffffffffff>] 0xffffffffffffffff [Fri Apr 12 10:35:32 2019] Pid: 16302, comm: ll_ost_io02_037 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018 [Fri Apr 12 10:35:32 2019] Call Trace: [Fri Apr 12 10:35:32 2019] [<ffffffffb51b58d4>] __lock_page+0x74/0x90 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0 [Fri Apr 12 10:35:32 2019] [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [Fri Apr 12 10:35:32 2019] [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffb50c1c31>] kthread+0xd1/0xe0 [Fri Apr 12 10:35:32 2019] [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21 [Fri Apr 12 10:35:32 2019] [<ffffffffffffffff>] 0xffffffffffffffff [Fri Apr 12 10:35:32 2019] LNet: Service thread pid 16285 was inactive for 1201.11s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [Fri Apr 12 10:35:32 2019] LNet: Skipped 34 previous similar messages as well as a bunch of other Bulk IO read and write errors. For info, the storage for this is via ib_srp to a DDN array rather than the SAS arrays we're using on our other filesystems - so I wonder if it is timing related? and we're also seeing IO blocked [root@astrofs-oss3 ~]# vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 193 0 734084 3879288 120170720 0 0 4520 8360 2 1 0 1 67 32 0 1 193 0 733936 3879288 120170720 0 0 0 16 3474 16751 0 0 0 100 0 0 193 0 733944 3879288 120170720 0 0 0 20 3141 15211 0 0 0 100 0 0 193 0 733944 3879288 120170720 0 0 0 8 2111 10145 0 0 0 100 0 0 193 0 733944 3879288 120170720 0 0 0 8 2723 13012 0 0 0 100 0 0 193 0 733944 3879288 120170720 0 0 0 12 2938 14461 0 0 0 100 0 0 193 0 733944 3879288 120170720 0 0 0 16 4151 20537 0 0 0 100 0 0 193 0 734156 3879288 120171072 0 0 0 20 4384 21279 0 0 0 100 0 ^C

Stephane Thiell added a comment - 06/Apr/19 5:23 AM - edited

pfarrell I confirm the patch does successfully apply on top of 2.10.7 and I was able to build new server rpms for Oak. Thanks!

pjones I think it's important to clarify what exactly happened and the title of this issue probably must be changed.

The deadlock happened repeatedly on two servers serving a set of OSTs in IO Cell #2. But it's critical to highlight two things:

Every time that the OST recovery finished on those servers (always successfully against about 1400 clients - that stayed up during the whole time), we shortly hit this deadlock issue. I definitively think that we hit a recovery scenario that triggered this same issue each time (and that's exactly what the recovery is for... replying transactions) - aborting the recovery on those OSTs just broke the cycle and we were finally able to resume operation.
Last night I tried to downgrade to Lustre 2.10.6, 2.10.5 and even 2.10.3+patch! (and old version that we have run for quite some time), and EVERY TIME the same issue happened too. This is definitively not a 2.10.7-specific issue. Please feel free to change the title accordingly.

As of what really triggered the issue, I was thinking perhaps it's due to the upgrade of Sherlock (our lustre clients) to 2.12 recently, and because we haven't restarted Oak since then we hit this now while we upgraded Oak... Do you know if ~~LU-12018~~ occurred also with 2.12 clients? That would be interesting to check.

I would also like to mention that when this issue occurred it was extremely difficult to diagnose at first. Oak was down for about 18 hours and we spent most of the time looking for wrong causes, like network or backend issues, because Lustre was reporting these timeout errors or ldiskfs watchdogs because of the deadlock. But I'm glad this issue was already found and special thanks to Alex for pointing us to the right patch (at least, we hope so, but it looks like it very much).

Finally, to answer your question, I have the updated rpms ready to be deployed on Oak servers but the plan is to not touch anything right now as everything is working again as expected ( both 2.10 on Oak and 2.12 on Fir ). We'll patch only if an issue occur or at the next maintenance (TBD).

Stephane Thiell added a comment - 06/Apr/19 5:23 AM - edited pfarrell I confirm the patch does successfully apply on top of 2.10.7 and I was able to build new server rpms for Oak. Thanks! pjones I think it's important to clarify what exactly happened and the title of this issue probably must be changed. The deadlock happened repeatedly on two servers serving a set of OSTs in IO Cell #2. But it's critical to highlight two things: Every time that the OST recovery finished on those servers (always successfully against about 1400 clients - that stayed up during the whole time), we shortly hit this deadlock issue. I definitively think that we hit a recovery scenario that triggered this same issue each time (and that's exactly what the recovery is for... replying transactions) - aborting the recovery on those OSTs just broke the cycle and we were finally able to resume operation. Last night I tried to downgrade to Lustre 2.10.6, 2.10.5 and even 2.10.3+patch! (and old version that we have run for quite some time), and EVERY TIME the same issue happened too. This is definitively not a 2.10.7-specific issue. Please feel free to change the title accordingly. As of what really triggered the issue, I was thinking perhaps it's due to the upgrade of Sherlock (our lustre clients) to 2.12 recently, and because we haven't restarted Oak since then we hit this now while we upgraded Oak... Do you know if LU-12018 occurred also with 2.12 clients? That would be interesting to check. I would also like to mention that when this issue occurred it was extremely difficult to diagnose at first. Oak was down for about 18 hours and we spent most of the time looking for wrong causes, like network or backend issues, because Lustre was reporting these timeout errors or ldiskfs watchdogs because of the deadlock. But I'm glad this issue was already found and special thanks to Alex for pointing us to the right patch (at least, we hope so, but it looks like it very much). Finally, to answer your question, I have the updated rpms ready to be deployed on Oak servers but the plan is to not touch anything right now as everything is working again as expected ( both 2.10 on Oak and 2.12 on Fir ). We'll patch only if an issue occur or at the next maintenance (TBD).

Peter Jones added a comment - 06/Apr/19 3:20 AM

sthiell

How soon will you expect to know whether this patch has helped? What is puzzling to me is that this issue has been seen (albeit quite rarely) as far back as 2.7.x releases so it's puzzling why it just suddenly start happening frequently after upgrading to 2.10.7...

Peter

Peter Jones added a comment - 06/Apr/19 3:20 AM sthiell How soon will you expect to know whether this patch has helped? What is puzzling to me is that this issue has been seen (albeit quite rarely) as far back as 2.7.x releases so it's puzzling why it just suddenly start happening frequently after upgrading to 2.10.7... Peter

Stephane Thiell added a comment - 05/Apr/19 4:38 PM - edited

Ok thanks, that's great. It is very short indeed!
I'm going to prepare a patched 2.10.7 for that then.

Stephane Thiell added a comment - 05/Apr/19 4:38 PM - edited Ok thanks, that's great. It is very short indeed! I'm going to prepare a patched 2.10.7 for that then.

Patrick Farrell (Inactive) added a comment - 05/Apr/19 4:18 PM

Stephane,

The patch shouldn't require porting. You can just apply it to the 2.10.7 source without issue.

Patrick Farrell (Inactive) added a comment - 05/Apr/19 4:18 PM Stephane, The patch shouldn't require porting. You can just apply it to the 2.10.7 source without issue.

Stephane Thiell added a comment - 05/Apr/19 4:11 PM

bzzz Thanks!!! That must be it! I noticed the same symptoms as described in ~~LU-12018~~.

Would you be able to port the patch to b2_10?

Stephane Thiell added a comment - 05/Apr/19 4:11 PM bzzz Thanks!!! That must be it! I noticed the same symptoms as described in LU-12018 . Would you be able to port the patch to b2_10?

Alex Zhuravlev added a comment - 05/Apr/19 3:13 PM - edited

Call Trace:
 [<ffffffff816a94e9>] schedule+0x29/0x70
 [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
 [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700
 [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
 [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
 [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140
 [<ffffffffc0f92110>] ? qsd_reconciliation+0xa90/0xa90 [lquota]
 [<ffffffffc09f4549>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
 [<ffffffffc0f938f8>] qsd_start_reint_thread+0x858/0xc10 [lquota]
 [<ffffffff81267292>] ? dqput+0x162/0x1e0
 [<ffffffffc0f97781>] qsd_ready+0x231/0x3c0 [lquota]
 [<ffffffffc0f9a6d2>] qsd_adjust+0xa2/0x890 [lquota]
 [<ffffffffc0f8cbfa>] ? qsd_refresh_usage+0x6a/0x2b0 [lquota]
 [<ffffffffc0f9ba80>] qsd_op_adjust+0x4e0/0x730 [lquota]
 [<ffffffffc0ff5d80>] osd_object_delete+0x230/0x330 [osd_ldiskfs]
 [<ffffffffc0a106cd>] lu_object_free.isra.31+0x9d/0x1a0 [obdclass]
 [<ffffffffc0a1132e>] lu_site_purge_objects+0x2fe/0x520 [obdclass]
 [<ffffffffc0a12179>] lu_cache_shrink+0x259/0x2d0 [obdclass]
 [<ffffffff81195413>] shrink_slab+0x163/0x330
 [<ffffffff811f7521>] ? vmpressure+0x61/0x90
 [<ffffffff81198091>] zone_reclaim+0x1d1/0x2f0
 [<ffffffff8118c264>] get_page_from_freelist+0x2c4/0x9e0
 [<ffffffff81029557>] ? __switch_to+0xd7/0x510
 [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420
 [<ffffffff811d1108>] alloc_pages_current+0x98/0x110
 [<ffffffff81182917>] __page_cache_alloc+0x97/0xb0
 [<ffffffff81183625>] find_or_create_page+0x45/0xa0
 [<ffffffffc1018885>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
 [<ffffffffc1192728>] ofd_preprw+0x6a8/0x1150 [ofd]

Call Trace:
 [<ffffffff816a94e9>] schedule+0x29/0x70
 [<ffffffff816aab1d>] rwsem_down_read_failed+0x10d/0x1a0
 [<ffffffff8121a554>] ? shrink_dentry_list+0x274/0x490
 [<ffffffff81332038>] call_rwsem_down_read_failed+0x18/0x30
 [<ffffffff816a8780>] down_read+0x20/0x40
 [<ffffffffc0a11f8a>] lu_cache_shrink+0x6a/0x2d0 [obdclass]
 [<ffffffff81195359>] shrink_slab+0xa9/0x330
 [<ffffffff8118861f>] ? zone_watermark_ok+0x1f/0x30
 [<ffffffff811a8853>] ? compaction_suitable+0xa3/0xb0
 [<ffffffff81198091>] zone_reclaim+0x1d1/0x2f0
 [<ffffffff8118c264>] get_page_from_freelist+0x2c4/0x9e0
 [<ffffffff81085e1b>] ? copy_process+0xeeb/0x1970
 [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420
 [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
 [<ffffffff8108510d>] copy_process+0x1dd/0x1970
 [<ffffffff81029557>] ? __switch_to+0xd7/0x510
 [<ffffffff81086a51>] do_fork+0x91/0x320
 [<ffffffff816a8fad>] ? __schedule+0x39d/0x8b0
 [<ffffffff81086d06>] kernel_thread+0x26/0x30
 [<ffffffff810b1341>] kthreadd+0x2c1/0x300

Alex Zhuravlev added a comment - 05/Apr/19 3:13 PM - edited Call Trace: [<ffffffff816a94e9>] schedule+0x29/0x70 [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0 [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700 [<ffffffff816a989d>] wait_for_completion+0xfd/0x140 [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140 [<ffffffffc0f92110>] ? qsd_reconciliation+0xa90/0xa90 [lquota] [<ffffffffc09f4549>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [<ffffffffc0f938f8>] qsd_start_reint_thread+0x858/0xc10 [lquota] [<ffffffff81267292>] ? dqput+0x162/0x1e0 [<ffffffffc0f97781>] qsd_ready+0x231/0x3c0 [lquota] [<ffffffffc0f9a6d2>] qsd_adjust+0xa2/0x890 [lquota] [<ffffffffc0f8cbfa>] ? qsd_refresh_usage+0x6a/0x2b0 [lquota] [<ffffffffc0f9ba80>] qsd_op_adjust+0x4e0/0x730 [lquota] [<ffffffffc0ff5d80>] osd_object_delete+0x230/0x330 [osd_ldiskfs] [<ffffffffc0a106cd>] lu_object_free.isra.31+0x9d/0x1a0 [obdclass] [<ffffffffc0a1132e>] lu_site_purge_objects+0x2fe/0x520 [obdclass] [<ffffffffc0a12179>] lu_cache_shrink+0x259/0x2d0 [obdclass] [<ffffffff81195413>] shrink_slab+0x163/0x330 [<ffffffff811f7521>] ? vmpressure+0x61/0x90 [<ffffffff81198091>] zone_reclaim+0x1d1/0x2f0 [<ffffffff8118c264>] get_page_from_freelist+0x2c4/0x9e0 [<ffffffff81029557>] ? __switch_to+0xd7/0x510 [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420 [<ffffffff811d1108>] alloc_pages_current+0x98/0x110 [<ffffffff81182917>] __page_cache_alloc+0x97/0xb0 [<ffffffff81183625>] find_or_create_page+0x45/0xa0 [<ffffffffc1018885>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [<ffffffffc1192728>] ofd_preprw+0x6a8/0x1150 [ofd] Call Trace: [<ffffffff816a94e9>] schedule+0x29/0x70 [<ffffffff816aab1d>] rwsem_down_read_failed+0x10d/0x1a0 [<ffffffff8121a554>] ? shrink_dentry_list+0x274/0x490 [<ffffffff81332038>] call_rwsem_down_read_failed+0x18/0x30 [<ffffffff816a8780>] down_read+0x20/0x40 [<ffffffffc0a11f8a>] lu_cache_shrink+0x6a/0x2d0 [obdclass] [<ffffffff81195359>] shrink_slab+0xa9/0x330 [<ffffffff8118861f>] ? zone_watermark_ok+0x1f/0x30 [<ffffffff811a8853>] ? compaction_suitable+0xa3/0xb0 [<ffffffff81198091>] zone_reclaim+0x1d1/0x2f0 [<ffffffff8118c264>] get_page_from_freelist+0x2c4/0x9e0 [<ffffffff81085e1b>] ? copy_process+0xeeb/0x1970 [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420 [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [<ffffffff8108510d>] copy_process+0x1dd/0x1970 [<ffffffff81029557>] ? __switch_to+0xd7/0x510 [<ffffffff81086a51>] do_fork+0x91/0x320 [<ffffffff816a8fad>] ? __schedule+0x39d/0x8b0 [<ffffffff81086d06>] kernel_thread+0x26/0x30 [<ffffffff810b1341>] kthreadd+0x2c1/0x300

Alex Zhuravlev added a comment - 05/Apr/19 3:05 PM

sthiell this looks like ~~LU-12018~~

Alex Zhuravlev added a comment - 05/Apr/19 3:05 PM sthiell this looks like LU-12018

Stephane Thiell added a comment - 05/Apr/19 1:25 PM

I think we found the issue. There is some kind of incompatibility in the 2.10 recovery for 2.12 clients. It leads to a deadlock as shown in my logs ( I attached a second dump of sysrq t last night also). After remounting all 2.10.7 OSTs doing Bulk IO errors using -o abort_recov (GRR I really don't like that, it killed a lot of jobs), things are back to normal!! I think! Logs are all super clean. And we are in 2.10.7 now at least!!

Stephane Thiell added a comment - 05/Apr/19 1:25 PM I think we found the issue. There is some kind of incompatibility in the 2.10 recovery for 2.12 clients. It leads to a deadlock as shown in my logs ( I attached a second dump of sysrq t last night also). After remounting all 2.10.7 OSTs doing Bulk IO errors using -o abort_recov (GRR I really don't like that, it killed a lot of jobs), things are back to normal!! I think! Logs are all super clean. And we are in 2.10.7 now at least!!

Stephane Thiell added a comment - 05/Apr/19 5:34 AM

Hi - we have made some progress but not resolved the issue yet.

Oak has 3 IO cells, each with 2 x OSS, and only a single IOCell is impacted by this problem IOcell#2. They all run 2.10.7 now with our well-known kernel, again only IOCell #2 exhibit the issues, both redundant servers oak-io2-s1 and oak-io2-s2. All Oak's OSS use the same LNet routers. So if it's only localized on one IOCell, it could be the backend. But there is no indication of an issue. And today we power-cycled the full IO cell (rack). Still, what makes me think there is a problem in the backend more than the network is these backtraces:

[ 7874.415621] Lustre: 272943:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
  req@ffff883ad89b0450 x1629296681700080/t0(0) o3->f060da2b-47c0-9c4a-3590-7c18d2889f30@10.9.113.6@o2ib4:358/0 lens 488/432 e 0 to 0 dl 1554441168 ref 2 fl Interpret:/2/0 rc 0/0
[ 7874.447638] Lustre: 272943:0:(service.c:1346:ptlrpc_at_send_early_reply()) Skipped 73 previous similar messages
[ 7879.489456] LustreError: 272844:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 89+0s  req@ffff883ad89b2c50 x1628590488062352/t0(0) o3->0ef862b6-a2d6-21a8-44a8-9efaf84bbf69@10.9.103.11@o2ib4:358/0 lens 488/432 e 0 to 0 dl 1554441168 ref 1 fl Interpret:/2/0 rc 0/0
[ 7879.517403] LustreError: 272844:0:(ldlm_lib.c:3239:target_bulk_io()) Skipped 242 previous similar messages
[ 7879.528213] Lustre: oak-OST0048: Bulk IO read error with 0ef862b6-a2d6-21a8-44a8-9efaf84bbf69 (at 10.9.103.11@o2ib4), client will retry: rc -110
[ 7879.542676] Lustre: Skipped 141 previous similar messages
[ 7885.530499] LNet: Service thread pid 274929 completed after 1762.63s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[ 7885.548754] LNet: Skipped 29 previous similar messages
[ 7973.018750] Lustre: oak-OST0036: Bulk IO write error with f1e4ff8b-e962-43e4-68cf-4383c7c66071 (at 10.9.104.25@o2ib4), client will retry: rc = -110
[ 7973.033510] Lustre: Skipped 135 previous similar messages
[ 7992.876118] Pid: 272883, comm: ll_ost_io00_013 3.10.0-693.2.2.el7_lustre.pl3.x86_64 #1 SMP Thu Mar 15 13:06:45 PDT 2018
[ 7992.888157] Call Trace:
[ 7992.890898]  [<ffffffff81182094>] __lock_page+0x74/0x90
[ 7992.896748]  [<ffffffff81182a34>] __find_lock_page+0x54/0x70
[ 7992.903076]  [<ffffffff81183614>] find_or_create_page+0x34/0xa0
[ 7992.909695]  [<ffffffffc1018885>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
[ 7992.917201]  [<ffffffffc1192728>] ofd_preprw+0x6a8/0x1150 [ofd]
[ 7992.923832]  [<ffffffffc0c5322f>] tgt_brw_read+0x96f/0x1850 [ptlrpc]
[ 7992.931008]  [<ffffffffc0c51115>] tgt_request_handle+0x925/0x1370 [ptlrpc]
[ 7992.938727]  [<ffffffffc0bf9dd6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
[ 7992.947312]  [<ffffffffc0bfd512>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[ 7992.954345]  [<ffffffff810b098f>] kthread+0xcf/0xe0
[ 7992.959795]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
[ 7992.965835]  [<ffffffffffffffff>] 0xffffffffffffffff

as it is in osd_ldiskfs. Slow backend leading to slow response leading to IO timeout, am I right to think this?

Would be easy if the error came from a single OST, as in that case it could be a slow drives, volume or something like that, but no, it doesn't just come from a single OST, as shown by lctl dk:

[root@oak-io2-s2 ~]# lctl dk | grep Bulk
00000020:02000400:21.0:1554442186.572762:0:274935:0:(tgt_handler.c:2046:tgt_brw_read()) oak-OST003f: Bulk IO read error with bf2139cf-7e53-4025-4e6e-f6f1d7d97a02 (at 10.9.101.50@o2ib4), client will retry: rc -110
00000020:02000400:8.0:1554442187.507754:0:274982:0:(tgt_handler.c:2046:tgt_brw_read()) oak-OST0048: Bulk IO read error with 0ef862b6-a2d6-21a8-44a8-9efaf84bbf69 (at 10.9.103.11@o2ib4), client will retry: rc -110
00000020:02000400:3.0:1554442187.551764:0:273000:0:(tgt_handler.c:2046:tgt_brw_read()) oak-OST0048: Bulk IO read error with 99d2fc1e-a85b-fa24-ca8b-03d73ee86475 (at 10.9.112.8@o2ib4), client will retry: rc -110
00000020:02000400:31.0:1554442187.703820:0:274997:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST003e: Bulk IO write error with 5588751e-34a5-8e15-f371-63b5b81517bb (at 10.9.101.20@o2ib4), client will retry: rc = -110
00000020:02000400:15.0:1554442187.709735:0:275042:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0042: Bulk IO write error with 51bb6bf0-2ddf-6781-f22c-360735045eba (at 10.9.107.21@o2ib4), client will retry: rc = -110
00000020:02000400:11.0:1554442187.709742:0:273003:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0036: Bulk IO write error with f1e4ff8b-e962-43e4-68cf-4383c7c66071 (at 10.9.104.25@o2ib4), client will retry: rc = -110
00000020:02000400:21.0:1554442187.713740:0:275045:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0052: Bulk IO write error with 526957d3-630b-5bfe-a257-c024f120aeda (at 10.9.112.3@o2ib4), client will retry: rc = -110
00000020:02000400:11.0:1554442187.728733:0:272998:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0030: Bulk IO write error with 526957d3-630b-5bfe-a257-c024f120aeda (at 10.9.112.3@o2ib4), client will retry: rc = -110
00000020:02000400:15.0:1554442187.816743:0:272950:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST005c: Bulk IO write error with ed020952-e5b6-7729-7a78-b61d9e90b696 (at 10.9.105.32@o2ib4), client will retry: rc = -110
00000020:02000400:9.0:1554442188.093753:0:272924:0:(tgt_handler.c:2046:tgt_brw_read()) oak-OST0033: Bulk IO read error with ace149ff-a4e3-8eeb-f9dc-177b4b8f7f6d (at 10.9.106.25@o2ib4), client will retry: rc -110
00000020:02000400:23.0:1554442188.207750:0:274976:0:(tgt_handler.c:2046:tgt_brw_read()) oak-OST005b: Bulk IO read error with 8bab6b54-7a61-a87c-68e6-da0d28bb25d6 (at 10.9.106.1@o2ib4), client will retry: rc -110
00000020:02000400:17.0:1554442188.230752:0:273030:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0048: Bulk IO write error with b312f397-6359-6d9f-b796-be326bf43e45 (at 10.9.101.29@o2ib4), client will retry: rc = -110
00000020:02000400:5.0:1554442188.259746:0:275020:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0054: Bulk IO write error with d8bbb1fe-48fb-3725-9c52-55f23effddce (at 10.9.108.24@o2ib4), client will retry: rc = -110
00000020:02000400:19.0:1554442188.260735:0:275002:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0054: Bulk IO write error with d8bbb1fe-48fb-3725-9c52-55f23effddce (at 10.9.108.24@o2ib4), client will retry: rc = -110
00000020:02000400:24.0:1554442206.739748:0:272887:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0055: Bulk IO write error with d8bbb1fe-48fb-3725-9c52-55f23effddce (at 10.9.108.24@o2ib4), client will retry: rc = -110
00000020:02000400:1.0:1554442206.739750:0:275047:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0055: Bulk IO write error with d8bbb1fe-48fb-3725-9c52-55f23effddce (at 10.9.108.24@o2ib4), client will retry: rc = -110

For the night, we have set max_create_count=0 to avoid new files to these OSTs (the 48 of the cell). Any ideas would be greatly appreciated.

Stephane Thiell added a comment - 05/Apr/19 5:34 AM Hi - we have made some progress but not resolved the issue yet. Oak has 3 IO cells, each with 2 x OSS, and only a single IOCell is impacted by this problem IOcell#2. They all run 2.10.7 now with our well-known kernel, again only IOCell #2 exhibit the issues, both redundant servers oak-io2-s1 and oak-io2-s2. All Oak's OSS use the same LNet routers. So if it's only localized on one IOCell, it could be the backend. But there is no indication of an issue. And today we power-cycled the full IO cell (rack). Still, what makes me think there is a problem in the backend more than the network is these backtraces: [ 7874.415621] Lustre: 272943:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply req@ffff883ad89b0450 x1629296681700080/t0(0) o3->f060da2b-47c0-9c4a-3590-7c18d2889f30@10.9.113.6@o2ib4:358/0 lens 488/432 e 0 to 0 dl 1554441168 ref 2 fl Interpret:/2/0 rc 0/0 [ 7874.447638] Lustre: 272943:0:(service.c:1346:ptlrpc_at_send_early_reply()) Skipped 73 previous similar messages [ 7879.489456] LustreError: 272844:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 89+0s req@ffff883ad89b2c50 x1628590488062352/t0(0) o3->0ef862b6-a2d6-21a8-44a8-9efaf84bbf69@10.9.103.11@o2ib4:358/0 lens 488/432 e 0 to 0 dl 1554441168 ref 1 fl Interpret:/2/0 rc 0/0 [ 7879.517403] LustreError: 272844:0:(ldlm_lib.c:3239:target_bulk_io()) Skipped 242 previous similar messages [ 7879.528213] Lustre: oak-OST0048: Bulk IO read error with 0ef862b6-a2d6-21a8-44a8-9efaf84bbf69 (at 10.9.103.11@o2ib4), client will retry: rc -110 [ 7879.542676] Lustre: Skipped 141 previous similar messages [ 7885.530499] LNet: Service thread pid 274929 completed after 1762.63s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 7885.548754] LNet: Skipped 29 previous similar messages [ 7973.018750] Lustre: oak-OST0036: Bulk IO write error with f1e4ff8b-e962-43e4-68cf-4383c7c66071 (at 10.9.104.25@o2ib4), client will retry: rc = -110 [ 7973.033510] Lustre: Skipped 135 previous similar messages [ 7992.876118] Pid: 272883, comm: ll_ost_io00_013 3.10.0-693.2.2.el7_lustre.pl3.x86_64 #1 SMP Thu Mar 15 13:06:45 PDT 2018 [ 7992.888157] Call Trace: [ 7992.890898] [<ffffffff81182094>] __lock_page+0x74/0x90 [ 7992.896748] [<ffffffff81182a34>] __find_lock_page+0x54/0x70 [ 7992.903076] [<ffffffff81183614>] find_or_create_page+0x34/0xa0 [ 7992.909695] [<ffffffffc1018885>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [ 7992.917201] [<ffffffffc1192728>] ofd_preprw+0x6a8/0x1150 [ofd] [ 7992.923832] [<ffffffffc0c5322f>] tgt_brw_read+0x96f/0x1850 [ptlrpc] [ 7992.931008] [<ffffffffc0c51115>] tgt_request_handle+0x925/0x1370 [ptlrpc] [ 7992.938727] [<ffffffffc0bf9dd6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc] [ 7992.947312] [<ffffffffc0bfd512>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [ 7992.954345] [<ffffffff810b098f>] kthread+0xcf/0xe0 [ 7992.959795] [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 [ 7992.965835] [<ffffffffffffffff>] 0xffffffffffffffff as it is in osd_ldiskfs. Slow backend leading to slow response leading to IO timeout, am I right to think this? Would be easy if the error came from a single OST, as in that case it could be a slow drives, volume or something like that, but no, it doesn't just come from a single OST, as shown by lctl dk: [root@oak-io2-s2 ~]# lctl dk | grep Bulk 00000020:02000400:21.0:1554442186.572762:0:274935:0:(tgt_handler.c:2046:tgt_brw_read()) oak-OST003f: Bulk IO read error with bf2139cf-7e53-4025-4e6e-f6f1d7d97a02 (at 10.9.101.50@o2ib4), client will retry: rc -110 00000020:02000400:8.0:1554442187.507754:0:274982:0:(tgt_handler.c:2046:tgt_brw_read()) oak-OST0048: Bulk IO read error with 0ef862b6-a2d6-21a8-44a8-9efaf84bbf69 (at 10.9.103.11@o2ib4), client will retry: rc -110 00000020:02000400:3.0:1554442187.551764:0:273000:0:(tgt_handler.c:2046:tgt_brw_read()) oak-OST0048: Bulk IO read error with 99d2fc1e-a85b-fa24-ca8b-03d73ee86475 (at 10.9.112.8@o2ib4), client will retry: rc -110 00000020:02000400:31.0:1554442187.703820:0:274997:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST003e: Bulk IO write error with 5588751e-34a5-8e15-f371-63b5b81517bb (at 10.9.101.20@o2ib4), client will retry: rc = -110 00000020:02000400:15.0:1554442187.709735:0:275042:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0042: Bulk IO write error with 51bb6bf0-2ddf-6781-f22c-360735045eba (at 10.9.107.21@o2ib4), client will retry: rc = -110 00000020:02000400:11.0:1554442187.709742:0:273003:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0036: Bulk IO write error with f1e4ff8b-e962-43e4-68cf-4383c7c66071 (at 10.9.104.25@o2ib4), client will retry: rc = -110 00000020:02000400:21.0:1554442187.713740:0:275045:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0052: Bulk IO write error with 526957d3-630b-5bfe-a257-c024f120aeda (at 10.9.112.3@o2ib4), client will retry: rc = -110 00000020:02000400:11.0:1554442187.728733:0:272998:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0030: Bulk IO write error with 526957d3-630b-5bfe-a257-c024f120aeda (at 10.9.112.3@o2ib4), client will retry: rc = -110 00000020:02000400:15.0:1554442187.816743:0:272950:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST005c: Bulk IO write error with ed020952-e5b6-7729-7a78-b61d9e90b696 (at 10.9.105.32@o2ib4), client will retry: rc = -110 00000020:02000400:9.0:1554442188.093753:0:272924:0:(tgt_handler.c:2046:tgt_brw_read()) oak-OST0033: Bulk IO read error with ace149ff-a4e3-8eeb-f9dc-177b4b8f7f6d (at 10.9.106.25@o2ib4), client will retry: rc -110 00000020:02000400:23.0:1554442188.207750:0:274976:0:(tgt_handler.c:2046:tgt_brw_read()) oak-OST005b: Bulk IO read error with 8bab6b54-7a61-a87c-68e6-da0d28bb25d6 (at 10.9.106.1@o2ib4), client will retry: rc -110 00000020:02000400:17.0:1554442188.230752:0:273030:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0048: Bulk IO write error with b312f397-6359-6d9f-b796-be326bf43e45 (at 10.9.101.29@o2ib4), client will retry: rc = -110 00000020:02000400:5.0:1554442188.259746:0:275020:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0054: Bulk IO write error with d8bbb1fe-48fb-3725-9c52-55f23effddce (at 10.9.108.24@o2ib4), client will retry: rc = -110 00000020:02000400:19.0:1554442188.260735:0:275002:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0054: Bulk IO write error with d8bbb1fe-48fb-3725-9c52-55f23effddce (at 10.9.108.24@o2ib4), client will retry: rc = -110 00000020:02000400:24.0:1554442206.739748:0:272887:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0055: Bulk IO write error with d8bbb1fe-48fb-3725-9c52-55f23effddce (at 10.9.108.24@o2ib4), client will retry: rc = -110 00000020:02000400:1.0:1554442206.739750:0:275047:0:(tgt_handler.c:2347:tgt_brw_write()) oak-OST0055: Bulk IO write error with d8bbb1fe-48fb-3725-9c52-55f23effddce (at 10.9.108.24@o2ib4), client will retry: rc = -110 For the night, we have set max_create_count=0 to avoid new files to these OSTs (the 48 of the cell). Any ideas would be greatly appreciated.

Stephane Thiell added a comment - 04/Apr/19 10:40 PM

A note about ~~LU-12065~~ lnd: increase CQ entries

10.9.112.3@o2ib4 in the logs from my previous comment is one of the clients that has 2.12.0 + the patch for ~~LU-12065~~ (lnd: increase CQ entries).

All Oak servers are now all back to 2.10.7 to they have it too.

The LNet routers between Oak (o2ib5) and Sherlock (o2ib4/6) also have the patch on top of 2.12.0.

so at least we know that this specific patch doesn't fix this issue

Stephane Thiell added a comment - 04/Apr/19 10:40 PM A note about LU-12065 lnd: increase CQ entries 10.9.112.3@o2ib4 in the logs from my previous comment is one of the clients that has 2.12.0 + the patch for LU-12065 (lnd: increase CQ entries). All Oak servers are now all back to 2.10.7 to they have it too. The LNet routers between Oak (o2ib5) and Sherlock (o2ib4/6) also have the patch on top of 2.12.0. so at least we know that this specific patch doesn't fix this issue

People

Assignee:: Peter Jones

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 04/Apr/19 3:58 PM

Updated:: 16/Aug/19 5:35 PM