[LU-12162] Major issue on OSS after upgrading Oak to 2.10.7 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.10.7
Labels:
None
Environment:
Most clients are 2.12.0 (Sherlock) behind 2.12 routers, a few local clients are 2.10.6

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

We recently upgraded Oak servers to 2.10.7 but since then, the OSS are showing a lot of these errors:

[29459.670359] Pid: 184632, comm: ll_ost_io00_061 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018
[29459.682008] Call Trace:
[29459.684747]  [<ffffffffa9996684>] __lock_page+0x74/0x90
[29459.690595]  [<ffffffffa99971a4>] __find_lock_page+0x54/0x70
[29459.696924]  [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0
[29459.703545]  [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
[29459.711052]  [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd]
[29459.717672]  [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc]
[29459.724829]  [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[29459.732543]  [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[29459.741128]  [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[29459.748166]  [<ffffffffa98bdf21>] kthread+0xd1/0xe0
[29459.753622]  [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39
[29459.760630]  [<ffffffffffffffff>] 0xffffffffffffffff
[29459.766199] LustreError: dumping log to /tmp/lustre-log.1554392681.184632
[29558.439589] Lustre: 184902:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
  req@ffff883e21427450 x1628649065264016/t0(0) o3->214046b3-82aa-4e1f-6785-53da21760b7f@10.8.7.25@o2ib6:295/0 lens 488/432 e 0 to 0 dl 1554392785 ref 2 fl Interpret:/2/0 rc 0/0
[29558.471803] Lustre: 184902:0:(service.c:1346:ptlrpc_at_send_early_reply()) Skipped 86 previous similar messages
[29565.756656] LustreError: 184652:0:(ldlm_lib.c:3199:target_bulk_io()) @@@ bulk READ failed: rc -107  req@ffff8841a2c23c50 x1629017475649200/t0(0) o3->9c21af89-2623-9dd8-0966-6a7e83d1ca0c@10.9.105.61@o2ib4:106/0 lens 488/432 e 0 to 0 dl 1554392596 ref 1 fl Interpret:/2/0 rc 0/0
[29565.783924] LustreError: 184652:0:(ldlm_lib.c:3199:target_bulk_io()) Skipped 41 previous similar messages
[29565.794652] Lustre: 184652:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (600:191s); client may timeout.  req@ffff8841a2c23c50 x1629017475649200/t0(0) o3->9c21af89-2623-9dd8-0966-6a7e83d1ca0c@10.9.105.61@o2ib4:106/0 lens 488/432 e 0 to 0 dl 1554392596 ref 1 fl Complete:/2/ffffffff rc -107/-1
[29565.827647] Lustre: 184652:0:(service.c:2114:ptlrpc_server_handle_request()) Skipped 4 previous similar messages
[29592.627263] Lustre: oak-OST002f: deleting orphan objects from 0x0:23118956 to 0x0:23118977
[29652.157536] LNet: Service thread pid 184907 was inactive for 1200.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[29652.176670] Pid: 184907, comm: ll_ost_io01_098 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018
[29652.188320] Call Trace:
[29652.191067]  [<ffffffffa9996684>] __lock_page+0x74/0x90
[29652.196917]  [<ffffffffa99971a4>] __find_lock_page+0x54/0x70
[29652.203246]  [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0
[29652.209865]  [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
[29652.217369]  [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd]
[29652.223997]  [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc]
[29652.231172]  [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[29652.238889]  [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[29652.247474]  [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[29652.254504]  [<ffffffffa98bdf21>] kthread+0xd1/0xe0
[29652.259960]  [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39
[29652.266967]  [<ffffffffffffffff>] 0xffffffffffffffff
[29652.272539] LustreError: dumping log to /tmp/lustre-log.1554392874.184907
[29688.912990] Lustre: oak-OST0023: Connection restored to MGC10.0.2.51@o2ib5_1 (at 10.0.2.52@o2ib5)
[29688.922902] Lustre: Skipped 242 previous similar messages
[29688.929136] Lustre: oak-OST0023: deleting orphan objects from 0x0:23044807 to 0x0:23044835
[29690.017000] Lustre: oak-OST001d: deleting orphan objects from 0x0:24237516 to 0x0:24237537
[29690.216982] Lustre: oak-OST000d: deleting orphan objects from 0x0:22136235 to 0x0:22136257
[29692.305008] Lustre: oak-OST0007: deleting orphan objects from 0x0:21489097 to 0x0:21489122
[29692.721065] Lustre: oak-OST0021: deleting orphan objects from 0x0:22812231 to 0x0:22812257
[29754.557446] LNet: Service thread pid 184833 was inactive for 1201.54s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[29754.576566] Pid: 184833, comm: ll_ost_io01_080 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018
[29754.588216] Call Trace:
[29754.590953]  [<ffffffffa9996684>] __lock_page+0x74/0x90
[29754.596802]  [<ffffffffa99971a4>] __find_lock_page+0x54/0x70
[29754.603129]  [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0
[29754.609750]  [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
[29754.617253]  [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd]
[29754.623881]  [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc]
[29754.631039]  [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[29754.638753]  [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[29754.647335]  [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[29754.654363]  [<ffffffffa98bdf21>] kthread+0xd1/0xe0
[29754.659821]  [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39
[29754.666831]  [<ffffffffffffffff>] 0xffffffffffffffff

Users are reporting problem accessing some file data (from 2.12.0 clients).

When checking with htop, the OSS are quiet, definitively too quiet, but the load keeps increasing, which makes me think there is some kind of thread deadlocks on the OSS. I'll attach logs soon.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

oak-io1-s2-kern-and-sysrq-t.txt.gz
513 kB
04/Apr/19 4:02 PM
oak-io2-s2-20190404_sysrq_big.log
5.94 MB
05/Apr/19 6:11 AM

Issue Links

duplicates

LU-12018 deadlock on OSS: quota reintegration vs memory release

Resolved

Activity

[LU-12162] Major issue on OSS after upgrading Oak to 2.10.7

Stephane Thiell added a comment - 12/Apr/19 2:38 PM

Hi Andrew,
Thanks for your comment and sorry to hear that. This is interesting that it has happened shortly after a server upgrade for you also. Were you able to recover from this? If so how? Did you abort the recovery like we did?
And also, I'm curious, are you using any 2.12 clients?

Stephane Thiell added a comment - 12/Apr/19 2:38 PM Hi Andrew, Thanks for your comment and sorry to hear that. This is interesting that it has happened shortly after a server upgrade for you also. Were you able to recover from this? If so how? Did you abort the recovery like we did? And also, I'm curious, are you using any 2.12 clients?

Andrew Elwell added a comment - 12/Apr/19 3:22 AM

I think we're seeing the same issue at Pawsey following an upgrade of our astrofs servers the other week:

[root@astrofs-oss3 ~]# rpm -qa | grep lustre | sort
kernel-3.10.0-862.9.1.el7_lustre.x86_64
kernel-3.10.0-957.el7_lustre.x86_64
kmod-lustre-2.10.6-1.el7.x86_64
kmod-lustre-osd-ldiskfs-2.10.6-1.el7.x86_64
lustre-2.10.6-1.el7.x86_64
lustre-osd-ldiskfs-mount-2.10.6-1.el7.x86_64
lustre-resource-agents-2.10.6-1.el7.x86_64
[root@astrofs-oss3 ~]# uname -r
3.10.0-957.el7_lustre.x86_64
[root@astrofs-oss3 ~]#

and it's giving the same call trace that Stephane originally reported:

Fri Apr 12 10:35:31 2019] LustreError: dumping log to /tmp/lustre-log.1555036556.16340
[Fri Apr 12 10:35:31 2019] Pid: 16367, comm: ll_ost_io02_074 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018
[Fri Apr 12 10:35:31 2019] Call Trace:
[Fri Apr 12 10:35:31 2019]  [<ffffffffb51b58d4>] __lock_page+0x74/0x90
[Fri Apr 12 10:35:31 2019]  [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70
[Fri Apr 12 10:35:31 2019]  [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0
[Fri Apr 12 10:35:31 2019]  [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
[Fri Apr 12 10:35:31 2019]  [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffb50c1c31>] kthread+0xd1/0xe0
[Fri Apr 12 10:35:32 2019]  [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21
[Fri Apr 12 10:35:32 2019]  [<ffffffffffffffff>] 0xffffffffffffffff
[Fri Apr 12 10:35:32 2019] Pid: 16256, comm: ll_ost_io02_020 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018
[Fri Apr 12 10:35:32 2019] Call Trace:
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b58d4>] __lock_page+0x74/0x90
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0
[Fri Apr 12 10:35:32 2019]  [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffb50c1c31>] kthread+0xd1/0xe0
[Fri Apr 12 10:35:32 2019]  [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21
[Fri Apr 12 10:35:32 2019]  [<ffffffffffffffff>] 0xffffffffffffffff
[Fri Apr 12 10:35:32 2019] Pid: 15878, comm: ll_ost_io02_002 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018
[Fri Apr 12 10:35:32 2019] Call Trace:
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b58d4>] __lock_page+0x74/0x90
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0
[Fri Apr 12 10:35:32 2019]  [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffb50c1c31>] kthread+0xd1/0xe0
[Fri Apr 12 10:35:32 2019]  [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21
[Fri Apr 12 10:35:32 2019]  [<ffffffffffffffff>] 0xffffffffffffffff
[Fri Apr 12 10:35:32 2019] Pid: 16302, comm: ll_ost_io02_037 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018
[Fri Apr 12 10:35:32 2019] Call Trace:
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b58d4>] __lock_page+0x74/0x90
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70
[Fri Apr 12 10:35:32 2019]  [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0
[Fri Apr 12 10:35:32 2019]  [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Fri Apr 12 10:35:32 2019]  [<ffffffffb50c1c31>] kthread+0xd1/0xe0
[Fri Apr 12 10:35:32 2019]  [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21
[Fri Apr 12 10:35:32 2019]  [<ffffffffffffffff>] 0xffffffffffffffff
[Fri Apr 12 10:35:32 2019] LNet: Service thread pid 16285 was inactive for 1201.11s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[Fri Apr 12 10:35:32 2019] LNet: Skipped 34 previous similar messages

as well as a bunch of other Bulk IO read and write errors.

For info, the storage for this is via ib_srp to a DDN array rather than the SAS arrays we're using on our other filesystems - so I wonder if it is timing related?

and we're also seeing IO blocked

[root@astrofs-oss3 ~]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1 193      0 734084 3879288 120170720    0    0  4520  8360    2    1  0  1 67 32  0
 1 193      0 733936 3879288 120170720    0    0     0    16 3474 16751  0  0  0 100  0
 0 193      0 733944 3879288 120170720    0    0     0    20 3141 15211  0  0  0 100  0
 0 193      0 733944 3879288 120170720    0    0     0     8 2111 10145  0  0  0 100  0
 0 193      0 733944 3879288 120170720    0    0     0     8 2723 13012  0  0  0 100  0
 0 193      0 733944 3879288 120170720    0    0     0    12 2938 14461  0  0  0 100  0
 0 193      0 733944 3879288 120170720    0    0     0    16 4151 20537  0  0  0 100  0
 0 193      0 734156 3879288 120171072    0    0     0    20 4384 21279  0  0  0 100  0
^C

Andrew Elwell added a comment - 12/Apr/19 3:22 AM I think we're seeing the same issue at Pawsey following an upgrade of our astrofs servers the other week: [root@astrofs-oss3 ~] # rpm -qa | grep lustre | sort kernel-3.10.0-862.9.1.el7_lustre.x86_64 kernel-3.10.0-957.el7_lustre.x86_64 kmod-lustre-2.10.6-1.el7.x86_64 kmod-lustre-osd-ldiskfs-2.10.6-1.el7.x86_64 lustre-2.10.6-1.el7.x86_64 lustre-osd-ldiskfs-mount-2.10.6-1.el7.x86_64 lustre-resource-agents-2.10.6-1.el7.x86_64 [root@astrofs-oss3 ~] # uname -r 3.10.0-957.el7_lustre.x86_64 [root@astrofs-oss3 ~] # and it's giving the same call trace that Stephane originally reported: Fri Apr 12 10:35:31 2019] LustreError: dumping log to /tmp/lustre-log.1555036556.16340 [Fri Apr 12 10:35:31 2019] Pid: 16367, comm: ll_ost_io02_074 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018 [Fri Apr 12 10:35:31 2019] Call Trace: [Fri Apr 12 10:35:31 2019] [<ffffffffb51b58d4>] __lock_page+0x74/0x90 [Fri Apr 12 10:35:31 2019] [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70 [Fri Apr 12 10:35:31 2019] [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0 [Fri Apr 12 10:35:31 2019] [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [Fri Apr 12 10:35:31 2019] [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffb50c1c31>] kthread+0xd1/0xe0 [Fri Apr 12 10:35:32 2019] [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21 [Fri Apr 12 10:35:32 2019] [<ffffffffffffffff>] 0xffffffffffffffff [Fri Apr 12 10:35:32 2019] Pid: 16256, comm: ll_ost_io02_020 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018 [Fri Apr 12 10:35:32 2019] Call Trace: [Fri Apr 12 10:35:32 2019] [<ffffffffb51b58d4>] __lock_page+0x74/0x90 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0 [Fri Apr 12 10:35:32 2019] [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [Fri Apr 12 10:35:32 2019] [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffb50c1c31>] kthread+0xd1/0xe0 [Fri Apr 12 10:35:32 2019] [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21 [Fri Apr 12 10:35:32 2019] [<ffffffffffffffff>] 0xffffffffffffffff [Fri Apr 12 10:35:32 2019] Pid: 15878, comm: ll_ost_io02_002 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018 [Fri Apr 12 10:35:32 2019] Call Trace: [Fri Apr 12 10:35:32 2019] [<ffffffffb51b58d4>] __lock_page+0x74/0x90 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0 [Fri Apr 12 10:35:32 2019] [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [Fri Apr 12 10:35:32 2019] [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffb50c1c31>] kthread+0xd1/0xe0 [Fri Apr 12 10:35:32 2019] [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21 [Fri Apr 12 10:35:32 2019] [<ffffffffffffffff>] 0xffffffffffffffff [Fri Apr 12 10:35:32 2019] Pid: 16302, comm: ll_ost_io02_037 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018 [Fri Apr 12 10:35:32 2019] Call Trace: [Fri Apr 12 10:35:32 2019] [<ffffffffb51b58d4>] __lock_page+0x74/0x90 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0 [Fri Apr 12 10:35:32 2019] [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [Fri Apr 12 10:35:32 2019] [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffb50c1c31>] kthread+0xd1/0xe0 [Fri Apr 12 10:35:32 2019] [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21 [Fri Apr 12 10:35:32 2019] [<ffffffffffffffff>] 0xffffffffffffffff [Fri Apr 12 10:35:32 2019] LNet: Service thread pid 16285 was inactive for 1201.11s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [Fri Apr 12 10:35:32 2019] LNet: Skipped 34 previous similar messages as well as a bunch of other Bulk IO read and write errors. For info, the storage for this is via ib_srp to a DDN array rather than the SAS arrays we're using on our other filesystems - so I wonder if it is timing related? and we're also seeing IO blocked [root@astrofs-oss3 ~]# vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 193 0 734084 3879288 120170720 0 0 4520 8360 2 1 0 1 67 32 0 1 193 0 733936 3879288 120170720 0 0 0 16 3474 16751 0 0 0 100 0 0 193 0 733944 3879288 120170720 0 0 0 20 3141 15211 0 0 0 100 0 0 193 0 733944 3879288 120170720 0 0 0 8 2111 10145 0 0 0 100 0 0 193 0 733944 3879288 120170720 0 0 0 8 2723 13012 0 0 0 100 0 0 193 0 733944 3879288 120170720 0 0 0 12 2938 14461 0 0 0 100 0 0 193 0 733944 3879288 120170720 0 0 0 16 4151 20537 0 0 0 100 0 0 193 0 734156 3879288 120171072 0 0 0 20 4384 21279 0 0 0 100 0 ^C

Stephane Thiell added a comment - 06/Apr/19 5:23 AM - edited

pfarrell I confirm the patch does successfully apply on top of 2.10.7 and I was able to build new server rpms for Oak. Thanks!

pjones I think it's important to clarify what exactly happened and the title of this issue probably must be changed.

The deadlock happened repeatedly on two servers serving a set of OSTs in IO Cell #2. But it's critical to highlight two things:

Every time that the OST recovery finished on those servers (always successfully against about 1400 clients - that stayed up during the whole time), we shortly hit this deadlock issue. I definitively think that we hit a recovery scenario that triggered this same issue each time (and that's exactly what the recovery is for... replying transactions) - aborting the recovery on those OSTs just broke the cycle and we were finally able to resume operation.
Last night I tried to downgrade to Lustre 2.10.6, 2.10.5 and even 2.10.3+patch! (and old version that we have run for quite some time), and EVERY TIME the same issue happened too. This is definitively not a 2.10.7-specific issue. Please feel free to change the title accordingly.

As of what really triggered the issue, I was thinking perhaps it's due to the upgrade of Sherlock (our lustre clients) to 2.12 recently, and because we haven't restarted Oak since then we hit this now while we upgraded Oak... Do you know if ~~LU-12018~~ occurred also with 2.12 clients? That would be interesting to check.

I would also like to mention that when this issue occurred it was extremely difficult to diagnose at first. Oak was down for about 18 hours and we spent most of the time looking for wrong causes, like network or backend issues, because Lustre was reporting these timeout errors or ldiskfs watchdogs because of the deadlock. But I'm glad this issue was already found and special thanks to Alex for pointing us to the right patch (at least, we hope so, but it looks like it very much).

Finally, to answer your question, I have the updated rpms ready to be deployed on Oak servers but the plan is to not touch anything right now as everything is working again as expected ( both 2.10 on Oak and 2.12 on Fir ). We'll patch only if an issue occur or at the next maintenance (TBD).

Stephane Thiell added a comment - 06/Apr/19 5:23 AM - edited pfarrell I confirm the patch does successfully apply on top of 2.10.7 and I was able to build new server rpms for Oak. Thanks! pjones I think it's important to clarify what exactly happened and the title of this issue probably must be changed. The deadlock happened repeatedly on two servers serving a set of OSTs in IO Cell #2. But it's critical to highlight two things: Every time that the OST recovery finished on those servers (always successfully against about 1400 clients - that stayed up during the whole time), we shortly hit this deadlock issue. I definitively think that we hit a recovery scenario that triggered this same issue each time (and that's exactly what the recovery is for... replying transactions) - aborting the recovery on those OSTs just broke the cycle and we were finally able to resume operation. Last night I tried to downgrade to Lustre 2.10.6, 2.10.5 and even 2.10.3+patch! (and old version that we have run for quite some time), and EVERY TIME the same issue happened too. This is definitively not a 2.10.7-specific issue. Please feel free to change the title accordingly. As of what really triggered the issue, I was thinking perhaps it's due to the upgrade of Sherlock (our lustre clients) to 2.12 recently, and because we haven't restarted Oak since then we hit this now while we upgraded Oak... Do you know if LU-12018 occurred also with 2.12 clients? That would be interesting to check. I would also like to mention that when this issue occurred it was extremely difficult to diagnose at first. Oak was down for about 18 hours and we spent most of the time looking for wrong causes, like network or backend issues, because Lustre was reporting these timeout errors or ldiskfs watchdogs because of the deadlock. But I'm glad this issue was already found and special thanks to Alex for pointing us to the right patch (at least, we hope so, but it looks like it very much). Finally, to answer your question, I have the updated rpms ready to be deployed on Oak servers but the plan is to not touch anything right now as everything is working again as expected ( both 2.10 on Oak and 2.12 on Fir ). We'll patch only if an issue occur or at the next maintenance (TBD).

Peter Jones added a comment - 06/Apr/19 3:20 AM

sthiell

How soon will you expect to know whether this patch has helped? What is puzzling to me is that this issue has been seen (albeit quite rarely) as far back as 2.7.x releases so it's puzzling why it just suddenly start happening frequently after upgrading to 2.10.7...

Peter

Peter Jones added a comment - 06/Apr/19 3:20 AM sthiell How soon will you expect to know whether this patch has helped? What is puzzling to me is that this issue has been seen (albeit quite rarely) as far back as 2.7.x releases so it's puzzling why it just suddenly start happening frequently after upgrading to 2.10.7... Peter

Stephane Thiell added a comment - 05/Apr/19 4:38 PM - edited

Ok thanks, that's great. It is very short indeed!
I'm going to prepare a patched 2.10.7 for that then.

Stephane Thiell added a comment - 05/Apr/19 4:38 PM - edited Ok thanks, that's great. It is very short indeed! I'm going to prepare a patched 2.10.7 for that then.

Patrick Farrell (Inactive) added a comment - 05/Apr/19 4:18 PM

Stephane,

The patch shouldn't require porting. You can just apply it to the 2.10.7 source without issue.

Patrick Farrell (Inactive) added a comment - 05/Apr/19 4:18 PM Stephane, The patch shouldn't require porting. You can just apply it to the 2.10.7 source without issue.

Stephane Thiell added a comment - 05/Apr/19 4:11 PM

bzzz Thanks!!! That must be it! I noticed the same symptoms as described in ~~LU-12018~~.

Would you be able to port the patch to b2_10?

Stephane Thiell added a comment - 05/Apr/19 4:11 PM bzzz Thanks!!! That must be it! I noticed the same symptoms as described in LU-12018 . Would you be able to port the patch to b2_10?

Alex Zhuravlev added a comment - 05/Apr/19 3:13 PM - edited

Call Trace:
 [<ffffffff816a94e9>] schedule+0x29/0x70
 [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
 [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700
 [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
 [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
 [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140
 [<ffffffffc0f92110>] ? qsd_reconciliation+0xa90/0xa90 [lquota]
 [<ffffffffc09f4549>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
 [<ffffffffc0f938f8>] qsd_start_reint_thread+0x858/0xc10 [lquota]
 [<ffffffff81267292>] ? dqput+0x162/0x1e0
 [<ffffffffc0f97781>] qsd_ready+0x231/0x3c0 [lquota]
 [<ffffffffc0f9a6d2>] qsd_adjust+0xa2/0x890 [lquota]
 [<ffffffffc0f8cbfa>] ? qsd_refresh_usage+0x6a/0x2b0 [lquota]
 [<ffffffffc0f9ba80>] qsd_op_adjust+0x4e0/0x730 [lquota]
 [<ffffffffc0ff5d80>] osd_object_delete+0x230/0x330 [osd_ldiskfs]
 [<ffffffffc0a106cd>] lu_object_free.isra.31+0x9d/0x1a0 [obdclass]
 [<ffffffffc0a1132e>] lu_site_purge_objects+0x2fe/0x520 [obdclass]
 [<ffffffffc0a12179>] lu_cache_shrink+0x259/0x2d0 [obdclass]
 [<ffffffff81195413>] shrink_slab+0x163/0x330
 [<ffffffff811f7521>] ? vmpressure+0x61/0x90
 [<ffffffff81198091>] zone_reclaim+0x1d1/0x2f0
 [<ffffffff8118c264>] get_page_from_freelist+0x2c4/0x9e0
 [<ffffffff81029557>] ? __switch_to+0xd7/0x510
 [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420
 [<ffffffff811d1108>] alloc_pages_current+0x98/0x110
 [<ffffffff81182917>] __page_cache_alloc+0x97/0xb0
 [<ffffffff81183625>] find_or_create_page+0x45/0xa0
 [<ffffffffc1018885>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
 [<ffffffffc1192728>] ofd_preprw+0x6a8/0x1150 [ofd]

Call Trace:
 [<ffffffff816a94e9>] schedule+0x29/0x70
 [<ffffffff816aab1d>] rwsem_down_read_failed+0x10d/0x1a0
 [<ffffffff8121a554>] ? shrink_dentry_list+0x274/0x490
 [<ffffffff81332038>] call_rwsem_down_read_failed+0x18/0x30
 [<ffffffff816a8780>] down_read+0x20/0x40
 [<ffffffffc0a11f8a>] lu_cache_shrink+0x6a/0x2d0 [obdclass]
 [<ffffffff81195359>] shrink_slab+0xa9/0x330
 [<ffffffff8118861f>] ? zone_watermark_ok+0x1f/0x30
 [<ffffffff811a8853>] ? compaction_suitable+0xa3/0xb0
 [<ffffffff81198091>] zone_reclaim+0x1d1/0x2f0
 [<ffffffff8118c264>] get_page_from_freelist+0x2c4/0x9e0
 [<ffffffff81085e1b>] ? copy_process+0xeeb/0x1970
 [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420
 [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
 [<ffffffff8108510d>] copy_process+0x1dd/0x1970
 [<ffffffff81029557>] ? __switch_to+0xd7/0x510
 [<ffffffff81086a51>] do_fork+0x91/0x320
 [<ffffffff816a8fad>] ? __schedule+0x39d/0x8b0
 [<ffffffff81086d06>] kernel_thread+0x26/0x30
 [<ffffffff810b1341>] kthreadd+0x2c1/0x300

Alex Zhuravlev added a comment - 05/Apr/19 3:13 PM - edited Call Trace: [<ffffffff816a94e9>] schedule+0x29/0x70 [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0 [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700 [<ffffffff816a989d>] wait_for_completion+0xfd/0x140 [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140 [<ffffffffc0f92110>] ? qsd_reconciliation+0xa90/0xa90 [lquota] [<ffffffffc09f4549>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [<ffffffffc0f938f8>] qsd_start_reint_thread+0x858/0xc10 [lquota] [<ffffffff81267292>] ? dqput+0x162/0x1e0 [<ffffffffc0f97781>] qsd_ready+0x231/0x3c0 [lquota] [<ffffffffc0f9a6d2>] qsd_adjust+0xa2/0x890 [lquota] [<ffffffffc0f8cbfa>] ? qsd_refresh_usage+0x6a/0x2b0 [lquota] [<ffffffffc0f9ba80>] qsd_op_adjust+0x4e0/0x730 [lquota] [<ffffffffc0ff5d80>] osd_object_delete+0x230/0x330 [osd_ldiskfs] [<ffffffffc0a106cd>] lu_object_free.isra.31+0x9d/0x1a0 [obdclass] [<ffffffffc0a1132e>] lu_site_purge_objects+0x2fe/0x520 [obdclass] [<ffffffffc0a12179>] lu_cache_shrink+0x259/0x2d0 [obdclass] [<ffffffff81195413>] shrink_slab+0x163/0x330 [<ffffffff811f7521>] ? vmpressure+0x61/0x90 [<ffffffff81198091>] zone_reclaim+0x1d1/0x2f0 [<ffffffff8118c264>] get_page_from_freelist+0x2c4/0x9e0 [<ffffffff81029557>] ? __switch_to+0xd7/0x510 [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420 [<ffffffff811d1108>] alloc_pages_current+0x98/0x110 [<ffffffff81182917>] __page_cache_alloc+0x97/0xb0 [<ffffffff81183625>] find_or_create_page+0x45/0xa0 [<ffffffffc1018885>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [<ffffffffc1192728>] ofd_preprw+0x6a8/0x1150 [ofd] Call Trace: [<ffffffff816a94e9>] schedule+0x29/0x70 [<ffffffff816aab1d>] rwsem_down_read_failed+0x10d/0x1a0 [<ffffffff8121a554>] ? shrink_dentry_list+0x274/0x490 [<ffffffff81332038>] call_rwsem_down_read_failed+0x18/0x30 [<ffffffff816a8780>] down_read+0x20/0x40 [<ffffffffc0a11f8a>] lu_cache_shrink+0x6a/0x2d0 [obdclass] [<ffffffff81195359>] shrink_slab+0xa9/0x330 [<ffffffff8118861f>] ? zone_watermark_ok+0x1f/0x30 [<ffffffff811a8853>] ? compaction_suitable+0xa3/0xb0 [<ffffffff81198091>] zone_reclaim+0x1d1/0x2f0 [<ffffffff8118c264>] get_page_from_freelist+0x2c4/0x9e0 [<ffffffff81085e1b>] ? copy_process+0xeeb/0x1970 [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420 [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [<ffffffff8108510d>] copy_process+0x1dd/0x1970 [<ffffffff81029557>] ? __switch_to+0xd7/0x510 [<ffffffff81086a51>] do_fork+0x91/0x320 [<ffffffff816a8fad>] ? __schedule+0x39d/0x8b0 [<ffffffff81086d06>] kernel_thread+0x26/0x30 [<ffffffff810b1341>] kthreadd+0x2c1/0x300

Alex Zhuravlev added a comment - 05/Apr/19 3:05 PM

sthiell this looks like ~~LU-12018~~

Alex Zhuravlev added a comment - 05/Apr/19 3:05 PM sthiell this looks like LU-12018

Stephane Thiell added a comment - 05/Apr/19 1:25 PM

I think we found the issue. There is some kind of incompatibility in the 2.10 recovery for 2.12 clients. It leads to a deadlock as shown in my logs ( I attached a second dump of sysrq t last night also). After remounting all 2.10.7 OSTs doing Bulk IO errors using -o abort_recov (GRR I really don't like that, it killed a lot of jobs), things are back to normal!! I think! Logs are all super clean. And we are in 2.10.7 now at least!!

Stephane Thiell added a comment - 05/Apr/19 1:25 PM I think we found the issue. There is some kind of incompatibility in the 2.10 recovery for 2.12 clients. It leads to a deadlock as shown in my logs ( I attached a second dump of sysrq t last night also). After remounting all 2.10.7 OSTs doing Bulk IO errors using -o abort_recov (GRR I really don't like that, it killed a lot of jobs), things are back to normal!! I think! Logs are all super clean. And we are in 2.10.7 now at least!!

People

Assignee:: Peter Jones

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 04/Apr/19 3:58 PM

Updated:: 16/Aug/19 5:35 PM