Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12162

Major issue on OSS after upgrading Oak to 2.10.7

Details

    • Bug
    • Resolution: Unresolved
    • Blocker
    • None
    • Lustre 2.10.7
    • None
    • Most clients are 2.12.0 (Sherlock) behind 2.12 routers, a few local clients are 2.10.6
    • 2
    • 9223372036854775807

    Description

      We recently upgraded Oak servers to 2.10.7 but since then, the OSS are showing a lot of these errors:

       

      [29459.670359] Pid: 184632, comm: ll_ost_io00_061 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018
      [29459.682008] Call Trace:
      [29459.684747]  [<ffffffffa9996684>] __lock_page+0x74/0x90
      [29459.690595]  [<ffffffffa99971a4>] __find_lock_page+0x54/0x70
      [29459.696924]  [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0
      [29459.703545]  [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
      [29459.711052]  [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd]
      [29459.717672]  [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc]
      [29459.724829]  [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      [29459.732543]  [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      [29459.741128]  [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [29459.748166]  [<ffffffffa98bdf21>] kthread+0xd1/0xe0
      [29459.753622]  [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39
      [29459.760630]  [<ffffffffffffffff>] 0xffffffffffffffff
      [29459.766199] LustreError: dumping log to /tmp/lustre-log.1554392681.184632
      [29558.439589] Lustre: 184902:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
        req@ffff883e21427450 x1628649065264016/t0(0) o3->214046b3-82aa-4e1f-6785-53da21760b7f@10.8.7.25@o2ib6:295/0 lens 488/432 e 0 to 0 dl 1554392785 ref 2 fl Interpret:/2/0 rc 0/0
      [29558.471803] Lustre: 184902:0:(service.c:1346:ptlrpc_at_send_early_reply()) Skipped 86 previous similar messages
      [29565.756656] LustreError: 184652:0:(ldlm_lib.c:3199:target_bulk_io()) @@@ bulk READ failed: rc -107  req@ffff8841a2c23c50 x1629017475649200/t0(0) o3->9c21af89-2623-9dd8-0966-6a7e83d1ca0c@10.9.105.61@o2ib4:106/0 lens 488/432 e 0 to 0 dl 1554392596 ref 1 fl Interpret:/2/0 rc 0/0
      [29565.783924] LustreError: 184652:0:(ldlm_lib.c:3199:target_bulk_io()) Skipped 41 previous similar messages
      [29565.794652] Lustre: 184652:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (600:191s); client may timeout.  req@ffff8841a2c23c50 x1629017475649200/t0(0) o3->9c21af89-2623-9dd8-0966-6a7e83d1ca0c@10.9.105.61@o2ib4:106/0 lens 488/432 e 0 to 0 dl 1554392596 ref 1 fl Complete:/2/ffffffff rc -107/-1
      [29565.827647] Lustre: 184652:0:(service.c:2114:ptlrpc_server_handle_request()) Skipped 4 previous similar messages
      [29592.627263] Lustre: oak-OST002f: deleting orphan objects from 0x0:23118956 to 0x0:23118977
      [29652.157536] LNet: Service thread pid 184907 was inactive for 1200.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [29652.176670] Pid: 184907, comm: ll_ost_io01_098 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018
      [29652.188320] Call Trace:
      [29652.191067]  [<ffffffffa9996684>] __lock_page+0x74/0x90
      [29652.196917]  [<ffffffffa99971a4>] __find_lock_page+0x54/0x70
      [29652.203246]  [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0
      [29652.209865]  [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
      [29652.217369]  [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd]
      [29652.223997]  [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc]
      [29652.231172]  [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      [29652.238889]  [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      [29652.247474]  [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [29652.254504]  [<ffffffffa98bdf21>] kthread+0xd1/0xe0
      [29652.259960]  [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39
      [29652.266967]  [<ffffffffffffffff>] 0xffffffffffffffff
      [29652.272539] LustreError: dumping log to /tmp/lustre-log.1554392874.184907
      [29688.912990] Lustre: oak-OST0023: Connection restored to MGC10.0.2.51@o2ib5_1 (at 10.0.2.52@o2ib5)
      [29688.922902] Lustre: Skipped 242 previous similar messages
      [29688.929136] Lustre: oak-OST0023: deleting orphan objects from 0x0:23044807 to 0x0:23044835
      [29690.017000] Lustre: oak-OST001d: deleting orphan objects from 0x0:24237516 to 0x0:24237537
      [29690.216982] Lustre: oak-OST000d: deleting orphan objects from 0x0:22136235 to 0x0:22136257
      [29692.305008] Lustre: oak-OST0007: deleting orphan objects from 0x0:21489097 to 0x0:21489122
      [29692.721065] Lustre: oak-OST0021: deleting orphan objects from 0x0:22812231 to 0x0:22812257
      [29754.557446] LNet: Service thread pid 184833 was inactive for 1201.54s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [29754.576566] Pid: 184833, comm: ll_ost_io01_080 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018
      [29754.588216] Call Trace:
      [29754.590953]  [<ffffffffa9996684>] __lock_page+0x74/0x90
      [29754.596802]  [<ffffffffa99971a4>] __find_lock_page+0x54/0x70
      [29754.603129]  [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0
      [29754.609750]  [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
      [29754.617253]  [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd]
      [29754.623881]  [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc]
      [29754.631039]  [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      [29754.638753]  [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      [29754.647335]  [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [29754.654363]  [<ffffffffa98bdf21>] kthread+0xd1/0xe0
      [29754.659821]  [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39
      [29754.666831]  [<ffffffffffffffff>] 0xffffffffffffffff
      

      Users are reporting problem accessing some file data (from 2.12.0 clients).

      When checking with htop, the OSS are quiet, definitively too quiet, but the load keeps increasing, which makes me think there is some kind of thread deadlocks on the OSS. I'll attach logs soon.

      Attachments

        Issue Links

          Activity

            [LU-12162] Major issue on OSS after upgrading Oak to 2.10.7

            Hi Andrew,
            Thanks for your comment and sorry to hear that. This is interesting that it has happened shortly after a server upgrade for you also. Were you able to recover from this? If so how? Did you abort the recovery like we did?
            And also, I'm curious, are you using any 2.12 clients?

            sthiell Stephane Thiell added a comment - Hi Andrew, Thanks for your comment and sorry to hear that. This is interesting that it has happened shortly after a server upgrade for you also. Were you able to recover from this? If so how? Did you abort the recovery like we did? And also, I'm curious, are you using any 2.12 clients?

            I think we're seeing the same issue at Pawsey following an upgrade of our astrofs servers the other week:

            [root@astrofs-oss3 ~]# rpm -qa | grep lustre | sort
            kernel-3.10.0-862.9.1.el7_lustre.x86_64
            kernel-3.10.0-957.el7_lustre.x86_64
            kmod-lustre-2.10.6-1.el7.x86_64
            kmod-lustre-osd-ldiskfs-2.10.6-1.el7.x86_64
            lustre-2.10.6-1.el7.x86_64
            lustre-osd-ldiskfs-mount-2.10.6-1.el7.x86_64
            lustre-resource-agents-2.10.6-1.el7.x86_64
            [root@astrofs-oss3 ~]# uname -r
            3.10.0-957.el7_lustre.x86_64
            [root@astrofs-oss3 ~]#

             

            and it's giving the same call trace that Stephane originally reported:

            Fri Apr 12 10:35:31 2019] LustreError: dumping log to /tmp/lustre-log.1555036556.16340
            [Fri Apr 12 10:35:31 2019] Pid: 16367, comm: ll_ost_io02_074 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018
            [Fri Apr 12 10:35:31 2019] Call Trace:
            [Fri Apr 12 10:35:31 2019]  [<ffffffffb51b58d4>] __lock_page+0x74/0x90
            [Fri Apr 12 10:35:31 2019]  [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70
            [Fri Apr 12 10:35:31 2019]  [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0
            [Fri Apr 12 10:35:31 2019]  [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
            [Fri Apr 12 10:35:31 2019]  [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb50c1c31>] kthread+0xd1/0xe0
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21
            [Fri Apr 12 10:35:32 2019]  [<ffffffffffffffff>] 0xffffffffffffffff
            [Fri Apr 12 10:35:32 2019] Pid: 16256, comm: ll_ost_io02_020 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018
            [Fri Apr 12 10:35:32 2019] Call Trace:
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb51b58d4>] __lock_page+0x74/0x90
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb50c1c31>] kthread+0xd1/0xe0
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21
            [Fri Apr 12 10:35:32 2019]  [<ffffffffffffffff>] 0xffffffffffffffff
            [Fri Apr 12 10:35:32 2019] Pid: 15878, comm: ll_ost_io02_002 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018
            [Fri Apr 12 10:35:32 2019] Call Trace:
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb51b58d4>] __lock_page+0x74/0x90
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb50c1c31>] kthread+0xd1/0xe0
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21
            [Fri Apr 12 10:35:32 2019]  [<ffffffffffffffff>] 0xffffffffffffffff
            [Fri Apr 12 10:35:32 2019] Pid: 16302, comm: ll_ost_io02_037 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018
            [Fri Apr 12 10:35:32 2019] Call Trace:
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb51b58d4>] __lock_page+0x74/0x90
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb50c1c31>] kthread+0xd1/0xe0
            [Fri Apr 12 10:35:32 2019]  [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21
            [Fri Apr 12 10:35:32 2019]  [<ffffffffffffffff>] 0xffffffffffffffff
            [Fri Apr 12 10:35:32 2019] LNet: Service thread pid 16285 was inactive for 1201.11s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
            [Fri Apr 12 10:35:32 2019] LNet: Skipped 34 previous similar messages 

            as well as a bunch of other Bulk IO read and write errors.

            For info, the storage for this is via ib_srp to a DDN array rather than the SAS arrays we're using on our other filesystems - so I wonder if it is timing related?

            and we're also seeing IO blocked

            [root@astrofs-oss3 ~]# vmstat 1
            procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
             r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
             1 193      0 734084 3879288 120170720    0    0  4520  8360    2    1  0  1 67 32  0
             1 193      0 733936 3879288 120170720    0    0     0    16 3474 16751  0  0  0 100  0
             0 193      0 733944 3879288 120170720    0    0     0    20 3141 15211  0  0  0 100  0
             0 193      0 733944 3879288 120170720    0    0     0     8 2111 10145  0  0  0 100  0
             0 193      0 733944 3879288 120170720    0    0     0     8 2723 13012  0  0  0 100  0
             0 193      0 733944 3879288 120170720    0    0     0    12 2938 14461  0  0  0 100  0
             0 193      0 733944 3879288 120170720    0    0     0    16 4151 20537  0  0  0 100  0
             0 193      0 734156 3879288 120171072    0    0     0    20 4384 21279  0  0  0 100  0
            ^C
            Elwell Andrew Elwell added a comment - I think we're seeing the same issue at Pawsey following an upgrade of our astrofs servers the other week: [root@astrofs-oss3 ~] # rpm -qa | grep lustre | sort kernel-3.10.0-862.9.1.el7_lustre.x86_64 kernel-3.10.0-957.el7_lustre.x86_64 kmod-lustre-2.10.6-1.el7.x86_64 kmod-lustre-osd-ldiskfs-2.10.6-1.el7.x86_64 lustre-2.10.6-1.el7.x86_64 lustre-osd-ldiskfs-mount-2.10.6-1.el7.x86_64 lustre-resource-agents-2.10.6-1.el7.x86_64 [root@astrofs-oss3 ~] # uname -r 3.10.0-957.el7_lustre.x86_64 [root@astrofs-oss3 ~] #   and it's giving the same call trace that Stephane originally reported: Fri Apr 12 10:35:31 2019] LustreError: dumping log to /tmp/lustre-log.1555036556.16340 [Fri Apr 12 10:35:31 2019] Pid: 16367, comm: ll_ost_io02_074 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018 [Fri Apr 12 10:35:31 2019] Call Trace: [Fri Apr 12 10:35:31 2019] [<ffffffffb51b58d4>] __lock_page+0x74/0x90 [Fri Apr 12 10:35:31 2019] [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70 [Fri Apr 12 10:35:31 2019] [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0 [Fri Apr 12 10:35:31 2019] [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [Fri Apr 12 10:35:31 2019] [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffb50c1c31>] kthread+0xd1/0xe0 [Fri Apr 12 10:35:32 2019] [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21 [Fri Apr 12 10:35:32 2019] [<ffffffffffffffff>] 0xffffffffffffffff [Fri Apr 12 10:35:32 2019] Pid: 16256, comm: ll_ost_io02_020 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018 [Fri Apr 12 10:35:32 2019] Call Trace: [Fri Apr 12 10:35:32 2019] [<ffffffffb51b58d4>] __lock_page+0x74/0x90 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0 [Fri Apr 12 10:35:32 2019] [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [Fri Apr 12 10:35:32 2019] [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffb50c1c31>] kthread+0xd1/0xe0 [Fri Apr 12 10:35:32 2019] [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21 [Fri Apr 12 10:35:32 2019] [<ffffffffffffffff>] 0xffffffffffffffff [Fri Apr 12 10:35:32 2019] Pid: 15878, comm: ll_ost_io02_002 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018 [Fri Apr 12 10:35:32 2019] Call Trace: [Fri Apr 12 10:35:32 2019] [<ffffffffb51b58d4>] __lock_page+0x74/0x90 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0 [Fri Apr 12 10:35:32 2019] [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [Fri Apr 12 10:35:32 2019] [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffb50c1c31>] kthread+0xd1/0xe0 [Fri Apr 12 10:35:32 2019] [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21 [Fri Apr 12 10:35:32 2019] [<ffffffffffffffff>] 0xffffffffffffffff [Fri Apr 12 10:35:32 2019] Pid: 16302, comm: ll_ost_io02_037 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018 [Fri Apr 12 10:35:32 2019] Call Trace: [Fri Apr 12 10:35:32 2019] [<ffffffffb51b58d4>] __lock_page+0x74/0x90 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b65d4>] __find_lock_page+0x54/0x70 [Fri Apr 12 10:35:32 2019] [<ffffffffb51b7244>] find_or_create_page+0x34/0xa0 [Fri Apr 12 10:35:32 2019] [<ffffffffc1018755>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [Fri Apr 12 10:35:32 2019] [<ffffffffc115d5d2>] ofd_preprw_write.isra.30+0x202/0xda0 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc115e592>] ofd_preprw+0x422/0x1170 [ofd] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5706a>] tgt_brw_write+0xc3a/0x17d0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0d5342a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cfbe5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffc0cff5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Fri Apr 12 10:35:32 2019] [<ffffffffb50c1c31>] kthread+0xd1/0xe0 [Fri Apr 12 10:35:32 2019] [<ffffffffb5774c1d>] ret_from_fork_nospec_begin+0x7/0x21 [Fri Apr 12 10:35:32 2019] [<ffffffffffffffff>] 0xffffffffffffffff [Fri Apr 12 10:35:32 2019] LNet: Service thread pid 16285 was inactive for 1201.11s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [Fri Apr 12 10:35:32 2019] LNet: Skipped 34 previous similar messages as well as a bunch of other Bulk IO read and write errors. For info, the storage for this is via ib_srp to a DDN array rather than the SAS arrays we're using on our other filesystems - so I wonder if it is timing related? and we're also seeing IO blocked [root@astrofs-oss3 ~]# vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 193 0 734084 3879288 120170720 0 0 4520 8360 2 1 0 1 67 32 0 1 193 0 733936 3879288 120170720 0 0 0 16 3474 16751 0 0 0 100 0 0 193 0 733944 3879288 120170720 0 0 0 20 3141 15211 0 0 0 100 0 0 193 0 733944 3879288 120170720 0 0 0 8 2111 10145 0 0 0 100 0 0 193 0 733944 3879288 120170720 0 0 0 8 2723 13012 0 0 0 100 0 0 193 0 733944 3879288 120170720 0 0 0 12 2938 14461 0 0 0 100 0 0 193 0 733944 3879288 120170720 0 0 0 16 4151 20537 0 0 0 100 0 0 193 0 734156 3879288 120171072 0 0 0 20 4384 21279 0 0 0 100 0 ^C
            sthiell Stephane Thiell added a comment - - edited

            pfarrell I confirm the patch does successfully apply on top of 2.10.7 and I was able to build new server rpms for Oak. Thanks!

            pjones I think it's important to clarify what exactly happened and the title of this issue probably must be changed.

            The deadlock happened repeatedly on two servers serving a set of OSTs in IO Cell #2. But it's critical to highlight two things:

            • Every time that the OST recovery finished on those servers (always successfully against about 1400 clients - that stayed up during the whole time), we shortly hit this deadlock issue. I definitively think that we hit a recovery scenario that triggered this same issue each time (and that's exactly what the recovery is for... replying transactions) - aborting the recovery on those OSTs just broke the cycle and we were finally able to resume operation.
            • Last night I tried to downgrade to Lustre 2.10.6, 2.10.5 and even 2.10.3+patch! (and old version that we have run for quite some time), and EVERY TIME the same issue happened too. This is definitively not a 2.10.7-specific issue. Please feel free to change the title accordingly.

            As of what really triggered the issue, I was thinking perhaps it's due to the upgrade of Sherlock (our lustre clients) to 2.12 recently, and because we haven't restarted Oak since then we hit this now while we upgraded Oak... Do you know if LU-12018 occurred also with 2.12 clients? That would be interesting to check.

            I would also like to mention that when this issue occurred it was extremely difficult to diagnose at first. Oak was down for about 18 hours and we spent most of the time looking for wrong causes, like network or backend issues, because Lustre was reporting these timeout errors or ldiskfs watchdogs because of the deadlock. But I'm glad this issue was already found and special thanks to Alex for pointing us to the right patch (at least, we hope so, but it looks like it very much).

            Finally, to answer your question, I have the updated rpms ready to be deployed on Oak servers but the plan is to not touch anything right now as everything is working again as expected ( both 2.10 on Oak and 2.12 on Fir ). We'll patch only if an issue occur or at the next maintenance (TBD).

            sthiell Stephane Thiell added a comment - - edited pfarrell I confirm the patch does successfully apply on top of 2.10.7 and I was able to build new server rpms for Oak. Thanks! pjones I think it's important to clarify what exactly happened and the title of this issue probably must be changed. The deadlock happened repeatedly on two servers serving a set of OSTs in IO Cell #2. But it's critical to highlight two things: Every time that the OST recovery finished on those servers (always successfully against about 1400 clients - that stayed up during the whole time), we shortly hit this deadlock issue. I definitively think that we hit a recovery scenario that triggered this same issue each time (and that's exactly what the recovery is for... replying transactions) - aborting the recovery on those OSTs just broke the cycle and we were finally able to resume operation. Last night I tried to downgrade to Lustre 2.10.6, 2.10.5 and even 2.10.3+patch! (and old version that we have run for quite some time), and EVERY TIME the same issue happened too. This is definitively not a 2.10.7-specific issue. Please feel free to change the title accordingly. As of what really triggered the issue, I was thinking perhaps it's due to the upgrade of Sherlock (our lustre clients) to 2.12 recently, and because we haven't restarted Oak since then we hit this now while we upgraded Oak... Do you know if LU-12018 occurred also with 2.12 clients? That would be interesting to check. I would also like to mention that when this issue occurred it was extremely difficult to diagnose at first. Oak was down for about 18 hours and we spent most of the time looking for wrong causes, like network or backend issues, because Lustre was reporting these timeout errors or ldiskfs watchdogs because of the deadlock. But I'm glad this issue was already found and special thanks to Alex for pointing us to the right patch (at least, we hope so, but it looks like it very much). Finally, to answer your question, I have the updated rpms ready to be deployed on Oak servers but the plan is to not touch anything right now as everything is working again as expected ( both 2.10 on Oak and 2.12 on Fir ). We'll patch only if an issue occur or at the next maintenance (TBD).
            pjones Peter Jones added a comment -

            sthiell

            How soon will you expect to know whether this patch has helped? What is puzzling to me is that this issue has been seen (albeit quite rarely) as far back as 2.7.x releases so it's puzzling why it just suddenly start happening frequently after upgrading to 2.10.7...

            Peter

            pjones Peter Jones added a comment - sthiell How soon will you expect to know whether this patch has helped? What is puzzling to me is that this issue has been seen (albeit quite rarely) as far back as 2.7.x releases so it's puzzling why it just suddenly start happening frequently after upgrading to 2.10.7... Peter
            sthiell Stephane Thiell added a comment - - edited

            Ok thanks, that's great. It is very short indeed!
            I'm going to prepare a patched 2.10.7 for that then.

            sthiell Stephane Thiell added a comment - - edited Ok thanks, that's great. It is very short indeed! I'm going to prepare a patched 2.10.7 for that then.

            Stephane,

            The patch shouldn't require porting.  You can just apply it to the 2.10.7 source without issue.

            pfarrell Patrick Farrell (Inactive) added a comment - Stephane, The patch shouldn't require porting.  You can just apply it to the 2.10.7 source without issue.

            bzzz Thanks!!! That must be it! I noticed the same symptoms as described in LU-12018.

            Would you be able to port the patch to b2_10?

            sthiell Stephane Thiell added a comment - bzzz  Thanks!!! That must be it! I noticed the same symptoms as described in LU-12018 . Would you be able to port the patch to b2_10?
            bzzz Alex Zhuravlev added a comment - - edited
            Call Trace:
             [<ffffffff816a94e9>] schedule+0x29/0x70
             [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
             [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700
             [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
             [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
             [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140
             [<ffffffffc0f92110>] ? qsd_reconciliation+0xa90/0xa90 [lquota]
             [<ffffffffc09f4549>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
             [<ffffffffc0f938f8>] qsd_start_reint_thread+0x858/0xc10 [lquota]
             [<ffffffff81267292>] ? dqput+0x162/0x1e0
             [<ffffffffc0f97781>] qsd_ready+0x231/0x3c0 [lquota]
             [<ffffffffc0f9a6d2>] qsd_adjust+0xa2/0x890 [lquota]
             [<ffffffffc0f8cbfa>] ? qsd_refresh_usage+0x6a/0x2b0 [lquota]
             [<ffffffffc0f9ba80>] qsd_op_adjust+0x4e0/0x730 [lquota]
             [<ffffffffc0ff5d80>] osd_object_delete+0x230/0x330 [osd_ldiskfs]
             [<ffffffffc0a106cd>] lu_object_free.isra.31+0x9d/0x1a0 [obdclass]
             [<ffffffffc0a1132e>] lu_site_purge_objects+0x2fe/0x520 [obdclass]
             [<ffffffffc0a12179>] lu_cache_shrink+0x259/0x2d0 [obdclass]
             [<ffffffff81195413>] shrink_slab+0x163/0x330
             [<ffffffff811f7521>] ? vmpressure+0x61/0x90
             [<ffffffff81198091>] zone_reclaim+0x1d1/0x2f0
             [<ffffffff8118c264>] get_page_from_freelist+0x2c4/0x9e0
             [<ffffffff81029557>] ? __switch_to+0xd7/0x510
             [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420
             [<ffffffff811d1108>] alloc_pages_current+0x98/0x110
             [<ffffffff81182917>] __page_cache_alloc+0x97/0xb0
             [<ffffffff81183625>] find_or_create_page+0x45/0xa0
             [<ffffffffc1018885>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
             [<ffffffffc1192728>] ofd_preprw+0x6a8/0x1150 [ofd]
            
            Call Trace:
             [<ffffffff816a94e9>] schedule+0x29/0x70
             [<ffffffff816aab1d>] rwsem_down_read_failed+0x10d/0x1a0
             [<ffffffff8121a554>] ? shrink_dentry_list+0x274/0x490
             [<ffffffff81332038>] call_rwsem_down_read_failed+0x18/0x30
             [<ffffffff816a8780>] down_read+0x20/0x40
             [<ffffffffc0a11f8a>] lu_cache_shrink+0x6a/0x2d0 [obdclass]
             [<ffffffff81195359>] shrink_slab+0xa9/0x330
             [<ffffffff8118861f>] ? zone_watermark_ok+0x1f/0x30
             [<ffffffff811a8853>] ? compaction_suitable+0xa3/0xb0
             [<ffffffff81198091>] zone_reclaim+0x1d1/0x2f0
             [<ffffffff8118c264>] get_page_from_freelist+0x2c4/0x9e0
             [<ffffffff81085e1b>] ? copy_process+0xeeb/0x1970
             [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420
             [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
             [<ffffffff8108510d>] copy_process+0x1dd/0x1970
             [<ffffffff81029557>] ? __switch_to+0xd7/0x510
             [<ffffffff81086a51>] do_fork+0x91/0x320
             [<ffffffff816a8fad>] ? __schedule+0x39d/0x8b0
             [<ffffffff81086d06>] kernel_thread+0x26/0x30
             [<ffffffff810b1341>] kthreadd+0x2c1/0x300
            
            bzzz Alex Zhuravlev added a comment - - edited Call Trace: [<ffffffff816a94e9>] schedule+0x29/0x70 [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0 [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700 [<ffffffff816a989d>] wait_for_completion+0xfd/0x140 [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140 [<ffffffffc0f92110>] ? qsd_reconciliation+0xa90/0xa90 [lquota] [<ffffffffc09f4549>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [<ffffffffc0f938f8>] qsd_start_reint_thread+0x858/0xc10 [lquota] [<ffffffff81267292>] ? dqput+0x162/0x1e0 [<ffffffffc0f97781>] qsd_ready+0x231/0x3c0 [lquota] [<ffffffffc0f9a6d2>] qsd_adjust+0xa2/0x890 [lquota] [<ffffffffc0f8cbfa>] ? qsd_refresh_usage+0x6a/0x2b0 [lquota] [<ffffffffc0f9ba80>] qsd_op_adjust+0x4e0/0x730 [lquota] [<ffffffffc0ff5d80>] osd_object_delete+0x230/0x330 [osd_ldiskfs] [<ffffffffc0a106cd>] lu_object_free.isra.31+0x9d/0x1a0 [obdclass] [<ffffffffc0a1132e>] lu_site_purge_objects+0x2fe/0x520 [obdclass] [<ffffffffc0a12179>] lu_cache_shrink+0x259/0x2d0 [obdclass] [<ffffffff81195413>] shrink_slab+0x163/0x330 [<ffffffff811f7521>] ? vmpressure+0x61/0x90 [<ffffffff81198091>] zone_reclaim+0x1d1/0x2f0 [<ffffffff8118c264>] get_page_from_freelist+0x2c4/0x9e0 [<ffffffff81029557>] ? __switch_to+0xd7/0x510 [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420 [<ffffffff811d1108>] alloc_pages_current+0x98/0x110 [<ffffffff81182917>] __page_cache_alloc+0x97/0xb0 [<ffffffff81183625>] find_or_create_page+0x45/0xa0 [<ffffffffc1018885>] osd_bufs_get+0x235/0x430 [osd_ldiskfs] [<ffffffffc1192728>] ofd_preprw+0x6a8/0x1150 [ofd] Call Trace: [<ffffffff816a94e9>] schedule+0x29/0x70 [<ffffffff816aab1d>] rwsem_down_read_failed+0x10d/0x1a0 [<ffffffff8121a554>] ? shrink_dentry_list+0x274/0x490 [<ffffffff81332038>] call_rwsem_down_read_failed+0x18/0x30 [<ffffffff816a8780>] down_read+0x20/0x40 [<ffffffffc0a11f8a>] lu_cache_shrink+0x6a/0x2d0 [obdclass] [<ffffffff81195359>] shrink_slab+0xa9/0x330 [<ffffffff8118861f>] ? zone_watermark_ok+0x1f/0x30 [<ffffffff811a8853>] ? compaction_suitable+0xa3/0xb0 [<ffffffff81198091>] zone_reclaim+0x1d1/0x2f0 [<ffffffff8118c264>] get_page_from_freelist+0x2c4/0x9e0 [<ffffffff81085e1b>] ? copy_process+0xeeb/0x1970 [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420 [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [<ffffffff8108510d>] copy_process+0x1dd/0x1970 [<ffffffff81029557>] ? __switch_to+0xd7/0x510 [<ffffffff81086a51>] do_fork+0x91/0x320 [<ffffffff816a8fad>] ? __schedule+0x39d/0x8b0 [<ffffffff81086d06>] kernel_thread+0x26/0x30 [<ffffffff810b1341>] kthreadd+0x2c1/0x300

            sthiell this looks like LU-12018

            bzzz Alex Zhuravlev added a comment - sthiell this looks like LU-12018

            I think we found the issue. There is some kind of incompatibility in the 2.10 recovery for 2.12 clients. It leads to a deadlock as shown in my logs ( I attached a second dump of sysrq t last night also). After remounting all 2.10.7 OSTs doing Bulk IO errors using -o abort_recov (GRR I really don't like that, it killed a lot of jobs), things are back to normal!! I think! Logs are all super clean. And we are in 2.10.7 now at least!!

            sthiell Stephane Thiell added a comment - I think we found the issue. There is some kind of incompatibility in the 2.10 recovery for 2.12 clients. It leads to a deadlock as shown in my logs ( I attached a second dump of sysrq t last night also). After remounting all 2.10.7 OSTs doing Bulk IO errors using -o abort_recov (GRR I really don't like that, it killed a lot of jobs), things are back to normal!! I think! Logs are all super clean. And we are in 2.10.7 now at least!!

            People

              pjones Peter Jones
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: