Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12162

Major issue on OSS after upgrading Oak to 2.10.7

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Blocker
    • None
    • Lustre 2.10.7
    • None
    • Most clients are 2.12.0 (Sherlock) behind 2.12 routers, a few local clients are 2.10.6
    • 2
    • 9223372036854775807

    Description

      We recently upgraded Oak servers to 2.10.7 but since then, the OSS are showing a lot of these errors:

       

      [29459.670359] Pid: 184632, comm: ll_ost_io00_061 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018
      [29459.682008] Call Trace:
      [29459.684747]  [<ffffffffa9996684>] __lock_page+0x74/0x90
      [29459.690595]  [<ffffffffa99971a4>] __find_lock_page+0x54/0x70
      [29459.696924]  [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0
      [29459.703545]  [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
      [29459.711052]  [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd]
      [29459.717672]  [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc]
      [29459.724829]  [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      [29459.732543]  [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      [29459.741128]  [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [29459.748166]  [<ffffffffa98bdf21>] kthread+0xd1/0xe0
      [29459.753622]  [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39
      [29459.760630]  [<ffffffffffffffff>] 0xffffffffffffffff
      [29459.766199] LustreError: dumping log to /tmp/lustre-log.1554392681.184632
      [29558.439589] Lustre: 184902:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
        req@ffff883e21427450 x1628649065264016/t0(0) o3->214046b3-82aa-4e1f-6785-53da21760b7f@10.8.7.25@o2ib6:295/0 lens 488/432 e 0 to 0 dl 1554392785 ref 2 fl Interpret:/2/0 rc 0/0
      [29558.471803] Lustre: 184902:0:(service.c:1346:ptlrpc_at_send_early_reply()) Skipped 86 previous similar messages
      [29565.756656] LustreError: 184652:0:(ldlm_lib.c:3199:target_bulk_io()) @@@ bulk READ failed: rc -107  req@ffff8841a2c23c50 x1629017475649200/t0(0) o3->9c21af89-2623-9dd8-0966-6a7e83d1ca0c@10.9.105.61@o2ib4:106/0 lens 488/432 e 0 to 0 dl 1554392596 ref 1 fl Interpret:/2/0 rc 0/0
      [29565.783924] LustreError: 184652:0:(ldlm_lib.c:3199:target_bulk_io()) Skipped 41 previous similar messages
      [29565.794652] Lustre: 184652:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (600:191s); client may timeout.  req@ffff8841a2c23c50 x1629017475649200/t0(0) o3->9c21af89-2623-9dd8-0966-6a7e83d1ca0c@10.9.105.61@o2ib4:106/0 lens 488/432 e 0 to 0 dl 1554392596 ref 1 fl Complete:/2/ffffffff rc -107/-1
      [29565.827647] Lustre: 184652:0:(service.c:2114:ptlrpc_server_handle_request()) Skipped 4 previous similar messages
      [29592.627263] Lustre: oak-OST002f: deleting orphan objects from 0x0:23118956 to 0x0:23118977
      [29652.157536] LNet: Service thread pid 184907 was inactive for 1200.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [29652.176670] Pid: 184907, comm: ll_ost_io01_098 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018
      [29652.188320] Call Trace:
      [29652.191067]  [<ffffffffa9996684>] __lock_page+0x74/0x90
      [29652.196917]  [<ffffffffa99971a4>] __find_lock_page+0x54/0x70
      [29652.203246]  [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0
      [29652.209865]  [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
      [29652.217369]  [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd]
      [29652.223997]  [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc]
      [29652.231172]  [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      [29652.238889]  [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      [29652.247474]  [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [29652.254504]  [<ffffffffa98bdf21>] kthread+0xd1/0xe0
      [29652.259960]  [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39
      [29652.266967]  [<ffffffffffffffff>] 0xffffffffffffffff
      [29652.272539] LustreError: dumping log to /tmp/lustre-log.1554392874.184907
      [29688.912990] Lustre: oak-OST0023: Connection restored to MGC10.0.2.51@o2ib5_1 (at 10.0.2.52@o2ib5)
      [29688.922902] Lustre: Skipped 242 previous similar messages
      [29688.929136] Lustre: oak-OST0023: deleting orphan objects from 0x0:23044807 to 0x0:23044835
      [29690.017000] Lustre: oak-OST001d: deleting orphan objects from 0x0:24237516 to 0x0:24237537
      [29690.216982] Lustre: oak-OST000d: deleting orphan objects from 0x0:22136235 to 0x0:22136257
      [29692.305008] Lustre: oak-OST0007: deleting orphan objects from 0x0:21489097 to 0x0:21489122
      [29692.721065] Lustre: oak-OST0021: deleting orphan objects from 0x0:22812231 to 0x0:22812257
      [29754.557446] LNet: Service thread pid 184833 was inactive for 1201.54s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [29754.576566] Pid: 184833, comm: ll_ost_io01_080 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018
      [29754.588216] Call Trace:
      [29754.590953]  [<ffffffffa9996684>] __lock_page+0x74/0x90
      [29754.596802]  [<ffffffffa99971a4>] __find_lock_page+0x54/0x70
      [29754.603129]  [<ffffffffa9997e14>] find_or_create_page+0x34/0xa0
      [29754.609750]  [<ffffffffc1257ac5>] osd_bufs_get+0x235/0x430 [osd_ldiskfs]
      [29754.617253]  [<ffffffffc13d18fb>] ofd_preprw+0x6bb/0x1170 [ofd]
      [29754.623881]  [<ffffffffc0dc7845>] tgt_brw_read+0x975/0x1860 [ptlrpc]
      [29754.631039]  [<ffffffffc0dc572a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      [29754.638753]  [<ffffffffc0d6e04b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      [29754.647335]  [<ffffffffc0d71792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [29754.654363]  [<ffffffffa98bdf21>] kthread+0xd1/0xe0
      [29754.659821]  [<ffffffffa9f255f7>] ret_from_fork_nospec_end+0x0/0x39
      [29754.666831]  [<ffffffffffffffff>] 0xffffffffffffffff
      

      Users are reporting problem accessing some file data (from 2.12.0 clients).

      When checking with htop, the OSS are quiet, definitively too quiet, but the load keeps increasing, which makes me think there is some kind of thread deadlocks on the OSS. I'll attach logs soon.

      Attachments

        Issue Links

          Activity

            People

              pjones Peter Jones
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: