Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8488

LBUG in ptl_send_rpc(), request->rq_xid <= imp->imp_known_replied_xid fails

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      We have had three clients crash pretty quickly when testing Lustre 2.8 clients with lustre 2.5.5 servers. The clients were using tag 2.8.0_0.0.llnlpreview.18, the servers are using 2.5.5-8chaos (see the lustre-release-fe-llnl repo for each).

      They hit the following LBUG:

      2016-08-09 08:24:53 [569127.746044] Lustre: lcy-OST0004-osc-ffff882025188000: Connection restored to 10.1.1.175@o2ib9 (at 10.1.1.175@o2ib9)
      2016-08-09 08:24:53 [569127.760391] Lustre: Skipped 7 previous similar messages
      2016-08-09 08:27:04 [569258.797932] LustreError: 166-1: MGC10.1.1.169@o2ib9: Connection to MGS (at 10.1.1.169@o2ib9) was lost; in progress operations using this service will fail
      2016-08-09 08:28:14 [569328.893397] Lustre: lcy-OST0004-osc-ffff882025188000: Connection to lcy-OST0004 (at 10.1.1.175@o2ib9) was lost; in progress operations using this service will wait for rec
      overy to complete
      2016-08-09 08:28:14 [569328.918048] Lustre: lcy-OST0004-osc-ffff882025188000: Connection restored to 10.1.1.175@o2ib9 (at 10.1.1.175@o2ib9)
      2016-08-09 08:28:14 [569328.932430] Lustre: Skipped 1 previous similar message
      2016-08-09 08:36:14 [569809.168166] Lustre: 130908:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1470756919/real 1470756919]  req@ffff880d11ca
      e600 x1541675108398336/t0(0) o8->lquake-OST0003-osc-ffff882022b7f800@172.19.1.130@o2ib100:28/4 lens 520/544 e 0 to 1 dl 1470756974 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      2016-08-09 08:36:14 [569809.209141] Lustre: 130908:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
      2016-08-09 08:36:40 [569835.180706] Lustre: lcy-OST0008-osc-ffff882025188000: Connection to lcy-OST0008 (at 10.1.1.179@o2ib9) was lost; in progress operations using this service will wait for rec
      overy to complete
      2016-08-09 08:36:40 [569835.206029] Lustre: lcy-OST0008-osc-ffff882025188000: Connection restored to 10.1.1.179@o2ib9 (at 10.1.1.179@o2ib9)
      2016-08-09 08:46:15 [570411.139216] Lustre: 130984:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1470757409/real 1470757409]  req@ffff881f9afbce00 x1541675108443980/t0(0) o10->lcy-OST0001-osc-ffff882025188000@10.1.1.172@o2ib9:6/4 lens 560/432 e 3 to 1 dl 1470757575 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
      2016-08-09 08:46:15 [570411.180184] Lustre: 130984:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 22 previous similar messages
      2016-08-09 08:46:15 [570411.195473] LustreError: 130984:0:(niobuf.c:732:ptl_send_rpc()) @@@ xid: 1541675108443980, replied: 1541675108452731, list_empty:0
      2016-08-09 08:46:15 [570411.195473]   req@ffff881f9afbce00 x1541675108443980/t0(0) o10->lcy-OST0001-osc-ffff882025188000@10.1.1.172@o2ib9:6/4 lens 560/432 e 3 to 1 dl 1470757575 ref 1 fl Rpc:XS/2/ffffffff rc -11/-1
      2016-08-09 08:46:15 [570411.238718] LustreError: 130984:0:(niobuf.c:733:ptl_send_rpc()) LBUG
      2016-08-09 08:46:15 [570411.248645] Pid: 130984, comm: ptlrpcd_07_05
      2016-08-09 08:46:15 Aug  9 08:51:14 opal173 kernel: [570411.256196] 
      2016-08-09 08:46:15 [570411.256196] Call Trace:
      2016-08-09 08:46:15 LustreError: 130984:0:(niobuf.c:733:ptl_send_rpc()) LBUG
      2016-08-09 08:46:15 Aug  9[570411.266553]  [<ffffffffa0aea7e3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      2016-08-09 08:46:15  08:46:15 opal173 kernel: [57041[570411.278045]  [<ffffffffa0aead85>] lbug_with_loc+0x45/0xc0 [libcfs]
      2016-08-09 08:46:15 1.238718] LustreError: 130984:0:[570411.288199]  [<ffffffffa0ec853a>] ptl_send_rpc+0xa1a/0xdb0 [ptlrpc]
      2016-08-09 08:46:15 (niobuf.c:733:ptl_send_rpc()) LB[570411.298337]  [<ffffffffa0ec1209>] ptlrpc_check_set.part.21+0x1789/0x1d80 [ptlrpc]
      2016-08-09 08:46:15 UG
      2016-08-09 08:46:15 [570411.309879]  [<ffffffffa0ec185b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
      2016-08-09 08:46:16 [570411.319667]  [<ffffffffa0eed03b>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc]
      2016-08-09 08:46:16 [570411.329345]  [<ffffffffa0eed3eb>] ptlrpcd+0x2bb/0x560 [ptlrpc]
      2016-08-09 08:46:16 [570411.338362]  [<ffffffff810bd4b0>] ? default_wake_function+0x0/0x20
      2016-08-09 08:46:16 [570411.347762]  [<ffffffffa0eed130>] ? ptlrpcd+0x0/0x560 [ptlrpc]
      2016-08-09 08:46:16 [570411.356709]  [<ffffffff810a997f>] kthread+0xcf/0xe0
      2016-08-09 08:46:16 [570411.364548]  [<ffffffff810a98b0>] ? kthread+0x0/0xe0
      2016-08-09 08:46:16 [570411.372443]  [<ffffffff8165d658>] ret_from_fork+0x58/0x90
      2016-08-09 08:46:16 [570411.380778]  [<ffffffff810a98b0>] ? kthread+0x0/0xe0
      2016-08-09 08:46:16 [570411.388582] 
      2016-08-09 08:46:16 [570411.392834] Kernel panic - not syncing: LBUG
      2016-08-09 08:46:16 [570411.399778] CPU: 33 PID: 130984 Comm: ptlrpcd_07_05 Tainted: P           OE  ------------   3.10.0-327.22.2.1chaos.ch6.x86_64 #1
      2016-08-09 08:46:16 [570411.416933] Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
      2016-08-09 08:46:16 [570411.431470]  ffffffffa0b06e0f 000000003006f545 ffff880fb772bb48 ffffffff8164c6b4
      2016-08-09 08:46:16 [570411.441962]  ffff880fb772bbc8 ffffffff816456af ffffffff00000008 ffff880fb772bbd8
      2016-08-09 08:46:16 [570411.452453]  ffff880fb772bb78 000000003006f545 ffffffffa0f5a675 0000000000000246
      2016-08-09 08:46:16 [570411.462912] Call Trace:
      2016-08-09 08:46:16 [570411.467750]  [<ffffffff8164c6b4>] dump_stack+0x19/0x1b
      2016-08-09 08:46:16 [570411.475583]  [<ffffffff816456af>] panic+0xd8/0x1e7
      2016-08-09 08:46:16 [570411.482985]  [<ffffffffa0aeadeb>] lbug_with_loc+0xab/0xc0 [libcfs]
      2016-08-09 08:46:16 [570411.491940]  [<ffffffffa0ec853a>] ptl_send_rpc+0xa1a/0xdb0 [ptlrpc]
      2016-08-09 08:46:16 [570411.500942]  [<ffffffffa0ec1209>] ptlrpc_check_set.part.21+0x1789/0x1d80 [ptlrpc]
      2016-08-09 08:46:16 [570411.511280]  [<ffffffffa0ec185b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
      2016-08-09 08:46:16 [570411.520423]  [<ffffffffa0eed03b>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc]
      2016-08-09 08:46:16 [570411.529433]  [<ffffffffa0eed3eb>] ptlrpcd+0x2bb/0x560 [ptlrpc]
      2016-08-09 08:46:16 [570411.537780]  [<ffffffff810bd4b0>] ? wake_up_state+0x20/0x20
      2016-08-09 08:46:16 [570411.545821]  [<ffffffffa0eed130>] ? ptlrpcd_check+0x5e0/0x5e0 [ptlrpc]
      2016-08-09 08:46:16 [570411.554875]  [<ffffffff810a997f>] kthread+0xcf/0xe0
      2016-08-09 08:46:16 [570411.562041]  [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
      2016-08-09 08:46:16 [570411.571021]  [<ffffffff8165d658>] ret_from_fork+0x58/0x90
      2016-08-09 08:46:16 [570411.578700]  [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
      

      Attachments

        Activity

          [LU-8488] LBUG in ptl_send_rpc(), request->rq_xid <= imp->imp_known_replied_xid fails
          pjones Peter Jones added a comment -

          Patch ported and landed to 2.8 FE branch

          pjones Peter Jones added a comment - Patch ported and landed to 2.8 FE branch

          Alex,

          It looks like something happened in the autotest system and you need to kick off testing again.  There appear to be no test results.

          ofaaland Olaf Faaland added a comment - Alex, It looks like something happened in the autotest system and you need to kick off testing again.  There appear to be no test results.

          I've ported LU-6808: https://review.whamcloud.com/#/c/30018/ but still waiting for confirmation from autotest.

          bzzz Alex Zhuravlev added a comment - I've ported LU-6808 : https://review.whamcloud.com/#/c/30018/ but still waiting for confirmation from autotest.
          pjones Peter Jones added a comment -

          Alex

          Please connect with Minh to get the necessary permissions

          Peter

          pjones Peter Jones added a comment - Alex Please connect with Minh to get the necessary permissions Peter

          unfortunately I can't access lustre-release-fe-llnl repo, could someone please make a tarball and attach please?

          bzzz Alex Zhuravlev added a comment - unfortunately I can't access lustre-release-fe-llnl repo, could someone please make a tarball and attach please?
          ofaaland Olaf Faaland added a comment -

          Recently, we have seen this with 2.5.x servers and 2.8.x clients.  Our general file system tests produced it a couple of times/day, but I do not know the volume of file system tests that ran before it appears.  They are a fairly small subset of our total test suite.

          ofaaland Olaf Faaland added a comment - Recently, we have seen this with 2.5.x servers and 2.8.x clients.  Our general file system tests produced it a couple of times/day, but I do not know the volume of file system tests that ran before it appears.  They are a fairly small subset of our total test suite.
          pjones Peter Jones added a comment -

          LLNL have flagged this as important. Is the situation the same that this LBUG is only seen with 2.5.x servers and 2.8.x clients? And this happens pretty quickly in such a configuration?

          pjones Peter Jones added a comment - LLNL have flagged this as important. Is the situation the same that this LBUG is only seen with 2.5.x servers and 2.8.x clients? And this happens pretty quickly in such a configuration?

          All these failures happened on the request to old OST servers, so I think It's highly likely related to LU-8193, before the fix of LU-8193, bulk resend will change request's rq_xid on unreplied list directly, that breaks the sanity check on known replied xid.

          LU-6808 also has some fixes related to this (bulk request, connect to old server).

          niu Niu Yawei (Inactive) added a comment - All these failures happened on the request to old OST servers, so I think It's highly likely related to LU-8193 , before the fix of LU-8193 , bulk resend will change request's rq_xid on unreplied list directly, that breaks the sanity check on known replied xid. LU-6808 also has some fixes related to this (bulk request, connect to old server).

          James,
          > No it does fix this. We are running 2.8.0 with this patch already merged and still saw this LBUG

          To understand this clearly,
          the patch of LU-3534 helps fix the LBUG seen here ? OR
          even with the patch of LU-3534 the LBUG reported here is still reproducible ?

          Thanks.

          parinay parinay v kondekar (Inactive) added a comment - - edited James, > No it does fix this. We are running 2.8.0 with this patch already merged and still saw this LBUG To understand this clearly, the patch of LU-3534 helps fix the LBUG seen here ? OR even with the patch of LU-3534 the LBUG reported here is still reproducible ? Thanks.

          No it does fix this. We are running 2.8.0 with this patch already merged and still saw this LBUG.

          simmonsja James A Simmons added a comment - No it does fix this. We are running 2.8.0 with this patch already merged and still saw this LBUG.

          People

            bzzz Alex Zhuravlev
            morrone Christopher Morrone (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: