Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8488

LBUG in ptl_send_rpc(), request->rq_xid <= imp->imp_known_replied_xid fails

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      We have had three clients crash pretty quickly when testing Lustre 2.8 clients with lustre 2.5.5 servers. The clients were using tag 2.8.0_0.0.llnlpreview.18, the servers are using 2.5.5-8chaos (see the lustre-release-fe-llnl repo for each).

      They hit the following LBUG:

      2016-08-09 08:24:53 [569127.746044] Lustre: lcy-OST0004-osc-ffff882025188000: Connection restored to 10.1.1.175@o2ib9 (at 10.1.1.175@o2ib9)
      2016-08-09 08:24:53 [569127.760391] Lustre: Skipped 7 previous similar messages
      2016-08-09 08:27:04 [569258.797932] LustreError: 166-1: MGC10.1.1.169@o2ib9: Connection to MGS (at 10.1.1.169@o2ib9) was lost; in progress operations using this service will fail
      2016-08-09 08:28:14 [569328.893397] Lustre: lcy-OST0004-osc-ffff882025188000: Connection to lcy-OST0004 (at 10.1.1.175@o2ib9) was lost; in progress operations using this service will wait for rec
      overy to complete
      2016-08-09 08:28:14 [569328.918048] Lustre: lcy-OST0004-osc-ffff882025188000: Connection restored to 10.1.1.175@o2ib9 (at 10.1.1.175@o2ib9)
      2016-08-09 08:28:14 [569328.932430] Lustre: Skipped 1 previous similar message
      2016-08-09 08:36:14 [569809.168166] Lustre: 130908:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1470756919/real 1470756919]  req@ffff880d11ca
      e600 x1541675108398336/t0(0) o8->lquake-OST0003-osc-ffff882022b7f800@172.19.1.130@o2ib100:28/4 lens 520/544 e 0 to 1 dl 1470756974 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      2016-08-09 08:36:14 [569809.209141] Lustre: 130908:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
      2016-08-09 08:36:40 [569835.180706] Lustre: lcy-OST0008-osc-ffff882025188000: Connection to lcy-OST0008 (at 10.1.1.179@o2ib9) was lost; in progress operations using this service will wait for rec
      overy to complete
      2016-08-09 08:36:40 [569835.206029] Lustre: lcy-OST0008-osc-ffff882025188000: Connection restored to 10.1.1.179@o2ib9 (at 10.1.1.179@o2ib9)
      2016-08-09 08:46:15 [570411.139216] Lustre: 130984:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1470757409/real 1470757409]  req@ffff881f9afbce00 x1541675108443980/t0(0) o10->lcy-OST0001-osc-ffff882025188000@10.1.1.172@o2ib9:6/4 lens 560/432 e 3 to 1 dl 1470757575 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
      2016-08-09 08:46:15 [570411.180184] Lustre: 130984:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 22 previous similar messages
      2016-08-09 08:46:15 [570411.195473] LustreError: 130984:0:(niobuf.c:732:ptl_send_rpc()) @@@ xid: 1541675108443980, replied: 1541675108452731, list_empty:0
      2016-08-09 08:46:15 [570411.195473]   req@ffff881f9afbce00 x1541675108443980/t0(0) o10->lcy-OST0001-osc-ffff882025188000@10.1.1.172@o2ib9:6/4 lens 560/432 e 3 to 1 dl 1470757575 ref 1 fl Rpc:XS/2/ffffffff rc -11/-1
      2016-08-09 08:46:15 [570411.238718] LustreError: 130984:0:(niobuf.c:733:ptl_send_rpc()) LBUG
      2016-08-09 08:46:15 [570411.248645] Pid: 130984, comm: ptlrpcd_07_05
      2016-08-09 08:46:15 Aug  9 08:51:14 opal173 kernel: [570411.256196] 
      2016-08-09 08:46:15 [570411.256196] Call Trace:
      2016-08-09 08:46:15 LustreError: 130984:0:(niobuf.c:733:ptl_send_rpc()) LBUG
      2016-08-09 08:46:15 Aug  9[570411.266553]  [<ffffffffa0aea7e3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      2016-08-09 08:46:15  08:46:15 opal173 kernel: [57041[570411.278045]  [<ffffffffa0aead85>] lbug_with_loc+0x45/0xc0 [libcfs]
      2016-08-09 08:46:15 1.238718] LustreError: 130984:0:[570411.288199]  [<ffffffffa0ec853a>] ptl_send_rpc+0xa1a/0xdb0 [ptlrpc]
      2016-08-09 08:46:15 (niobuf.c:733:ptl_send_rpc()) LB[570411.298337]  [<ffffffffa0ec1209>] ptlrpc_check_set.part.21+0x1789/0x1d80 [ptlrpc]
      2016-08-09 08:46:15 UG
      2016-08-09 08:46:15 [570411.309879]  [<ffffffffa0ec185b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
      2016-08-09 08:46:16 [570411.319667]  [<ffffffffa0eed03b>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc]
      2016-08-09 08:46:16 [570411.329345]  [<ffffffffa0eed3eb>] ptlrpcd+0x2bb/0x560 [ptlrpc]
      2016-08-09 08:46:16 [570411.338362]  [<ffffffff810bd4b0>] ? default_wake_function+0x0/0x20
      2016-08-09 08:46:16 [570411.347762]  [<ffffffffa0eed130>] ? ptlrpcd+0x0/0x560 [ptlrpc]
      2016-08-09 08:46:16 [570411.356709]  [<ffffffff810a997f>] kthread+0xcf/0xe0
      2016-08-09 08:46:16 [570411.364548]  [<ffffffff810a98b0>] ? kthread+0x0/0xe0
      2016-08-09 08:46:16 [570411.372443]  [<ffffffff8165d658>] ret_from_fork+0x58/0x90
      2016-08-09 08:46:16 [570411.380778]  [<ffffffff810a98b0>] ? kthread+0x0/0xe0
      2016-08-09 08:46:16 [570411.388582] 
      2016-08-09 08:46:16 [570411.392834] Kernel panic - not syncing: LBUG
      2016-08-09 08:46:16 [570411.399778] CPU: 33 PID: 130984 Comm: ptlrpcd_07_05 Tainted: P           OE  ------------   3.10.0-327.22.2.1chaos.ch6.x86_64 #1
      2016-08-09 08:46:16 [570411.416933] Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
      2016-08-09 08:46:16 [570411.431470]  ffffffffa0b06e0f 000000003006f545 ffff880fb772bb48 ffffffff8164c6b4
      2016-08-09 08:46:16 [570411.441962]  ffff880fb772bbc8 ffffffff816456af ffffffff00000008 ffff880fb772bbd8
      2016-08-09 08:46:16 [570411.452453]  ffff880fb772bb78 000000003006f545 ffffffffa0f5a675 0000000000000246
      2016-08-09 08:46:16 [570411.462912] Call Trace:
      2016-08-09 08:46:16 [570411.467750]  [<ffffffff8164c6b4>] dump_stack+0x19/0x1b
      2016-08-09 08:46:16 [570411.475583]  [<ffffffff816456af>] panic+0xd8/0x1e7
      2016-08-09 08:46:16 [570411.482985]  [<ffffffffa0aeadeb>] lbug_with_loc+0xab/0xc0 [libcfs]
      2016-08-09 08:46:16 [570411.491940]  [<ffffffffa0ec853a>] ptl_send_rpc+0xa1a/0xdb0 [ptlrpc]
      2016-08-09 08:46:16 [570411.500942]  [<ffffffffa0ec1209>] ptlrpc_check_set.part.21+0x1789/0x1d80 [ptlrpc]
      2016-08-09 08:46:16 [570411.511280]  [<ffffffffa0ec185b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
      2016-08-09 08:46:16 [570411.520423]  [<ffffffffa0eed03b>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc]
      2016-08-09 08:46:16 [570411.529433]  [<ffffffffa0eed3eb>] ptlrpcd+0x2bb/0x560 [ptlrpc]
      2016-08-09 08:46:16 [570411.537780]  [<ffffffff810bd4b0>] ? wake_up_state+0x20/0x20
      2016-08-09 08:46:16 [570411.545821]  [<ffffffffa0eed130>] ? ptlrpcd_check+0x5e0/0x5e0 [ptlrpc]
      2016-08-09 08:46:16 [570411.554875]  [<ffffffff810a997f>] kthread+0xcf/0xe0
      2016-08-09 08:46:16 [570411.562041]  [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
      2016-08-09 08:46:16 [570411.571021]  [<ffffffff8165d658>] ret_from_fork+0x58/0x90
      2016-08-09 08:46:16 [570411.578700]  [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
      

      Attachments

        Activity

          People

            bzzz Alex Zhuravlev
            morrone Christopher Morrone (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: