Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8488

LBUG in ptl_send_rpc(), request->rq_xid <= imp->imp_known_replied_xid fails

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      We have had three clients crash pretty quickly when testing Lustre 2.8 clients with lustre 2.5.5 servers. The clients were using tag 2.8.0_0.0.llnlpreview.18, the servers are using 2.5.5-8chaos (see the lustre-release-fe-llnl repo for each).

      They hit the following LBUG:

      2016-08-09 08:24:53 [569127.746044] Lustre: lcy-OST0004-osc-ffff882025188000: Connection restored to 10.1.1.175@o2ib9 (at 10.1.1.175@o2ib9)
      2016-08-09 08:24:53 [569127.760391] Lustre: Skipped 7 previous similar messages
      2016-08-09 08:27:04 [569258.797932] LustreError: 166-1: MGC10.1.1.169@o2ib9: Connection to MGS (at 10.1.1.169@o2ib9) was lost; in progress operations using this service will fail
      2016-08-09 08:28:14 [569328.893397] Lustre: lcy-OST0004-osc-ffff882025188000: Connection to lcy-OST0004 (at 10.1.1.175@o2ib9) was lost; in progress operations using this service will wait for rec
      overy to complete
      2016-08-09 08:28:14 [569328.918048] Lustre: lcy-OST0004-osc-ffff882025188000: Connection restored to 10.1.1.175@o2ib9 (at 10.1.1.175@o2ib9)
      2016-08-09 08:28:14 [569328.932430] Lustre: Skipped 1 previous similar message
      2016-08-09 08:36:14 [569809.168166] Lustre: 130908:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1470756919/real 1470756919]  req@ffff880d11ca
      e600 x1541675108398336/t0(0) o8->lquake-OST0003-osc-ffff882022b7f800@172.19.1.130@o2ib100:28/4 lens 520/544 e 0 to 1 dl 1470756974 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      2016-08-09 08:36:14 [569809.209141] Lustre: 130908:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
      2016-08-09 08:36:40 [569835.180706] Lustre: lcy-OST0008-osc-ffff882025188000: Connection to lcy-OST0008 (at 10.1.1.179@o2ib9) was lost; in progress operations using this service will wait for rec
      overy to complete
      2016-08-09 08:36:40 [569835.206029] Lustre: lcy-OST0008-osc-ffff882025188000: Connection restored to 10.1.1.179@o2ib9 (at 10.1.1.179@o2ib9)
      2016-08-09 08:46:15 [570411.139216] Lustre: 130984:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1470757409/real 1470757409]  req@ffff881f9afbce00 x1541675108443980/t0(0) o10->lcy-OST0001-osc-ffff882025188000@10.1.1.172@o2ib9:6/4 lens 560/432 e 3 to 1 dl 1470757575 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
      2016-08-09 08:46:15 [570411.180184] Lustre: 130984:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 22 previous similar messages
      2016-08-09 08:46:15 [570411.195473] LustreError: 130984:0:(niobuf.c:732:ptl_send_rpc()) @@@ xid: 1541675108443980, replied: 1541675108452731, list_empty:0
      2016-08-09 08:46:15 [570411.195473]   req@ffff881f9afbce00 x1541675108443980/t0(0) o10->lcy-OST0001-osc-ffff882025188000@10.1.1.172@o2ib9:6/4 lens 560/432 e 3 to 1 dl 1470757575 ref 1 fl Rpc:XS/2/ffffffff rc -11/-1
      2016-08-09 08:46:15 [570411.238718] LustreError: 130984:0:(niobuf.c:733:ptl_send_rpc()) LBUG
      2016-08-09 08:46:15 [570411.248645] Pid: 130984, comm: ptlrpcd_07_05
      2016-08-09 08:46:15 Aug  9 08:51:14 opal173 kernel: [570411.256196] 
      2016-08-09 08:46:15 [570411.256196] Call Trace:
      2016-08-09 08:46:15 LustreError: 130984:0:(niobuf.c:733:ptl_send_rpc()) LBUG
      2016-08-09 08:46:15 Aug  9[570411.266553]  [<ffffffffa0aea7e3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      2016-08-09 08:46:15  08:46:15 opal173 kernel: [57041[570411.278045]  [<ffffffffa0aead85>] lbug_with_loc+0x45/0xc0 [libcfs]
      2016-08-09 08:46:15 1.238718] LustreError: 130984:0:[570411.288199]  [<ffffffffa0ec853a>] ptl_send_rpc+0xa1a/0xdb0 [ptlrpc]
      2016-08-09 08:46:15 (niobuf.c:733:ptl_send_rpc()) LB[570411.298337]  [<ffffffffa0ec1209>] ptlrpc_check_set.part.21+0x1789/0x1d80 [ptlrpc]
      2016-08-09 08:46:15 UG
      2016-08-09 08:46:15 [570411.309879]  [<ffffffffa0ec185b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
      2016-08-09 08:46:16 [570411.319667]  [<ffffffffa0eed03b>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc]
      2016-08-09 08:46:16 [570411.329345]  [<ffffffffa0eed3eb>] ptlrpcd+0x2bb/0x560 [ptlrpc]
      2016-08-09 08:46:16 [570411.338362]  [<ffffffff810bd4b0>] ? default_wake_function+0x0/0x20
      2016-08-09 08:46:16 [570411.347762]  [<ffffffffa0eed130>] ? ptlrpcd+0x0/0x560 [ptlrpc]
      2016-08-09 08:46:16 [570411.356709]  [<ffffffff810a997f>] kthread+0xcf/0xe0
      2016-08-09 08:46:16 [570411.364548]  [<ffffffff810a98b0>] ? kthread+0x0/0xe0
      2016-08-09 08:46:16 [570411.372443]  [<ffffffff8165d658>] ret_from_fork+0x58/0x90
      2016-08-09 08:46:16 [570411.380778]  [<ffffffff810a98b0>] ? kthread+0x0/0xe0
      2016-08-09 08:46:16 [570411.388582] 
      2016-08-09 08:46:16 [570411.392834] Kernel panic - not syncing: LBUG
      2016-08-09 08:46:16 [570411.399778] CPU: 33 PID: 130984 Comm: ptlrpcd_07_05 Tainted: P           OE  ------------   3.10.0-327.22.2.1chaos.ch6.x86_64 #1
      2016-08-09 08:46:16 [570411.416933] Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
      2016-08-09 08:46:16 [570411.431470]  ffffffffa0b06e0f 000000003006f545 ffff880fb772bb48 ffffffff8164c6b4
      2016-08-09 08:46:16 [570411.441962]  ffff880fb772bbc8 ffffffff816456af ffffffff00000008 ffff880fb772bbd8
      2016-08-09 08:46:16 [570411.452453]  ffff880fb772bb78 000000003006f545 ffffffffa0f5a675 0000000000000246
      2016-08-09 08:46:16 [570411.462912] Call Trace:
      2016-08-09 08:46:16 [570411.467750]  [<ffffffff8164c6b4>] dump_stack+0x19/0x1b
      2016-08-09 08:46:16 [570411.475583]  [<ffffffff816456af>] panic+0xd8/0x1e7
      2016-08-09 08:46:16 [570411.482985]  [<ffffffffa0aeadeb>] lbug_with_loc+0xab/0xc0 [libcfs]
      2016-08-09 08:46:16 [570411.491940]  [<ffffffffa0ec853a>] ptl_send_rpc+0xa1a/0xdb0 [ptlrpc]
      2016-08-09 08:46:16 [570411.500942]  [<ffffffffa0ec1209>] ptlrpc_check_set.part.21+0x1789/0x1d80 [ptlrpc]
      2016-08-09 08:46:16 [570411.511280]  [<ffffffffa0ec185b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
      2016-08-09 08:46:16 [570411.520423]  [<ffffffffa0eed03b>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc]
      2016-08-09 08:46:16 [570411.529433]  [<ffffffffa0eed3eb>] ptlrpcd+0x2bb/0x560 [ptlrpc]
      2016-08-09 08:46:16 [570411.537780]  [<ffffffff810bd4b0>] ? wake_up_state+0x20/0x20
      2016-08-09 08:46:16 [570411.545821]  [<ffffffffa0eed130>] ? ptlrpcd_check+0x5e0/0x5e0 [ptlrpc]
      2016-08-09 08:46:16 [570411.554875]  [<ffffffff810a997f>] kthread+0xcf/0xe0
      2016-08-09 08:46:16 [570411.562041]  [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
      2016-08-09 08:46:16 [570411.571021]  [<ffffffff8165d658>] ret_from_fork+0x58/0x90
      2016-08-09 08:46:16 [570411.578700]  [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
      

      Attachments

        Activity

          [LU-8488] LBUG in ptl_send_rpc(), request->rq_xid <= imp->imp_known_replied_xid fails
          pjones Peter Jones added a comment -

          Alex

          Please connect with Minh to get the necessary permissions

          Peter

          pjones Peter Jones added a comment - Alex Please connect with Minh to get the necessary permissions Peter

          unfortunately I can't access lustre-release-fe-llnl repo, could someone please make a tarball and attach please?

          bzzz Alex Zhuravlev added a comment - unfortunately I can't access lustre-release-fe-llnl repo, could someone please make a tarball and attach please?
          ofaaland Olaf Faaland added a comment -

          Recently, we have seen this with 2.5.x servers and 2.8.x clients.  Our general file system tests produced it a couple of times/day, but I do not know the volume of file system tests that ran before it appears.  They are a fairly small subset of our total test suite.

          ofaaland Olaf Faaland added a comment - Recently, we have seen this with 2.5.x servers and 2.8.x clients.  Our general file system tests produced it a couple of times/day, but I do not know the volume of file system tests that ran before it appears.  They are a fairly small subset of our total test suite.
          pjones Peter Jones added a comment -

          LLNL have flagged this as important. Is the situation the same that this LBUG is only seen with 2.5.x servers and 2.8.x clients? And this happens pretty quickly in such a configuration?

          pjones Peter Jones added a comment - LLNL have flagged this as important. Is the situation the same that this LBUG is only seen with 2.5.x servers and 2.8.x clients? And this happens pretty quickly in such a configuration?

          All these failures happened on the request to old OST servers, so I think It's highly likely related to LU-8193, before the fix of LU-8193, bulk resend will change request's rq_xid on unreplied list directly, that breaks the sanity check on known replied xid.

          LU-6808 also has some fixes related to this (bulk request, connect to old server).

          niu Niu Yawei (Inactive) added a comment - All these failures happened on the request to old OST servers, so I think It's highly likely related to LU-8193 , before the fix of LU-8193 , bulk resend will change request's rq_xid on unreplied list directly, that breaks the sanity check on known replied xid. LU-6808 also has some fixes related to this (bulk request, connect to old server).

          James,
          > No it does fix this. We are running 2.8.0 with this patch already merged and still saw this LBUG

          To understand this clearly,
          the patch of LU-3534 helps fix the LBUG seen here ? OR
          even with the patch of LU-3534 the LBUG reported here is still reproducible ?

          Thanks.

          parinay parinay v kondekar (Inactive) added a comment - - edited James, > No it does fix this. We are running 2.8.0 with this patch already merged and still saw this LBUG To understand this clearly, the patch of LU-3534 helps fix the LBUG seen here ? OR even with the patch of LU-3534 the LBUG reported here is still reproducible ? Thanks.

          No it does fix this. We are running 2.8.0 with this patch already merged and still saw this LBUG.

          simmonsja James A Simmons added a comment - No it does fix this. We are running 2.8.0 with this patch already merged and still saw this LBUG.

          We have noticed similar crash

          Lustre: 47121:0:(client.c:2093:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1472152273/real 1472152273]  req@ffff8813bd6e0cc0 x1543375481248976/t0(0) o3->lustre02-OST0006-osc-ffff8813c2ad6400@10.253.108.30@tcp:6/4 lens 488/432 e 0 to 1 dl 1472152289 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
          Lustre: lustre02-OST0006-osc-ffff8813c2ad6400: Connection to lustre02-OST0006 (at 10.253.108.30@tcp) was lost; in progress operations using this service will wait for recovery to complete
          Lustre: 47110:0:(client.c:2093:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1472152290/real 1472152290]  req@ffff8805b92893c0 x1543375481250220/t0(0) o8->lustre02-OST0006-osc-ffff8813c2ad6400@10.253.108.30@tcp:28/4 lens 520/544 e 0 to 1 dl 1472152310 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          Lustre: lustre02-OST0006-osc-ffff8813c2ad6400: Connection restored to 10.253.108.31@tcp (at 10.253.108.31@tcp)
          LustreError: 62105:0:(niobuf.c:750:ptl_send_rpc()) @@@ xid: 1543375481425516, replied: 1543375481435211, list_empty:0
            req@ffff8805f42e43c0 x1543375481425516/t0(0) o13->lustre02-OST0006-osc-ffff8813c2ad6400@10.253.108.31@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 
          LustreError: 62094:0:(niobuf.c:751:ptl_send_rpc()) LBUG 
          Pid: 62094, comm: ssexec
          LustreError: 62099:0:(niobuf.c:751:ptl_send_rpc()) LBUG 
          
          Call Trace:
          LustreError: 62088:0:(niobuf.c:751:ptl_send_rpc()) LBUG 
          Pid: 62099, comm: ssexec
          Pid: 62088, comm: ssexec
          
          Call Trace:
          
          Call Trace:
           [<ffffffffa147f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
           [<ffffffffa147f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
           [<ffffffffa1480007>] lbug_with_loc+0x47/0xb0 [libcfs]
           [<ffffffffa147f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
           [<ffffffffa1480007>] lbug_with_loc+0x47/0xb0 [libcfs]
           [<ffffffffa1480007>] lbug_with_loc+0x47/0xb0 [libcfs]
           [<ffffffffa17a9478>] ptl_send_rpc+0x3f8/0xe70 [ptlrpc]
           [<ffffffffa17a9478>] ptl_send_rpc+0x3f8/0xe70 [ptlrpc]
           [<ffffffffa17a9478>] ptl_send_rpc+0x3f8/0xe70 [ptlrpc]
           [<ffffffffa17e06a4>] ? sptlrpc_req_refresh_ctx+0x154/0x910 [ptlrpc]
           [<ffffffffa17e06a4>] ? sptlrpc_req_refresh_ctx+0x154/0x910 [ptlrpc]
           [<ffffffffa17e06a4>] ? sptlrpc_req_refresh_ctx+0x154/0x910 [ptlrpc]
          LustreError: 62086:0:(niobuf.c:751:ptl_send_rpc()) LBUG 
          Pid: 62086, comm: ssexec
          
          Call Trace:
           [<ffffffffa17a313c>] ptlrpc_check_set+0x17cc/0x1d90 [ptlrpc]
           [<ffffffffa17a313c>] ptlrpc_check_set+0x17cc/0x1d90 [ptlrpc]
           [<ffffffffa147f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
           [<ffffffffa1480007>] lbug_with_loc+0x47/0xb0 [libcfs]
           [<ffffffffa17a313c>] ptlrpc_check_set+0x17cc/0x1d90 [ptlrpc]
           [<ffffffffa17a39ea>] ptlrpc_set_wait+0x2ea/0x9c0 [ptlrpc]
           [<ffffffffa17a39ea>] ptlrpc_set_wait+0x2ea/0x9c0 [ptlrpc]
           [<ffffffffa17a9478>] ptl_send_rpc+0x3f8/0xe70 [ptlrpc]
           [<ffffffffa17a39ea>] ptlrpc_set_wait+0x2ea/0x9c0 [ptlrpc]
           [<ffffffffa1798c20>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc]
           [<ffffffffa1798c20>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc]
           [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
           [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
           [<ffffffffa1798c20>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc]
           [<ffffffffa17e06a4>] ? sptlrpc_req_refresh_ctx+0x154/0x910 [ptlrpc]
           [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
           [<ffffffffa1a997d4>] obd_statfs_rqset+0x144/0x560 [lustre]
           [<ffffffffa1a997d4>] obd_statfs_rqset+0x144/0x560 [lustre]
           [<ffffffffa1a9b51b>] ll_statfs_internal+0x2bb/0x7f0 [lustre]
           [<ffffffffa1a9b51b>] ll_statfs_internal+0x2bb/0x7f0 [lustre]
           [<ffffffffa17a313c>] ptlrpc_check_set+0x17cc/0x1d90 [ptlrpc]
           [<ffffffffa1a997d4>] obd_statfs_rqset+0x144/0x560 [lustre]
           [<ffffffffa1a9bae5>] ll_statfs+0x95/0x190 [lustre]
           [<ffffffffa1a9bae5>] ll_statfs+0x95/0x190 [lustre]
           [<ffffffff811bcdd4>] statfs_by_dentry+0x74/0xa0
           [<ffffffff811bcdd4>] statfs_by_dentry+0x74/0xa0
           [<ffffffff811bcf0b>] vfs_statfs+0x1b/0xb0
           [<ffffffff811bcf0b>] vfs_statfs+0x1b/0xb0
           [<ffffffffa1a9b51b>] ll_statfs_internal+0x2bb/0x7f0 [lustre]
           [<ffffffff811bd107>] user_statfs+0x47/0xb0
           [<ffffffff811bd107>] user_statfs+0x47/0xb0
           [<ffffffffa17a39ea>] ptlrpc_set_wait+0x2ea/0x9c0 [ptlrpc]
           [<ffffffff811bd20a>] sys_statfs+0x2a/0x50
           [<ffffffff811bd20a>] sys_statfs+0x2a/0x50
           [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
           [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
          
          

          The code

          745 >       if (list_empty(&request->rq_unreplied_list) ||
          746 >           request->rq_xid <= imp->imp_known_replied_xid) {
          747 >       >       DEBUG_REQ(D_ERROR, request, "xid: "LPU64", replied: "LPU64", "
          748 >       >       >         "list_empty:%d\n", request->rq_xid,
          749 >       >       >         imp->imp_known_replied_xid,
          750 >       >       >         list_empty(&request->rq_unreplied_list));
          751 >       >       LBUG();
          752 >       }
          

          from lustre logs

          req@ffff881eaf9e90c0 x1543375481252856/t0(0) o13->lustre02-OST0006-osc-ffff8813c2ad6400@10.253.108.31@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
           00000100:00020000:4.0F:1472152509.139389:0:62099:0:(niobuf.c:750:ptl_send_rpc()) @@@ xid: 1543375481269472, replied: 1543375481435211, list_empty:0
          

          from crash

          crash> ptlrpc_request ffff881eaf9e90c0 | grep rq_xid
            rq_xid = 1543375481252856, 
          crash>
          
          crash> ptlrpc_request ffff881eaf9e90c0 | grep rq_import
            rq_import = 0xffff881a34ef2000, 
          crash> obd_import  0xffff881a34ef2000 | grep imp_known_replied_xid
            imp_known_replied_xid = 1543375481435211, 
          crash> 
          
          crash> ptlrpc_request ffff881eaf9e90c0 | grep -A 3 cr_unreplied_list
                cr_unreplied_list = {
                  next = 0xffff880e1f320b48, 
                  prev = 0xffff881f38437548
                },
          

          The issue was seen during compilebench runs from lustre/tests ( 1 out of 100 runs (87the run failed))

          JFYI I re-ran the tests with LU-3534 patch and was unable to reproduce the failure seen here. I am not sure if thats the fix.

          parinay parinay v kondekar (Inactive) added a comment - - edited We have noticed similar crash Lustre: 47121:0:(client.c:2093:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1472152273/real 1472152273] req@ffff8813bd6e0cc0 x1543375481248976/t0(0) o3->lustre02-OST0006-osc-ffff8813c2ad6400@10.253.108.30@tcp:6/4 lens 488/432 e 0 to 1 dl 1472152289 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Lustre: lustre02-OST0006-osc-ffff8813c2ad6400: Connection to lustre02-OST0006 (at 10.253.108.30@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 47110:0:(client.c:2093:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1472152290/real 1472152290] req@ffff8805b92893c0 x1543375481250220/t0(0) o8->lustre02-OST0006-osc-ffff8813c2ad6400@10.253.108.30@tcp:28/4 lens 520/544 e 0 to 1 dl 1472152310 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: lustre02-OST0006-osc-ffff8813c2ad6400: Connection restored to 10.253.108.31@tcp (at 10.253.108.31@tcp) LustreError: 62105:0:(niobuf.c:750:ptl_send_rpc()) @@@ xid: 1543375481425516, replied: 1543375481435211, list_empty:0 req@ffff8805f42e43c0 x1543375481425516/t0(0) o13->lustre02-OST0006-osc-ffff8813c2ad6400@10.253.108.31@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 LustreError: 62094:0:(niobuf.c:751:ptl_send_rpc()) LBUG Pid: 62094, comm: ssexec LustreError: 62099:0:(niobuf.c:751:ptl_send_rpc()) LBUG Call Trace: LustreError: 62088:0:(niobuf.c:751:ptl_send_rpc()) LBUG Pid: 62099, comm: ssexec Pid: 62088, comm: ssexec Call Trace: Call Trace: [<ffffffffa147f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa147f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa1480007>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa147f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa1480007>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa1480007>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa17a9478>] ptl_send_rpc+0x3f8/0xe70 [ptlrpc] [<ffffffffa17a9478>] ptl_send_rpc+0x3f8/0xe70 [ptlrpc] [<ffffffffa17a9478>] ptl_send_rpc+0x3f8/0xe70 [ptlrpc] [<ffffffffa17e06a4>] ? sptlrpc_req_refresh_ctx+0x154/0x910 [ptlrpc] [<ffffffffa17e06a4>] ? sptlrpc_req_refresh_ctx+0x154/0x910 [ptlrpc] [<ffffffffa17e06a4>] ? sptlrpc_req_refresh_ctx+0x154/0x910 [ptlrpc] LustreError: 62086:0:(niobuf.c:751:ptl_send_rpc()) LBUG Pid: 62086, comm: ssexec Call Trace: [<ffffffffa17a313c>] ptlrpc_check_set+0x17cc/0x1d90 [ptlrpc] [<ffffffffa17a313c>] ptlrpc_check_set+0x17cc/0x1d90 [ptlrpc] [<ffffffffa147f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa1480007>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa17a313c>] ptlrpc_check_set+0x17cc/0x1d90 [ptlrpc] [<ffffffffa17a39ea>] ptlrpc_set_wait+0x2ea/0x9c0 [ptlrpc] [<ffffffffa17a39ea>] ptlrpc_set_wait+0x2ea/0x9c0 [ptlrpc] [<ffffffffa17a9478>] ptl_send_rpc+0x3f8/0xe70 [ptlrpc] [<ffffffffa17a39ea>] ptlrpc_set_wait+0x2ea/0x9c0 [ptlrpc] [<ffffffffa1798c20>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc] [<ffffffffa1798c20>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc] [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 [<ffffffffa1798c20>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc] [<ffffffffa17e06a4>] ? sptlrpc_req_refresh_ctx+0x154/0x910 [ptlrpc] [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 [<ffffffffa1a997d4>] obd_statfs_rqset+0x144/0x560 [lustre] [<ffffffffa1a997d4>] obd_statfs_rqset+0x144/0x560 [lustre] [<ffffffffa1a9b51b>] ll_statfs_internal+0x2bb/0x7f0 [lustre] [<ffffffffa1a9b51b>] ll_statfs_internal+0x2bb/0x7f0 [lustre] [<ffffffffa17a313c>] ptlrpc_check_set+0x17cc/0x1d90 [ptlrpc] [<ffffffffa1a997d4>] obd_statfs_rqset+0x144/0x560 [lustre] [<ffffffffa1a9bae5>] ll_statfs+0x95/0x190 [lustre] [<ffffffffa1a9bae5>] ll_statfs+0x95/0x190 [lustre] [<ffffffff811bcdd4>] statfs_by_dentry+0x74/0xa0 [<ffffffff811bcdd4>] statfs_by_dentry+0x74/0xa0 [<ffffffff811bcf0b>] vfs_statfs+0x1b/0xb0 [<ffffffff811bcf0b>] vfs_statfs+0x1b/0xb0 [<ffffffffa1a9b51b>] ll_statfs_internal+0x2bb/0x7f0 [lustre] [<ffffffff811bd107>] user_statfs+0x47/0xb0 [<ffffffff811bd107>] user_statfs+0x47/0xb0 [<ffffffffa17a39ea>] ptlrpc_set_wait+0x2ea/0x9c0 [ptlrpc] [<ffffffff811bd20a>] sys_statfs+0x2a/0x50 [<ffffffff811bd20a>] sys_statfs+0x2a/0x50 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b The code 745 > if (list_empty(&request->rq_unreplied_list) || 746 > request->rq_xid <= imp->imp_known_replied_xid) { 747 > > DEBUG_REQ(D_ERROR, request, "xid: " LPU64 ", replied: " LPU64 ", " 748 > > > "list_empty:%d\n" , request->rq_xid, 749 > > > imp->imp_known_replied_xid, 750 > > > list_empty(&request->rq_unreplied_list)); 751 > > LBUG(); 752 > } from lustre logs req@ffff881eaf9e90c0 x1543375481252856/t0(0) o13->lustre02-OST0006-osc-ffff8813c2ad6400@10.253.108.31@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 00000100:00020000:4.0F:1472152509.139389:0:62099:0:(niobuf.c:750:ptl_send_rpc()) @@@ xid: 1543375481269472, replied: 1543375481435211, list_empty:0 from crash crash> ptlrpc_request ffff881eaf9e90c0 | grep rq_xid rq_xid = 1543375481252856, crash> crash> ptlrpc_request ffff881eaf9e90c0 | grep rq_import rq_import = 0xffff881a34ef2000, crash> obd_import 0xffff881a34ef2000 | grep imp_known_replied_xid imp_known_replied_xid = 1543375481435211, crash> crash> ptlrpc_request ffff881eaf9e90c0 | grep -A 3 cr_unreplied_list cr_unreplied_list = { next = 0xffff880e1f320b48, prev = 0xffff881f38437548 }, The issue was seen during compilebench runs from lustre/tests ( 1 out of 100 runs (87the run failed)) JFYI I re-ran the tests with LU-3534 patch and was unable to reproduce the failure seen here. I am not sure if thats the fix.
          yujian Jian Yu added a comment -

          Hi Alex,

          Could you please advise? Thank you.

          yujian Jian Yu added a comment - Hi Alex, Could you please advise? Thank you.

          We just hit it as well. If you need a dump we can provide it.

          simmonsja James A Simmons added a comment - We just hit it as well. If you need a dump we can provide it.

          We (Stanford Research Computing) have seen a very similar LBUG on a client (a single occurence so far):

          Aug 15 20:55:06 sherlock-ln03 kernel: Lustre: regal-OST000a-osc-ffff880332e3b000: Connection restored to regal-OST000a (at 10.210.34.203@o2ib1)
          Aug 15 20:55:06 sherlock-ln03 kernel: Lustre: Skipped 1 previous similar message
          Aug 15 20:55:06 sherlock-ln03 kernel: LustreError: 20432:0:(niobuf.c:737:ptl_send_rpc()) @@@ xid: 1542242077009292, replied: 1542242077009307, list_empty:0
          Aug 15 20:55:06 sherlock-ln03 kernel:  req@ffff880ed871c3c0 x1542242077009292/t0(0) o4->regal-OST004f-osc-ffff880332e3b000@10.210.34.216@o2ib1:6/4 lens 488/448 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
          Aug 15 20:55:06 sherlock-ln03 kernel: LustreError: 20432:0:(niobuf.c:738:ptl_send_rpc()) LBUG
          Aug 15 20:55:06 sherlock-ln03 kernel: Pid: 20432, comm: ptlrpcd_00_06
          Aug 15 20:55:06 sherlock-ln03 kernel:
          Aug 15 20:55:06 sherlock-ln03 kernel: Call Trace:
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa03d6895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa03d6e97>] lbug_with_loc+0x47/0xb0 [libcfs]
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07c1498>] ptl_send_rpc+0x408/0xe80 [ptlrpc]
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07f80c4>] ? sptlrpc_req_refresh_ctx+0x154/0x910 [ptlrpc]
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07c17d5>] ? ptl_send_rpc+0x745/0xe80 [ptlrpc]
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07b5de1>] ptlrpc_send_new_req+0x511/0x9b0 [ptlrpc]
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07ba280>] ptlrpc_check_set+0xa20/0x1ca0 [ptlrpc]
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07e8633>] ptlrpcd_check+0x3d3/0x610 [ptlrpc]
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07b0416>] ? ptlrpc_set_next_timeout+0xc6/0x1a0 [ptlrpc]
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07e8ae2>] ptlrpcd+0x272/0x4f0 [ptlrpc]
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07e8870>] ? ptlrpcd+0x0/0x4f0 [ptlrpc]
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
          Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
          

          That's using IEEL 3.0.0.0

          Cheers,

          Kilian

          kilian Kilian Cavalotti added a comment - We (Stanford Research Computing) have seen a very similar LBUG on a client (a single occurence so far): Aug 15 20:55:06 sherlock-ln03 kernel: Lustre: regal-OST000a-osc-ffff880332e3b000: Connection restored to regal-OST000a (at 10.210.34.203@o2ib1) Aug 15 20:55:06 sherlock-ln03 kernel: Lustre: Skipped 1 previous similar message Aug 15 20:55:06 sherlock-ln03 kernel: LustreError: 20432:0:(niobuf.c:737:ptl_send_rpc()) @@@ xid: 1542242077009292, replied: 1542242077009307, list_empty:0 Aug 15 20:55:06 sherlock-ln03 kernel: req@ffff880ed871c3c0 x1542242077009292/t0(0) o4->regal-OST004f-osc-ffff880332e3b000@10.210.34.216@o2ib1:6/4 lens 488/448 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 Aug 15 20:55:06 sherlock-ln03 kernel: LustreError: 20432:0:(niobuf.c:738:ptl_send_rpc()) LBUG Aug 15 20:55:06 sherlock-ln03 kernel: Pid: 20432, comm: ptlrpcd_00_06 Aug 15 20:55:06 sherlock-ln03 kernel: Aug 15 20:55:06 sherlock-ln03 kernel: Call Trace: Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa03d6895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa03d6e97>] lbug_with_loc+0x47/0xb0 [libcfs] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07c1498>] ptl_send_rpc+0x408/0xe80 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07f80c4>] ? sptlrpc_req_refresh_ctx+0x154/0x910 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07c17d5>] ? ptl_send_rpc+0x745/0xe80 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07b5de1>] ptlrpc_send_new_req+0x511/0x9b0 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07ba280>] ptlrpc_check_set+0xa20/0x1ca0 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13 Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07e8633>] ptlrpcd_check+0x3d3/0x610 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07b0416>] ? ptlrpc_set_next_timeout+0xc6/0x1a0 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07e8ae2>] ptlrpcd+0x272/0x4f0 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20 Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07e8870>] ? ptlrpcd+0x0/0x4f0 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0 Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0 Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 That's using IEEL 3.0.0.0 Cheers, – Kilian

          People

            bzzz Alex Zhuravlev
            morrone Christopher Morrone (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: