Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9274

LBUG: (recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed:

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.0, Lustre 2.11.0
    • Lustre 2.10.0
    • Soak cluster version=lustre: 2.9.52_73_gb5c4f03
    • 3
    • 9223372036854775807

    Description

      Soak is running, performing successful OSS failover, partitions are being recovered:

      2017-03-30 00:47:10,761:fsmgmt.fsmgmt:INFO     Next recovery check in 15s...
      2017-03-30 00:47:32,586:fsmgmt.fsmgmt:DEBUG    Recovery Result Record: {'soak-4': {'soaked-OST000f': 'RECOVERING', 'soaked-OST000e': 'INACTIVE', 'soaked-OST0008': 'INACTIVE', 'soaked-OST0009': 'RECOVERING', 'soaked-OST0002': 'INACTIVE', 'soaked-OST0003': 'RECOVERING', 'soaked-OST0015': 'COMPLETE', 'soaked-OST0014': 'INACTIVE'}}
      

      Single client has LBUG, after recovering some partitions

      Mar 30 00:47:33 soak-36 kernel: Lustre: soaked-OST0009-osc-ffff88085b72c000: Connection restored to 192.168.1.104@o2ib10 (at 192.168.1.104@o2ib10)
      .....:q
      Mar 30 00:48:31 soak-36 kernel: LustreError: 4753:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed:
      Mar 30 00:48:31 soak-36 kernel: LustreError: 4753:0:(recover.c:157:ptlrpc_replay_next()) LBUG
      Mar 30 00:48:31 soak-36 kernel: Pid: 4753, comm: ptlrpcd_rcv
      Mar 30 00:48:31 soak-36 kernel: #012Call Trace:
      Mar 30 00:48:31 soak-36 kernel: [<ffffffffa092c7f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      Mar 30 00:48:31 soak-36 kernel: [<ffffffffa092c861>] lbug_with_loc+0x41/0xb0 [libcfs]
      Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d31c87>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]
      Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d55682>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]
      Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d2a2ff>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]
      Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d2e0b5>] ptlrpc_check_set.part.23+0x425/0x1dd0 [ptlrpc]
      Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d2fabb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
      Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d5bb8b>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
      Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d5bf3b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
      Mar 30 00:48:31 soak-36 kernel: [<ffffffff810c4fd0>] ? default_wake_function+0x0/0x20
      Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d5bc80>] ? ptlrpcd+0x0/0x560 [ptlrpc]
      Mar 30 00:48:32 soak-36 kernel: [<ffffffff810b064f>] kthread+0xcf/0xe0
      Mar 30 00:48:32 soak-36 kernel: [<ffffffff810b0580>] ? kthread+0x0/0xe0
      Mar 30 00:48:32 soak-36 kernel: [<ffffffff81696958>] ret_from_fork+0x58/0x90
      Mar 30 00:48:32 soak-36 kernel: [<ffffffff810b0580>] ? kthread+0x0/0xe0
      Mar 30 00:48:32 soak-36 kernel:
      Mar 30 00:48:32 soak-36 kernel: Kernel panic - not syncing: LBUG
      

      vmcore-dmesg.txt attached. Full crash dump is available on soak-36

      Attachments

        1. soak-10.2.log.gz
          9.34 MB
        2. soak-10.debug.log.gz
          6.50 MB
        3. soak-10.dump.txt.gz
          3.78 MB
        4. soak-10.log.gz
          188 kB
        5. soak-10.lustre.log.txt.gz
          10.34 MB
        6. soak-10.stack.txt
          1.23 MB
        7. soak-10.stacks.txt
          1.07 MB
        8. soak-10.stacks.txt.gz
          80 kB
        9. soak-11.debug.log.gz
          15.27 MB
        10. soak-11.dump.txt.gz
          5.81 MB
        11. soak-11.lustre.log.txt.gz
          11.37 MB
        12. soak-11.stacks.txt
          1.25 MB
        13. soak-11.stacks.txt.gz
          109 kB
        14. soak-16.lustre-log.1498910471.2905.gz
          70 kB
        15. soak-16.syslog.txt
          4.15 MB
        16. soak-29.console.txt.gz
          10.38 MB
        17. soak-8.debug.log.gz
          6.57 MB
        18. soak-8.dump.txt.gz
          5.85 MB
        19. soak-8.lustre.log.txt.gz
          11.09 MB
        20. soak-8.stacks.txt
          1.36 MB
        21. soak-8.stacks.txt.gz
          116 kB
        22. soak-9.debug.log.gz
          15.41 MB
        23. soak-9.dump.txt.gz
          4.25 MB
        24. soak-9.lustre.log.txt.gz
          10.93 MB
        25. soak-9.stacks.txt
          1.35 MB
        26. soak-9.stacks.txt.gz
          114 kB
        27. vmcore-dmesg.txt
          135 kB

        Issue Links

          Activity

            [LU-9274] LBUG: (recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed:

            Patch landed for 2.10.

            niu Niu Yawei (Inactive) added a comment - Patch landed for 2.10.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27920/
            Subject: LU-9274 ptlrpc: add replay request into unreplied list
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7d29d3167684b13a612c8c1bff860019a218115c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27920/ Subject: LU-9274 ptlrpc: add replay request into unreplied list Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7d29d3167684b13a612c8c1bff860019a218115c

            Peter, I think the fix from https://review.whamcloud.com/#/c/27920/ should be enough.

            niu Niu Yawei (Inactive) added a comment - Peter, I think the fix from https://review.whamcloud.com/#/c/27920/ should be enough.
            pjones Peter Jones added a comment -

            Niu

            the test also included a reconfiguration on soak which it was hoped might make this race less likely as the disk faliure had coincided with this race being hit far more frenquently (see DCO-7264), Given that this has not worked would it be enough to just land this one fix or did earlier testing suggest further fixes would be needed too?

            Peter

            pjones Peter Jones added a comment - Niu the test also included a reconfiguration on soak which it was hoped might make this race less likely as the disk faliure had coincided with this race being hit far more frenquently (see DCO-7264), Given that this has not worked would it be enough to just land this one fix or did earlier testing suggest further fixes would be needed too? Peter

            Cliff, the fix hasn't been landed yet, so the crash is expected.

            niu Niu Yawei (Inactive) added a comment - Cliff, the fix hasn't been landed yet, so the crash is expected.

            Attached lustre dump from all MDS

            cliffw Cliff White (Inactive) added a comment - Attached lustre dump from all MDS

            Now running lustre-b2_10 build #3

            Jul 11 01:13:20 soak-17 kernel: LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed:
            Jul 11 01:13:20 soak-17 kernel: LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) LBUG
            Jul 11 01:13:20 soak-17 kernel: Pid: 2937, comm: ptlrpcd_rcv
            Jul 11 01:13:20 soak-17 kernel: #012Call Trace:
            Jul 11 01:13:20 soak-17 kernel: [<ffffffffa08447ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
            Jul 11 01:13:20 soak-17 kernel: [<ffffffffa084487c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]
            Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]
            Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]
            Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]
            Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
            Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
            Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
            Jul 11 01:13:20 soak-17 kernel: [<ffffffff810c5080>] ? default_wake_function+0x0/0x20
            Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba03b0>] ? ptlrpcd+0x0/0x560 [ptlrpc]
            Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b06ff>] kthread+0xcf/0xe0
            Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0
            Jul 11 01:13:20 soak-17 kernel: [<ffffffff81696a58>] ret_from_fork+0x58/0x90
            Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0
            Jul 11 01:13:20 soak-17 kernel:
            Jul 11 01:13:20 soak-17 kernel: Kernel panic - not syncing: LBUG
            

            Crash dump is available on soak-17

            [ 3947.210017] perf: interrupt took too long (3225 > 3192), lowering kernel.perf_event_max_sample_rate to 62000
            [ 4161.246530] Lustre: 2953:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1499735243/real 0]  req@ffff880397b67200 x1572582391402544/t0(0) o101->soaked-OST0008-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 328/400 e 0 to 1 dl 1499735256 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            [ 4161.282464] Lustre: soaked-OST0008-osc-ffff8808276d0000: Connection to soaked-OST0008 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            [ 4162.030527] Lustre: soaked-OST000e-osc-ffff8808276d0000: Connection to soaked-OST000e (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            [ 4169.926508] Lustre: soaked-OST0014-osc-ffff8808276d0000: Connection to soaked-OST0014 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            [ 4194.948696] Lustre: 2944:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735290/real 1499735290]  req@ffff880334b43c00 x1572582391418656/t0(0) o400->soaked-OST0002-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 224/224 e 0 to 1 dl 1499735303 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
            [ 4194.986416] Lustre: 2944:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
            [ 4194.998649] Lustre: soaked-OST0002-osc-ffff8808276d0000: Connection to soaked-OST0002 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            [ 4245.020397] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735340/real 1499735340]  req@ffff880334b42100 x1572582391449120/t0(0) o8->soaked-OST0002-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735351 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
            [ 4245.057630] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
            [ 4320.019906] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735415/real 1499735415]  req@ffff880335d55a00 x1572582391496080/t0(0) o8->soaked-OST0008-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735436 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
            [ 4320.057297] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
            [ 4451.018399] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1499735515/real 0]  req@ffff880335d53c00 x1572582391561888/t0(0) o8->soaked-OST0014-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735546 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
            [ 4451.054303] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
            [ 4504.566715] LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed:
            [ 4504.587823] LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) LBUG
            [ 4504.598875] Pid: 2937, comm: ptlrpcd_rcv
            [ 4504.606014]
            Call Trace:
            [ 4504.615457]  [<ffffffffa08447ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
            [ 4504.624991]  [<ffffffffa084487c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            [ 4504.634021]  [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]
            [ 4504.643628]  [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]
            [ 4504.654851]  [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]
            [ 4504.664682]  [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]
            [ 4504.674633]  [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
            [ 4504.683405]  [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
            [ 4504.692018]  [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
            [ 4504.699993]  [<ffffffff810c5080>] ? default_wake_function+0x0/0x20
            [ 4504.708298]  [<ffffffffa0ba03b0>] ? ptlrpcd+0x0/0x560 [ptlrpc]
            [ 4504.716207]  [<ffffffff810b06ff>] kthread+0xcf/0xe0
            [ 4504.723002]  [<ffffffff810b0630>] ? kthread+0x0/0xe0
            [ 4504.729868]  [<ffffffff81696a58>] ret_from_fork+0x58/0x90
            [ 4504.737213]  [<ffffffff810b0630>] ? kthread+0x0/0xe0
            [ 4504.744039]
            [ 4504.746963] Kernel panic - not syncing: LBUG
            [ 4504.752969] CPU: 14 PID: 2937 Comm: ptlrpcd_rcv Tainted: G           OE  ------------   3.10.0-514.10.2.el7.x86_64 #1
            [ 4504.766051] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [ 4504.779814]  ffffffffa0862e8b 0000000057e50b83 ffff88082ab1fb30 ffffffff816864ef
            [ 4504.789340]  ffff88082ab1fbb0 ffffffff8167f8f6 ffffffff00000008 ffff88082ab1fbc0
            [ 4504.798846]  ffff88082ab1fb60 0000000057e50b83 0000000057e50b83 ffff88082d98f838
            [ 4504.808330] Call Trace:
            [ 4504.812212]  [<ffffffff816864ef>] dump_stack+0x19/0x1b
            [ 4504.819084]  [<ffffffff8167f8f6>] panic+0xe3/0x1f2
            [ 4504.825543]  [<ffffffffa0844894>] lbug_with_loc+0x64/0xb0 [libcfs]
            [ 4504.833554]  [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]
            [ 4504.842225]  [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]
            [ 4504.852623]  [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]
            [ 4504.861745]  [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]
            [ 4504.871040]  [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
            [ 4504.879243]  [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
            [ 4504.887319]  [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
            
            cliffw Cliff White (Inactive) added a comment - Now running lustre-b2_10 build #3 Jul 11 01:13:20 soak-17 kernel: LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: Jul 11 01:13:20 soak-17 kernel: LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) LBUG Jul 11 01:13:20 soak-17 kernel: Pid: 2937, comm: ptlrpcd_rcv Jul 11 01:13:20 soak-17 kernel: #012Call Trace: Jul 11 01:13:20 soak-17 kernel: [<ffffffffa08447ee>] libcfs_call_trace+0x4e/0x60 [libcfs] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa084487c>] lbug_with_loc+0x4c/0xb0 [libcfs] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffff810c5080>] ? default_wake_function+0x0/0x20 Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba03b0>] ? ptlrpcd+0x0/0x560 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b06ff>] kthread+0xcf/0xe0 Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0 Jul 11 01:13:20 soak-17 kernel: [<ffffffff81696a58>] ret_from_fork+0x58/0x90 Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0 Jul 11 01:13:20 soak-17 kernel: Jul 11 01:13:20 soak-17 kernel: Kernel panic - not syncing: LBUG Crash dump is available on soak-17 [ 3947.210017] perf: interrupt took too long (3225 > 3192), lowering kernel.perf_event_max_sample_rate to 62000 [ 4161.246530] Lustre: 2953:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1499735243/real 0] req@ffff880397b67200 x1572582391402544/t0(0) o101->soaked-OST0008-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 328/400 e 0 to 1 dl 1499735256 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [ 4161.282464] Lustre: soaked-OST0008-osc-ffff8808276d0000: Connection to soaked-OST0008 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 4162.030527] Lustre: soaked-OST000e-osc-ffff8808276d0000: Connection to soaked-OST000e (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 4169.926508] Lustre: soaked-OST0014-osc-ffff8808276d0000: Connection to soaked-OST0014 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 4194.948696] Lustre: 2944:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735290/real 1499735290] req@ffff880334b43c00 x1572582391418656/t0(0) o400->soaked-OST0002-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 224/224 e 0 to 1 dl 1499735303 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 4194.986416] Lustre: 2944:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages [ 4194.998649] Lustre: soaked-OST0002-osc-ffff8808276d0000: Connection to soaked-OST0002 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 4245.020397] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735340/real 1499735340] req@ffff880334b42100 x1572582391449120/t0(0) o8->soaked-OST0002-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735351 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 4245.057630] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [ 4320.019906] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735415/real 1499735415] req@ffff880335d55a00 x1572582391496080/t0(0) o8->soaked-OST0008-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735436 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 4320.057297] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [ 4451.018399] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1499735515/real 0] req@ffff880335d53c00 x1572582391561888/t0(0) o8->soaked-OST0014-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735546 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [ 4451.054303] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages [ 4504.566715] LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: [ 4504.587823] LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) LBUG [ 4504.598875] Pid: 2937, comm: ptlrpcd_rcv [ 4504.606014] Call Trace: [ 4504.615457] [<ffffffffa08447ee>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 4504.624991] [<ffffffffa084487c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 4504.634021] [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [ 4504.643628] [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [ 4504.654851] [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [ 4504.664682] [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [ 4504.674633] [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [ 4504.683405] [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [ 4504.692018] [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc] [ 4504.699993] [<ffffffff810c5080>] ? default_wake_function+0x0/0x20 [ 4504.708298] [<ffffffffa0ba03b0>] ? ptlrpcd+0x0/0x560 [ptlrpc] [ 4504.716207] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 4504.723002] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 4504.729868] [<ffffffff81696a58>] ret_from_fork+0x58/0x90 [ 4504.737213] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 4504.744039] [ 4504.746963] Kernel panic - not syncing: LBUG [ 4504.752969] CPU: 14 PID: 2937 Comm: ptlrpcd_rcv Tainted: G OE ------------ 3.10.0-514.10.2.el7.x86_64 #1 [ 4504.766051] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 4504.779814] ffffffffa0862e8b 0000000057e50b83 ffff88082ab1fb30 ffffffff816864ef [ 4504.789340] ffff88082ab1fbb0 ffffffff8167f8f6 ffffffff00000008 ffff88082ab1fbc0 [ 4504.798846] ffff88082ab1fb60 0000000057e50b83 0000000057e50b83 ffff88082d98f838 [ 4504.808330] Call Trace: [ 4504.812212] [<ffffffff816864ef>] dump_stack+0x19/0x1b [ 4504.819084] [<ffffffff8167f8f6>] panic+0xe3/0x1f2 [ 4504.825543] [<ffffffffa0844894>] lbug_with_loc+0x64/0xb0 [libcfs] [ 4504.833554] [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [ 4504.842225] [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [ 4504.852623] [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [ 4504.861745] [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [ 4504.871040] [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [ 4504.879243] [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [ 4504.887319] [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc]

            There is a crash dump available on soak-31

            [ 4874.102156] Lustre: 3604:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
            [ 4928.198801] Lustre: soaked-OST0002-osc-ffff88105a2ec000: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib)
            [ 4929.712442] Lustre: soaked-OST0014-osc-ffff88105a2ec000: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib)
            [ 4929.724803] Lustre: Skipped 1 previous similar message
            [ 4931.461826] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed:
            [ 4931.475358] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) LBUG
            [ 4931.482951] Pid: 3604, comm: ptlrpcd_rcv
            [ 4931.487341]
            Call Trace:
            [ 4931.491756]  [<ffffffffa07807ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
            [ 4931.499055]  [<ffffffffa078087c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            [ 4931.506024]  [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]
            [ 4931.513639]  [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]
            [ 4931.522999]  [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]
            [ 4931.531098]  [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]
            [ 4931.539393]  [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
            [ 4931.546623]  [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
            [ 4931.553754]  [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
            [ 4931.560279]  [<ffffffff810c5080>] ? default_wake_function+0x0/0x20
            [ 4931.567215]  [<ffffffffa0b433b0>] ? ptlrpcd+0x0/0x560 [ptlrpc]
            [ 4931.573730]  [<ffffffff810b06ff>] kthread+0xcf/0xe0
            [ 4931.579187]  [<ffffffff810b0630>] ? kthread+0x0/0xe0
            [ 4931.584740]  [<ffffffff81696a58>] ret_from_fork+0x58/0x90
            [ 4931.590778]  [<ffffffff810b0630>] ? kthread+0x0/0xe0
            [ 4931.596326]
            [ 4931.597997] Kernel panic - not syncing: LBUG
            [ 4931.602769] CPU: 5 PID: 3604 Comm: ptlrpcd_rcv Tainted: G          IOE  ------------   3.10.0-514.10.2.el7.x86_64 #1
            [ 4931.614514] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [ 4931.626160]  ffffffffa079ee8b 00000000f7bba2b7 ffff880841747b30 ffffffff816864ef
            [ 4931.634457]  ffff880841747bb0 ffffffff8167f8f6 ffffffff00000008 ffff880841747bc0
            [ 4931.642744]  ffff880841747b60 00000000f7bba2b7 00000000f7bba2b7 ffff88085ed4f838
            [ 4931.651035] Call Trace:
            [ 4931.653764]  [<ffffffff816864ef>] dump_stack+0x19/0x1b
            [ 4931.659507]  [<ffffffff8167f8f6>] panic+0xe3/0x1f2
            [ 4931.664856]  [<ffffffffa0780894>] lbug_with_loc+0x64/0xb0 [libcfs]
            [ 4931.671781]  [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]
            [ 4931.679388]  [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]
            [ 4931.688742]  [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]
            [ 4931.696832]  [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]
            [ 4931.705118]  [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
            [ 4931.712339]  [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
            [ 4931.719465]  [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
            [ 4931.725981]  [<ffffffff810c5080>] ? wake_up_state+0x20/0x20
            [ 4931.732224]  [<ffffffffa0b433b0>] ? ptlrpcd_check+0x5d0/0x5d0 [ptlrpc]
            [ 4931.739517]  [<ffffffff810b06ff>] kthread+0xcf/0xe0
            [ 4931.744959]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
            [ 4931.752243]  [<ffffffff81696a58>] ret_from_fork+0x58/0x90
            [ 4931.758267]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
            
            cliffw Cliff White (Inactive) added a comment - There is a crash dump available on soak-31 [ 4874.102156] Lustre: 3604:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages [ 4928.198801] Lustre: soaked-OST0002-osc-ffff88105a2ec000: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib) [ 4929.712442] Lustre: soaked-OST0014-osc-ffff88105a2ec000: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib) [ 4929.724803] Lustre: Skipped 1 previous similar message [ 4931.461826] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: [ 4931.475358] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) LBUG [ 4931.482951] Pid: 3604, comm: ptlrpcd_rcv [ 4931.487341] Call Trace: [ 4931.491756] [<ffffffffa07807ee>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 4931.499055] [<ffffffffa078087c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 4931.506024] [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [ 4931.513639] [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [ 4931.522999] [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [ 4931.531098] [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [ 4931.539393] [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [ 4931.546623] [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [ 4931.553754] [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc] [ 4931.560279] [<ffffffff810c5080>] ? default_wake_function+0x0/0x20 [ 4931.567215] [<ffffffffa0b433b0>] ? ptlrpcd+0x0/0x560 [ptlrpc] [ 4931.573730] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 4931.579187] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 4931.584740] [<ffffffff81696a58>] ret_from_fork+0x58/0x90 [ 4931.590778] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 4931.596326] [ 4931.597997] Kernel panic - not syncing: LBUG [ 4931.602769] CPU: 5 PID: 3604 Comm: ptlrpcd_rcv Tainted: G IOE ------------ 3.10.0-514.10.2.el7.x86_64 #1 [ 4931.614514] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 4931.626160] ffffffffa079ee8b 00000000f7bba2b7 ffff880841747b30 ffffffff816864ef [ 4931.634457] ffff880841747bb0 ffffffff8167f8f6 ffffffff00000008 ffff880841747bc0 [ 4931.642744] ffff880841747b60 00000000f7bba2b7 00000000f7bba2b7 ffff88085ed4f838 [ 4931.651035] Call Trace: [ 4931.653764] [<ffffffff816864ef>] dump_stack+0x19/0x1b [ 4931.659507] [<ffffffff8167f8f6>] panic+0xe3/0x1f2 [ 4931.664856] [<ffffffffa0780894>] lbug_with_loc+0x64/0xb0 [libcfs] [ 4931.671781] [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [ 4931.679388] [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [ 4931.688742] [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [ 4931.696832] [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [ 4931.705118] [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [ 4931.712339] [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [ 4931.719465] [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc] [ 4931.725981] [<ffffffff810c5080>] ? wake_up_state+0x20/0x20 [ 4931.732224] [<ffffffffa0b433b0>] ? ptlrpcd_check+0x5d0/0x5d0 [ptlrpc] [ 4931.739517] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 4931.744959] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [ 4931.752243] [<ffffffff81696a58>] ret_from_fork+0x58/0x90 [ 4931.758267] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140

            Hit this again on lustre-master, build 3606

            [ 4931.461826] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: ^M
            [ 4931.475358] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) LBUG^M
            [ 4931.482951] Pid: 3604, comm: ptlrpcd_rcv^M
            [ 4931.487341] ^M
            [ 4931.487341] Call Trace:^M
            [ 4931.491756]  [<ffffffffa07807ee>] libcfs_call_trace+0x4e/0x60 [libcfs]^M
            [ 4931.499055]  [<ffffffffa078087c>] lbug_with_loc+0x4c/0xb0 [libcfs]^M
            [ 4931.506024]  [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]^M
            [ 4931.513639]  [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]^M
            [ 4931.522999]  [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]^M
            [ 4931.531098]  [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]^M
            [ 4931.539393]  [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]^M
            [ 4931.546623]  [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]^M
            [ 4931.553754]  [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc]^M
            [ 4931.560279]  [<ffffffff810c5080>] ? default_wake_function+0x0/0x20^M
            [ 4931.567215]  [<ffffffffa0b433b0>] ? ptlrpcd+0x0/0x560 [ptlrpc]^M
            [ 4931.573730]  [<ffffffff810b06ff>] kthread+0xcf/0xe0^M
            [ 4931.579187]  [<ffffffff810b0630>] ? kthread+0x0/0xe0^M
            [ 4931.584740]  [<ffffffff81696a58>] ret_from_fork+0x58/0x90^M
            [ 4931.590778]  [<ffffffff810b0630>] ? kthread+0x0/0xe0^M
            [ 4931.596326] ^M
            [ 4931.597997] Kernel panic - not syncing: LBUG^M
            
            cliffw Cliff White (Inactive) added a comment - Hit this again on lustre-master, build 3606 [ 4931.461826] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: ^M [ 4931.475358] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) LBUG^M [ 4931.482951] Pid: 3604, comm: ptlrpcd_rcv^M [ 4931.487341] ^M [ 4931.487341] Call Trace:^M [ 4931.491756] [<ffffffffa07807ee>] libcfs_call_trace+0x4e/0x60 [libcfs]^M [ 4931.499055] [<ffffffffa078087c>] lbug_with_loc+0x4c/0xb0 [libcfs]^M [ 4931.506024] [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]^M [ 4931.513639] [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]^M [ 4931.522999] [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]^M [ 4931.531098] [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]^M [ 4931.539393] [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]^M [ 4931.546623] [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]^M [ 4931.553754] [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc]^M [ 4931.560279] [<ffffffff810c5080>] ? default_wake_function+0x0/0x20^M [ 4931.567215] [<ffffffffa0b433b0>] ? ptlrpcd+0x0/0x560 [ptlrpc]^M [ 4931.573730] [<ffffffff810b06ff>] kthread+0xcf/0xe0^M [ 4931.579187] [<ffffffff810b0630>] ? kthread+0x0/0xe0^M [ 4931.584740] [<ffffffff81696a58>] ret_from_fork+0x58/0x90^M [ 4931.590778] [<ffffffff810b0630>] ? kthread+0x0/0xe0^M [ 4931.596326] ^M [ 4931.597997] Kernel panic - not syncing: LBUG^M

            I have created LU-9748 for the DNE recovery issue

            cliffw Cliff White (Inactive) added a comment - I have created LU-9748 for the DNE recovery issue

            People

              niu Niu Yawei (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: