Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.4.3
    • None
    • (Git repo at https://github.com/jlan/lustre-nas)
      Server: 2.4.3-12nasS, kernel: 2.6.32-358.23.2.el6
      Client: 2.4.3-11nasC, kernel: sles11sp3 3.0.101-0.31.1.
    • 3
    • 17268

    Description

      We have several clients panicked yesterday on this problem. There was ptlrpc_expire_one_request message on nbp9 (10.151.26.5):
      [1422397399.479345] Lustre: 5076:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1422397397/real 1422397397] req@ffff88020b91dc00 x1487556472851932/t0(0) o250->MGC10.151.26.5@o2ib@10.151.26.5@o2ib:26/25 lens 400/544 e 0 to 1 dl 1422397502 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [1422397399.507345] Lustre: 5076:0:(client.c:1878:ptlrpc_expire_one_request()) Skipped 3507 previous similar messages

      The client hits LBUG as shown in dmesg below:

      [1422397952.774718] LustreError: 22882:0:(sec_null.c:196:null_free_reqbuf()) ASSERTION( req->rq_reqmsg == req->rq_reqbuf ) failed: req ffff880837378000: reqmsg ffff8808373780e8 is not reqbuf ffff8805b759a400 in null sec
      [1422397952.794717] LustreError: 22882:0:(sec_null.c:196:null_free_reqbuf()) LBUG
      [1422397952.802717] Pid: 22882, comm: kworker/u:1
      [1422397952.806717]
      [1422397952.806717] Call Trace:
      [1422397952.810716] [<ffffffff81004935>] dump_trace+0x75/0x310
      [1422397952.818716] [<ffffffffa062c82a>] libcfs_debug_dumpstack+0x4a/0x70 [libcfs]
      [1422397952.818716] [<ffffffffa062cd5e>] lbug_with_loc+0x3e/0xb0 [libcfs]
      [1422397952.818716] [<ffffffffa0914030>] null_free_reqbuf+0x260/0x2d0 [ptlrpc]
      [1422397952.818716] [<ffffffffa09022e4>] sptlrpc_cli_free_reqbuf+0x44/0x130 [ptlrpc]
      [1422397952.818716] [<ffffffffa08c77ad>] __ptlrpc_free_req+0x15d/0x590 [ptlrpc]
      [1422397952.818716] [<ffffffffa08c7d3b>] __ptlrpc_req_finished+0x15b/0x260 [ptlrpc]
      [1422397952.818716] [<ffffffffa08de527>] request_out_callback+0xb7/0x240 [ptlrpc]
      [1422397952.818716] [<ffffffffa06e2da7>] lnet_msg_detach_md+0x57/0xf0 [lnet]
      [1422397952.818716] [<ffffffffa06e42b9>] lnet_finalize+0x89/0x3e0 [lnet]
      [1422397952.818716] [<ffffffffa085b4d7>] kiblnd_tx_done+0x117/0x3e0 [ko2iblnd]
      [1422397952.818716] [<ffffffffa085c372>] kiblnd_txlist_done+0x52/0x60 [ko2iblnd]
      [1422397952.818716] [<ffffffffa085c463>] kiblnd_peer_connect_failed+0xe3/0x2e0 [ko2iblnd]
      [1422397952.818716] [<ffffffffa08655ed>] kiblnd_cm_callback+0x4ed/0x1260 [ko2iblnd]
      [1422397952.818716] [<ffffffffa04f1a02>] cma_work_handler+0x72/0xa0 [rdma_cm]
      [1422397952.818716] [<ffffffff8107b4dc>] process_one_work+0x16c/0x350
      [1422397952.818716] [<ffffffff8107e10a>] worker_thread+0x17a/0x410
      [1422397952.818716] [<ffffffff81082476>] kthread+0x96/0xa0
      [1422397952.818716] [<ffffffff8147aae4>] kernel_thread_helper+0x4/0x10
      [1422397952.818716]
      [1422397952.818716] Kernel panic - not syncing: LBUG
      [1422397952.818716] Pid: 22882, comm: kworker/u:1 Tainted: GF NX 3.0.101-0.31.1.20140612-nasa #1
      [1422397952.818716] Call Trace:
      [1422397952.818716] [<ffffffff81004935>] dump_trace+0x75/0x310
      [1422397952.818716] [<ffffffff8146ec63>] dump_stack+0x69/0x6f
      [1422397952.818716] [<ffffffff8146ecfc>] panic+0x93/0x201
      Broadcast message from root (Wed Jan 28 12:11:27 2015):+0xa3/0xb0 [libcfs]
      [1422397952.818716] [<ffffffffa0914030>] null_free_reqbuf+0x260/0x2d0 [ptlrpc]
      System is being shutdown to apply patches during scheduled dedicated time. Plea[e logoff now to a]oid losing data. 1912,1 99%
      [he system is goin] DOWN for reboot in 9 minutes!
      [1422397952.818716] [<ffffffffa09022e4>] sptlrpc_cli_free_reqbuf+0x44/0x130 [ptlrpc]
      [1422397952.818716] [<ffffffffa08c77ad>] __ptlrpc_free_req+0x15d/0x590 [ptlrpc]
      [1422397952.818716] [<ffffffffa08c7d3b>] __ptlrpc_req_finished+0x15b/0x260 [ptlrpc]
      [1422397952.818716] [<ffffffffa08de527>] request_out_callback+0xb7/0x240 [ptlrpc]
      [1422397952.818716] [<ffffffffa06e2da7>] lnet_msg_detach_md+0x57/0xf0 [lnet]
      [1422397952.818716] [<ffffffffa06e42b9>] lnet_finalize+0x89/0x3e0 [lnet]
      [1422397952.818716] [<ffffffffa085b4d7>] kiblnd_tx_done+0x117/0x3e0 [ko2iblnd]
      [1422397952.818716] [<ffffffffa085c372>] kiblnd_txlist_done+0x52/0x60 [ko2iblnd]
      [1422397952.818716] [<ffffffffa085c463>] kiblnd_peer_connect_failed+0xe3/0x2e0 [ko2iblnd]
      [1422397952.818716] [<ffffffffa08655ed>] kiblnd_cm_callback+0x4ed/0x1260 [ko2iblnd]
      [1422397952.818716] [<ffffffffa04f1a02>] cma_work_handler+0x72/0xa0 [rdma_cm]
      [1422397952.818716] [<ffffffff8107b4dc>] process_one_work+0x16c/0x350
      [1422397952.818716] [<ffffffff8107e10a>] worker_thread+0x17a/0x410
      [1422397952.818716] [<ffffffff81082476>] kthread+0x96/0xa0
      [1422397952.818716] [<ffffffff8147aae4>] kernel_thread_helper+0x4/0x10

      Attachments

        Issue Links

          Activity

            [LU-6172] Client LBUG in null_free_reqbuf

            Duplicate of LU-3333.
            ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - Duplicate of LU-3333 . ~ jfc.
            ys Yang Sheng added a comment -

            So can we mark this ticket as duplicated with lu-3333?

            ys Yang Sheng added a comment - So can we mark this ticket as duplicated with lu-3333?

            Thanks, guys!

            I cherry-picked the patch to nas-2.4.3 branch with no problem.

            jaylan Jay Lan (Inactive) added a comment - Thanks, guys! I cherry-picked the patch to nas-2.4.3 branch with no problem.
            pjones Peter Jones added a comment -

            Yang Sheng

            NASA have their own Lustre distribution and so can pick up the patch. They will advise if they need any assistance in porting the fix back to b2_4. Alternatively, if this is a rare issue, they may wait, knowing that this issue will disappear when they upgrade to 2.5.x clients.

            Peter

            pjones Peter Jones added a comment - Yang Sheng NASA have their own Lustre distribution and so can pick up the patch. They will advise if they need any assistance in porting the fix back to b2_4. Alternatively, if this is a rare issue, they may wait, knowing that this issue will disappear when they upgrade to 2.5.x clients. Peter
            ys Yang Sheng added a comment -

            Hi, Oleg. I think you may right. Looks like not other place can cause such issue. But seem we won't landed patch to 2.4. How to handle this ticket?

            ys Yang Sheng added a comment - Hi, Oleg. I think you may right. Looks like not other place can cause such issue. But seem we won't landed patch to 2.4. How to handle this ticket?
            green Oleg Drokin added a comment -

            I suspect this might be an instance of LU-3333

            green Oleg Drokin added a comment - I suspect this might be an instance of LU-3333
            pjones Peter Jones added a comment -

            Yang Sheng

            Could you please advise?

            Peter

            pjones Peter Jones added a comment - Yang Sheng Could you please advise? Peter

            The Call Trace part in the decription of the problem was cut-n-paste from a screeen that was contaminated by broadcast messages from root.

            The correct stack trace should be as below:

            [1422397952.818716] Kernel panic - not syncing: LBUG
            [1422397952.818716] Pid: 22882, comm: kworker/u:1 Tainted: GF NX 3.0.101-0.31.1.20140612-nasa #1
            [1422397952.818716] Call Trace:
            [1422397952.818716] [<ffffffff81004935>] dump_trace+0x75/0x310
            [1422397952.818716] [<ffffffff8146ec63>] dump_stack+0x69/0x6f
            [1422397952.818716] [<ffffffff8146ecfc>] panic+0x93/0x201
            [1422397952.818716] [<ffffffffa062cdc3>] lbug_with_loc+0xa3/0xb0 [libcfs]
            [1422397952.818716] [<ffffffffa0914030>] null_free_reqbuf+0x260/0x2d0 [ptlrpc]
            [1422397952.818716] [<ffffffffa09022e4>] sptlrpc_cli_free_reqbuf+0x44/0x130 [ptlrpc]
            [1422397952.818716] [<ffffffffa08c77ad>] __ptlrpc_free_req+0x15d/0x590 [ptlrpc]
            [1422397952.818716] [<ffffffffa08c7d3b>] __ptlrpc_req_finished+0x15b/0x260 [ptlrpc]
            [1422397952.818716] [<ffffffffa08de527>] request_out_callback+0xb7/0x240 [ptlrpc]
            [1422397952.818716] [<ffffffffa06e2da7>] lnet_msg_detach_md+0x57/0xf0 [lnet]
            [1422397952.818716] [<ffffffffa06e42b9>] lnet_finalize+0x89/0x3e0 [lnet]
            [1422397952.818716] [<ffffffffa085b4d7>] kiblnd_tx_done+0x117/0x3e0 [ko2iblnd]
            [1422397952.818716] [<ffffffffa085c372>] kiblnd_txlist_done+0x52/0x60 [ko2iblnd]
            [1422397952.818716] [<ffffffffa085c463>] kiblnd_peer_connect_failed+0xe3/0x2e0 [ko2iblnd]
            [1422397952.818716] [<ffffffffa08655ed>] kiblnd_cm_callback+0x4ed/0x1260 [ko2iblnd]
            [1422397952.818716] [<ffffffffa04f1a02>] cma_work_handler+0x72/0xa0 [rdma_cm]
            [1422397952.818716] [<ffffffff8107b4dc>] process_one_work+0x16c/0x350
            [1422397952.818716] [<ffffffff8107e10a>] worker_thread+0x17a/0x410
            [1422397952.818716] [<ffffffff81082476>] kthread+0x96/0xa0
            [1422397952.818716] [<ffffffff8147aae4>] kernel_thread_helper+0x4/0x10

            jaylan Jay Lan (Inactive) added a comment - The Call Trace part in the decription of the problem was cut-n-paste from a screeen that was contaminated by broadcast messages from root. The correct stack trace should be as below: [1422397952.818716] Kernel panic - not syncing: LBUG [1422397952.818716] Pid: 22882, comm: kworker/u:1 Tainted: GF NX 3.0.101-0.31.1.20140612-nasa #1 [1422397952.818716] Call Trace: [1422397952.818716] [<ffffffff81004935>] dump_trace+0x75/0x310 [1422397952.818716] [<ffffffff8146ec63>] dump_stack+0x69/0x6f [1422397952.818716] [<ffffffff8146ecfc>] panic+0x93/0x201 [1422397952.818716] [<ffffffffa062cdc3>] lbug_with_loc+0xa3/0xb0 [libcfs] [1422397952.818716] [<ffffffffa0914030>] null_free_reqbuf+0x260/0x2d0 [ptlrpc] [1422397952.818716] [<ffffffffa09022e4>] sptlrpc_cli_free_reqbuf+0x44/0x130 [ptlrpc] [1422397952.818716] [<ffffffffa08c77ad>] __ptlrpc_free_req+0x15d/0x590 [ptlrpc] [1422397952.818716] [<ffffffffa08c7d3b>] __ptlrpc_req_finished+0x15b/0x260 [ptlrpc] [1422397952.818716] [<ffffffffa08de527>] request_out_callback+0xb7/0x240 [ptlrpc] [1422397952.818716] [<ffffffffa06e2da7>] lnet_msg_detach_md+0x57/0xf0 [lnet] [1422397952.818716] [<ffffffffa06e42b9>] lnet_finalize+0x89/0x3e0 [lnet] [1422397952.818716] [<ffffffffa085b4d7>] kiblnd_tx_done+0x117/0x3e0 [ko2iblnd] [1422397952.818716] [<ffffffffa085c372>] kiblnd_txlist_done+0x52/0x60 [ko2iblnd] [1422397952.818716] [<ffffffffa085c463>] kiblnd_peer_connect_failed+0xe3/0x2e0 [ko2iblnd] [1422397952.818716] [<ffffffffa08655ed>] kiblnd_cm_callback+0x4ed/0x1260 [ko2iblnd] [1422397952.818716] [<ffffffffa04f1a02>] cma_work_handler+0x72/0xa0 [rdma_cm] [1422397952.818716] [<ffffffff8107b4dc>] process_one_work+0x16c/0x350 [1422397952.818716] [<ffffffff8107e10a>] worker_thread+0x17a/0x410 [1422397952.818716] [<ffffffff81082476>] kthread+0x96/0xa0 [1422397952.818716] [<ffffffff8147aae4>] kernel_thread_helper+0x4/0x10

            People

              ys Yang Sheng
              jaylan Jay Lan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: