Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.5.0
    • Lustre 2.5.0
    • 3
    • 9674

    Description

      Disable OUT_PORTAL on OST, otherwise it will confuse the receiver when MDT and OST are on the same node, which might make ll_ost_outxx handle CONNECT requests from MDTs. Then it would cause panic like

      Lustre: 13261:0:(client.c:1896:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1401492648/real 1401492648] req@ffff880044bf9000 x1469570233011472/t0(0) o1000->lustre-MDT0000-osp-MDT0001@0@lo:24/10 lens 8416/8416 e 0 to 1 dl 1401492655 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      Lustre: lustre-MDT0000-osp-MDT0001: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      Lustre: Skipped 5 previous similar messages
      Lustre: lustre-MDT0000: Client lustre-MDT0001-mdtlov_UUID (at 0@lo) reconnecting
      LustreError: 4239:0:(mdt_handler.c:3191:mdt_tgt_connect()) ASSERTION( mti != ((void *)0) ) failed:
      LustreError: 4239:0:(mdt_handler.c:3191:mdt_tgt_connect()) LBUG
      Pid: 4239, comm: ll_ost_out01_00

      Call Trace:
      [<ffffffffa0603905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [<ffffffffa0603f07>] lbug_with_loc+0x47/0xb0 [libcfs]
      [<ffffffffa0ddab95>] mdt_tgt_connect+0x515/0x550 [mdt]
      [<ffffffffa0939f5d>] tgt_request_handle+0x57d/0xe30 [ptlrpc]
      [<ffffffffa08f6718>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      [<ffffffffa06045be>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      [<ffffffffa061629f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      [<ffffffffa08edd09>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      [<ffffffffa06147e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      [<ffffffff810533f3>] ? __wake_up+0x53/0x70
      [<ffffffffa08f7aac>] ptlrpc_main+0xacc/0x1750 [ptlrpc]
      [<ffffffffa08f6fe0>] ? ptlrpc_main+0x0/0x1750 [ptlrpc]
      [<ffffffff81091d66>] kthread+0x96/0xa0
      [<ffffffff8100c14a>] child_rip+0xa/0x20
      [<ffffffff81091cd0>] ? kthread+0x0/0xa0
      [<ffffffff8100c140>] ? child_rip+0x0/0x20

      Attachments

        Issue Links

          Activity

            [LU-3751] disable OUT_PORTAL on OST for now

            The patch has landed to master. Let me know if more work is needed and I will reopen this ticket.

            jlevi Jodi Levi (Inactive) added a comment - The patch has landed to master. Let me know if more work is needed and I will reopen this ticket.

            With Change, 7323 now landed to Master can this ticket be closed or is additional work needed?

            jlevi Jodi Levi (Inactive) added a comment - With Change, 7323 now landed to Master can this ticket be closed or is additional work needed?

            sure, I will do.

            bzzz Alex Zhuravlev added a comment - sure, I will do.

            Problem should gone when MDT part of UT patch will be landed, no need to change protocol. Alex, could you base your changes on later patches in UT series? I expect it should work with http://review.whamcloud.com/6973

            tappro Mikhail Pershin added a comment - Problem should gone when MDT part of UT patch will be landed, no need to change protocol. Alex, could you base your changes on later patches in UT series? I expect it should work with http://review.whamcloud.com/6973

            I'm not saying that change is the right thing in the long term, but at least it gives me a way to develop stuff before UT is complete.

            bzzz Alex Zhuravlev added a comment - I'm not saying that change is the right thing in the long term, but at least it gives me a way to develop stuff before UT is complete.

            What are the implications here for unified targets? Isn't the whole point of UT that the same RPC to the same portal will execute the same operation on an OSD? If the "OST OUT" has a different portal than the "MDT OUT", we will need to handle OST updates separately from MDT updates forever in the future, which doesn't make sense to me.

            My preference would be Di's patch that just disables this code for 2.5 (rather than changing the protocol forever in the future as Alex's patch does). However, disabling the OST OUT handler will cause problems for LFSCK Phase 2, which is supposed to be using this service for MDT->OST communications, though that will only become a problem in 2.6.

            Mike, there needs to be some way for OUT to handle RPCs for both MDT and OST devices. Is that part of your later UT patch series?

            adilger Andreas Dilger added a comment - What are the implications here for unified targets? Isn't the whole point of UT that the same RPC to the same portal will execute the same operation on an OSD? If the "OST OUT" has a different portal than the "MDT OUT", we will need to handle OST updates separately from MDT updates forever in the future, which doesn't make sense to me. My preference would be Di's patch that just disables this code for 2.5 (rather than changing the protocol forever in the future as Alex's patch does). However, disabling the OST OUT handler will cause problems for LFSCK Phase 2, which is supposed to be using this service for MDT->OST communications, though that will only become a problem in 2.6. Mike, there needs to be some way for OUT to handle RPCs for both MDT and OST devices. Is that part of your later UT patch series?
            di.wang Di Wang added a comment -

            Alex, could you please push your patch to review and try to land it?

            di.wang Di Wang added a comment - Alex, could you please push your patch to review and try to land it?

            I used the following:

            diff --git a/lustre/include/lustre/lustre_idl.h b/lustre/include/lustre/lustre_idl.h
            index 3ee0c7e..109ae00 100644
            — a/lustre/include/lustre/lustre_idl.h
            +++ b/lustre/include/lustre/lustre_idl.h
            @@ -141,6 +141,7 @@
            #define SEQ_DATA_PORTAL 31
            #define SEQ_CONTROLLER_PORTAL 32
            #define MGS_BULK_PORTAL 33
            +#define OUT_OST_PORTAL 34

            /* Portal 63 is reserved for the Cray Inc DVS - nic@cray.com, roe@cray.com, n8851@cray.com */

            diff --git a/lustre/ost/ost_handler.c b/lustre/ost/ost_handler.c
            index 7880341..2f4421f 100644
            — a/lustre/ost/ost_handler.c
            +++ b/lustre/ost/ost_handler.c
            @@ -2797,7 +2797,7 @@ static int ost_setup(struct obd_device obd, struct lustre_cfg lcfg)
            .bc_buf_size = OUT_BUFSIZE,
            .bc_req_max_size = OUT_MAXREQSIZE,
            .bc_rep_max_size = OUT_MAXREPSIZE,

            • .bc_req_portal = OUT_PORTAL,
              + .bc_req_portal = OUT_OST_PORTAL,
              .bc_rep_portal = OSC_REPLY_PORTAL,
              },
              /*
            bzzz Alex Zhuravlev added a comment - I used the following: diff --git a/lustre/include/lustre/lustre_idl.h b/lustre/include/lustre/lustre_idl.h index 3ee0c7e..109ae00 100644 — a/lustre/include/lustre/lustre_idl.h +++ b/lustre/include/lustre/lustre_idl.h @@ -141,6 +141,7 @@ #define SEQ_DATA_PORTAL 31 #define SEQ_CONTROLLER_PORTAL 32 #define MGS_BULK_PORTAL 33 +#define OUT_OST_PORTAL 34 /* Portal 63 is reserved for the Cray Inc DVS - nic@cray.com, roe@cray.com, n8851@cray.com */ diff --git a/lustre/ost/ost_handler.c b/lustre/ost/ost_handler.c index 7880341..2f4421f 100644 — a/lustre/ost/ost_handler.c +++ b/lustre/ost/ost_handler.c @@ -2797,7 +2797,7 @@ static int ost_setup(struct obd_device obd, struct lustre_cfg lcfg) .bc_buf_size = OUT_BUFSIZE, .bc_req_max_size = OUT_MAXREQSIZE, .bc_rep_max_size = OUT_MAXREPSIZE, .bc_req_portal = OUT_PORTAL, + .bc_req_portal = OUT_OST_PORTAL, .bc_rep_portal = OSC_REPLY_PORTAL, }, /*
            di.wang Di Wang added a comment - http://review.whamcloud.com/7323

            People

              di.wang Di Wang
              di.wang Di Wang
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: