Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 1.8.x (1.8.0 - 1.8.5)
    • 2
    • 6494

    Description

      We receive many messages like:

      Jan 8 04:21:55 osiride-lp-030 kernel: LustreError: 11463:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff810a722ccc00 x1388786345868037/t0 o101->MGS@MGC10.121.13.31@tcp_0:26/25 lens 296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
      Jan 8 04:21:55 osiride-lp-030 kernel: LustreError: 11463:0:(client.c:858:ptlrpc_import_delay_req()) Skipped 179 previous similar messages
      Jan 8 04:22:38 osiride-lp-030 kernel: Lustre: 6743:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1388786345868061 sent from MGC10.121.13.31@tcp to NID 0@lo 5s ago has timed out (5s prior to deadline).
      Jan 8 04:22:38 osiride-lp-030 kernel: req@ffff810256529800 x1388786345868061/t0 o250->MGS@MGC10.121.13.31@tcp_0:26/25 lens 368/584 e 0 to 1 dl 1325992958 ref 1 fl Rpc:N/0/0 rc 0/0
      Jan 8 04:22:38 osiride-lp-030 kernel: Lustre: 6743:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 100 previous similar messages

      I have attached the "messages" of the MDS/MGS server.

      Can you explain the meaning of these messages and how could we fix it?

      Attachments

        1. dump-18-01.tgz
          1 kB
        2. lustre.info
          14 kB
        3. messages-lp-030.bz2
          12 kB
        4. messages-lp-031.bz2
          58 kB
        5. tunefs.tgz
          1 kB

        Issue Links

          Activity

            [LU-970] Invalid Import messages
            pjones Peter Jones added a comment -

            Thanks!

            pjones Peter Jones added a comment - Thanks!

            Please close this issue

            lustre.support Supporto Lustre Jnet2000 (Inactive) added a comment - Please close this issue

            No I have not. We are planning to upgrade the version of Lustre to the latest stable. I will change the configuration during the upgrade.

            lustre.support Supporto Lustre Jnet2000 (Inactive) added a comment - No I have not. We are planning to upgrade the version of Lustre to the latest stable. I will change the configuration during the upgrade.

            Cool . To be clear, you've also fixed the MGS configuration with tunefs.lustre as explained in my comment on 19/Jan/12 9:17 AM, right?

            johann Johann Lombardi (Inactive) added a comment - Cool . To be clear, you've also fixed the MGS configuration with tunefs.lustre as explained in my comment on 19/Jan/12 9:17 AM, right?

            Ok,
            we have rebalanced the services on osiride-lp030 and osiride-lp031 and restart all the client. No more Lustre error and this is the output of lctl dl command on both the servers.

            [root@osiride-lp-030 ~]# lctl dl
            0 UP mgs MGS MGS 45
            1 UP mgc MGC10.121.13.31@tcp 5c2ce5e0-645a-2b58-6c0d-c5a9a11671f5 5
            2 UP ost OSS OSS_uuid 3
            3 UP obdfilter home-OST0001 home-OST0001_UUID 43
            4 UP obdfilter home-OST0002 home-OST0002_UUID 43
            5 UP obdfilter home-OST0000 home-OST0000_UUID 43
            6 UP mdt MDS MDS_uuid 3
            7 UP lov home-mdtlov home-mdtlov_UUID 4
            8 UP mds home-MDT0000 home-MDT0000_UUID 41
            9 UP osc home-OST0000-osc home-mdtlov_UUID 5
            10 UP osc home-OST0001-osc home-mdtlov_UUID 5
            11 UP osc home-OST0002-osc home-mdtlov_UUID 5
            12 UP osc home-OST0003-osc home-mdtlov_UUID 5
            13 UP osc home-OST0004-osc home-mdtlov_UUID 5
            14 UP osc home-OST0005-osc home-mdtlov_UUID 5
            15 UP osc home-OST0006-osc home-mdtlov_UUID 5
            16 UP osc home-OST0007-osc home-mdtlov_UUID 5
            17 UP osc home-OST0008-osc home-mdtlov_UUID 5
            18 UP osc home-OST0009-osc home-mdtlov_UUID 5
            19 UP osc home-OST000a-osc home-mdtlov_UUID 5
            20 UP osc home-OST000b-osc home-mdtlov_UUID 5

            [root@osiride-lp-031 ~]# lctl dl
            0 UP mgc MGC10.121.13.31@tcp e4919e7b-230b-9ce3-910d-3ec6e1bed6fc 5
            1 UP ost OSS OSS_uuid 3
            2 UP obdfilter home-OST0006 home-OST0006_UUID 43
            3 UP obdfilter home-OST0004 home-OST0004_UUID 43
            4 UP obdfilter home-OST0007 home-OST0007_UUID 43
            5 UP obdfilter home-OST0003 home-OST0003_UUID 43
            6 UP obdfilter home-OST0009 home-OST0009_UUID 43
            7 UP obdfilter home-OST0008 home-OST0008_UUID 43
            8 UP obdfilter home-OST0005 home-OST0005_UUID 43
            9 UP obdfilter home-OST000b home-OST000b_UUID 43
            10 UP obdfilter home-OST000a home-OST000a_UUID 43

            Could you please close the issue? Thanks in advance

            lustre.support Supporto Lustre Jnet2000 (Inactive) added a comment - Ok, we have rebalanced the services on osiride-lp030 and osiride-lp031 and restart all the client. No more Lustre error and this is the output of lctl dl command on both the servers. [root@osiride-lp-030 ~] # lctl dl 0 UP mgs MGS MGS 45 1 UP mgc MGC10.121.13.31@tcp 5c2ce5e0-645a-2b58-6c0d-c5a9a11671f5 5 2 UP ost OSS OSS_uuid 3 3 UP obdfilter home-OST0001 home-OST0001_UUID 43 4 UP obdfilter home-OST0002 home-OST0002_UUID 43 5 UP obdfilter home-OST0000 home-OST0000_UUID 43 6 UP mdt MDS MDS_uuid 3 7 UP lov home-mdtlov home-mdtlov_UUID 4 8 UP mds home-MDT0000 home-MDT0000_UUID 41 9 UP osc home-OST0000-osc home-mdtlov_UUID 5 10 UP osc home-OST0001-osc home-mdtlov_UUID 5 11 UP osc home-OST0002-osc home-mdtlov_UUID 5 12 UP osc home-OST0003-osc home-mdtlov_UUID 5 13 UP osc home-OST0004-osc home-mdtlov_UUID 5 14 UP osc home-OST0005-osc home-mdtlov_UUID 5 15 UP osc home-OST0006-osc home-mdtlov_UUID 5 16 UP osc home-OST0007-osc home-mdtlov_UUID 5 17 UP osc home-OST0008-osc home-mdtlov_UUID 5 18 UP osc home-OST0009-osc home-mdtlov_UUID 5 19 UP osc home-OST000a-osc home-mdtlov_UUID 5 20 UP osc home-OST000b-osc home-mdtlov_UUID 5 [root@osiride-lp-031 ~] # lctl dl 0 UP mgc MGC10.121.13.31@tcp e4919e7b-230b-9ce3-910d-3ec6e1bed6fc 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter home-OST0006 home-OST0006_UUID 43 3 UP obdfilter home-OST0004 home-OST0004_UUID 43 4 UP obdfilter home-OST0007 home-OST0007_UUID 43 5 UP obdfilter home-OST0003 home-OST0003_UUID 43 6 UP obdfilter home-OST0009 home-OST0009_UUID 43 7 UP obdfilter home-OST0008 home-OST0008_UUID 43 8 UP obdfilter home-OST0005 home-OST0005_UUID 43 9 UP obdfilter home-OST000b home-OST000b_UUID 43 10 UP obdfilter home-OST000a home-OST000a_UUID 43 Could you please close the issue? Thanks in advance

            I'm afraid that we have not enough log of this incident to find out why the MGS wasn't responsive at this time.
            I would suggest to fix the mgsnode configuration error and then we can look at this problem if this happens again.

            johann Johann Lombardi (Inactive) added a comment - I'm afraid that we have not enough log of this incident to find out why the MGS wasn't responsive at this time. I would suggest to fix the mgsnode configuration error and then we can look at this problem if this happens again.

            So when we start the node 10.121.13.31@tcp and rebalance the service, the Lustre error will be gone? But why we see the lustre error before the failing over of 10.121.13.31@tcp node?

            thanks in advance

            lustre.support Supporto Lustre Jnet2000 (Inactive) added a comment - So when we start the node 10.121.13.31@tcp and rebalance the service, the Lustre error will be gone? But why we see the lustre error before the failing over of 10.121.13.31@tcp node? thanks in advance

            > How should I change the configuration according this setup and to avoid the Lustre errors?

            There is no need to change the configuration. Please just follow the procedure i detailed in my comment on 19/Jan/12 9:17 AM and the error messages will be gone.

            johann Johann Lombardi (Inactive) added a comment - > How should I change the configuration according this setup and to avoid the Lustre errors? There is no need to change the configuration. Please just follow the procedure i detailed in my comment on 19/Jan/12 9:17 AM and the error messages will be gone.

            Hi Johann and Zhenyu, the normal configuration is:

            • MGT, MDT, OST0000, OST0001, OST0002 are owned by 10.121.13.31@tcp and the failover node are 10.121.13.62@tcp
            • OST0003 -> OST000b are owned by 10.121.13.62@tcp and the failover node are 10.121.13.31@tcp

            How should I change the configuration according this setup and to avoid the Lustre errors?

            When we take the tunefs output and the dumpfs output, we are in a failed situation, because all the targets are mounted on 10.121.13.62@tcp.

            We have the Lustre errors before and after the shutdown of the 10.121.13.31@tcp node, as you see in the messages.

            Thanks in advance

            lustre.support Supporto Lustre Jnet2000 (Inactive) added a comment - Hi Johann and Zhenyu, the normal configuration is: MGT, MDT, OST0000, OST0001, OST0002 are owned by 10.121.13.31@tcp and the failover node are 10.121.13.62@tcp OST0003 -> OST000b are owned by 10.121.13.62@tcp and the failover node are 10.121.13.31@tcp How should I change the configuration according this setup and to avoid the Lustre errors? When we take the tunefs output and the dumpfs output, we are in a failed situation, because all the targets are mounted on 10.121.13.62@tcp. We have the Lustre errors before and after the shutdown of the 10.121.13.31@tcp node, as you see in the messages. Thanks in advance
            bobijam Zhenyu Xu added a comment -

            No, there is no need to set a fs name on a separate MGT which can handle multiple filesystems at once. My fault to mentioned the incorrect info in comment on 18/Jan/12 10:45 AM.

            bobijam Zhenyu Xu added a comment - No, there is no need to set a fs name on a separate MGT which can handle multiple filesystems at once. My fault to mentioned the incorrect info in comment on 18/Jan/12 10:45 AM.

            People

              bobijam Zhenyu Xu
              lustre.support Supporto Lustre Jnet2000 (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: