[LU-970] Invalid Import messages Created: 09/Jan/12 Updated: 02/Feb/12 Resolved: 02/Feb/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.x (1.8.0 - 1.8.5) |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Supporto Lustre Jnet2000 (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | log, server | ||
| Environment: |
Lustre version: 1.8.5.54-20110316022453-PRISTINE-2.6.18-194.17.1.el5_lustre.20110315140510 |
||
| Attachments: |
|
| Severity: | 2 |
| Epic: | log, server |
| Rank (Obsolete): | 6494 |
| Description |
|
We receive many messages like: Jan 8 04:21:55 osiride-lp-030 kernel: LustreError: 11463:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff810a722ccc00 x1388786345868037/t0 o101->MGS@MGC10.121.13.31@tcp_0:26/25 lens 296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 I have attached the "messages" of the MDS/MGS server. Can you explain the meaning of these messages and how could we fix it? |
| Comments |
| Comment by Johann Lombardi (Inactive) [ 09/Jan/12 ] |
|
This means that the MDT somehow cannot reach the MGS which is supposed to run locally.
Also, you mentioned that you are running "1.8.5.54-20110316022453". Do i understand correctly that you are running a beta version of Oracle's 1.8.6 which isn't intended to be used in production? If so, i would really advise to upgrade to 1.8.7-wc1. |
| Comment by Peter Jones [ 09/Jan/12 ] |
|
Bobi Could you please take care of this ticket? Thanks Peter |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 09/Jan/12 ] |
|
[root@osiride-lp-031 wisi281]# lctl dl |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 09/Jan/12 ] |
|
[root@osiride-lp-031 wisi281]# lctl get_param mgc.*.import |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 10/Jan/12 ] |
|
Hi,
The first server hosts these services:
The second server hosts services OST03 to OST0b We have a dedicated 10GbE using Broadcom Corporation NetXtreme II BCM57711E 10-Gigabit PCIe We have a Red Hat Cluster Suite cluster to provide High Availability. The output of "lctl dl" and "lctl get_param mgc.*.import" is taken after a failover and all the Lustre services are hosted on the osiride-lp-031 server. We have the same messages on the osiride-lp-031. I have attached the "messages" of osiride-lp-031 before and after the failover. Thanks in advance |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 10/Jan/12 ] |
|
messages of osiride-lp-031 |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 10/Jan/12 ] |
|
We are planning to upgrade to the latest GA version of Lustre at the end of January. |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 10/Jan/12 ] |
|
Hi, >> 0 UP mgc MGC10.121.13.31@tcp 326e50f4-053e-14d7-29f8-10a8ae98140d 5 Thanks in advance for your support |
| Comment by Johann Lombardi (Inactive) [ 10/Jan/12 ] |
|
This indeed looks weird. Could you please run the following commands?
|
| Comment by Zhenyu Xu [ 10/Jan/12 ] |
|
Could be the MDT device or some OST devices being mkfs.lustre-ed with wrong "--mgsnode" argument. |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 10/Jan/12 ] |
|
Hi Zhenyu, Hi Johann, I'm not able to take the output of tunefs.lustre because this problem: [root@osiride-lp-031 ~]# tunefs.lustre --print /dev/mpath/ost07p1 tunefs.lustre FATAL: Device /dev/mpath/ost07p1 has not been formatted with I try on the /dev/dm-31 that are the real block device, but I receive the same error. |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 10/Jan/12 ] |
|
Do you think that having this two entries in the device list table is the cause of the errors that I receive in the "messages"? Thanks in advance |
| Comment by Zhenyu Xu [ 10/Jan/12 ] |
|
Please umount /dev/mpath/ost07p1 and mount it as 'ldiskfs' type, and upload its "CONFIGS/mountdata" file here. And check whether the "Invalid Import" messages persist as ost07 is "offline" the filesystem. |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 10/Jan/12 ] |
|
Sorry Zhenyu but I receive the tunefs.lustre error on all the lustre's block devices : /dev/mpath/mgsp1 on /lustre/mgs type lustre (rw) Should I use tunefs.lustre with the real scsi disk device and not on the multipathed block devices? |
| Comment by Zhenyu Xu [ 11/Jan/12 ] |
|
Then try to tunefs.lustre upon the real scsi disk device. Or use debugfs -R "dump CONFIGS/mountdata /tmp/mountdata" /dev/mpath/ost07p1
to dump the file and upload here. |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 11/Jan/12 ] |
|
opps debugfs -R "dump CONFIGS/mountdata /tmp/mountdata-ost07" /dev/mpath/ost07p1 debugfs 1.41.10.sun2 (24-Feb-2010) |
| Comment by Johann Lombardi (Inactive) [ 11/Jan/12 ] |
|
This problem (i.e. debugfs cannot open the filesystem due to MMP) has been fixed in recent e2fsprogs. Could you please update the e2fsprogs package and rerun the tunefs.lustre command? |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 13/Jan/12 ] |
|
Thanks Johann, |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 13/Jan/12 ] |
|
The end-user agree to upgrade the e2fsprogs, but we should start from the test environment. I'm planning to give to you the configuration information on 17th January. See you soon!!! |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 17/Jan/12 ] |
|
The end-user ask us to wait other 2 days to upgrade in production the e2fsprogs tools. Thanks in advance |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 18/Jan/12 ] |
|
dumps |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 18/Jan/12 ] |
|
Ok, we upgrade the e2fsprogs and make the dumps of the configuration. I have attached it. thanks in advance |
| Comment by Zhenyu Xu [ 18/Jan/12 ] |
#strings mdt #strings ost* You've formatted your devices with inconsistent fsname, this could happened when you formatted the mgs device without specifying "--fsname" argument where "lustre" is its default value, and you formatted the other devices with "--fsname=home". |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 18/Jan/12 ] |
|
Could be this problem the cause of the Lustre-Error that we see in the messages? |
| Comment by Zhenyu Xu [ 18/Jan/12 ] |
|
yes, this could cause the error message. You need "tunefs.lustre --mgs --fsname=home <other options> <mgs device>" and remount it or even all other devices as well. |
| Comment by Johann Lombardi (Inactive) [ 18/Jan/12 ] |
|
Please don't run this command for now until we can look at the output of "tunefs.lustre --print". Thanks in advance. |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 19/Jan/12 ] |
|
the tunefs output |
| Comment by Johann Lombardi (Inactive) [ 19/Jan/12 ] |
|
I am afraid that you forgot to specify the failover mgsnode when formatting the OSTs & MDT: Read previous values:
Target: home-OST0000
Index: 0
Lustre FS: home
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.121.13.31@tcp failover.node=10.121.13.62@tcp ost.quota_type=ug
The OSTs and MDT should be given the full list of NIDs where the MGS can run. In your case, this is both 10.121.13.31@tcp and 10.121.13.62@tcp. This explains why the targets cannot reach the MGS when this latter runs on 10.121.13.62@tcp. That's the root cause of the error messages you see. To fix this, you would have to do the following procedure for each OST and the MDT:
I also noticed that some OSTs have "failover.node=10.121.13.62@tcp" while some others have "10.121.13.31@tcp". .*.import" |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 19/Jan/12 ] |
|
the lctl get_param {mdc,osc}.*.import output |
| Comment by Johann Lombardi (Inactive) [ 20/Jan/12 ] |
The MDT/OST failover config looks fine, so you just have to fix the mgsnode issue as mentioned above. |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 20/Jan/12 ] |
|
Johann, Should I fix the MGS configuration too? |
| Comment by Zhenyu Xu [ 20/Jan/12 ] |
|
No, there is no need to set a fs name on a separate MGT which can handle multiple filesystems at once. My fault to mentioned the incorrect info in comment on 18/Jan/12 10:45 AM. |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 20/Jan/12 ] |
|
Hi Johann and Zhenyu, the normal configuration is:
How should I change the configuration according this setup and to avoid the Lustre errors? When we take the tunefs output and the dumpfs output, we are in a failed situation, because all the targets are mounted on 10.121.13.62@tcp. We have the Lustre errors before and after the shutdown of the 10.121.13.31@tcp node, as you see in the messages. Thanks in advance |
| Comment by Johann Lombardi (Inactive) [ 20/Jan/12 ] |
|
> How should I change the configuration according this setup and to avoid the Lustre errors? There is no need to change the configuration. Please just follow the procedure i detailed in my comment on 19/Jan/12 9:17 AM and the error messages will be gone. |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 20/Jan/12 ] |
|
So when we start the node 10.121.13.31@tcp and rebalance the service, the Lustre error will be gone? But why we see the lustre error before the failing over of 10.121.13.31@tcp node? thanks in advance |
| Comment by Johann Lombardi (Inactive) [ 20/Jan/12 ] |
|
I'm afraid that we have not enough log of this incident to find out why the MGS wasn't responsive at this time. |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 28/Jan/12 ] |
|
Ok, [root@osiride-lp-030 ~]# lctl dl [root@osiride-lp-031 ~]# lctl dl Could you please close the issue? Thanks in advance |
| Comment by Johann Lombardi (Inactive) [ 30/Jan/12 ] |
|
Cool . To be clear, you've also fixed the MGS configuration with tunefs.lustre as explained in my comment on 19/Jan/12 9:17 AM, right? |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 30/Jan/12 ] |
|
No I have not. We are planning to upgrade the version of Lustre to the latest stable. I will change the configuration during the upgrade. |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 02/Feb/12 ] |
|
Please close this issue |
| Comment by Peter Jones [ 02/Feb/12 ] |
|
Thanks! |