[LU-2330] OST unable to complete registration with MGS Created: 14/Nov/12  Updated: 12/Jan/19  Resolved: 12/Jan/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Ned Bass Assignee: Li Wei (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: llnl
Environment:

lustre-orion-2.3.54-6chaos


Severity: 3
Rank (Obsolete): 5562

 Description   

We created a new filesystem with 768 zfs-osd OSTs. The OSTs were started out-of-index order, in parallel. There was a failure in the initial registration of about 35 of the OSTs:

2012-11-14 12:30:07 Lustre: Lustre: Build Version: 2.3.54-6chaos-6chaos--PRISTINE-2.6.32-220.23.1.2chaos.ch5.x86_64
2012-11-14 12:30:19 LustreError: 41374:0:(client.c:1123:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff881011707000 x1418644692140034/t0(0) o253->MGC172.20.5.1@o2ib500@172.20.5.1@o2ib500:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
2012-11-14 12:30:19 Lustre: Error -5 communicating with the MGS, is the MGS running?
2012-11-14 12:30:25 LustreError: 41374:0:(client.c:1123:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff881011707000 x1418644692140035/t0(0) o101->MGC172.20.5.1@o2ib500@172.20.5.1@o2ib500:26/25 lens 328/384 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
2012-11-14 12:30:31 LustreError: 41374:0:(client.c:1123:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff881011707000 x1418644692140036/t0(0) o101->MGC172.20.5.1@o2ib500@172.20.5.1@o2ib500:26/25 lens 328/384 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
2012-11-14 12:30:31 LustreError: 15c-8: MGC172.20.5.1@o2ib500: The configuration from log 'lsfull-OST0062' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
2012-11-14 12:30:31 LustreError: 41374:0:(obd_mount.c:1851:server_start_targets()) failed to start server lsfull-OST0062: -5
2012-11-14 12:30:31 Lustre: lsfull-OST0062: Unable to start target: -5
2012-11-14 12:30:31 LustreError: 41374:0:(obd_mount.c:1350:lustre_disconnect_osp()) Can't end config log lsfull
2012-11-14 12:30:31 LustreError: 41374:0:(obd_mount.c:2113:server_put_super()) lsfull-OST0062: failed to disconnect osp-on-ost (rc=-2)!
2012-11-14 12:30:31 LustreError: 41374:0:(obd_mount.c:2143:server_put_super()) no obd lsfull-OST0062
2012-11-14 12:30:31 LustreError: 41374:0:(obd_mount.c:1418:lustre_stop_osp()) Can not find osp-on-ost lsfull-MDT0000-osp-OST0062
2012-11-14 12:30:31 LustreError: 41374:0:(obd_mount.c:2158:server_put_super()) lsfull-OST0062: Fail to stop osp-on-ost!
2012-11-14 12:30:58 Lustre: server umount lsfull-OST0062 complete
2012-11-14 12:30:58 LustreError: 41374:0:(obd_mount.c:2990:lustre_fill_super()) Unable to mount  (-5)
2012-11-14 12:32:54 LustreError: 42061:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
2012-11-14 12:32:54 Lustre: lsfull-OST0062: Initializing new disk
2012-11-14 12:34:10 LustreError: 166-1: MGC172.20.5.1@o2ib500: Connection to MGS (at 172.20.5.1@o2ib500) was lost; in progress operations using this service will fail
2012-11-14 12:50:50 Lustre: Evicted from MGS (at MGC172.20.5.1@o2ib500_0) after server handle changed from 0x2ca8c28d7be253b7 to 0xa10ed509d011108e
2012-11-14 12:50:50 Lustre: MGC172.20.5.1@o2ib500: Connection restored to MGS (at 172.20.5.1@o2ib500)
2012-11-14 12:56:27 LustreError: 137-5: UUID 'ls1-OST0062_UUID' is not available for connect (no target)
2012-11-14 12:56:27 LustreError: Skipped 21 previous similar messages

df shows the OST mounted, and the lustre:svname property value no longer has the ":", however, but it has no exports:

# grove98 /root > df -t lustre
Filesystem           1K-blocks      Used Available Use% Mounted on
grove98/lsfull-ost0  67554518656      1152 67554515456   1% /mnt/lustre/local/lsfull-OST0062
# grove98 /root > zfs get lustre:svname grove98/lsfull-ost0
NAME                 PROPERTY       VALUE           SOURCE
grove98/lsfull-ost0  lustre:svname  lsfull-OST0062  local
# grove98 /root > ls /proc/fs/lustre/obdfilter/lsfull-OST0062/exports/
clear

If I try to restart the OST, the MGS complains it isn't registered:

MGS:

LustreError: 142-7: The target lsfull-OST0062 has not registered yet. It must be started before failnids can be added.
LustreError: 527:0:(mgs_llog.c:2956:mgs_write_log_param()) err -2 on param 'failover.node=172.20.1.97@o2ib500'
LustreError: 527:0:(mgs_handler.c:393:mgs_handle_target_reg()) Failed to write lsfull-OST0062 log (-2)

OSS:

2012-11-14 15:52:56 Lustre: Error -2 communicating with the MGS, is the MGS running?
2012-11-14 15:52:56 LustreError: 45074:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
2012-11-14 15:52:56 LustreError: 15c-8: MGC172.20.5.1@o2ib500: The configuration from log 'lsfull-OST0062' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
2012-11-14 15:52:56 LustreError: 45074:0:(obd_mount.c:1851:server_start_targets()) failed to start server lsfull-OST0062: -2
2012-11-14 15:52:56 Lustre: lsfull-OST0062: Unable to start target: -2
2012-11-14 15:52:56 LustreError: 45074:0:(obd_mount.c:1350:lustre_disconnect_osp()) Can't end config log lsfull
2012-11-14 15:52:56 LustreError: 45074:0:(obd_mount.c:2113:server_put_super()) lsfull-OST0062: failed to disconnect osp-on-ost (rc=-2)!
2012-11-14 15:52:56 LustreError: 45074:0:(obd_mount.c:2143:server_put_super()) no obd lsfull-OST0062
2012-11-14 15:52:56 LustreError: 45074:0:(obd_mount.c:1418:lustre_stop_osp()) Can not find osp-on-ost lsfull-MDT0000-osp-OST0062
2012-11-14 15:52:56 LustreError: 45074:0:(obd_mount.c:2158:server_put_super()) lsfull-OST0062: Fail to stop osp-on-ost!
2012-11-14 15:52:57 Lustre: server umount lsfull-OST0062 complete
2012-11-14 15:52:57 LustreError: 45074:0:(obd_mount.c:2990:lustre_fill_super()) Unable to mount  (-2)

Finally, if put the ":" back in the lustre:svname property to try to force re-registration, the MGS complains the index is already in use:

OSS:

# grove98 /root > zfs set lustre:svname=lsfull:OST0062 grove98/lsfull-ost0
# grove98 /root > zfs get lustre:svname grove98/lsfull-ost0
NAME                 PROPERTY       VALUE           SOURCE
grove98/lsfull-ost0  lustre:svname  lsfull:OST0062  local
# grove98 /root > /etc/init.d/lustre start
Mounting grove98/lsfull-ost0 on /mnt/lustre/local/lsfull-OST0062
mount.lustre: mount grove98/lsfull-ost0 at /mnt/lustre/local/lsfull-OST0062 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
# grove98 /root > dmesg | tail
Lustre: Error -98 communicating with the MGS, is the MGS running?
LustreError: 45333:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
LustreError: 15c-8: MGC172.20.5.1@o2ib500: The configuration from log 'lsfull-OST0062' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
LustreError: 45333:0:(obd_mount.c:1851:server_start_targets()) failed to start server lsfull-OST0062: -2
Lustre: lsfull-OST0062: Unable to start target: -2
LustreError: 45333:0:(obd_mount.c:1350:lustre_disconnect_osp()) Can't end config log lsfull
LustreError: 45333:0:(obd_mount.c:2113:server_put_super()) lsfull-OST0062: failed to disconnect osp-on-ost (rc=-2)!
LustreError: 45333:0:(obd_mount.c:2143:server_put_super()) no obd lsfull-OST0062
LustreError: 45333:0:(obd_mount.c:1418:lustre_stop_osp()) Can not find osp-on-ost lsfull-MDT0000-osp-OST0062
LustreError: 45333:0:(obd_mount.c:2158:server_put_super()) lsfull-OST0062: Fail to stop osp-on-ost!
Lustre: server umount lsfull-OST0062 complete
LustreError: 45333:0:(obd_mount.c:2990:lustre_fill_super()) Unable to mount  (-2)

MGS:

LustreError: 140-5: Server lsfull-OST0062 requested index 98, but that index is already in use. Use --writeconf to force
LustreError: 534:0:(mgs_llog.c:3005:mgs_write_log_target()) Can't get index (-98)
LustreError: 534:0:(mgs_handler.c:393:mgs_handle_target_reg()) Failed to write lsfull-OST0062 log (-98)


 Comments   
Comment by Peter Jones [ 15/Nov/12 ]

Alex will triage this one

Comment by Alex Zhuravlev [ 20/Nov/12 ]

I'm going to reproduce this locally.. it seems due to high load MGS was not able to respond in time, so OST got a timeout and error, but MGS finally processed the request and registered the target.

Comment by Alex Zhuravlev [ 28/Nov/12 ]

Ned, can you clarify a bit.. after the very first try to mount OST0062 which did not succeed you found OST0062 in df output ?

the first log claims "Lustre: server umount lsfull-OST0062 complete" which implies no remainings in df/mount output.

Comment by Ned Bass [ 28/Nov/12 ]

Alex, I updated the OST log in the description to include timestamps to give a better idea of the timing.

Yes that was the confusing part of this bug. df definitely showed OST0062 mounted. It looks like a second attempt was made to start the OST at 12:32:54, after the initial attempt failed. The OST was then evicted by the MGS, and reconnected. It was at this point that I logged in and ran df.

Comment by Sarah Liu [ 18/Feb/13 ]

It seems this is another instance seen in zfs testing:
https://maloo.whamcloud.com/test_sets/07920122-7788-11e2-987d-52540035b04c

Comment by Peter Jones [ 12/Jan/19 ]

Closing ancient ticket

Generated at Sat Feb 10 01:24:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.