Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
None
-
3
-
14209
Description
A production lustre cluster "porter" was upgraded from 2.4.0-28chaos to lustre-2.4.2-11chaos today. OSTs now will not start.
# porter1 /root > /etc/init.d/lustre start Stopping snmpd: [ OK ] Shutting down cerebrod: [ OK ] Mounting porter1/lse-ost0 on /mnt/lustre/local/lse-OST0001 mount.lustre: mount porter1/lse-ost0 at /mnt/lustre/local/lse-OST0001 failed: Input/output error Is the MGS running? # porter1 /root >
Lustre: Lustre: Build Version: 2.4.2-11chaos-11chaos--PRISTINE-2.6.32-431.17.2.1chaos.ch5.2.x86_64 LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.115.67@o2ib10 (no target) LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.120.38@o2ib7 (no target) LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.120.101@o2ib7 (no target) LustreError: 5426:0:(client.c:1053:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff881026873800 x1470103660003336/t0(0) o253->MGC172.19.1.165@o2ib100@172.19.1.165@o2ib100:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 LustreError: 5426:0:(obd_mount_server.c:1140:server_register_target()) lse-OST0001: error registering with the MGS: rc = -5 (not fatal) LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.116.205@o2ib5 (no target) LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.114.162@o2ib5 (no target) LustreError: Skipped 19 previous similar messages LustreError: 5426:0:(client.c:1053:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff881026873800 x1470103660003340/t0(0) o101->MGC172.19.1.165@o2ib100@172.19.1.165@o2ib100:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.120.162@o2ib7 (no target) LustreError: Skipped 23 previous similar messages LustreError: 5426:0:(client.c:1053:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff881026873800 x1470103660003344/t0(0) o101->MGC172.19.1.165@o2ib100@172.19.1.165@o2ib100:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 LustreError: 15c-8: MGC172.19.1.165@o2ib100: The configuration from log 'lse-OST0001' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. LustreError: 5426:0:(obd_mount_server.c:1273:server_start_targets()) failed to start server lse-OST0001: -5 Lustre: lse-OST0001: Unable to start target: -5 LustreError: 5426:0:(obd_mount_server.c:865:lustre_disconnect_lwp()) lse-MDT0000-lwp-OST0001: Can't end config log lse-client. LustreError: 5426:0:(obd_mount_server.c:1442:server_put_super()) lse-OST0001: failed to disconnect lwp. (rc=-2) LustreError: 5426:0:(obd_mount_server.c:1472:server_put_super()) no obd lse-OST0001 Lustre: server umount lse-OST0001 complete LustreError: 5426:0:(obd_mount.c:1290:lustre_fill_super()) Unable to mount (-5)
# porter1 /root > lctl ping 172.19.1.165@o2ib100 # <-- MGS NID 12345-0@lo 12345-172.19.1.165@o2ib100
Attachments
Issue Links
- is related to
-
LU-2887 sanity-quota test_12a: slow due to ZFS VMs sharing single disk
-
- Resolved
-
Looking at the stacktraces in the logs, it seems everybody is either blocked on the transaction commit wait inside zfs or on the semaphore that is held by somebody that waits on the transaction commit.
So it really looks like some sort of in-zfs wait for me. There's no dump of all threads stacks in here, so I wonder if you have one where it is visible there's lustre induced deadlock of some sort above zfs?