Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
None
-
3
-
7507
Description
Three 2.1.4-3chaos OSS nodes rebooted this morning. A client login node has failed to reconnect to most of their OSTs. I will attach complete console logs.
# aztec3 /root > lfs check servers | grep -v active lsc-OST0004-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0008-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST000c-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0010-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0014-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0018-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST001c-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0020-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0024-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0028-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST002c-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0030-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0034-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0038-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0000-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST007f-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0083-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0087-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST008b-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST008f-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0093-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0097-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST009b-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST009f-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST00a3-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST00a7-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST00ab-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST00af-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST00b3-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST007b-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0130-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0134-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0138-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST013c-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0140-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0144-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0148-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST014c-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0150-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0154-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0158-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST015c-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0160-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST0164-osc-ffff880c3043a000: check error: Resource temporarily unavailable lsc-OST012c-osc-ffff880c3043a000: check error: Resource temporarily unavailable
Example dmesg from client:
# aztec3 /root > dmesg | grep OST012c Lustre: 7485:0:(client.c:1820:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1364918372/real 1364918372] req@ffff88011258ec00 x1430682336999865/t0(0) o400->lsc-OST012c-osc-ffff880c3043a000@172.19.1.121@o2ib100:28/4 lens 192/192 e 0 to 1 dl 1364918478 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: lsc-OST012c-osc-ffff880c3043a000: Connection to lsc-OST012c (at 172.19.1.121@o2ib100) was lost; in progress operations using this service will wait for recovery to complete Lustre: 7485:0:(client.c:1820:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1364918397/real 1364918397] req@ffff8804cf6c4800 x1430682337000483/t0(0) o400->lsc-OST012c-osc-ffff880c3043a000@172.19.1.121@o2ib100:28/4 lens 192/192 e 0 to 1 dl 1364918503 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 7485:0:(client.c:1820:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1364918447/real 1364918447] req@ffff880633f37c00 x1430682337002326/t0(0) o400->lsc-OST012c-osc-ffff880c3043a000@172.19.1.121@o2ib100:28/4 lens 192/192 e 0 to 1 dl 1364918553 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Example dmesg for one server:
# sumom23 /root > dmesg | grep 172.16.66.53 Lustre: lsc-OST0156: Client ea981a59-5970-b37c-8c49-7b6886344dc2 (at 172.16.66.53@tcp) refused reconnection, still busy with 5 active RPCs Lustre: lsc-OST0156: Bulk IO read error with ea981a59-5970-b37c-8c49-7b6886344dc2 (at 172.16.66.53@tcp), client will retry: rc -110 Lustre: lsc-OST0156: Bulk IO read error with ea981a59-5970-b37c-8c49-7b6886344dc2 (at 172.16.66.53@tcp), client will retry: rc -110 Lustre: lsc-OST0156: Bulk IO read error with ea981a59-5970-b37c-8c49-7b6886344dc2 (at 172.16.66.53@tcp), client will retry: rc -110 Lustre: lsc-OST013e: Bulk IO read error with ea981a59-5970-b37c-8c49-7b6886344dc2 (at 172.16.66.53@tcp), client will retry: rc -110 Lustre: lsc-OST0156: Client ea981a59-5970-b37c-8c49-7b6886344dc2 (at 172.16.66.53@tcp) refused reconnection, still busy with 1 active RPCs LustreError: 23397:0:(ldlm_lib.c:2715:target_bulk_io()) @@@ Reconnect on bulk PUT req@ffff8802fb0af000 x1430682219726723/t0(0) o3->ea981a59-5970-b37c-8c49-7b6886344dc2@172.16.66.53@tcp:0/0 lens 456/400 e 1 to 0 dl 1364611876 ref 1 fl Interpret:/2/0 rc 0/0 Lustre: lsc-OST0156: Bulk IO read error with ea981a59-5970-b37c-8c49-7b6886344dc2 (at 172.16.66.53@tcp), client will retry: rc -110
LLNL-bug-id: TOSS-2006