[LU-3645] Interop 2.1.5 <--> 2.4 Write operations during failover errors out instead of stalling Created: 26/Jul/13  Updated: 15/Mar/14  Resolved: 22/Nov/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Girish Shilamkar (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: mn1

Attachments: File LU-XXXX.tgz     File client_lctl_dk_20130911.tgz     File client_messages_20130911.tgz     Text File ll10987.out.gz     File mds1.llog.gz     Text File mds2.llog.gz     File server_lctl_dk_20130911.tgz     Text File ucbn003.localdomain.llog.gz    
Severity: 3
Rank (Obsolete): 9382

 Description   

During acceptance testing for KIT, they tried OSS failover while running several applications. And applications got IO errors (can't create file and similar messages). This should not happen and IO should just stall till failover happens.
The clients were running 2.4 and servers were 2.1.5. We tried with 2.1.5 clients and did not see this issue. I have attached the client and server logs.



 Comments   
Comment by Peter Jones [ 26/Jul/13 ]

Hongchao

Could you please look into this one?

Thanks

Peter

Comment by Kit Westneat (Inactive) [ 26/Jul/13 ]

This is actually a duplicate of LU-3621, we got our wires crossed here at DDN HQ, sorry about that.

Comment by Peter Jones [ 26/Jul/13 ]

ok - thanks Kit

Comment by Kit Westneat (Inactive) [ 30/Jul/13 ]

Hi Peter, I was mistaken on this - it is a different issue uncovered during the same testing. Can you reopen the ticket and advise us on what kind of debug information we should get to help diagnose the issue? I've attached rpctrace and vfstrace output.

Comment by Kit Westneat (Inactive) [ 30/Jul/13 ]

The I/O error appeared on ucbn003 and the bonnie++ error message was:
Create files in random order...Can't create file v8KnMgSE00000005fb

Comment by James Nunez (Inactive) [ 30/Jul/13 ]

Per Kit Westneat :
We accidentally closed out LU-3645 as a duplicate, but it's actually a different problem which arose at the same time. I just wanted to make sure it didn't fall off the radar since it was already closed as a duplicate. I have updated the ticket with new logs.

Thanks,
Kit

Comment by Hongchao Zhang [ 01/Aug/13 ]

status update:
the logs are under investigation, but still no result, will update this ticket once any progress is made, Thanks!

Comment by Hongchao Zhang [ 02/Aug/13 ]

I can't find the logs related to eviction in the client logs, but there are some in server (MDS & OSS)
what is running on the following IPs? Thanks
172.26.4.6
172.26.4.1
172.26.7.228
172.26.7.229
172.26.7.231

LU-XXXX/oss1/messages-20130623:Jun 21 14:34:05 oss1 kernel: : Lustre: pfscdat1-OST0000: haven't heard from client 52a702a3-cdba-d2a9-b8ee-c95c3cce895b (at 172.26.4.6@o2ib) in 230 seconds. I think it's dead, and I am evicting it. exp ffff8800b073f800, cur 1371818045 expire 1371817895 last 1371817815
LU-XXXX/oss1/messages-20130623:Jun 21 14:34:13 oss1 kernel: : Lustre: pfscwor1-OST0000: haven't heard from client ac49323c-bd2e-c726-9261-28fae514373c (at 172.26.4.6@o2ib) in 238 seconds. I think it's dead, and I am evicting it. exp ffff880814dda400, cur 1371818053 expire 1371817903 last 1371817815
LU-XXXX/oss1/messages-20130630:Jun 26 11:21:43 oss1 kernel: : Lustre: pfscwor1-OST0000: haven't heard from client a12614ba-2215-a27d-dd84-d404b5350351 (at 172.26.7.228@o2ib) in 231 seconds. I think it's dead, and I am evicting it. exp ffff88081469d400, cur 1372238503 expire 1372238353 last 1372238272
LU-XXXX/oss1/messages-20130630:Jun 26 15:12:08 oss1 kernel: : Lustre: pfscdat1-OST0000: haven't heard from client b47094a7-bd11-5591-8fda-789563de29b3 (at 172.26.7.229@o2ib) in 234 seconds. I think it's dead, and I am evicting it. exp ffff88018615c400, cur 1372252328 expire 1372252178 last 1372252094
LU-XXXX/oss1/messages-20130630:Jun 26 15:38:53 oss1 kernel: : Lustre: pfscwor1-OST0000: haven't heard from client 1c07b9d5-3d84-2150-c182-e634ff69e868 (at 172.26.7.229@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff880814f3ec00, cur 1372253933 expire 1372253783 last 1372253706
LU-XXXX/oss1/messages-20130630:Jun 26 17:02:35 oss1 kernel: : Lustre: pfscwor1-OST0000: haven't heard from client 2aa5cfe7-24e1-82df-7822-0695a2c9a627 (at 172.26.7.229@o2ib) in 228 seconds. I think it's dead, and I am evicting it. exp ffff880385efc400, cur 1372258955 expire 1372258805 last 1372258727
LU-XXXX/oss1/messages-20130630:Jun 27 12:45:03 oss1 kernel: : Lustre: pfscwor1-OST0000: haven't heard from client ff2ba913-cde8-1aa0-ac07-4e6e1a8b775d (at 172.26.7.229@o2ib) in 229 seconds. I think it's dead, and I am evicting it. exp ffff8803898ba800, cur 1372329903 expire 1372329753 last 1372329674
LU-XXXX/oss1/messages-20130630:Jun 27 12:45:05 oss1 kernel: : Lustre: pfscdat1-OST0000: haven't heard from client b7b0858f-7c85-8d52-0995-d93c0d0807c9 (at 172.26.7.229@o2ib) in 231 seconds. I think it's dead, and I am evicting it. exp ffff88081469d000, cur 1372329905 expire 1372329755 last 1372329674
LU-XXXX/oss1/messages-20130630:Jun 28 21:12:45 oss1 kernel: : Lustre: pfscwor1-OST0000: recovery is timed out, evict stale exports
LU-XXXX/oss1/messages-20130630:Jun 28 21:12:46 oss1 kernel: : Lustre: pfscdat2-OST0000: recovery is timed out, evict stale exports
LU-XXXX/oss1/messages-20130707:Jul 4 15:40:15 oss1 kernel: : Lustre: pfscwor1-OST0000: haven't heard from client 14f37ec5-cbc1-09f9-5ebb-c2be149e63d8 (at 172.26.7.231@o2ib) in 234 seconds. I think it's dead, and I am evicting it. exp ffff881013986000, cur 1372945215 expire 1372945065 last 1372944981
LU-XXXX/oss1/messages-20130707:Jul 4 15:45:50 oss1 kernel: : Lustre: pfscwor1-OST0000: haven't heard from client 14f37ec5-cbc1-09f9-5ebb-c2be149e63d8 (at 172.26.7.231@o2ib) in 228 seconds. I think it's dead, and I am evicting it. exp ffff880806639400, cur 1372945550 expire 1372945400 last 1372945322
LU-XXXX/oss1/messages-20130707:Jul 4 15:50:00 oss1 kernel: : Lustre: pfscdat2-OST0000: haven't heard from client 0514dccd-a92e-ce94-41cd-9c76585df636 (at 172.26.7.231@o2ib) in 228 seconds. I think it's dead, and I am evicting it. exp ffff88081b2c8800, cur 1372945800 expire 1372945650 last 1372945572
LU-XXXX/oss1/messages-20130714:Jul 8 09:32:07 oss1 kernel: : Lustre: pfscdat2-OST0000: haven't heard from client 9bb010b4-be70-74ca-50c0-d8a26989ff2b (at 172.26.7.229@o2ib) in 274 seconds. I think it's dead, and I am evicting it. exp ffff88081b336400, cur 1373268727 expire 1373268577 last 1373268453
LU-XXXX/oss1/messages-20130714:Jul 11 16:21:25 oss1 kernel: : Lustre: pfscwor1-OST0000: haven't heard from client f91d0897-30fc-c2ca-b15e-36b1093067eb (at 172.26.4.1@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff880815de1c00, cur 1373552485 expire 1373552335 last 1373552258
LU-XXXX/oss2/messages-20130623:Jun 21 14:34:13 oss2 kernel: : Lustre: pfscdat2-OST0000: haven't heard from client e5a9b005-ddb0-703e-9583-ec501dad4374 (at 172.26.4.6@o2ib) in 238 seconds. I think it's dead, and I am evicting it. exp ffff880661c34400, cur 1371818053 expire 1371817903 last 1371817815
LU-XXXX/oss2/messages-20130711:Jun 26 11:21:43 oss2 kernel: : Lustre: pfscdat2-OST0000: haven't heard from client 72a4d8b7-4cd1-b836-d32e-9870e7cde133 (at 172.26.7.228@o2ib) in 231 seconds. I think it's dead, and I am evicting it. exp ffff880f4962b800, cur 1372238503 expire 1372238353 last 1372238272
LU-XXXX/oss2/messages-20130711:Jun 26 15:12:08 oss2 kernel: : Lustre: pfscdat2-OST0000: haven't heard from client 0b9d9c9d-305a-0c95-263a-ee4767c3649d (at 172.26.7.229@o2ib) in 234 seconds. I think it's dead, and I am evicting it. exp ffff880ff0366c00, cur 1372252328 expire 1372252178 last 1372252094
LU-XXXX/oss2/messages-20130711:Jun 26 15:38:53 oss2 kernel: : Lustre: pfscdat2-OST0000: haven't heard from client 7082363c-bbaf-7a14-8f27-bba2748ac8b6 (at 172.26.7.229@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff880ec28e0800, cur 1372253933 expire 1372253783 last 1372253706
LU-XXXX/oss2/messages-20130711:Jun 26 17:02:35 oss2 kernel: : Lustre: pfscdat2-OST0000: haven't heard from client 2b993f06-3f9e-d311-c501-1dc637f23b9b (at 172.26.7.229@o2ib) in 228 seconds. I think it's dead, and I am evicting it. exp ffff880feaeac400, cur 1372258955 expire 1372258805 last 1372258727
LU-XXXX/oss2/messages-20130711:Jun 27 12:45:03 oss2 kernel: : Lustre: pfscdat2-OST0000: haven't heard from client 3b957e97-f989-033d-9b2f-d32653cdb7b5 (at 172.26.7.229@o2ib) in 229 seconds. I think it's dead, and I am evicting it. exp ffff880ff6b27000, cur 1372329903 expire 1372329753 last 1372329674

Comment by Girish Shilamkar (Inactive) [ 06/Aug/13 ]

> Hongchao Zhang added a comment - 02/Aug/13 4:56 AM
> I can't find the logs related to eviction in the client logs, but
> there are some in server (MDS & OSS) what is running on the following > IPs? Thanks
> 172.26.4.6
> 172.26.4.1
> 172.26.7.228
> 172.26.7.229
> 172.26.7.231

They are the client IPs. 172.26.7.* had mounted the file systems but were not part of the failover testing.

> LU-XXXX/oss1/messages-20130623:Jun 21 14:34:05 oss1 kernel: : Lustre: pfscdat1-OST0000: haven't heard from client 52a702a3-cdba-d2a9-b8ee-c95c3cce895b (at 172.26.4.6@o2ib) in 230 seconds. I think it's dead, and I am evicting it. exp ffff8800b073f800, cur 1371818045 expire 1371817895 last 1371817815
...

All these messages are very old. The complete server system has been
rebooted 2013-07-16, i.e. all the messages should not be relevant.

Comment by Girish Shilamkar (Inactive) [ 07/Aug/13 ]

Can you please increase the priority of this issue ? The problem is blocking this new system from getting into production.
Is there any additional logs, information needed from our side?

Thanks

Comment by Hongchao Zhang [ 07/Aug/13 ]

okay!
I'm trying to reproduce it myself with 2 nodes(2.1.5 client + 2.4 server) by "bonnie++", and can't reproduce it up to now.

is this issue can be reproduced with fewer nodes on your site(say, 2 nodes)?
could you please attach the new syslog of ucbn003/oss1/oss2, the latest logs in LU-XXXX.tgz is July 16, Thanks

Comment by Girish Shilamkar (Inactive) [ 07/Aug/13 ]

> I'm trying to reproduce it myself with 2 nodes(2.1.5 client + 2.4 server) by "bonnie++", and can't >reproduce it up to now.
is this issue can be reproduced with fewer nodes on your site(say, 2 nodes)?

This problem was seen during acceptance testing and it only had mds failover pair and two oss.

>could you please attach the new syslog of ucbn003/oss1/oss2, the latest logs in LU-XXXX.tgz is July 16,
The test was run on July 15/16, so all the necessary information should be there. Do you want us to collect logs with any particular debug option ?

I can see ost_write operations failing with -ENODEV. I ran replay-ost-single on 2.1.5 (ost_writes fail with -ENODEV) and 2.4 ( ost_write fails with -ENOTCONN)

Does the error returned by ost_write affect the failover behavior ?

Comment by Girish Shilamkar (Inactive) [ 07/Aug/13 ]

> I'm trying to reproduce it myself with 2 nodes(2.1.5 client + 2.4 server) by "bonnie++", and can't reproduce it up to now.

I am assuming 2.1.5 client + 2.4 server is typo, as this problem is seen with 2.1.5 servers and 2.4 client

Comment by Kit Westneat (Inactive) [ 07/Aug/13 ]

I sorted out the relevant bits from the Lustre log so I could see what was going on a bit better, hopefully this is helpful.

client:

00000080:00200000:0.0:1375191524.336997:0:11756:0:(namei.c:503:ll_lookup_it()) VFS Op:name=v8KnMgSE00000005fb,dir=144115440287769024/33554490(ffff8808619835f8),intent=o
pen|creat
00000100:00100000:0.0:1375191524.337573:0:11756:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc bonnie++:8ff70f2e-0c20-af6e-ad31-a95bc6b86f3a:11756:1441894532689196:172.26.17.2@o2ib:101
00000100:00100000:0.0:1375191524.337599:0:11756:0:(client.c:2090:ptlrpc_set_wait()) set ffff88086ac7b2c0 going to sleep for 6 seconds
00000100:00100000:0.0:1375191524.423682:0:11756:0:(client.c:1805:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc bonnie++:8ff70f2e-0c20-af6e-ad31-a95bc6b86f3a:11756:1441894532689196:172.26.17.2@o2ib:101
00000002:00100000:0.0:1375191524.423714:0:11756:0:(mdc_locks.c:579:mdc_finish_enqueue()) @@@ op: 3 disposition: 17, status: -5  req@ffff880867f32c00 x1441894532689196/t0(0) o101->pfscdat2-MDT0000-mdc-ffff8810708aec00@172.26.17.2@o2ib:12/10 lens 592/544 e 0 to 0 dl 1375191531 ref 1 fl Complete:R/0/0 rc 301/301

mds2:

00000100:00100000:1.0:1375191524.337381:0:17462:0:(service.c:1547:ptlrpc_server_handle_req_in()) got req x1441894532689196
00000100:00100000:1.0:1375191524.337395:0:17462:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_10:8ff70f2e-0c20-af6e-ad31-a95bc6b86f3a+3316:11756:x1441894532689196:12345-172.26.4.3@o2ib:101
00000100:00080000:3.0:1375191524.337404:0:7060:0:(recover.c:216:ptlrpc_request_handle_notconn()) import pfscdat2-OST0000-osc-MDT0000 of pfscdat2-OST0000_UUID@172.26.17.3@o2ib abruptly disconnected: reconnecting
00000100:02000400:3.0:1375191524.337410:0:7060:0:(import.c:170:ptlrpc_set_import_discon()) pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete
00000100:00080000:3.0:1375191524.337419:0:7060:0:(import.c:180:ptlrpc_set_import_discon()) ffff88082aeed800 pfscdat2-OST0000_UUID: changing import state from FULL to DISCONN
00000100:00080000:3.0:1375191524.337424:0:7060:0:(import.c:626:ptlrpc_connect_import()) ffff88082aeed800 pfscdat2-OST0000_UUID: changing import state from DISCONN to CONNECTING
00000100:00080000:3.0:1375191524.337428:0:7060:0:(import.c:482:import_select_connection()) pfscdat2-OST0000-osc-MDT0000: connect to NID 172.26.17.3@o2ib last attempt 4982526859
00000100:00080000:3.0:1375191524.337433:0:7060:0:(import.c:561:import_select_connection()) pfscdat2-OST0000-osc-MDT0000: import ffff88082aeed800 using connection 172.26.17.3@o2ib/172.26.17.3@o2ib
00000100:00100000:3.0:1375191524.337453:0:7060:0:(import.c:725:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5)  req@ffff881001505c00 x1441623136776061/t0(0) o8->pfscdat2-OST0000-osc-MDT0000@172.26.17.3@o2ib:28/4 lens 368/512 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
00000100:00100000:5.0:1375191524.337462:0:17454:0:(service.c:1547:ptlrpc_server_handle_req_in()) got req x1441894532689216
00000100:00020000:3.0:1375191524.337469:0:7060:0:(layout.c:1659:__req_capsule_get()) @@@ Wrong buffer for field `ost_body' (1 of 1) in format `OST_CREATE': 0 vs. 208 (server)
  req@ffff8810019bf000 x1441623136776022/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.3@o2ib:7/4 lens 400/192 e 0 to 0 dl 1375191531 ref 1 fl Interpret:RN/0/0 rc -107/-107
00000100:00100000:5.0:1375191524.337475:0:17454:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_02:8ff70f2e-0c20-af6e-ad31-a95bc6b86f3a+3319:11760:x1441894532689216:12345-172.26.4.3@o2ib:35
00000100:00100000:13.0:1375191524.337481:0:7061:0:(client.c:1434:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-rcv:pfscdat2-MDT0000-mdtlov_UUID:7061:1441623136776061:172.26.17.3@o2ib:8
00000100:00100000:5.0:1375191524.337511:0:17454:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_02:8ff70f2e-0c20-af6e-ad31-a95bc6b86f3a+3318:11760:x1441894532689216:12345-172.26.4.3@o2ib:35 Request procesed in 38us (60us total) trans 51541021103 rc 0/0
00000100:02020000:13.0:1375191524.347566:0:7061:0:(client.c:1132:ptlrpc_check_status()) 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_connect operation failed with -19
00000100:00080000:13.0:1375191524.347576:0:7061:0:(import.c:1112:ptlrpc_connect_interpret()) ffff88082aeed800 pfscdat2-OST0000_UUID: changing import state from CONNECTING to DISCONN
00000100:00080000:13.0:1375191524.347579:0:7061:0:(import.c:1158:ptlrpc_connect_interpret()) recovery of pfscdat2-OST0000_UUID on 172.26.17.3@o2ib failed (-19)
00000100:00100000:13.0:1375191524.347583:0:7061:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-rcv:pfscdat2-MDT0000-mdtlov_UUID:7061:1441623136776061:172.26.17.3@o2ib:8
00000008:00020000:3.0:1375191524.381692:0:7060:0:(osc_create.c:175:osc_interpret_create()) @@@ Unknown rc -107 from async create: failing oscc  req@ffff8810019bf000 x1441623136776022/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.3@o2ib:7/4 lens 400/192 e 0 to 0 dl 1375191531 ref 1 fl Interpret:RN/0/0 rc -107/-107
00000100:00080000:3.0:1375191524.410395:0:7060:0:(import.c:195:ptlrpc_set_import_discon()) osc: import ffff88082aeed800 already not connected (conn 5, was 4): DISCONN
00000008:00080000:3.0:1375191524.410401:0:7060:0:(osc_create.c:182:osc_interpret_create()) preallocated through id 5539121 (next to use 5534160)
00020000:00080000:3.0:1375191524.410404:0:7060:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x8bb8 sub-object on OST idx 0/1: rc = -107
00000100:00100000:3.0:1375191524.410411:0:7060:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:7060:1441623136776022:172.26.17.3@o2ib:5
00020000:00020000:1.0:1375191524.410415:0:17462:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x8bb8 sub-object on OST idx 0/1: rc = -5
00000100:00100000:1.0:1375191524.423247:0:17462:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_10:8ff70f2e-0c20-af6e-ad31-a95bc6b86f3a+3317:11756:x1441894532689196:12345-172.26.4.3@o2ib:101 Request procesed in 85853us (85878us total) trans 0 rc 301/301

oss1:

00000100:00100000:8.0:1375191524.325142:0:2026:0:(service.c:1547:ptlrpc_server_handle_req_in()) got req x1441623136776022
00000100:00100000:8.0:1375191524.325163:0:2026:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_creat_02:0+-99:7060:x1441623136776022:12345-172.26.17.2@o2ib:5
00000010:00080000:8.0:1375191524.325170:0:2026:0:(ost_handler.c:2085:ost_handle()) operation 5 on unconnected OST from 12345-172.26.17.2@o2ib
00000100:00100000:8.0:1375191524.325200:0:2026:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_creat_02:0+-99:7060:x1441623136776022:12345-172.26.17.2@o2ib:5 Request procesed in 41us (115us total) trans 0 rc -107/-107
Comment by Oleg Drokin [ 09/Aug/13 ]

The last log excerpts you show explains the EIO as due to MDS2 being disconnected from OST, so when we try to create an object, there's nowhere to get the object from, hence EIO.

Now, why did MDS2 disconnected from the OST? Is OST still in recovery and does not recognize MDS, so does not let it in?

Comment by Kit Westneat (Inactive) [ 09/Aug/13 ]

I found the kernel logs for the servers:

mds2:

Jul 30 15:34:08 mds2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans
Jul 30 15:38:44 mds2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_create operation failed with -107
Jul 30 15:38:44 mds2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Jul 30 15:38:44 mds2 kernel: : LustreError: 7060:0:(layout.c:1659:__req_capsule_get()) @@@ Wrong buffer for field `ost_body' (1 of 1) in format `OST_CREATE': 0 vs. 208 (server)
Jul 30 15:38:44 mds2 kernel: : req@ffff8810019bf000 x1441623136776022/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.3@o2ib:7/4 lens 400/192 e 0 to 0 dl 1375191531 ref 1 fl Interpret:RN/0/0 rc -107/-107
Jul 30 15:38:44 mds2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_connect operation failed with -19
Jul 30 15:38:44 mds2 kernel: : LustreError: 7060:0:(osc_create.c:175:osc_interpret_create()) @@@ Unknown rc -107 from async create: failing oscc  req@ffff8810019bf000 x1441623136776022/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.3@o2ib:7/4 lens 400/192 e 0 to 0 dl 1375191531 ref 1 fl Interpret:RN/0/0 rc -107/-107
Jul 30 15:38:44 mds2 kernel: : LustreError: 17462:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x8bb8 sub-object on OST idx 0/1: rc = -5
Jul 30 15:39:23 mds2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.4@o2ib)

oss1:

Jul 30 15:38:43 oss1 kernel: : Lustre: Failing over pfscdat2-OST0000
Jul 30 15:38:44 oss1 kernel: : LustreError: 137-5: pfscdat2-OST0000: Not available for connect from 172.26.4.3@o2ib (stopping)
Jul 30 15:38:44 oss1 kernel: : LustreError: 137-5: pfscdat2-OST0000: Not available for connect from 172.26.17.2@o2ib (stopping)
Jul 30 15:38:44 oss1 kernel: : Lustre: pfscdat2-OST0000: shutting down for failover; client state will be preserved.
Jul 30 15:38:45 oss1 kernel: : Lustre: OST pfscdat2-OST0000 has stopped.

oss2:

Jul 30 15:38:56 oss2 kernel: : Lustre: 4795:0:(ldlm_lib.c:2028:target_recovery_init()) RECOVERY: service pfscdat2-OST0000, 11 recoverable clients, last_transno 42740433
Jul 30 15:38:56 oss2 kernel: : Lustre: pfscdat2-OST0000: Now serving pfscdat2-OST0000/ on /dev/mapper/ost_pfscdat2_0 with recovery enabled

So oss1 was cleanly unmounted from oss1 and failed over to oss2. Shouldn't the MDT IO hang until it can connect to oss2?

Comment by Hongchao Zhang [ 12/Aug/13 ]

this is the normal case for MDS to handle the objects precreation on OST, for the creation request is marked as no-delay and no-resend.
the interesting part of logs is where the MDT returned the error (in this case, -5/-EIO), which was not found in the log.

Comment by Kit Westneat (Inactive) [ 12/Aug/13 ]

Hi Hongchao,

I am not sure what you mean, I see this in the MDS logs:
00020000:00020000:1.0:1375191524.410415:0:17462:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x8bb8 sub-object on OST idx 0/1: rc = -5
00000100:00100000:1.0:1375191524.423247:0:17462:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_10:8ff70f2e-0c20-af6e-ad31-a95bc6b86f3a+3317:11756:x1441894532689196:12345-172.26.4.3@o2ib:101 Request procesed in 85853us (85878us total) trans 0 rc 301/301

Isn't that the part where the MDT is returning EIO? Let me know if there is another debug level I should use to get more information.

Thanks.

Comment by Hongchao Zhang [ 13/Aug/13 ]

Hi Kit,

At MDS, the OSC will create objects in advance (precreation) to be used by MDT, then the failure of creation request won't be known to client,
and MDT will check and wait if it found the the specified OSC is in recovery during process the "open|create" request.

Is the log of the client running the failed application available at July 30? and it could contain the logs where&when the MDT returned error to the client.

Thanks

Comment by Kit Westneat (Inactive) [ 13/Aug/13 ]

Sure, here's the client syslogs from that period:

Jul 29 13:37:06 ucbn003 kernel: Lustre: Trying to mount a client with IR setting not compatible with current mgc. Force to use current mgc setting that is IR disabled.
Jul 30 15:33:14 ucbn003 kernel: LustreError: 11-0: pfscdat2-OST0000-osc-ffff8810708aec00: Communicating with 172.26.17.4@o2ib, operation ost_destroy failed with -19.
Jul 30 15:33:14 ucbn003 kernel: Lustre: pfscdat2-OST0000-osc-ffff8810708aec00: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Jul 30 15:34:08 ucbn003 kernel: Lustre: pfscdat2-OST0000-osc-ffff8810708aec00: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib)
Jul 30 15:38:44 ucbn003 kernel: LustreError: 11-0: pfscdat2-OST0000-osc-ffff8810708aec00: Communicating with 172.26.17.3@o2ib, operation ldlm_enqueue failed with -107.
Jul 30 15:38:44 ucbn003 kernel: Lustre: pfscdat2-OST0000-osc-ffff8810708aec00: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Jul 30 15:39:23 ucbn003 kernel: Lustre: pfscdat2-OST0000-osc-ffff8810708aec00: Connection restored to pfscdat2-OST0000 (at 172.26.17.4@o2ib)

Just to make sure we are on the same page, the EIO was in the client lctl dk logs posted earlier:
00000002:00100000:0.0:1375191524.423714:0:11756:0:(mdc_locks.c:579:mdc_finish_enqueue()) @@@ op: 3 disposition: 17, status: 5 req@ffff880867f32c00 x1441894532689196/t0(0) o101>pfscdat2-MDT0000-mdc-ffff8810708aec00@172.26.17.2@o2ib:12/10 lens 592/544 e 0 to 0 dl 1375191531 ref 1 fl Complete:R/0/0 rc 301/301

It looks like pid 17462 gets the EIO in lov_update_create_set and then returns it to the client to complete x1441894532689196, which looks like the open|creat on the file.

Thanks.

Comment by Hongchao Zhang [ 14/Aug/13 ]

Hi KIT,

yes, lov_create failed somehow, could you please retest with this patch, which prints more logs to help to find the cause, thanks!

diff --git a/lustre/lov/lov_request.c b/lustre/lov/lov_request.c
index 0fc635e..dd7c081 100644
--- a/lustre/lov/lov_request.c
+++ b/lustre/lov/lov_request.c
@@ -573,10 +573,10 @@ static int lov_update_create_set(struct lov_request_set *set,
                 /* Pre-creating objects may timeout via -ETIMEDOUT or
                  * -ENOTCONN both are always non-critical events. */
                 CDEBUG(rc == -ETIMEDOUT || rc == -ENOTCONN ? D_HA : D_ERROR,
-                       "error creating fid "LPX64" sub-object "
+                       "error creating fid "LPX64"/"LPX64" sub-object "
                        "on OST idx %d/%d: rc = %d\n",
-                       set->set_oi->oi_oa->o_id, req->rq_idx,
-                       lsm->lsm_stripe_count, rc);
+                       set->set_oi->oi_oa->o_id, set->set_oi->oi_oa->o_seq,
+		       req->rq_idx, lsm->lsm_stripe_count, rc);
                 if (rc > 0) {
                         CERROR("obd_create returned invalid err %d\n", rc);
                         rc = -EIO;
@@ -636,6 +636,8 @@ static int create_done(struct obd_export *exp, struct lov_request_set *set,
                         req->rq_complete = 0;
 
                         rc = qos_remedy_create(set, req);
+			if (rc != 0)
+				CERROR("qos_remedy_create failed %d\n", rc);
                         lov_update_create_set(set, req, rc);
                 }
         }
@@ -733,6 +735,8 @@ int cb_create_update(void *cookie, int rc)
                 if (lovreq->rq_idx == cfs_fail_val)
                         rc = -ENOTCONN;
 
+	if (rc != 0)
+		CERROR("cb_create_update failed %d\n", rc);
         rc= lov_update_create_set(lovreq->rq_rqset, lovreq, rc);
         if (lov_set_finished(lovreq->rq_rqset, 0))
                 lov_put_reqset(lovreq->rq_rqset);
Comment by Kit Westneat (Inactive) [ 14/Aug/13 ]

Hi Hongchao,

I was reviewing this ticket and I realized that in this test configuration, the file system only has one OST. Looking at the code for the creates, it appears that it is doing what is intended. When all OSTs are offline, the MDT will return EIO to the client. Is this correct?

I will work on rebuilding a server with that debug patch.

Thanks.

Comment by Hongchao Zhang [ 15/Aug/13 ]

Hi Kit,

one OST isn't a problem and the MDT/clients will wait it back. in my local test(two nodes of VM), the all two OSTs are unmounted and mounted without this error.

Thanks

Comment by Kit Westneat (Inactive) [ 04/Sep/13 ]

Hi Hongchao,

Here are the results of the debug patch:

Sep  4 14:34:48 mds2 kernel: : Lustre: 4591:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1378298081/real 1378298081]  req@ffff8808173ce000 x1445232561408895/t0(0) o400->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:28/4 lens 192/192 e 0 to 1 dl 1378298088 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Sep  4 14:34:48 mds2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep  4 14:34:54 mds2 kernel: : Lustre: 4592:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1378298088/real 1378298088]  req@ffff880825dca000 x1445232561408896/t0(0) o8->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:28/4 lens 368/512 e 0 to 1 dl 1378298094 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Sep  4 14:35:27 mds2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib)
Sep  4 14:35:27 mds2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans
Sep  4 14:40:00 mds2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_setattr operation failed with -19
Sep  4 14:40:00 mds2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep  4 14:40:00 mds2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_connect operation failed with -19
Sep  4 14:40:51 mds2 kernel: : LustreError: 4591:0:(lov_request.c:739:cb_create_update()) cb_create_update failed -11
Sep  4 14:40:51 mds2 kernel: : LustreError: 4591:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x9a8/0x0 sub-object on OST idx 0/1: rc = -11
Sep  4 14:40:51 mds2 kernel: : LustreError: 15954:0:(lov_request.c:640:create_done()) qos_remedy_create failed -5
Sep  4 14:40:51 mds2 kernel: : LustreError: 15954:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x9a8/0x0 sub-object on OST idx 0/1: rc = -5
Sep  4 14:40:51 mds2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.4@o2ib)
Sep  4 14:40:51 mds2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans
Comment by Hongchao Zhang [ 05/Sep/13 ]

Hi Kit,

Thanks for your effort to test the debug patch!

this issue is caused by the failed precreate request (rc = -107/-ENOTCONN), which was just sent to precreate objects and triggere the recovery.
and the fake request queued by OSC to wait the objects to be created is woken up and the lov_create/osc_create_async failed

step 1, OSC found there was no enough objects, then it called osc_internal_create to send a request to precreate objects
step 2, one of the following osc_create_async found the objects was used up and queued a fake request to wait the objects to be created
step 3, the request sent in step 1 triggered the recovery with recoverable error (-ENOTCONN or -ENODEV) and the fake request queued in step 2 will be woken up
and the osc_create_async failed.

normally, the failure of the request precreating objects won't affect the OSC, for there should be enough objects to be used during the processing interval of
the precreate request (normally spent no much time, except bad network). in this case, the precreated objects are used up before the reply of the precreate
request is received!

the patch will be attached soon!

Comment by Hongchao Zhang [ 05/Sep/13 ]

the patch is tracked at http://review.whamcloud.com/#/c/7559/

Comment by Kit Westneat (Inactive) [ 05/Sep/13 ]

Hi Hongchao,

Is this an MDT only patch, or is it for the clients too?

Thanks,
Kit

Comment by Hongchao Zhang [ 06/Sep/13 ]

Hi Kit

the patch is for MDT only.

Thanks

Comment by Kit Westneat (Inactive) [ 13/Sep/13 ]

Hi Hongchao,

We've run into a problem with this patch. After upgrading the MDT, we are starting to get data corruption on the system. Here is a description of the issue:

Another admin has been working on the test file system this morning.
First, she could execute normal commands on the file system. Then
normal commands (like vi) added 112 MB rubbish to files. Even a touch
on a non-existing file created a file of 112 MB.

Another admin logged in and his .Xauthority was increased. He looked
at the binary data ant it seemed like the additional data came from a
software package which is installed on the same Lustre file system,
i.e. the rubbish seems to be no arbitrary data but seems to come from
another location.

root@iccn999:/software/all/tsm/sbin# touch gaga1
Wed Sep 11-14:38:21 (14/1012) - ACTIVE
root@iccn999:/software/all/tsm/sbin# ls -l gaga1
rw-rr- 1 root root 116430464 Sep 11 14:38 gaga1

On another client the behaviour is different:
root@iccn996:/software/all/tsm/sbin# touch gaga2
touch: setting times of `gaga2': No such file or directory
Wed Sep 11-14:39:15 (5/41)
root@iccn996:/software/all/tsm/sbin# ls -l gaga2
rw-rr- 1 root root 0 Sep 11 14:39 gaga2

I will upload the lctl dk logs (with vfstrace and rpctrace). Is there any other information we should get?

Comment by Kit Westneat (Inactive) [ 13/Sep/13 ]

I should add that we didn't see the problem any more after downgrading to stock 2.1.5. We did however see the issue at both 2.1.6+patch and 2.1.5+patch.

Comment by Hongchao Zhang [ 16/Sep/13 ]

sorry, there is an error in the previous patch, and it has been updated (http://review.whamcloud.com/#/c/7559/)

Comment by Kit Westneat (Inactive) [ 26/Sep/13 ]

Hi Hongchao,

The patch fixed the data corruption, but we are still seeing the same errors on failover. Should we reproduce the issue with the original debugging patch + fix? Here are the kernel messages from the fix (patch 7559):

Sep 25 15:16:49 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_setattr operation failed with -19
Sep 25 15:16:49 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 25 15:17:39 pfscn2 kernel: : LustreError: 9001:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x3288 sub-object on OST idx 0/1: rc = -11
Sep 25 15:17:39 pfscn2 kernel: : LustreError: 7308:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x3288 sub-object on OST idx 0/1: rc = -5
Sep 25 15:17:40 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.4@o2ib)
Sep 25 15:17:40 pfscn2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans
Comment by Kit Westneat (Inactive) [ 27/Sep/13 ]

Hi Hongchao,

I was trying to follow the create logic some and it's confused me some. It seems like the fake requests are handled by oscc_internal_create, but then if the oscc is recovering (due to the EAGAIN) all it does is return 0. Am I reading that right?

Thanks,
Kit

Comment by Hongchao Zhang [ 30/Sep/13 ]

Hi
the fake requests will be queued into osc_creater.oscc_wait_create_list in osc_create_async to wait the objects creation to complete, and it will be woken up
in osc_interpret_create with returned value from OST. in this case, -ENOTCONN or -ENODEV is returned and failed these fake requests, which should continue
waiting the recovery of OST to complete and the objects will be ready then.

I'll create a new debug patch with the http://review.whamcloud.com/#/c/7559/

Thanks!

Comment by Kit Westneat (Inactive) [ 01/Oct/13 ]

Ah ok, that makes sense. Should handle_async_create add the RECOVERING flag to the oscc? It seems like osc_interpret_create would add it, but is it possible that it could get to oscc_internal_create without that flag?

Comment by Kit Westneat (Inactive) [ 03/Oct/13 ]

Hi Hongchao,

Any updates?

Thanks,
Kit

Comment by Kit Westneat (Inactive) [ 03/Oct/13 ]

I tested adding the degraded flag to the oscc and it seemed to fix the problem on my test system:

        if(rc != 0 && rc != -ENOTCONN && rc != -ENODEV)
                GOTO(out_wake, rc);
           
+        if (rc == -ENOTCONN || rc ==  -ENODEV)
+                oscc->oscc_flags |= OSCC_FLAG_DEGRADED;

        /* Handle the critical type errors first.

I ran into one issue where the OST ran out of inodes and it caused the lov_create threads on the MDT to deadlock. I'm not sure if that's a new behavior or in stock 2.1.6 as well. I'll investigate.

Comment by Hongchao Zhang [ 08/Oct/13 ]

Hi Kit,

Sorry for delayed response, we're in holiday in the last several days.

if "OSCC_FLAG_DEGRADED" is set, then this OST won't be used preferentially when creating object. do you only test the failover on one OSS?

BTW, is there something liken "Unknown rc XXX from async create: failing oscc" in the kernel logs?

Thanks

Comment by Kit Westneat (Inactive) [ 09/Oct/13 ]

The fix that worked on my test system didn't work on the customer system, so it might have been a fluke.

I haven't seen that message in the new logs. Here is the latest server dk's:

00000100:00100000:14.0:1381243911.199943:0:4598:0:(import.c:725:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5)  req@ffff8808181fb400 x1448249214211642/t0(
0) o8->pfscdat2-OST0000-osc-MDT0000@172.26.17.3@o2ib:28/4 lens 368/512 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
00000100:00100000:13.0:1381243911.212619:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1:
1448249214211637:172.26.17.3@o2ib:-1
00020000:00020000:13.0:1381243911.212624:0:4596:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10f5a sub-object on OST idx 0/1: rc = -11
00000100:00100000:13.0:1381243911.225308:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1:
1448249214211638:172.26.17.3@o2ib:-1
00020000:00020000:13.0:1381243911.225312:0:4596:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x18c7 sub-object on OST idx 0/1: rc = -11
00000100:00100000:13.0:1381243911.250579:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1:
1448249214211639:172.26.17.3@o2ib:-1
00020000:00020000:6.0:1381243911.250579:0:29944:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10f5a sub-object on OST idx 0/1: rc = -5
00000100:00100000:13.0:1381243911.250586:0:4596:0:(client.c:1434:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:e76ce10d-1d27-cdc4-1091-649094c70331:4596:1448249214211641:172.26.17.1@o2ib:400
00020000:00020000:5.0:1381243911.250612:0:4755:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x18c7 sub-object on OST idx 0/1: rc = -5
00000100:00100000:13.0:1381243911.250638:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:0:1448249214211644:172.26.17.3@o2ib:13
00000100:00100000:5.0:1381243911.250741:0:4755:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_02:94b08aa2-54d7-b32e-019d-2561ed3286b5+6:58211:x1447047461344560:12345-172.26.4.3@o2ib:101 Request procesed in 49643595us (49643620us total) trans 0 rc 301/301
Comment by Hongchao Zhang [ 10/Oct/13 ]

Hi Kit,

could you please test with the following debug patch

diff --git a/lustre/lov/lov_request.c b/lustre/lov/lov_request.c
index 0fc635e..f0476cb 100644
--- a/lustre/lov/lov_request.c
+++ b/lustre/lov/lov_request.c
@@ -577,6 +577,14 @@ static int lov_update_create_set(struct lov_request_set *se
                        "on OST idx %d/%d: rc = %d\n",
                        set->set_oi->oi_oa->o_id, req->rq_idx,
                        lsm->lsm_stripe_count, rc);
+
+#ifdef __KERNEL__
+               if (rc == -EAGAIN) {
+                       CDEBUG(D_ERROR, "shouldn't encounter -EAGAIN!\n");
+                       dump_stack();
+               }
+#endif
+
                 if (rc > 0) {
                         CERROR("obd_create returned invalid err %d\n", rc);
                         rc = -EIO;
diff --git a/lustre/osc/osc_create.c b/lustre/osc/osc_create.c
index ffc81d4..22265bd 100644
--- a/lustre/osc/osc_create.c
+++ b/lustre/osc/osc_create.c
@@ -412,7 +412,7 @@ static int handle_async_create(struct ptlrpc_request *req, i
 
         LASSERT_SPIN_LOCKED(&oscc->oscc_lock);
 
-        if(rc)
+        if(rc != 0 && rc != -ENOTCONN && rc != -ENODEV)
                 GOTO(out_wake, rc);
 
         /* Handle the critical type errors first.
Comment by Kit Westneat (Inactive) [ 23/Oct/13 ]

Hi Hongchao,

Sorry for the delay in getting the output:

Oct 23 10:47:48 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this 
service will wait for recovery to complete
Oct 23 10:48:39 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10949 sub-object on OST idx 0/1: rc = -11
Oct 23 10:48:39 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:583:lov_update_create_set()) shouldn't encounter -EAGAIN!
Oct 23 10:48:39 pfscn2 kernel: : Pid: 9651, comm: ptlrpcd Not tainted 2.6.32-358.11.1.el6_lustre.es124.x86_64 #1
Oct 23 10:48:39 pfscn2 kernel: : Call Trace:
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0ad3505>] ? lov_update_create_set+0x4f5/0x500 [lov]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0ad78cd>] ? cb_create_update+0x2d/0x100 [lov]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0a6f28b>] ? handle_async_create+0x7b/0x390 [osc]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0a6fad9>] ? async_create_interpret+0x39/0x50 [osc]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0811c9b>] ? ptlrpc_check_set+0x29b/0x1b00 [ptlrpc]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff8150ecda>] ? schedule_timeout+0x19a/0x2e0
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842f20>] ? ptlrpcd_check+0x1a0/0x230 [ptlrpc]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa08431be>] ? ptlrpcd+0x20e/0x370 [ptlrpc]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff81063310>] ? default_wake_function+0x0/0x20
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Oct 23 10:48:39 pfscn2 kernel: : LustreError: 10250:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10949 sub-object on OST idx 0/1: rc = -5
Oct 23 10:49:21 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib)
Oct 23 10:49:21 pfscn2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans
Oct 23 10:53:38 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_setattr operation failed with -19
Oct 23 10:53:38 pfscn2 kernel: : LustreError: Skipped 1 previous similar message
Oct 23 10:53:38 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 23 10:54:09 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_connect operation failed with -19
Oct 23 10:54:09 pfscn2 kernel: : LustreError: Skipped 1 previous similar message
Oct 23 10:54:16 pfscn2 kernel: : Lustre: 10169:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from 6ede50b8-462e-c607-2c6d-f747541fabbd@172.26.17.4@o2ib t0 exp (null) cur 1382518456 last 0
Oct 23 10:54:28 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0xd948 sub-object on OST idx 0/1: rc = -11
Oct 23 10:54:28 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:583:lov_update_create_set()) shouldn't encounter -EAGAIN!
Oct 23 10:54:28 pfscn2 kernel: : Pid: 9651, comm: ptlrpcd Not tainted 2.6.32-358.11.1.el6_lustre.es124.x86_64 #1
Oct 23 10:54:28 pfscn2 kernel: : Call Trace:
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0ad3505>] ? lov_update_create_set+0x4f5/0x500 [lov]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0ad78cd>] ? cb_create_update+0x2d/0x100 [lov]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0a6f28b>] ? handle_async_create+0x7b/0x390 [osc]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0a6fad9>] ? async_create_interpret+0x39/0x50 [osc]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0811c9b>] ? ptlrpc_check_set+0x29b/0x1b00 [ptlrpc]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff8150ecda>] ? schedule_timeout+0x19a/0x2e0
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842f20>] ? ptlrpcd_check+0x1a0/0x230 [ptlrpc]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa08431be>] ? ptlrpcd+0x20e/0x370 [ptlrpc]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff81063310>] ? default_wake_function+0x0/0x20
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Oct 23 10:54:28 pfscn2 kernel: : LustreError: 24178:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0xd948 sub-object on OST idx 0/1: rc = -5
Comment by Kit Westneat (Inactive) [ 23/Oct/13 ]

should it actually be:
+ if(rc != 0 && rc != -ENOTCONN && rc != -ENODEV && rc != -EAGAIN)

I built a version with that just to test. I didn't get any IO errors, but I did get:

Oct 23 22:27:56 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_create operation failed with -19
Oct 23 22:27:56 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 23 22:27:56 pfscn2 kernel: : LustreError: 9255:0:(osc_create.c:175:osc_interpret_create()) @@@ Unknown rc -19 from async create: failing oscc  req@ffff88081f6c3800 x1449716075693976/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:7/4 lens 400/400 e 0 to 0 dl 1382560083 ref 1 fl Interpret:RN/0/0 rc -19/-19
Oct 23 22:28:53 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib)
Oct 23 22:28:53 pfscn2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans
Oct 23 22:28:53 pfscn2 kernel: : Lustre: Skipped 1 previous similar message
Oct 23 22:31:16 pfscn2 kernel: : Lustre: Service thread pid 10987 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Oct 23 22:31:16 pfscn2 kernel: : Pid: 10987, comm: mdt_13
Oct 23 22:31:16 pfscn2 kernel: :
Oct 23 22:31:16 pfscn2 kernel: : Call Trace:
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa04de923>] ? cfs_alloc+0x63/0x90 [libcfs]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0808a45>] ? ptlrpc_next_xid+0x15/0x40 [ptlrpc]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa04de60e>] cfs_waitq_wait+0xe/0x10 [libcfs]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0ac650a>] lov_create+0xbaa/0x1400 [lov]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d5c0d6>] ? mdd_get_md+0x96/0x2f0 [mdd]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffff81063310>] ? default_wake_function+0x0/0x20
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d7c916>] ? mdd_read_unlock+0x26/0x30 [mdd]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d6090e>] mdd_lov_create+0x9ee/0x1ba0 [mdd]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d72871>] mdd_create+0xf81/0x1a90 [mdd]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0e72414>] ? osd_object_init+0xe4/0x420 [osd_ldiskfs]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0eaa3f7>] cml_create+0x97/0x250 [cmm]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0de55b1>] ? mdt_version_get_save+0x91/0xd0 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dfb156>] mdt_reint_open+0x1aa6/0x2940 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa081d754>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d7556e>] ? md_ucred+0x1e/0x60 [mdd]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0de3c51>] mdt_reint_rec+0x41/0xe0 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0ddaed4>] mdt_reint_internal+0x544/0x8e0 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0ddb53d>] mdt_intent_reint+0x1ed/0x500 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dd9c09>] mdt_intent_policy+0x379/0x690 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa07d9391>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa07ff1dd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dda586>] mdt_enqueue+0x46/0x130 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dcf772>] mdt_handle_common+0x932/0x1750 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dd0665>] mdt_regular_handle+0x15/0x20 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082db9e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082cf50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffff8100c0ca>] child_rip+0xa/0x20
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082cf50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082cf50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Comment by Kit Westneat (Inactive) [ 23/Oct/13 ]

the decode lustrelog dumped

Comment by Hongchao Zhang [ 28/Oct/13 ]

Hi Kit,

Do you apply any patch on the original 2.1.5? the error "-EAGAIN" is originated from ptlrpc_check_set as per the logs and it should be "ptlrpc_import_delay_req"
to set this error when the fake request (queued by osc_create_async) is timed out, but no "-EAGAIN" is found in this function.

but the logs show the the cause of the issue (the fake request failed and was not restarted if the OSC was recovering), will try to create the updated patch.

Thanks!

Comment by Kit Westneat (Inactive) [ 31/Oct/13 ]

I think it was stock 2.1.5, I will see if I can find any patches.

Any updates on creating the patch? The customer is getting impatient. When can I tell them to expect the fix?

Thanks,
Kit

Comment by Hongchao Zhang [ 01/Nov/13 ]

the patch is under test and will be attached soon

Comment by Hongchao Zhang [ 01/Nov/13 ]

the new patch is tracked at http://review.whamcloud.com/#/c/7559/

Comment by Kit Westneat (Inactive) [ 05/Nov/13 ]

Hi Hongchao,

Still getting errors:

Nov  4 21:33:56 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_create operation failed with -107
Nov  4 21:33:56 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Nov  4 21:33:56 pfscn2 kernel: : LustreError: 9244:0:(layout.c:1659:__req_capsule_get()) @@@ Wrong buffer for field `ost_body' (1 of 1) in format `OST_CREATE': 0 vs. 208 (server)
Nov  4 21:33:56 pfscn2 kernel: : req@ffff8808092fa000 x1450802648479850/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:7/4 lens 400/192 e 0 to 0 dl 1383597243 ref 1 fl Interpret:RN/0/0 rc -107/-107
Nov  4 21:33:56 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_connect operation failed with -19
Nov  4 21:33:56 pfscn2 kernel: : LustreError: 9244:0:(osc_create.c:177:osc_interpret_create()) @@@ Unknown rc -107 from async create: failing oscc  req@ffff8808092fa000 x1450802648479850/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:7/4 lens 400/192 e 0 to 0 dl 1383597243 ref 1 fl Interpret:RN/0/0 rc -107/-107
Nov  4 21:33:56 pfscn2 kernel: : LustreError: 12037:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x1377d sub-object on OST idx 0/1: rc = -5
Nov  4 21:34:48 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib)

Are you able to reproduce the error? Basically we are doing it by having 8 clients writing small files, then failing over an OSS.

Comment by Kit Westneat (Inactive) [ 05/Nov/13 ]

Do you think that patchset 4 would fix this issue? It actually seems as if we are hitting 2 separate issues depending on where in the code path the OSS fails.

Comment by Kit Westneat (Inactive) [ 05/Nov/13 ]

With both patchset 4 and 5, it seems to solve the issue:

LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_create operation failed with -107
Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 9266:0:(layout.c:1659:__req_capsule_get()) @@@ Wrong buffer for field `ost_body' (1 of 1) in format `OST_CREATE': 0 vs. 208 (server)
  req@ffff880808005800 x1450889175112924/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:7/4 lens 400/192 e 0 to 0 dl 1383677590 ref 1 fl Interpret:RN/0/0 rc -107/-107
LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_connect operation failed with -19
LustreError: 9266:0:(osc_create.c:177:osc_interpret_create()) @@@ Unknown rc -107 from async create: failing oscc  req@ffff880808005800 x1450889175112924/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:7/4 lens 400/192 e 0 to 0 dl 1383677590 ref 1 fl Interpret:RN/0/0 rc -107/-107
Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib)
Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans
LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_statfs operation failed with -107
Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.4@o2ib)
Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans
LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_statfs operation failed with -107
LustreError: Skipped 1 previous similar message
Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 9266:0:(osc_create.c:177:osc_interpret_create()) @@@ Unknown rc -19 from async create: failing oscc  req@ffff881001a70000 x1450889176260853/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:7/4 lens 400/400 e 0 to 0 dl 1383678377 ref 1 fl Interpret:RN/0/0 rc -19/-19
Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib)
Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans
LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The obd_ping operation failed with -107
LustreError: Skipped 2 previous similar messages
Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.4@o2ib)
Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans

This error message looks somewhat concerning:

LustreError: 9266:0:(layout.c:1659:__req_capsule_get()) @@@ Wrong buffer for field `ost_body' (1 of 1) in format `OST_CREATE': 0 vs. 208 (server)

Do you know if this should cause issues?

Comment by Hongchao Zhang [ 06/Nov/13 ]

this error message should not cause issues, for it only detected there was no "ost_body" in the replied message (server side) of the failed ost_create request.
and 208 is just the size of "ost_body"

    if (field->rmf_flags & RMF_F_STRUCT_ARRAY) {
        /*
         * We've already asserted that field->rmf_size > 0 in
         * req_layout_init().
         */
        len = lustre_msg_buflen(msg, offset);
        if ((len % field->rmf_size) != 0) {
            CERROR("%s: array field size mismatch "
                   "%d modulo %d != 0 (%d)\n",
                   field->rmf_name, len, field->rmf_size, loc);
            return NULL;
        }
    } else if (pill->rc_area[loc][offset] != -1) {
        len = pill->rc_area[loc][offset];
    } else {
        len = max(field->rmf_size, 0);
    }
    value = getter(msg, offset, len);

    if (value == NULL) {
        DEBUG_REQ(D_ERROR, pill->rc_req,
                  "Wrong buffer for field `%s' (%d of %d) "
                  "in format `%s': %d vs. %d (%s)\n",
                  field->rmf_name, offset, lustre_msg_bufcount(msg),
                  fmt->rf_name, lustre_msg_buflen(msg, offset), len,
                  rcl_names[loc]);

Yes, the two patches should be combined, and the patch has been updated.
Thanks

Comment by Kit Westneat (Inactive) [ 06/Nov/13 ]

Would it be possible to suppress the warning message in this case? I think it is going to confuse sysadmins.

Comment by Hongchao Zhang [ 07/Nov/13 ]

Okay, the patch has been updated to suppress this error message if the corresponding request failed.

Thanks

Comment by Kit Westneat (Inactive) [ 07/Nov/13 ]

Perfect! I confirmed with the customer that this patch fixes their issue, so once this lands, I guess we can close the ticket.

Comment by Kit Westneat (Inactive) [ 13/Nov/13 ]

Hi Hongchao,

Do you think something like this patch is necessary on master? The osc appears to be a lot different in that version.

Thanks,
Kit

Comment by Hongchao Zhang [ 16/Nov/13 ]

Hi Kit,

the master has waited the object precreation in "osp_declare_object_create"/"osp_precreate_reserve"/"osp_precreate_ready_condition" for recovering case

static int osp_precreate_ready_condition(const struct lu_env *env,
                                         struct osp_device *d)
{
        if (d->opd_pre_recovering)
                return 0;

        /* ready if got enough precreated objects */
        /* we need to wait for others (opd_pre_reserved) and our object (+1) */
        if (d->opd_pre_reserved + 1 < osp_objs_precreated(env, d))
                return 1;

        /* ready if OST reported no space and no destroys in progress */
        if (d->opd_syn_changes + d->opd_syn_rpc_in_progress == 0 &&
            d->opd_pre_status == -ENOSPC)
                return 1;

        /* Bail out I/O fails to OST */
        if (d->opd_pre_status == -EIO)
                return 1;

        return 0;
}
Generated at Sat Feb 10 01:35:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.