[LU-990]  (lov_request.c:690:lov_update_create_set()) error creating fid xxx rc=-107 Created: 13/Jan/12  Updated: 06/Feb/14  Resolved: 06/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Zhenyu Xu
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

lustre-1.8.3


Severity: 3
Rank (Obsolete): 6838

 Description   

The customer saw the following error messages on MDS during 128 clients were creating the many small files. (mostly 1 milion files), then they also saw Input/output error to the some files, when they opened these files.

(lov_request.c:690:lov_update_create_set()) error creating fid xxx rc=-107

As far as we see the log files on OSSs, at the same time, there are many slow IO messages at the same time.

-107 = -ENOTCONN, is this happened connection loss between server and client? (MDS to OSS as well?)

Please advise.



 Comments   
Comment by Peter Jones [ 13/Jan/12 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 13/Jan/12 ]

I saw several different kind of error messages in the logs:

in OS1, there are many network error messages.

Dec 21 00:45:02 os1 kernel: LustreError: 11686:0:(o2iblnd_cb.c:1232:kiblnd_connect_peer()) Can't resolve addr for 172.20.1.23@o2ib: -101
Dec 21 00:45:02 os1 kernel: LustreError: 11683:0:(o2iblnd_cb.c:1232:kiblnd_connect_peer()) Can't resolve addr for 172.20.1.28@o2ib: -101
Dec 21 00:46:44 os1 kernel: LustreError: 11639:0:(o2iblnd_cb.c:1232:kiblnd_connect_peer()) Can't resolve addr for 172.20.1.26@o2ib: -101
Dec 21 00:46:45 os1 kernel: LustreError: 11608:0:(o2iblnd_cb.c:1232:kiblnd_connect_peer()) Can't resolve addr for 172.20.1.25@o2ib: -101
Dec 21 00:48:52 os1 kernel: LustreError: 11876:0:(o2iblnd_cb.c:1232:kiblnd_connect_peer()) Can't resolve addr for 172.20.1.22@o2ib: -101
Dec 21 00:48:56 os1 kernel: LustreError: 11855:0:(o2iblnd_cb.c:1232:kiblnd_connect_peer()) Can't resolve addr for 172.20.1.24@o2ib: -101
Dec 21 02:44:43 os1 kernel: LustreError: 11909:0:(o2iblnd_cb.c:1232:kiblnd_connect_peer()) Can't resolve addr for 172.20.1.1@o2ib: -101
...

#define ENETUNREACH 101 /* Network is unreachable */

on OS2

Dec 22 00:54:22 os2 kernel: LustreError: 11568:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-172.20.1.61@o2ib: -113
Dec 22 00:54:22 os2 kernel: LustreError: 8091:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-172.20.1.59@o2ib: -113
Dec 22 16:19:55 os2 kernel: LustreError: 11667:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-172.20.1.56@o2ib: -113
...

#define EHOSTUNREACH 113 /* No route to host */

Are there network problems in the site?

Comment by Shuichi Ihara (Inactive) [ 13/Jan/12 ]

thanks. we asked about the network issue, but they said there are no error messages on the infiniband side.
But let me ask again. they did run an large job which uses a lot of CPU resources. client might be delay response or took a long time for anything.
even there were network issues, we can't see evicted messages. So, the client were not killed completely.

any advise to prevent the this issue when if they run the large job again?

Comment by Zhenyu Xu [ 13/Jan/12 ]

w/o debug log I don't know what exact problem there is while I do see many heavy IO load warnning messages in the syslogs.

Please checkout LU-952, there are discussion and a patch for high load issue. Disabling read only cache and write through cache will help this issue, and if not, please collect lustre debug log and upload.

Comment by Shuichi Ihara (Inactive) [ 06/Feb/14 ]

We haven't seen same issue again very much. Please close this issue.

Comment by Peter Jones [ 06/Feb/14 ]

ok - thanks Ihara!

Generated at Sat Feb 10 01:12:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.