[LU-990] (lov_request.c:690:lov_update_create_set()) error creating fid xxx rc=-107 Created: 13/Jan/12 Updated: 06/Feb/14 Resolved: 06/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre-1.8.3 |
||
| Severity: | 3 |
| Rank (Obsolete): | 6838 |
| Description |
|
The customer saw the following error messages on MDS during 128 clients were creating the many small files. (mostly 1 milion files), then they also saw Input/output error to the some files, when they opened these files. (lov_request.c:690:lov_update_create_set()) error creating fid xxx rc=-107 As far as we see the log files on OSSs, at the same time, there are many slow IO messages at the same time. -107 = -ENOTCONN, is this happened connection loss between server and client? (MDS to OSS as well?) Please advise. |
| Comments |
| Comment by Peter Jones [ 13/Jan/12 ] |
|
Bobijam Could you please look into this one? Thanks Peter |
| Comment by Zhenyu Xu [ 13/Jan/12 ] |
|
I saw several different kind of error messages in the logs: in OS1, there are many network error messages. Dec 21 00:45:02 os1 kernel: LustreError: 11686:0:(o2iblnd_cb.c:1232:kiblnd_connect_peer()) Can't resolve addr for 172.20.1.23@o2ib: -101 Dec 21 00:45:02 os1 kernel: LustreError: 11683:0:(o2iblnd_cb.c:1232:kiblnd_connect_peer()) Can't resolve addr for 172.20.1.28@o2ib: -101 Dec 21 00:46:44 os1 kernel: LustreError: 11639:0:(o2iblnd_cb.c:1232:kiblnd_connect_peer()) Can't resolve addr for 172.20.1.26@o2ib: -101 Dec 21 00:46:45 os1 kernel: LustreError: 11608:0:(o2iblnd_cb.c:1232:kiblnd_connect_peer()) Can't resolve addr for 172.20.1.25@o2ib: -101 Dec 21 00:48:52 os1 kernel: LustreError: 11876:0:(o2iblnd_cb.c:1232:kiblnd_connect_peer()) Can't resolve addr for 172.20.1.22@o2ib: -101 Dec 21 00:48:56 os1 kernel: LustreError: 11855:0:(o2iblnd_cb.c:1232:kiblnd_connect_peer()) Can't resolve addr for 172.20.1.24@o2ib: -101 Dec 21 02:44:43 os1 kernel: LustreError: 11909:0:(o2iblnd_cb.c:1232:kiblnd_connect_peer()) Can't resolve addr for 172.20.1.1@o2ib: -101 ... #define ENETUNREACH 101 /* Network is unreachable */ on OS2 Dec 22 00:54:22 os2 kernel: LustreError: 11568:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-172.20.1.61@o2ib: -113 Dec 22 00:54:22 os2 kernel: LustreError: 8091:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-172.20.1.59@o2ib: -113 Dec 22 16:19:55 os2 kernel: LustreError: 11667:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-172.20.1.56@o2ib: -113 ... #define EHOSTUNREACH 113 /* No route to host */ Are there network problems in the site? |
| Comment by Shuichi Ihara (Inactive) [ 13/Jan/12 ] |
|
thanks. we asked about the network issue, but they said there are no error messages on the infiniband side. any advise to prevent the this issue when if they run the large job again? |
| Comment by Zhenyu Xu [ 13/Jan/12 ] |
|
w/o debug log I don't know what exact problem there is while I do see many heavy IO load warnning messages in the syslogs. Please checkout |
| Comment by Shuichi Ihara (Inactive) [ 06/Feb/14 ] |
|
We haven't seen same issue again very much. Please close this issue. |
| Comment by Peter Jones [ 06/Feb/14 ] |
|
ok - thanks Ihara! |