[LU-12198] lnetctl peer show hangs for ~2600 clients, ioctl getting E2BIG Created: 18/Apr/19 Updated: 22/Jun/20 Resolved: 25/Feb/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.5 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Ruth Klundt (Inactive) | Assignee: | Dominique Martinet (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Environment: |
x86 servers, 2.12 no patches, RHEL 7.6 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
command `lnetctl peer show` appears to hang, strace shows looping on: ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x65, 0x64, 0xb8), 0x7fffffffccf0) = -1 E2BIG (Argument list too long) There are 2605 lines in /sys/kernel/debug/lnet/peers. |
| Comments |
| Comment by Sonia Sharma (Inactive) [ 19/Apr/19 ] |
|
Yes, this is because we have this limit of 1000 peers while allocating buffer to get the peer list- 2996 count = 1000; 2997 size = count * sizeof(struct lnet_process_id); 2998 list = malloc(size); 009 LIBCFS_IOC_INIT_V2(peer_info, prcfg_hdr); 3010 peer_info.prcfg_hdr.ioc_len = sizeof(peer_info); 3011 peer_info.prcfg_size = size; 3012 peer_info.prcfg_bulk = list; 3013 3014 l_errno = 0; 3015 rc = l_ioctl(LNET_DEV_ID, IOC_LIBCFS_GET_PEER_LIST, 3016 &peer_info); 3017 count = peer_info.prcfg_count; 3018 if (rc == 0) 3019 break; 3020 l_errno = errno;
|
| Comment by Ruth Klundt (Inactive) [ 22/Apr/19 ] |
|
fyi a bit further down the code appears to retry with count and size returned from ioctl. those values must not be getting across correctly though because the call loops indefinitely. |
| Comment by Mahmoud Hanafi [ 05/Sep/19 ] |
|
We are hitting this issue on our routers. lnetctl peer show will hang and strace show ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x65, 0x64, 0xb8), 0x7fffffffe6b0) = -1 E2BIG (Argument list too long) |
| Comment by Dominique Martinet (Inactive) [ 13/Feb/20 ] |
|
The problem is that the hdr header in kernel does not go back to userspace on error, so when lnet_get_peer_list writes back to *sizep it stays in kernel and does not fill back the value for lnetctl to grow the buffer. |
| Comment by Gerrit Updater [ 13/Feb/20 ] |
|
Dominique Martinet (dominique.martinet@cea.fr) uploaded a new patch: https://review.whamcloud.com/37559 |
| Comment by Gerrit Updater [ 25/Feb/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37559/ |
| Comment by Peter Jones [ 25/Feb/20 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 25/Feb/20 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37720 |
| Comment by Mahmoud Hanafi [ 30/Mar/20 ] |
|
It would be nice if this could land in the next 2.12.x. |
| Comment by Gerrit Updater [ 06/Apr/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37720/ |
| Comment by Joe Grund [ 22/Jun/20 ] |
|
I'm seeing something similar to this while calling both lnetctl export and lnetctl peer show These commands hang and running strace shows the above. |
| Comment by James A Simmons [ 22/Jun/20 ] |
|
Is this with 2.14? I suspect we are reaching the limits of using ioctls. |