[LU-12198] lnetctl peer show hangs for ~2600 clients, ioctl getting E2BIG Created: 18/Apr/19  Updated: 22/Jun/20  Resolved: 25/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.14.0, Lustre 2.12.5

Type: Bug Priority: Minor
Reporter: Ruth Klundt (Inactive) Assignee: Dominique Martinet (Inactive)
Resolution: Fixed Votes: 1
Labels: None
Environment:

x86 servers, 2.12 no patches, RHEL 7.6


Issue Links:
Related
is related to LU-9680 Improve the user land to kernel space... In Progress
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

command `lnetctl peer show` appears to hang, strace shows looping on:

ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x65, 0x64, 0xb8), 0x7fffffffccf0) = -1 E2BIG (Argument list too long)

There are 2605 lines in /sys/kernel/debug/lnet/peers.



 Comments   
Comment by Sonia Sharma (Inactive) [ 19/Apr/19 ]

Yes, this is because we have this limit of 1000 peers while allocating buffer to get the peer list- 

2996         count = 1000;                                                                                                                                        
2997         size = count * sizeof(struct lnet_process_id);                                                                                                       
2998         list = malloc(size);
009                         LIBCFS_IOC_INIT_V2(peer_info, prcfg_hdr);                                                                                            
3010                         peer_info.prcfg_hdr.ioc_len = sizeof(peer_info);                                                                                     
3011                         peer_info.prcfg_size = size;                                                                                                         
3012                         peer_info.prcfg_bulk = list;                                                                                                         
3013                                                                                                                                                              
3014                         l_errno = 0;                                                                                                                         
3015                         rc = l_ioctl(LNET_DEV_ID, IOC_LIBCFS_GET_PEER_LIST,                                                                                  
3016                                      &peer_info);                                                                                                            
3017                         count = peer_info.prcfg_count;                                                                                                       
3018                         if (rc == 0)                                                                                                                         
3019                                 break;                                                                                                                       
3020                         l_errno = errno;

 

 

 

Comment by Ruth Klundt (Inactive) [ 22/Apr/19 ]

fyi a bit further down the code appears to retry with count and size returned from ioctl. those values must not be getting across correctly though because the call loops indefinitely.

Comment by Mahmoud Hanafi [ 05/Sep/19 ]

We are hitting this issue on our routers.

 lnetctl peer show

will hang and strace show 

 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x65, 0x64, 0xb8), 0x7fffffffe6b0) = -1 E2BIG (Argument list too long)

Comment by Dominique Martinet (Inactive) [ 13/Feb/20 ]

The problem is that the hdr header in kernel does not go back to userspace on error, so when lnet_get_peer_list writes back to *sizep it stays in kernel and does not fill back the value for lnetctl to grow the buffer.

Comment by Gerrit Updater [ 13/Feb/20 ]

Dominique Martinet (dominique.martinet@cea.fr) uploaded a new patch: https://review.whamcloud.com/37559
Subject: LU-12198 libcfs: always copy ioctl header back to user
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 818fc691a0e29e5764bfcd65d2a1918c5369fe7c

Comment by Gerrit Updater [ 25/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37559/
Subject: LU-12198 libcfs: always copy ioctl header back to user
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9e02ef474f8caa833d6a1b5e0068d5323a57e8c4

Comment by Peter Jones [ 25/Feb/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 25/Feb/20 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37720
Subject: LU-12198 libcfs: always copy ioctl header back to user
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 321a757880445089a48d26acfe0554853750ca3f

Comment by Mahmoud Hanafi [ 30/Mar/20 ]

It would be nice if this could land in the next 2.12.x.

Comment by Gerrit Updater [ 06/Apr/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37720/
Subject: LU-12198 libcfs: always copy ioctl header back to user
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: a3c687a943233a7c5ae7e3fb906d1913b063c95c

Comment by Joe Grund [ 22/Jun/20 ]

I'm seeing something similar to this while calling both

lnetctl export

and

lnetctl peer show

These commands hang and running strace shows the above.

Comment by James A Simmons [ 22/Jun/20 ]

Is this with 2.14?  I suspect we are reaching the limits of using ioctls.

Generated at Sat Feb 10 02:50:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.