[LU-6407] acceptor_000 runs at 100% all the time Created: 27/Mar/15 Updated: 27/Apr/15 Resolved: 27/Apr/15 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | John Hammond | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | lnet | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Using 2.7.51 after I run llmount.sh I see acceptor_000 running at 100% all the time. top - 11:29:59 up 1 min, 2 users, load average: 0.71, 0.19, 0.06
Tasks: 298 total, 2 running, 296 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 25.1%sy, 0.0%ni, 74.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3901240k total, 596948k used, 3304292k free, 25188k buffers
Swap: 0k total, 0k used, 0k free, 229524k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2335 root 20 0 0 0 0 R 100.0 0.0 0:33.71 acceptor_000
2278 root 20 0 15164 1352 908 R 0.7 0.0 0:00.28 top
1 root 20 0 19352 1500 1188 S 0.0 0.0 0:00.85 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.03 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:00.08 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
...
I crashed the machine a got a backtrace: crash> bt
PID: 27520 TASK: ffff8800c0fa0580 CPU: 2 COMMAND: "acceptor_000"
#0 [ffff88002c407e30] crash_nmi_callback at ffffffff8103054d
#1 [ffff88002c407e50] notifier_call_chain at ffffffff81559e45
#2 [ffff88002c407e90] __atomic_notifier_call_chain at ffffffff81559edc
#3 [ffff88002c407ee0] atomic_notifier_call_chain at ffffffff81559f26
#4 [ffff88002c407ef0] notify_die at ffffffff810a57be
#5 [ffff88002c407f20] do_nmi at ffffffff815576a3
#6 [ffff88002c407f50] nmi at ffffffff815571f0
[exception RIP: check_poison_obj+80]
RIP: ffffffff811840a0 RSP: ffff880012479bf0 RFLAGS: 00000293
RAX: 000000000000006b RBX: 0000000000000124 RCX: ffffffff8146c68f
RDX: 000000000000006b RSI: ffff8800aa5d4568 RDI: ffff88011dd81500
RBP: ffff880012479c40 R8: 0000000000000000 R9: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000510 R14: ffff8800aa5d4570 R15: 000000000000050f
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#7 [ffff880012479bf0] check_poison_obj at ffffffff811840a0
#8 [ffff880012479c48] cache_alloc_debugcheck_after at ffffffff8118439c
#9 [ffff880012479c88] kmem_cache_alloc at ffffffff81187806
#10 [ffff880012479cd8] sock_alloc_inode at ffffffff8146c68f
#11 [ffff880012479cf8] alloc_inode at ffffffff811c0cf7
#12 [ffff880012479d18] new_inode at ffffffff811c19fb
#13 [ffff880012479d48] sock_alloc at ffffffff8146d389
#14 [ffff880012479d58] sock_create_lite at ffffffff8146dca5
#15 [ffff880012479da8] lnet_sock_accept at ffffffffa0b07e86 [lnet]
#16 [ffff880012479e08] lnet_acceptor at ffffffffa0b1a9b7 [lnet]
#17 [ffff880012479eb8] kthread at ffffffff8109e856
#18 [ffff880012479f48] kernel_thread at ffffffff8100c30a
|
| Comments |
| Comment by Isaac Huang (Inactive) [ 27/Mar/15 ] |
|
It might have something to do with the recent dynamic acceptor start/stop work. Otherwise the acceptor mechanism hasn't changed for years. Also, why the thread is named "acceptor_000"? There can be at most 1 acceptor thread per host, so it might be just named as "acceptor". |
| Comment by John Hammond [ 27/Mar/15 ] |
|
James, it looks like libcfs_sock_accept() had a set_current_state(TASK_INTERRUPTIBLE) that lnet_sock_accept() does not. See http://review.whamcloud.com/#/c/13760/9..10/lnet/lnet/lib-socket.c. |
| Comment by John Hammond [ 27/Mar/15 ] |
|
Or "lnet_acceptor" so that it doesn't sound like part of sendmail. |
| Comment by Peter Jones [ 27/Mar/15 ] |
|
Amir Could you please look into this issue? Thanks Peter |
| Comment by Oleg Drokin [ 29/Mar/15 ] |
|
Hm, I see this too, actually. |
| Comment by Gerrit Updater [ 30/Mar/15 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/14265 |
| Comment by James A Simmons [ 30/Mar/15 ] |
|
Oh that is my bad. The reason that got removed was due to me trying to move to kernel_accept() which didn't work. In the revert of that code I missed putting back that set_current_state. |
| Comment by Isaac Huang (Inactive) [ 31/Mar/15 ] |
|
It looked like in commit c8fd9c3c the acceptor thread name was accidentally changed from "acceptor_%03d", accept_port to "acceptor_%03ld", secure. I'd suggest to restore to the original name, and add a lnet_ prefix as John suggested. |
| Comment by Andrew Zenk [ 07/Apr/15 ] |
|
I'm seeing the same issue here on the jenkins CentOS 6.6 inkernel build #2963. Build #2962 does not show the same issue. These are the changes listed for build 2963:
|
| Comment by Gerrit Updater [ 26/Apr/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14265/ |