[LU-6407] acceptor_000 runs at 100% all the time Created: 27/Mar/15  Updated: 27/Apr/15  Resolved: 27/Apr/15

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: lnet

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Using 2.7.51 after I run llmount.sh I see acceptor_000 running at 100% all the time.

top - 11:29:59 up 1 min,  2 users,  load average: 0.71, 0.19, 0.06
Tasks: 298 total,   2 running, 296 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us, 25.1%sy,  0.0%ni, 74.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3901240k total,   596948k used,  3304292k free,    25188k buffers
Swap:        0k total,        0k used,        0k free,   229524k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2335 root      20   0     0    0    0 R 100.0  0.0   0:33.71 acceptor_000
 2278 root      20   0 15164 1352  908 R  0.7  0.0   0:00.28 top
    1 root      20   0 19352 1500 1188 S  0.0  0.0   0:00.85 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.03 kthreadd
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.08 migration/0
    4 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0
...

I crashed the machine a got a backtrace:

crash> bt
PID: 27520  TASK: ffff8800c0fa0580  CPU: 2   COMMAND: "acceptor_000"
 #0 [ffff88002c407e30] crash_nmi_callback at ffffffff8103054d
 #1 [ffff88002c407e50] notifier_call_chain at ffffffff81559e45
 #2 [ffff88002c407e90] __atomic_notifier_call_chain at ffffffff81559edc
 #3 [ffff88002c407ee0] atomic_notifier_call_chain at ffffffff81559f26
 #4 [ffff88002c407ef0] notify_die at ffffffff810a57be
 #5 [ffff88002c407f20] do_nmi at ffffffff815576a3
 #6 [ffff88002c407f50] nmi at ffffffff815571f0
    [exception RIP: check_poison_obj+80]
    RIP: ffffffff811840a0  RSP: ffff880012479bf0  RFLAGS: 00000293
    RAX: 000000000000006b  RBX: 0000000000000124  RCX: ffffffff8146c68f
    RDX: 000000000000006b  RSI: ffff8800aa5d4568  RDI: ffff88011dd81500
    RBP: ffff880012479c40   R8: 0000000000000000   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000510  R14: ffff8800aa5d4570  R15: 000000000000050f
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #7 [ffff880012479bf0] check_poison_obj at ffffffff811840a0
 #8 [ffff880012479c48] cache_alloc_debugcheck_after at ffffffff8118439c
 #9 [ffff880012479c88] kmem_cache_alloc at ffffffff81187806
#10 [ffff880012479cd8] sock_alloc_inode at ffffffff8146c68f
#11 [ffff880012479cf8] alloc_inode at ffffffff811c0cf7
#12 [ffff880012479d18] new_inode at ffffffff811c19fb
#13 [ffff880012479d48] sock_alloc at ffffffff8146d389
#14 [ffff880012479d58] sock_create_lite at ffffffff8146dca5
#15 [ffff880012479da8] lnet_sock_accept at ffffffffa0b07e86 [lnet]
#16 [ffff880012479e08] lnet_acceptor at ffffffffa0b1a9b7 [lnet]
#17 [ffff880012479eb8] kthread at ffffffff8109e856
#18 [ffff880012479f48] kernel_thread at ffffffff8100c30a


 Comments   
Comment by Isaac Huang (Inactive) [ 27/Mar/15 ]

It might have something to do with the recent dynamic acceptor start/stop work. Otherwise the acceptor mechanism hasn't changed for years. Also, why the thread is named "acceptor_000"? There can be at most 1 acceptor thread per host, so it might be just named as "acceptor".

Comment by John Hammond [ 27/Mar/15 ]

James, it looks like libcfs_sock_accept() had a set_current_state(TASK_INTERRUPTIBLE) that lnet_sock_accept() does not. See http://review.whamcloud.com/#/c/13760/9..10/lnet/lnet/lib-socket.c.

Comment by John Hammond [ 27/Mar/15 ]

Or "lnet_acceptor" so that it doesn't sound like part of sendmail.

Comment by Peter Jones [ 27/Mar/15 ]

Amir

Could you please look into this issue?

Thanks

Peter

Comment by Oleg Drokin [ 29/Mar/15 ]

Hm, I see this too, actually.

Comment by Gerrit Updater [ 30/Mar/15 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/14265
Subject: LU-6407 lnet: set task state before scheduling
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 54179229716fa2a545c884f093f05d018c6d89f1

Comment by James A Simmons [ 30/Mar/15 ]

Oh that is my bad. The reason that got removed was due to me trying to move to kernel_accept() which didn't work. In the revert of that code I missed putting back that set_current_state.

Comment by Isaac Huang (Inactive) [ 31/Mar/15 ]

It looked like in commit c8fd9c3c the acceptor thread name was accidentally changed from "acceptor_%03d", accept_port to "acceptor_%03ld", secure. I'd suggest to restore to the original name, and add a lnet_ prefix as John suggested.

Comment by Andrew Zenk [ 07/Apr/15 ]

I'm seeing the same issue here on the jenkins CentOS 6.6 inkernel build #2963. Build #2962 does not show the same issue. These are the changes listed for build 2963:

LU-5823 clio: add cl_object_fiemap()
LU-6245 libcfs: remove tcpip abstraction from libcfs
LU-6245 libcfs: move lucache from libcfs to lustre
LU-5757 hsm: strengthen checks for flags and archive id

Comment by Gerrit Updater [ 26/Apr/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14265/
Subject: LU-6407 lnet: set task state before scheduling
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9002fabc81f6cb1c467c5b89548161579fcd48f6

Generated at Sat Feb 10 01:59:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.