[LU-6286] During path failover KMMPD hangs updating mmp Created: 25/Feb/15  Updated: 15/Oct/15  Resolved: 15/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Bruno Faccini (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

kernel 2.6.32-358.23.2.el6


Attachments: File service165.gz    
Severity: 3
Rank (Obsolete): 17618

 Description   

During SRP path failover kmmpd hangs which causes OSS require reboot.

Full detail in log file.

path failed at 14:46:50

Feb 24 14:46:50 nbp9-oss5 OpenSM[4146]: SM port is down
Feb 24 14:46:50 nbp9-oss5 OpenSM[4146]: Entering DISCOVERING state
Feb 24 14:47:02 nbp9-oss5 run_srp_daemon[95911]: failed srp_daemon: [HCA=mlx4_1] [port=1] [exit status=110]. Will try to restart srp_daemon periodically. No more warnings will be issued in the next 7200 seconds if the same problem repeats
Feb 24 14:47:10 nbp9-oss5 run_srp_daemon[95917]: starting srp_daemon: [HCA=mlx4_1] [port=1]
Feb 24 14:47:18 nbp9-oss5 kernel: scsi host12: ib_srp: failed receive status 5
Feb 24 14:47:18 nbp9-oss5 kernel: scsi host12: ib_srp: failed receive status 5
.....
Feb 24 14:49:07 nbp9-oss5 kernel: INFO: task kmmpd-dm-20:20927 blocked for more than 120 seconds.
Feb 24 14:49:07 nbp9-oss5 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 24 14:49:10 nbp9-oss5 kernel: kmmpd-dm-20   D 0000000000000000     0 20927      2 0x00000080
Feb 24 14:49:10 nbp9-oss5 kernel: ffff880aad5f7d20 0000000000000046 0000000000000000 ffffffffa001740c
Feb 24 14:49:10 nbp9-oss5 kernel: ffff880301b415c0 0000000000000008 0000000000007030 000000000fd00014
Feb 24 14:49:10 nbp9-oss5 kernel: ffff880aad45faf8 ffff880aad5f7fd8 000000000000fc40 ffff880aad45faf8
Feb 24 14:49:10 nbp9-oss5 kernel: Call Trace:
Feb 24 14:49:10 nbp9-oss5 kernel: [<ffffffffa001740c>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
Feb 24 14:49:10 nbp9-oss5 kernel: [<ffffffff811b2c60>] ? sync_buffer+0x0/0x50
Feb 24 14:49:10 nbp9-oss5 kernel: [<ffffffff8153fe63>] io_schedule+0x73/0xc0
Feb 24 14:49:10 nbp9-oss5 kernel: [<ffffffff811b2ca0>] sync_buffer+0x40/0x50
Feb 24 14:49:10 nbp9-oss5 kernel: [<ffffffff8154081f>] __wait_on_bit+0x5f/0x90
Feb 24 14:49:11 nbp9-oss5 kernel: [<ffffffff811b2c60>] ? sync_buffer+0x0/0x50
Feb 24 14:49:11 nbp9-oss5 kernel: [<ffffffff815408c8>] out_of_line_wait_on_bit+0x78/0x90
Feb 24 14:49:11 nbp9-oss5 kernel: [<ffffffff81096350>] ? wake_bit_function+0x0/0x50
Feb 24 14:49:11 nbp9-oss5 kernel: [<ffffffff811b2c56>] __wait_on_buffer+0x26/0x30
Feb 24 14:49:11 nbp9-oss5 kernel: [<ffffffffa0c9d40a>] write_mmp_block+0x5a/0x80 [ldiskfs]
Feb 24 14:49:11 nbp9-oss5 kernel: [<ffffffffa0c9d955>] kmmpd+0x1a5/0x3b0 [ldiskfs]
Feb 24 14:49:11 nbp9-oss5 kernel: [<ffffffffa0c9d7b0>] ? kmmpd+0x0/0x3b0 [ldiskfs]
Feb 24 14:49:11 nbp9-oss5 kernel: [<ffffffff81095fa6>] kthread+0x96/0xa0
Feb 24 14:49:11 nbp9-oss5 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Feb 24 14:49:11 nbp9-oss5 kernel: [<ffffffff81095f10>] ? kthread+0x0/0xa0
Feb 24 14:49:11 nbp9-oss5 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
....


 Comments   
Comment by Bruno Faccini (Inactive) [ 26/Feb/15 ]

The full node's syslog you provided indicates that the IB/SRP path/devices have never recovered and thus the kmmpd/ost threads further hung stacks displayed should only be normal consequences.
This looks as some IB/SRP driver/device SW/HW/FW problem on the iopath upstream of Lustre layers.

Comment by Peter Jones [ 15/Oct/15 ]

As per NASA ok to close

Generated at Sat Feb 10 01:58:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.