[LU-893] system hang when running recovery-mds-scale FLAVOR=OSS Created: 02/Dec/11  Updated: 03/Oct/19  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

lustre-master build #353 RHEL6-x86_64 for both server and client


Issue Links:
Related
is related to LU-885 recovery-mds-scale (FLAVOR=mds) fail,... Resolved
Severity: 3
Rank (Obsolete): 10260

 Description   

Running recovery-mds-scale FLAVOR=OSS with quota enables and HARD failure mode, console log shows one of the OSS's network is up but after a while it cannot be accessed. After reboot the node, it's back to use.

==== Checking the clients loads AFTER failover – failure NOT OK
ost6 has failed over 1 times, and counting...
sleeping 421 seconds ...
==== Checking the clients loads BEFORE failover – failure NOT OK ELAPSED=179 DURATION=86400 PERIOD=600
Wait ost4 recovery complete before doing next failover ....
affected facets: ost1,ost2,ost3,ost4,ost5,ost6
client-12: *.lustre-OST0000.recovery_status status: INACTIVE
client-12: *.lustre-OST0001.recovery_status status: COMPLETE
client-12: *.lustre-OST0002.recovery_status status: INACTIVE
client-12: *.lustre-OST0003.recovery_status status: COMPLETE
client-12: *.lustre-OST0004.recovery_status status: INACTIVE
client-12: *.lustre-OST0005.recovery_status status: COMPLETE
Checking clients are in FULL state before doing next failover
client-13: osc.lustre-OST0000-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-13: cannot run remote command on client-13,client-17,client-18 with
client-13: osc.lustre-OST0001-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-18: osc.lustre-OST0000-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-13: cannot run remote command on client-13,client-17,client-18 with
client-18: cannot run remote command on client-13,client-17,client-18 with
client-13: osc.lustre-OST0002-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-18: osc.lustre-OST0001-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-13: cannot run remote command on client-13,client-17,client-18 with
client-13: osc.lustre-OST0003-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-18: cannot run remote command on client-13,client-17,client-18 with
client-13: cannot run remote command on client-13,client-17,client-18 with
client-17: osc.lustre-OST0000-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-13: osc.lustre-OST0004-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-18: osc.lustre-OST0002-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-13: cannot run remote command on client-13,client-17,client-18 with
client-17: cannot run remote command on client-13,client-17,client-18 with
client-18: cannot run remote command on client-13,client-17,client-18 with
client-13: osc.lustre-OST0005-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-17: osc.lustre-OST0001-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-18: osc.lustre-OST0003-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-13: cannot run remote command on client-13,client-17,client-18 with
client-17: cannot run remote command on client-13,client-17,client-18 with
client-18: cannot run remote command on client-13,client-17,client-18 with
client-17: osc.lustre-OST0002-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-18: osc.lustre-OST0004-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-17: cannot run remote command on client-13,client-17,client-18 with
client-18: cannot run remote command on client-13,client-17,client-18 with
client-17: osc.lustre-OST0003-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-18: osc.lustre-OST0005-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-17: cannot run remote command on client-13,client-17,client-18 with
client-18: cannot run remote command on client-13,client-17,client-18 with
client-17: osc.lustre-OST0004-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-17: cannot run remote command on client-13,client-17,client-18 with
client-17: osc.lustre-OST0005-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-17: cannot run remote command on client-13,client-17,client-18 with
Starting failover on ost4
Failing ost4 on node client-12
+ pm -h powerman --off client-12
Command completed successfully
affected facets: ost1,ost2,ost3,ost4,ost5,ost6
+ pm -h powerman --on client-12
Command completed successfully
Failover ost1 to fat-amd-2
Failover ost2 to fat-amd-2
Failover ost3 to fat-amd-2
Failover ost4 to fat-amd-2
Failover ost5 to fat-amd-2
Failover ost6 to fat-amd-2
15:04:41 (1322867081) waiting for fat-amd-2 network 900 secs ...
15:04:41 (1322867081) network interface is UP
Starting ost1: /dev/disk/by-id/scsi-1IET_00020001 /mnt/ost1
fat-amd-2: debug=0xb3f0405
fat-amd-2: subsystem_debug=0xffb7efff
fat-amd-2: debug_mb=48
Started lustre-OST0000
Starting ost2: /dev/disk/by-id/scsi-1IET_00030001 /mnt/ost2
-------------------------------------------------------------------

PING fat-amd-2.lab.whamcloud.com (10.10.4.133) 56(84) bytes of data.
From brent.lab.whamcloud.com (10.10.0.1) icmp_seq=1 Destination Host Unreachable



 Comments   
Comment by Oleg Drokin [ 03/Jan/12 ]

is it possible to check the console of this node to see what happened to the network?

with no maloo report also impossible to see console logs I guess.

Comment by Sarah Liu [ 03/Jan/12 ]

I will try to reproduce this bug to see if I can get more information, will keep you updated.

Comment by Sarah Liu [ 04/Jan/12 ]

I reran this test on https://newbuild.whamcloud.com/job/lustre-master/376/ RHEL6.
1. when I use IB, I got this error on the one of the OSS

Loading ib_core.ko module
usb 5-3: New USB device found, idVendor=0557, idProduct=2221
usb 5-3: New USB device strings: Mfr=1, Product=2, SerialNumber=0
usb 5-3: Product: Hermon USB hidmouse Device
usb 5-3: Manufacturer: Winbond Electronics Corp
usb 5-3: configuration #1 chosen from 1 choice
Loading mlx4_core.ko module
input: Winbond Electronics Corp Hermon USB hidmouse Device as /devices/pci0000:00/0000:00:13.0/usb5/5-3/5-3:1.0/input/input3
generic-usb 0003:0557:2221.0001: input,hidraw0: USB HID v1.00 Mouse [Winbond Electronics Corp Hermon USB hidmouse Device] on usb-0000:00:13.0-3/input0
mlx4_core: Mellanox ConnectX core driver v0.01 (May 1, 2007)
mlx4_core: Initializing 0000:04:00.0
mlx4_core 0000:04:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
input: Winbond Electronics Corp Hermon USB hidmouse Device as /devices/pci0000:00/0000:00:13.0/usb5/5-3/5-3:1.1/input/input4
generic-usb 0003:0557:2221.0002: input,hidraw1: USB HID v1.00 Keyboard [Winbond Electronics Corp Hermon USB hidmouse Device] on usb-0000:00:13.0-3/input1
work_for_cpu invoked oom-killer: gfp_mask=0x80d0, order=0, oom_adj=0
work_for_cpu cpuset=/ mems_allowed=0
Pid: 116, comm: work_for_cpu Not tainted 2.6.32-131.17.1.el6_lustre.g2e85b73.x86_64 #1
Call Trace:
[<ffffffff810c00f1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff811102bb>] ? oom_kill_process+0xcb/0x2e0
[<ffffffff81110880>] ? select_bad_process+0xd0/0x110
[<ffffffff81110918>] ? __out_of_memory+0x58/0xc0
[<ffffffff81110b19>] ? out_of_memory+0x199/0x210
[<ffffffff81120262>] ? __alloc_pages_nodemask+0x812/0x8b0
[<ffffffff81010e86>] ? dma_generic_alloc_coherent+0xa6/0x160
[<ffffffffa00ace99>] ? mlx4_create_eq+0x139/0x6b0 [mlx4_core]
[<ffffffffa00ad5f9>] ? mlx4_init_eq_table+0x1e9/0x560 [mlx4_core]
[<ffffffffa00b24d0>] ? mlx4_setup_hca+0xa0/0x5c0 [mlx4_core]
[<ffffffffa00b3045>] ? __mlx4_init_one+0x2f5/0x880 [mlx4_core]
[<ffffffff81088ce0>] ? do_work_for_cpu+0x0/0x30
[<ffffffffa00b815f>] ? mlx4_init_one+0x42/0x47 [mlx4_core]
[<ffffffff81281087>] ? local_pci_probe+0x17/0x20
[<ffffffff81088cf8>] ? do_work_for_cpu+0x18/0x30
[<ffffffff8108de16>] ? kthread+0x96/0xa0
[<ffffffff810886d0>] ? worker_thread+0x0/0x2a0
[<ffffffff8100c1ca>] ? child_rip+0xa/0x20
[<ffffffff8108dd80>] ? kthread+0x0/0xa0
[<ffffffff8100c1c0>] ? child_rip+0x0/0x20
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 42, btch: 7 usd: 18
active_anon:24 inactive_anon:32 isolated_anon:0
active_file:288 inactive_file:0 isolated_file:0
unevictable:2985 dirty:0 writeback:0 unstable:0
free:403 slab_reclaimable:991 slab_unreclaimable:4105
mapped:50 shmem:0 pagetables:9 bounce:0
Node 0 DMA free:188kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:312kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 126 126 126
Node 0 DMA32 free:1424kB min:1436kB low:1792kB high:2152kB active_anon:96kB inactive_anon:128kB active_file:1152kB inactive_file:0kB unevictable:11940kB isolated(anon):0kB isolated(file):0kB present:129504kB mlocked:0kB dirty:0kB writeback:0kB mapped:200kB shmem:0kB slab_reclaimable:3964kB slab_unreclaimable:16420kB kernel_stack:304kB pagetables:36kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:3019 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 1*8kB 1*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 188kB
Node 0 DMA32: 0*4kB 0*8kB 1*16kB 0*32kB 0*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1424kB
3275 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
41195 pages RAM
12945 pages reserved
74 pages shared
20269 pages non-shared
Out of memory: kill process 107 (insmod) score 20 or a child
Killed process 107 (insmod) vsz:1296kB, anon-rss:192kB, file-rss:116kB
INFO: task insmod:107 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
insmod D 0000000000000000 0 107 1 0x00100004
ffff880009e7fb78 0000000000000082 ffff880009e7fb08 ffffffff8105055a
ffff880009e7fb08 ffff8800095e4ab8 0000000000000000 ffff880002a15f80
ffff880009d10638 ffff880009e7ffd8 000000000000f598 ffff880009d10638
Call Trace:
[<ffffffff8105055a>] ? enqueue_entity+0x13a/0x340
[<ffffffff81270ccc>] ? __bitmap_weight+0x8c/0xb0
[<ffffffff814dc035>] schedule_timeout+0x215/0x2e0
[<ffffffff8105055a>] ? enqueue_entity+0x13a/0x340
[<ffffffff814dbcb3>] wait_for_common+0x123/0x180
[<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
[<ffffffff814dbdcd>] wait_for_completion+0x1d/0x20
[<ffffffff81088a5e>] work_on_cpu+0xae/0xd0
[<ffffffff81281070>] ? local_pci_probe+0x0/0x20
[<ffffffff814dc6ae>] ? mutex_lock+0x1e/0x50
[<ffffffff8128223b>] pci_device_probe+0xcb/0x120
[<ffffffff8133bb12>] ? driver_sysfs_add+0x62/0x90
[<ffffffff8133bcb0>] driver_probe_device+0xa0/0x2a0
[<ffffffff8133bf5b>] __driver_attach+0xab/0xb0
[<ffffffff8133beb0>] ? __driver_attach+0x0/0xb0
[<ffffffff8133af14>] bus_for_each_dev+0x64/0x90
[<ffffffff8133ba4e>] driver_attach+0x1e/0x20
[<ffffffff8133b350>] bus_add_driver+0x200/0x300
[<ffffffff8133c286>] driver_register+0x76/0x140
[<ffffffff810899a8>] ? __create_workqueue_key+0x1e8/0x280
[<ffffffff812824d6>] __pci_register_driver+0x56/0xd0
[<ffffffffa00c3031>] ? mlx4_init+0x0/0xbf [mlx4_core]
[<ffffffffa00c3031>] ? mlx4_init+0x0/0xbf [mlx4_core]
[<ffffffffa00c30b2>] mlx4_init+0x81/0xbf [mlx4_core]
[<ffffffff8100204c>] do_one_initcall+0x3c/0x1d0
[<ffffffff810aca7f>] sys_init_module+0xdf/0x250
[<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
INFO: task insmod:107 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
insmod D 0000000000000000 0 107 1 0x00100004
ffff880009e7fb78 0000000000000082 ffff880009e7fb08 ffffffff8105055a
ffff880009e7fb08 ffff8800095e4ab8 0000000000000000 ffff880002a15f80
ffff880009d10638 ffff880009e7ffd8 000000000000f598 ffff880009d10638
Call Trace:
[<ffffffff8105055a>] ? enqueue_entity+0x13a/0x340
[<ffffffff81270ccc>] ? __bitmap_weight+0x8c/0xb0
[<ffffffff814dc035>] schedule_timeout+0x215/0x2e0
[<ffffffff8105055a>] ? enqueue_entity+0x13a/0x340
[<ffffffff814dbcb3>] wait_for_common+0x123/0x180
[<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
[<ffffffff814dbdcd>] wait_for_completion+0x1d/0x20
[<ffffffff81088a5e>] work_on_cpu+0xae/0xd0
[<ffffffff81281070>] ? local_pci_probe+0x0/0x20
[<ffffffff814dc6ae>] ? mutex_lock+0x1e/0x50
[<ffffffff8128223b>] pci_device_probe+0xcb/0x120
[<ffffffff8133bb12>] ? driver_sysfs_add+0x62/0x90
[<ffffffff8133bcb0>] driver_probe_device+0xa0/0x2a0
[<ffffffff8133bf5b>] __driver_attach+0xab/0xb0
[<ffffffff8133beb0>] ? __driver_attach+0x0/0xb0
[<ffffffff8133af14>] bus_for_each_dev+0x64/0x90
[<ffffffff8133ba4e>] driver_attach+0x1e/0x20
[<ffffffff8133b350>] bus_add_driver+0x200/0x300
[<ffffffff8133c286>] driver_register+0x76/0x140
[<ffffffff810899a8>] ? __create_workqueue_key+0x1e8/0x280
[<ffffffff812824d6>] __pci_register_driver+0x56/0xd0
[<ffffffffa00c3031>] ? mlx4_init+0x0/0xbf [mlx4_core]
[<ffffffffa00c3031>] ? mlx4_init+0x0/0xbf [mlx4_core]
[<ffffffffa00c30b2>] mlx4_init+0x81/0xbf [mlx4_core]
[<ffffffff8100204c>] do_one_initcall+0x3c/0x1d0
[<ffffffff810aca7f>] sys_init_module+0xdf/0x250
[<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
INFO: task insmod:107 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
insmod D 0000000000000000 0 107 1 0x00100004
ffff880009e7fb78 0000000000000082 ffff880009e7fb08 ffffffff8105055a
ffff880009e7fb08 ffff8800095e4ab8 0000000000000000 ffff880002a15f80
ffff880009d10638 ffff880009e7ffd8 000000000000f598 ffff880009d10638
Call Trace:
[<ffffffff8105055a>] ? enqueue_entity+0x13a/0x340
[<ffffffff81270ccc>] ? __bitmap_weight+0x8c/0xb0
[<ffffffff814dc035>] schedule_timeout+0x215/0x2e0
[<ffffffff8105055a>] ? enqueue_entity+0x13a/0x340
[<ffffffff814dbcb3>] wait_for_common+0x123/0x180
[<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
[<ffffffff814dbdcd>] wait_for_completion+0x1d/0x20
[<ffffffff81088a5e>] work_on_cpu+0xae/0xd0
[<ffffffff81281070>] ? local_pci_probe+0x0/0x20
[<ffffffff814dc6ae>] ? mutex_lock+0x1e/0x50
[<ffffffff8128223b>] pci_device_probe+0xcb/0x120
[<ffffffff8133bb12>] ? driver_sysfs_add+0x62/0x90
[<ffffffff8133bcb0>] driver_probe_device+0xa0/0x2a0
[<ffffffff8133bf5b>] __driver_attach+0xab/0xb0
[<ffffffff8133beb0>] ? __driver_attach+0x0/0xb0
[<ffffffff8133af14>] bus_for_each_dev+0x64/0x90
[<ffffffff8133ba4e>] driver_attach+0x1e/0x20
[<ffffffff8133b350>] bus_add_driver+0x200/0x300
[<ffffffff8133c286>] driver_register+0x76/0x140
[<ffffffff810899a8>] ? __create_workqueue_key+0x1e8/0x280
[<ffffffff812824d6>] __pci_register_driver+0x56/0xd0
[<ffffffffa00c3031>] ? mlx4_init+0x0/0xbf [mlx4_core]
[<ffffffffa00c3031>] ? mlx4_init+0x0/0xbf [mlx4_core]
[<ffffffffa00c30b2>] mlx4_init+0x81/0xbf [mlx4_core]
[<ffffffff8100204c>] do_one_initcall+0x3c/0x1d0
[<ffffffff810aca7f>] sys_init_module+0xdf/0x250
[<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
INFO: task insmod:107 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
insmod D 0000000000000000 0 107 1 0x00100004
ffff880009e7fb78 0000000000000082 ffff880009e7fb08 ffffffff8105055a
ffff880009e7fb08 ffff8800095e4ab8 0000000000000000 ffff880002a15f80
ffff880009d10638 ffff880009e7ffd8 000000000000f598 ffff880009d10638
Call Trace:
[<ffffffff8105055a>] ? enqueue_entity+0x13a/0x340
[<ffffffff81270ccc>] ? __bitmap_weight+0x8c/0xb0
[<ffffffff814dc035>] schedule_timeout+0x215/0x2e0
[<ffffffff8105055a>] ? enqueue_entity+0x13a/0x340
[<ffffffff814dbcb3>] wait_for_common+0x123/0x180
[<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
[<ffffffff814dbdcd>] wait_for_completion+0x1d/0x20
[<ffffffff81088a5e>] work_on_cpu+0xae/0xd0
[<ffffffff81281070>] ? local_pci_probe+0x0/0x20
[<ffffffff814dc6ae>] ? mutex_lock+0x1e/0x50
[<ffffffff8128223b>] pci_device_probe+0xcb/0x120
[<ffffffff8133bb12>] ? driver_sysfs_add+0x62/0x90
[<ffffffff8133bcb0>] driver_probe_device+0xa0/0x2a0
[<ffffffff8133bf5b>] __driver_attach+0xab/0xb0
[<ffffffff8133beb0>] ? __driver_attach+0x0/0xb0
[<ffffffff8133af14>] bus_for_each_dev+0x64/0x90
[<ffffffff8133ba4e>] driver_attach+0x1e/0x20
[<ffffffff8133b350>] bus_add_driver+0x200/0x300
[<ffffffff8133c286>] driver_register+0x76/0x140
[<ffffffff810899a8>] ? __create_workqueue_key+0x1e8/0x280
[<ffffffff812824d6>] __pci_register_driver+0x56/0xd0
[<ffffffffa00c3031>] ? mlx4_init+0x0/0xbf [mlx4_core]
[<ffffffffa00c3031>] ? mlx4_init+0x0/0xbf [mlx4_core]
[<ffffffffa00c30b2>] mlx4_init+0x81/0xbf [mlx4_core]
[<ffffffff8100204c>] do_one_initcall+0x3c/0x1d0
[<ffffffff810aca7f>] sys_init_module+0xdf/0x250
[<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
INFO: task insmod:107 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
insmod D 0000000000000000 0 107 1 0x00100004
ffff880009e7fb78 0000000000000082 ffff880009e7fb08 ffffffff8105055a
ffff880009e7fb08 ffff8800095e4ab8 0000000000000000 ffff880002a15f80
ffff880009d10638 ffff880009e7ffd8 000000000000f598 ffff880009d10638
Call Trace:
[<ffffffff8105055a>] ? enqueue_entity+0x13a/0x340
[<ffffffff81270ccc>] ? __bitmap_weight+0x8c/0xb0
[<ffffffff814dc035>] schedule_timeout+0x215/0x2e0
[<ffffffff8105055a>] ? enqueue_entity+0x13a/0x340
[<ffffffff814dbcb3>] wait_for_common+0x123/0x180
[<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
[<ffffffff814dbdcd>] wait_for_completion+0x1d/0x20
[<ffffffff81088a5e>] work_on_cpu+0xae/0xd0
[<ffffffff81281070>] ? local_pci_probe+0x0/0x20
[<ffffffff814dc6ae>] ? mutex_lock+0x1e/0x50
[<ffffffff8128223b>] pci_device_probe+0xcb/0x120
[<ffffffff8133bb12>] ? driver_sysfs_add+0x62/0x90
[<ffffffff8133bcb0>] driver_probe_device+0xa0/0x2a0
[<ffffffff8133bf5b>] __driver_attach+0xab/0xb0
[<ffffffff8133beb0>] ? __driver_attach+0x0/0xb0
[<ffffffff8133af14>] bus_for_each_dev+0x64/0x90
[<ffffffff8133ba4e>] driver_attach+0x1e/0x20
[<ffffffff8133b350>] bus_add_driver+0x200/0x300
[<ffffffff8133c286>] driver_register+0x76/0x140
[<ffffffff810899a8>] ? __create_workqueue_key+0x1e8/0x280
[<ffffffff812824d6>] __pci_register_driver+0x56/0xd0
[<ffffffffa00c3031>] ? mlx4_init+0x0/0xbf [mlx4_core]
[<ffffffffa00c3031>] ? mlx4_init+0x0/0xbf [mlx4_core]
[<ffffffffa00c30b2>] mlx4_init+0x81/0xbf [mlx4_core]
[<ffffffff8100204c>] do_one_initcall+0x3c/0x1d0
[<ffffffff810aca7f>] sys_init_module+0xdf/0x250
[<ffffffff8100b172>] system_call_fastpath+0x16/0x1b

2. Then I changed to TCP, but the system are still hang there. with these msg on one of the OSS. And the OSS can not even be reboot with pm.

Lustre: MGC10.10.4.12@tcp: Reactivating import
Lustre: lustre-OST0000: new disk, initializing
Lustre: lustre-OST0000: Now serving lustre-OST0000 on /dev/sdm with recovery enabled
Lustre: 3706:0:(debug.c:326:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
Lustre: 3707:0:(debug.c:326:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
LDISKFS-fs (sdj): mounted filesystem with ordered data mode
LDISKFS-fs (sdj): mounted filesystem with ordered data mode
Lustre: lustre-OST0002: new disk, initializing
Lustre: lustre-OST0002: Now serving lustre-OST0002 on /dev/sdj with recovery enabled
Lustre: 3928:0:(debug.c:326:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
Lustre: 3457:0:(ldlm_lib.c:909:target_handle_connect()) lustre-OST0000: connection from lustre-MDT0000-mdtlov_UUID@10.10.4.12@tcp t0 exp (null) cur 1325730017 last 0
Lustre: 3457:0:(filter.c:2695:filter_connect_internal()) lustre-OST0000: Received MDS connection for group 0
Lustre: 3456:0:(ldlm_lib.c:909:target_handle_connect()) lustre-OST0002: connection from lustre-MDT0000-mdtlov_UUID@10.10.4.12@tcp t0 exp (null) cur 1325730017 last 0
Lustre: 3456:0:(filter.c:2695:filter_connect_internal()) lustre-OST0002: Received MDS connection for group 0
Lustre: import lustre-OST0002->NET_0x200000a0a040c_UUID netid 20000: select flavor null
Lustre: lustre-OST0002: received MDS connection from 10.10.4.12@tcp
Lustre: lustre-OST0000: received MDS connection from 10.10.4.12@tcp
LDISKFS-fs (sdh): mounted filesystem with ordered data mode
LDISKFS-fs (sdh): mounted filesystem with ordered data mode
Lustre: lustre-OST0004: new disk, initializing
Lustre: lustre-OST0004: Now serving lustre-OST0004 on /dev/sdh with recovery enabled
Lustre: 4149:0:(debug.c:326:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
Lustre: 4149:0:(debug.c:326:libcfs_debug_str2mask()) Skipped 1 previous similar message
Lustre: 3457:0:(ldlm_lib.c:909:target_handle_connect()) lustre-OST0000: connection from c59ba0b4-190e-55a2-df32-cfae2f798b2c@10.10.4.133@tcp t0 exp (null) cur 1325730020 last 0
Lustre: import lustre-OST0002->NET_0x200000a0a0485_UUID netid 20000: select flavor null
Lustre: Skipped 1 previous similar message
CLIENT MAC ADDR: 00 25 90 14 4E 48 GUID: 534D4349 0002 1490 2500 14902500484E
CLIENT IP: 10.10.4.132 MASK: 255.255.0.0 DHCP IP: 10.10.0.6
GATEWAY IP: 10.10.0.1

Since the OSS are completely inaccessible, I can not get further useful information. But it should be easy to reproduce on TORO, if some one want to investigate it. Probably duplicate LU-885.

Thanks

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 05:50:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.