[LU-967] OSS hangs due to heavey IO loads Created: 06/Jan/12  Updated: 16/Jan/12  Resolved: 16/Jan/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre-1.8.4ddn2.2, RHEL5.5


Attachments: File t2s007053.messages.gz     File t2s007059_messages.gz    
Severity: 3
Rank (Obsolete): 6496

 Description   

please let me know if there are filed similar bugs or situation on here.



 Comments   
Comment by Andreas Dilger [ 06/Jan/12 ]

Ihara, it would be more useful if the bug report actually contained some more information, instead of just an attachment. Otherwise, it is much more difficult to find this and similar bugs in the future since there is nothing to search for.

Comment by Shuichi Ihara (Inactive) [ 06/Jan/12 ]

Andreas, sorry about that.
We saw OSS hangs several times, the attachement is the latest log file when we saw last OSS hangs at Jan/3/2012. As far as I could see the log files, there many slow xxx due to heavy IO load messges, then finally OS dumpes call traces below. After that, OSS was going to be slow down.

Jan 3 04:31:07 t2s007059 kernel: [1751926.394880] Lustre: 16341:0:(client.c:1482:ptlrpc_expire_one_request()) @@@ Request x1388084806175732 sent from work0-OST0030 to NID 10.1.7.17@o2ib 7s ago has timed out (7s prior to deadline).
Jan 3 04:31:07 t2s007059 kernel: [1751926.394882] req@ffff810583a24000 x1388084806175732/t0 o601->@:15/16 lens 224/416 e 0 to 1 dl 1325532667 ref 2 fl Rpc:N/0/0 rc 0/0
Jan 3 04:31:07 t2s007059 kernel: [1751926.400092] LustreError: 16341:0:(quota_context.c:699:dqacq_completion()) acquire qunit got error! (rc:-107)
Jan 3 04:31:08 t2s007059 kernel: [1751927.401587] Lustre: 724:0:(quota_interface.c:460:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 1 retries (err=0, rc=-107)
Jan 3 04:31:10 t2s007059 kernel: [1751929.441959] BUG: soft lockup - CPU#7 stuck for 10s! [kiblnd_sd_00:16258]
Jan 3 04:31:10 t2s007059 kernel: [1751929.443208] CPU 7:
Jan 3 04:31:10 t2s007059 kernel: [1751929.443716] Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) jbd2(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) dm_round_robin(U) sg(U) autofs4(U) lockd(U) sunrpc(U) ib_iser(U) libiscsi2(U) scsi_transport_iscsi2(U) scsi_transport_iscsi(U) ib_srp(U) rds(U) ib_sdp(U) ib_ipoib(U) ipoib_helper(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) rdma_ucm(U) rdma_cm(U) ib_ucm(U) ib_uverbs(U) ib_umad(U) ib_cm(U) iw_cm(U) ib_addr(U) ib_sa(U) loop(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) i2c_core(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) mlx4_ib(U) ib_mad(U) ib_core(U) mlx4_en(U) mlx4_core(U) bnx2(U) hpilo(U) serio_raw(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) shpchp(U) cciss(U) sd_mod
Jan 3 04:31:10 t2s007059 kernel: U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Jan 3 04:31:10 t2s007059 kernel: [1751929.463727] Pid: 16258, comm: kiblnd_sd_00 Tainted: G 2.6.18-194.17.1.el5_lustre.1.8.4.ddn2.2 #1
Jan 3 04:31:10 t2s007059 kernel: [1751929.465428] RIP: 0010:[<ffffffff8006f217>] [<ffffffff8006f217>] __write_lock_failed+0xf/0x20
Jan 3 04:31:10 t2s007059 kernel: [1751929.467061] RSP: 0018:ffff81061460d7c8 EFLAGS: 00000287
Jan 3 04:31:10 t2s007059 kernel: [1751929.468135] RAX: 0000000000006000 RBX: ffff81061460d7d0 RCX: ffff8103b0726000
Jan 3 04:31:10 t2s007059 kernel: [1751929.469470] RDX: ffff8104aa27a2c0 RSI: ffff81061fffe140 RDI: ffffffff8038ebcc
Jan 3 04:31:10 t2s007059 kernel: [1751929.470802] RBP: ffffffff80068c1b R08: 00000000ffffffff R09: 0000000000000000
Jan 3 04:31:10 t2s007059 kernel: [1751929.472141] R10: 000000000000003c R11: 0000000000000000 R12: ffff81061460d780
Jan 3 04:31:10 t2s007059 kernel: [1751929.473473] R13: ffffffff801910ec R14: ffff81061460d730 R15: ffff81061fe43400
Jan 3 04:31:10 t2s007059 kernel: [1751929.474806] FS: 00002b84181b9240(0000) GS:ffff81061fe1fbc0(0000) knlGS:0000000000000000
Jan 3 04:31:10 t2s007059 kernel: [1751929.476313] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 3 04:31:10 t2s007059 kernel: [1751929.477404] CR2: 000000001c587158 CR3: 0000000000201000 CR4: 00000000000006e0
Jan 3 04:31:10 t2s007059 kernel: [1751929.478807]
Jan 3 04:31:10 t2s007059 kernel: [1751929.478807] Call Trace:
Jan 3 04:31:10 t2s007059 kernel: [1751929.479610] [<ffffffff80072252>] _write_lock+0x12/0x20
Jan 3 04:31:10 t2s007059 kernel: [1751929.480607] [<ffffffff800f2b1a>] __get_vm_area_node+0xba/0x1b0
Jan 3 04:31:10 t2s007059 kernel: [1751929.481731] [<ffffffff800f2c30>] get_vm_area_node+0x20/0x30
Jan 3 04:31:10 t2s007059 kernel: [1751929.482799] [<ffffffff800f3438>] __vmalloc_node+0x48/0x70
Jan 3 04:31:10 t2s007059 kernel: [1751929.483845] [<ffffffff800f348e>] __vmalloc+0xe/0x10
Jan 3 04:31:10 t2s007059 kernel: [1751929.484797] [<ffffffff800f3538>] vmalloc+0x18/0x20
Jan 3 04:31:10 t2s007059 kernel: [1751929.485742] [<ffffffff8861e3c9>] :libcfs:cfs_alloc_large+0x9/0x10
Jan 3 04:31:10 t2s007059 kernel: [1751929.486980] [<ffffffff88809ede>] :ko2iblnd:kiblnd_create_tx_pool+0x95e/0x1020
Jan 3 04:31:10 t2s007059 kernel: [1751929.488329] [<ffffffff8880684b>] :ko2iblnd:kiblnd_pool_alloc_node+0x18b/0x260
Jan 3 04:31:10 t2s007059 kernel: [1751929.489676] [<ffffffff8880f898>] :ko2iblnd:kiblnd_get_idle_tx+0x18/0x1d0
Jan 3 04:31:10 t2s007059 kernel: [1751929.490953] [<ffffffff888106bd>] :ko2iblnd:kiblnd_check_sends+0x28d/0x4a0
Jan 3 04:31:10 t2s007059 kernel: [1751929.492245] [<ffffffff88258a4a>] :mlx4_ib:mlx4_ib_post_recv+0x1ea/0x200
Jan 3 04:31:10 t2s007059 kernel: [1751929.493502] [<ffffffff8881156b>] :ko2iblnd:kiblnd_post_rx+0x2cb/0x2f0
Jan 3 04:31:10 t2s007059 kernel: [1751929.494761] [<ffffffff8866d47d>] :lnet:lnet_enq_event_locked+0x9d/0xd0
Jan 3 04:31:10 t2s007059 kernel: [1751929.496090] [<ffffffff888119de>] :ko2iblnd:kiblnd_recv+0x44e/0x470
Jan 3 04:31:10 t2s007059 kernel: [1751929.497404] [<ffffffff8880ea32>] :ko2iblnd:kiblnd_init_tx_msg+0x152/0x1c0
Jan 3 04:31:10 t2s007059 kernel: [1751929.498779] [<ffffffff8867054f>] :lnet:lnet_ni_recv+0x1ef/0x220
Jan 3 04:31:10 t2s007059 kernel: [1751929.499994] [<ffffffff8867066b>] :lnet:lnet_recv_put+0xeb/0x110
Jan 3 04:31:10 t2s007059 kernel: [1751929.501203] [<ffffffff88674892>] :lnet:lnet_parse+0x1012/0x1ab0
Jan 3 04:31:10 t2s007059 kernel: [1751929.502416] [<ffffffff888104d6>] :ko2iblnd:kiblnd_check_sends+0xa6/0x4a0
Jan 3 04:31:10 t2s007059 kernel: [1751929.503770] [<ffffffff88812027>] :ko2iblnd:kiblnd_handle_rx+0x4c7/0x520
Jan 3 04:31:10 t2s007059 kernel: [1751929.505112] [<ffffffff88808cb2>] :ko2iblnd:kiblnd_unpack_msg+0x372/0x6a0
Jan 3 04:31:10 t2s007059 kernel: [1751929.506470] [<ffffffff88812418>] :ko2iblnd:kiblnd_rx_complete+0x2b8/0x350
Jan 3 04:31:10 t2s007059 kernel: [1751929.507906] [<ffffffff800a1c3d>] default_wake_function+0xd/0x10
Jan 3 04:31:10 t2s007059 kernel: [1751929.509124] [<ffffffff88812581>] :ko2iblnd:kiblnd_complete+0xd1/0xe0
Jan 3 04:31:10 t2s007059 kernel: [1751929.510415] [<ffffffff88818302>] :ko2iblnd:kiblnd_scheduler+0x4f2/0x690
Jan 3 04:31:10 t2s007059 kernel: [1751929.833105] [<ffffffff801910ec>] list_add+0xc/0x10
Jan 3 04:31:10 t2s007059 kernel: [1751929.834140] [<ffffffff800a1c30>] default_wake_function+0x0/0x10
Jan 3 04:31:10 t2s007059 kernel: [1751929.835302] [<ffffffff8006afb1>] child_rip+0xa/0x11
Jan 3 04:31:10 t2s007059 kernel: [1751929.836334] [<ffffffff88817e10>] :ko2iblnd:kiblnd_scheduler+0x0/0x690
Jan 3 04:31:10 t2s007059 kernel: [1751929.837555] [<ffffffff8006afa7>] child_rip+0x0/0x11
Jan 3 04:31:10 t2s007059 kernel: [1751929.838510]
Jan 3 04:31:15 t2s007059 kernel: [1751934.402464] Lustre: 16341:0:(client.c:1482:ptlrpc_expire_one_request()) @@@ Request x1388084806175733 sent from work0-OST0030 to NID 10.1.7.17@o2ib 7s ago has timed out (7s prior to deadline).

Comment by Peter Jones [ 06/Jan/12 ]

Lai

Can you please look into this one?

Thanks

Peter

Comment by Shuichi Ihara (Inactive) [ 08/Jan/12 ]

Might be related to LU-952?

Comment by Lai Siyao [ 11/Jan/12 ]

Hi Shuichi, it looks quite possible to be the same issue of LU-952, since there is a fix for LU-952, could you verify that works?

Comment by Lai Siyao [ 11/Jan/12 ]

If this still happens, it may be the chain reaction of slow disk I/O:
slow disk I/O --> slow journal --> slow memory reclamation --> slow memory allocation --> all operations hung

And under this situation, could you try with read cache disabled?

Comment by Shuichi Ihara (Inactive) [ 11/Jan/12 ]

Hi Lai,

we applied the patch for LU-952 and didn't see issue so far.
I'm attaching the newer log messages on OSS. You can see final reboot time (after attached the patch) on the attached log file below.

Jan 9 14:17:01 t2s007053 kernel: BIOS-provided physical RAM map:
Jan 9 14:17:01 t2s007053 kernel: BIOS-e820: 0000000000010000 - 000000000009f400 (usable)
Jan 9 14:17:01 t2s007053 kernel: BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved)
Jan 9 14:17:01 t2s007053 kernel: BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
Jan 9 14:17:01 t2s007053 kernel: BIOS-e820: 0000000000100000 - 00000000df62f000 (usable)
Jan 9 14:17:01 t2s007053 kernel: BIOS-e820: 00000000df62f000 - 00000000df63c000 (ACPI data)
Jan 9 14:17:01 t2s007053 kernel: BIOS-e820: 00000000df63c000 - 00000000df63d000 (usable)
Jan 9 14:17:01 t2s007053 kernel: BIOS-e820: 00000000df63d000 - 00000000e4000000 (reserved)
Jan 9 14:17:01 t2s007053 kernel: BIOS-e820: 00000000fec00000 - 00000000fee10000 (reserved)
Jan 9 14:17:01 t2s007053 kernel: BIOS-e820: 00000000ff800000 - 0000000100000000 (reserved)
Jan 9 14:17:01 t2s007053 kernel: BIOS-e820: 000000010000000

Could you please have a look at the log file and find out if any concerns are still existing?

Thanks!

Comment by Lai Siyao [ 12/Jan/12 ]

Hmm, the new log shows may operations blocked on IO, but not on memory allocation blocking any more, and in the end the operations can finish.

Comment by Shuichi Ihara (Inactive) [ 16/Jan/12 ]

Lai, still any issues are remained?
we haven't seen issue happens since we applied patch for LU-952.

Comment by Lai Siyao [ 16/Jan/12 ]

This is a duplicate of LU-952, which has been fixed.

Generated at Sat Feb 10 01:12:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.