[LU-8449] OSS crash with oom-killer started Created: 29/Jul/16  Updated: 11/Aug/16  Resolved: 11/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: Alex Zhuravlev
Resolution: Duplicate Votes: 0
Labels: soak
Environment:

lola
build: tip of master, commit 0f37c051158a399f7b00536eeec27f5dbdd54168


Attachments: File abrt-kernel-oops.tar.bz2     File all-lustre-log.tar.bz2     File allocation-per-slab.tar.bz2     File console-lola-5.log.bz2     File lola-2-leak_finder.output.bz2     Text File lola-2-lustre-log.1470213950.128013.bz2     Text File lola-6.timeouts.txt     Text File lola-7.errors.txt     File memory-counter-lola-5-20160728-021116.dat.bz2     File messages-lola-5.log.bz2     File slab-details-counter-lola-5-20160728-021116.dat.bz2     File slab-sorted-alloaction.dat.bz2    
Issue Links:
Related
is related to LU-7899 osd_xattr_set() to batch actual EA up... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727)
OSTs formatted with zfs, MDSs formatted with ldiskfs
DNE is enabled, HSM/robinhood enable and integrated
4 MDSs with 1 MDT / MDS
6 OSSs with 4 OSTs / OSS
Server nodes configured in active-active HA confguration

Sequence of events:

  • 2016-07-28 08:48:37 - Soak session started
  • 2016-07-28 08:50:34 - First LNet time out:
     
    Jul 28 08:50:43 lola-5 kernel: LNetError: 9448:0:(o2iblnd_cb.c:3114:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3 seconds
    Jul 28 08:50:43 lola-5 kernel: LNetError: 9448:0:(o2iblnd_cb.c:3177:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib10 (62): c: 0, oc: 0, rc: 8
    Jul 28 08:50:43 lola-5 kernel: Lustre: Skipped 4 previous similar messages
    Jul 28 08:51:03 lola-5 kernel: BUG: soft lockup - CPU#1 stuck for 67s! [ll_ost_io00_006:28605]
    Jul 28 08:51:03 lola-5 kernel: BUG: soft lockup - CPU#2 stuck for 67s! [ll_ost_io00_048:28758]
    

    (see also attached file abrt-kernel-oops.tar.bz2; In total 1545 event records of this form
    had been written till the node crashed)

  • 2016-07-28 08:51:03 - First occurrenance of error below. These error flooded the console after some time. (see console log after entry 'Jul 28 08:45:01 lola-5 TIME: Time stamp for console')
    Jul 28 08:51:03 lola-5 kernel: Pid: 28758, comm: ll_ost_io00_048 Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
    Jul 28 08:51:03 lola-5 kernel: RIP: 0010:[<ffffffff8129e8af>]  [<ffffffff8129e8af>] __write_lock_failed+0xf/0x20
    Jul 28 08:51:03 lola-5 kernel: RSP: 0018:ffff8803c8e2b918  EFLAGS: 00000287
    Jul 28 08:51:03 lola-5 kernel: RAX: 0000000000000000 RBX: ffff8803c8e2b920 RCX: 0000000000000000
    Jul 28 08:51:03 lola-5 kernel: RDX: ffff88044e415a00 RSI: ffff880335d78400 RDI: ffff8803fc143dd8
    Jul 28 08:51:03 lola-5 kernel: RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000
    Jul 28 08:51:03 lola-5 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000008fa9fcc28
    Jul 28 08:51:03 lola-5 kernel: R13: 0000000200000008 R14: ffff8803bac2b0b8 R15: ffffffff810674be
    Jul 28 08:51:03 lola-5 kernel: FS:  0000000000000000(0000) GS:ffff880038640000(0000) knlGS:0000000000000000
    Jul 28 08:51:03 lola-5 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
    Jul 28 08:51:03 lola-5 kernel: CR2: 00007f88b9e46000 CR3: 0000000001a8d000 CR4: 00000000000407e0
    Jul 28 08:51:03 lola-5 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    Jul 28 08:51:03 lola-5 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Jul 28 08:51:03 lola-5 kernel: Process ll_ost_io00_048 (pid: 28758, threadinfo ffff8803c8e28000, task ffff8803cb648040)
    
  • 2016-07-28-16 16:02 - oom-killer started (see entry in console) and last mtime update
    for collect data file:
    -rw-r--r-- 1 root root  738427536 Jul 28 16:02 lola-5-20160728-021116.raw.gz
    
  • Node neither accessible via ssh nor console. Node rebooted. No crash dump file was written. (Parameter set_param panic_on_lbug=1 was set).

Attached files:
message, console, and debug message (written inbetween Jul 28, 08:48 - 16:02), abrt-kernel-oops.tar.bz2 (content for single event)
collectl memory and slab counters.

We'll try to trigger a crashdump on the node which will be affected next and increase debug mask. Current debug files don't contain slab information, as far as I could see.



 Comments   
Comment by Frank Heckes (Inactive) [ 29/Jul/16 ]

The top-10-consumers read as:

#Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct
slab-details/filp.dat:20160728 16:01:40 filp 86528715 22151351040 86528715 22151351040 5768581 23628107776 5768581 23628107776 19533824 0
slab-details/size-131072.dat:20160728 16:01:40 size-131072 8696 1139802112 8696 1139802112 8696 1139802112 8696 1139802112 0 0
slab-details/dentry.dat:20160728 16:01:40 dentry 2671440 512916480 2671440 512916480 133572 547110912 133572 547110912 26107904 5
slab-details/zio_buf_16384.dat:20160728 16:01:40 zio_buf_16384 18261 299188224 18261 299188224 18261 299188224 18261 299188224 0 0
slab-details/size-1048576.dat:20160728 16:01:40 size-1048576 256 268435456 256 268435456 256 268435456 256 268435456 0 0
slab-details/task_struct.dat:20160728 16:01:40 task_struct 43410 115991520 43410 115991520 14470 118538240 14470 118538240 57344 0
slab-details/zio_cache.dat:20160728 16:01:40 zio_cache 31401 34917912 94507 105091784 13501 110600192 13501 110600192 0 0
slab-details/size-262144.dat:20160728 16:01:40 size-262144 395 103546880 395 103546880 395 103546880 395 103546880 0 0
slab-details/biovec-256.dat:20160728 16:01:40 biovec-256 23042 94380032 23042 94380032 23042 94380032 23042 94380032 0 0
slab-details/size-1024.dat:20160728 16:01:40 size-1024 61913 63398912 61956 63442944 15489 63442944 15489 63442944 -4096 0

see also attached files slab-sorted-alloaction.dat, allocation-per-slab.tar.bz2

Comment by Andreas Dilger [ 29/Jul/16 ]

Could this be related to "LU-7646 lnet: Stop Infinite CON RACE Condition" where the connection fails and reconnecting consumes all the memory?

Comment by Doug Oucharek (Inactive) [ 02/Aug/16 ]

If CON RACEs were involved, we should be seeing some rejection logs from o2iblnd with the reason CON RACE.

Comment by Frank Heckes (Inactive) [ 03/Aug/16 ]

I attached debug logs (and output of leak_finder.pl, most likely not so important) of a debug log (512M, -1) of an OSS node (lola-2) that crashed with the same error. Test session was started at 2016-08-03 01:39 and node crashed very soon with:

Aug  3 01:43:22 lola-2 kernel: LNetError: 9516:0:(o2iblnd_cb.c:3114:kiblnd_check_txs_locked()) Timed out tx: tx_queu
e, 0 seconds
Aug  3 01:43:22 lola-2 kernel: LNetError: 9516:0:(o2iblnd_cb.c:3177:kiblnd_check_conns()) Timed out RDMA with 192.16
8.1.114@o2ib10 (51): c: 0, oc: 0, rc: 7
Aug  3 01:43:22 lola-2 kernel: LustreError: 9516:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff8803fd339400
Aug  3 01:43:22 lola-2 kernel: LustreError: 9516:0:(events.c:449:server_bulk_callback()) event type 5, status -103, desc ffff88032a550e00
Aug  3 01:43:59 lola-2 kernel: BUG: soft lockup - CPU#0 stuck for 67s! [ll_ost_io00_028:22505]
Aug  3 01:43:59 lola-2 kernel: Modules linked in:

(Btw. 192.168.1.114 is the LNet router. There's no message that can be correlated in the router's syslog nor console log.

Comment by Cliff White (Inactive) [ 05/Aug/16 ]

Downgraded OSS nodes to build lustre-master 3406 - no oom-killer. Run stopped when MDS failed to complete recovery (existing bug)
Upgrading OSS nodes to build 3409 Revision: 9d8469694738288fcd246a1e9f22c47918a32c58
Soak restart Thu Aug 4 19:38:27 PDT 2016
First soft lockup: Aug 4 19:58:26 lola-7 kernel: BUG: soft lockup - CPU#3 stuck for 67s! [ll_ost_io00_031:20580]

Comment by John Hammond [ 05/Aug/16 ]

Do we have stack traces for the soft lockups?

Build 3046 is 4af430de697249d07bcb7fe3401a6caa4415fe31. There have been 56 commits on master from 4af430de697249d07bcb7fe3401a6caa4415fe31 to 0f37c051158a399f7b00536eeec27f5dbdd54168. How long does this take to reproduce?

Comment by John Hammond [ 05/Aug/16 ]

What is kernel.panic set to here? If it's set to 0 then please try sysctl -w kernel.panic=60 to see if we can get the full crash dump message on the console.

Comment by Cliff White (Inactive) [ 05/Aug/16 ]

The soft lockups appear within minutes of starting soak. Once the first one occurs, we get a cascade.

Aug  4 19:58:26 lola-7 kernel: BUG: soft lockup - CPU#3 stuck for 67s! [ll_ost_io00_031:20580]
Aug  4 19:58:26 lola-7 kernel: Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm raid1 dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core joydev i2c_i801 ioatdma ses enclosure sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci wmi isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Aug  4 19:58:26 lola-7 kernel: CPU 3
Aug  4 19:58:26 lola-7 kernel: Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm raid1 dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core joydev i2c_i801 ioatdma ses enclosure sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci wmi isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Aug  4 19:58:26 lola-7 kernel:
Aug  4 19:58:26 lola-7 kernel: Pid: 20580, comm: ll_ost_io00_031 Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
Aug  4 19:58:26 lola-7 kernel: RIP: 0010:[<ffffffff8129e8a9>]  [<ffffffff8129e8a9>] __write_lock_failed+0x9/0x20
Aug  4 19:58:26 lola-7 kernel: RSP: 0018:ffff88030bfb3778  EFLAGS: 00000287
Aug  4 19:58:26 lola-7 kernel: RAX: ffff8802a01ab780 RBX: ffff88030bfb3780 RCX: 0000000000026c0a
Aug  4 19:58:26 lola-7 kernel: RDX: ffff8803c52a8000 RSI: ffff8803c52a8000 RDI: ffff8803a66afb88
Aug  4 19:58:26 lola-7 kernel: RBP: ffffffff8100bc0e R08: ffff8802d28a96c0 R09: 0000000000000008
Aug  4 19:58:26 lola-7 kernel: R10: 0000000a0007d83d R11: 0000000000000000 R12: ffff88030bfb36f0
Aug  4 19:58:26 lola-7 kernel: R13: 0000000000000000 R14: ffffffffa0943036 R15: ffff880360c1f660
Aug  4 19:58:26 lola-7 kernel: FS:  0000000000000000(0000) GS:ffff880038660000(0000) knlGS:0000000000000000
Aug  4 19:58:26 lola-7 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Aug  4 19:58:26 lola-7 kernel: CR2: 0000000001620874 CR3: 0000000001a8d000 CR4: 00000000000407e0
Aug  4 19:58:26 lola-7 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug  4 19:58:26 lola-7 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Aug  4 19:58:26 lola-7 kernel: Process ll_ost_io00_031 (pid: 20580, threadinfo ffff88030bfb0000, task ffff88030bc72ab0)
Aug  4 19:58:26 lola-7 kernel: Stack:
Aug  4 19:58:26 lola-7 kernel: ffffffff8153d007 ffff88030bfb37b0 ffffffffa0fe2ab0 ffff8802d28a96d0
Aug  4 19:58:26 lola-7 kernel: <d> ffff8803c52a8000 ffff8803a66afad0 ffff88030bfb38d0 ffff88030bfb37e0
Aug  4 19:58:26 lola-7 kernel: <d> ffffffffa0fee79b 0000000000000000 ffff8803f6147180 ffff88030bfb37e0
Aug  4 19:58:26 lola-7 kernel: Call Trace:
Aug  4 19:58:26 lola-7 kernel: [<ffffffff8153d007>] ? _write_lock+0x17/0x20
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0fe2ab0>] ? osd_object_sa_dirty_add+0x40/0x80 [osd_zfs]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0fee79b>] ? __osd_sa_xattr_schedule_update+0x4b/0x160 [osd_zfs]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0ff09af>] ? __osd_sa_xattr_set+0x25f/0x350 [osd_zfs]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0ff0d8c>] ? osd_xattr_set+0x2ec/0x4d0 [osd_zfs]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa090cdb2>] ? dt_version_set+0x72/0x230 [obdclass]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b5a443>] ? tgt_txn_stop_cb+0x2c3/0x1160 [ptlrpc]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa090b96e>] ? dt_txn_hook_stop+0x5e/0x90 [obdclass]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0fddc3c>] ? osd_trans_stop+0x29c/0x520 [osd_zfs]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa1121a0f>] ? ofd_trans_stop+0x1f/0x60 [ofd]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa1129100>] ? ofd_commitrw_write+0x4d0/0xfa0 [ofd]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa112a18f>] ? ofd_commitrw+0x5bf/0xb10 [ofd]
Aug  4 19:58:26 lola-7 kernel: [<ffffffff81149461>] ? kzfree+0x31/0x40
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa08e8121>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b612e4>] ? obd_commitrw+0x114/0x380 [ptlrpc]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b6a080>] ? tgt_brw_write+0xc70/0x1530 [ptlrpc]
Aug  4 19:58:26 lola-7 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
Aug  4 19:58:26 lola-7 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
Aug  4 19:58:26 lola-7 kernel: [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0abee20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b688bc>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b153c1>] ? ptlrpc_main+0xd31/0x1800 [ptlrpc]
Aug  4 19:58:26 lola-7 kernel: [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130
Aug  4 19:58:26 lola-7 kernel: [<ffffffff81539896>] ? schedule+0x176/0x3a0
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b14690>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
Aug  4 19:58:26 lola-7 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0
Aug  4 19:58:26 lola-7 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
Aug  4 19:58:26 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
Aug  4 19:58:26 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Aug  4 19:58:26 lola-7 kernel: Code: 00 00 48 8b 5b 20 48 83 eb 07 48 39 d9 73 06 48 89 01 31 c0 c3 b8 f2 ff ff ff c3 90 90 90 90 90 90 90 f0 81 07 00 00 00 01 f3 90 <81> 3f 00 00 00 01 75 f6 f0 81 2f 00 00 00 01 0f 85 e2 ff ff ff
Aug  4 19:58:26 lola-7 kernel: Call Trace:
Aug  4 19:58:26 lola-7 kernel: [<ffffffff8153d007>] ? _write_lock+0x17/0x20
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0fe2ab0>] ? osd_object_sa_dirty_add+0x40/0x80 [osd_zfs]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0fee79b>] ? __osd_sa_xattr_schedule_update+0x4b/0x160 [osd_zfs]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0ff09af>] ? __osd_sa_xattr_set+0x25f/0x350 [osd_zfs]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0ff0d8c>] ? osd_xattr_set+0x2ec/0x4d0 [osd_zfs]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa090cdb2>] ? dt_version_set+0x72/0x230 [obdclass]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b5a443>] ? tgt_txn_stop_cb+0x2c3/0x1160 [ptlrpc]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa090b96e>] ? dt_txn_hook_stop+0x5e/0x90 [obdclass]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0fddc3c>] ? osd_trans_stop+0x29c/0x520 [osd_zfs]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa1121a0f>] ? ofd_trans_stop+0x1f/0x60 [ofd]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa1129100>] ? ofd_commitrw_write+0x4d0/0xfa0 [ofd]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa112a18f>] ? ofd_commitrw+0x5bf/0xb10 [ofd]
Aug  4 19:58:26 lola-7 kernel: [<ffffffff81149461>] ? kzfree+0x31/0x40
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa08e8121>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b612e4>] ? obd_commitrw+0x114/0x380 [ptlrpc]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b6a080>] ? tgt_brw_write+0xc70/0x1530 [ptlrpc]
Aug  4 19:58:26 lola-7 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
Aug  4 19:58:26 lola-7 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
Aug  4 19:58:26 lola-7 kernel: [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0abee20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b688bc>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc]
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b153c1>] ? ptlrpc_main+0xd31/0x1800 [ptlrpc]
Aug  4 19:58:26 lola-7 kernel: [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130
Aug  4 19:58:26 lola-7 kernel: [<ffffffff81539896>] ? schedule+0x176/0x3a0
Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b14690>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
Aug  4 19:58:26 lola-7 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0
Aug  4 19:58:26 lola-7 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
Aug  4 19:58:26 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
Aug  4 19:58:26 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Aug  4 19:59:41 lola-7 kernel: INFO: task txg_quiesce:12866 blocked for more than 120 seconds.
Aug  4 19:59:41 lola-7 kernel:      Tainted: P           --L------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
Aug  4 19:59:41 lola-7 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  4 19:59:41 lola-7 kernel: txg_quiesce   D 0000000000000009     0 12866      2 0x00000080
Aug  4 19:59:41 lola-7 kernel: ffff88040f2afd20 0000000000000046 0000000000000000 ffff88040f2afce4
Aug  4 19:59:41 lola-7 kernel: 0000020a00000000 ffff88043fe83000 0000020be7124652 ffff880038675a00
Aug  4 19:59:41 lola-7 kernel: 000000000000af7e 00000001001dbf93 ffff88041a5f7ad8 ffff88040f2affd8
Aug  4 19:59:41 lola-7 kernel: Call Trace:
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0313e5d>] cv_wait_common+0x11d/0x130 [spl]
Aug  4 19:59:41 lola-7 kernel: [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40
Aug  4 19:59:41 lola-7 kernel: [<ffffffff812a0d8c>] ? __bitmap_weight+0x8c/0xb0
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0313ec5>] __cv_wait+0x15/0x20 [spl]
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa03d19fb>] txg_quiesce_thread+0x32b/0x470 [zfs]
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa03d16d0>] ? txg_quiesce_thread+0x0/0x470 [zfs]
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa03d16d0>] ? txg_quiesce_thread+0x0/0x470 [zfs]
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa030f238>] thread_generic_wrapper+0x68/0x80 [spl]
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa030f1d0>] ? thread_generic_wrapper+0x0/0x80 [spl]
Aug  4 19:59:41 lola-7 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
Aug  4 19:59:41 lola-7 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Aug  4 19:59:41 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
Aug  4 19:59:41 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Aug  4 19:59:41 lola-7 kernel: INFO: task ll_ost_io01_001:13081 blocked for more than 120 seconds.
Aug  4 19:59:41 lola-7 kernel:      Tainted: P           --L------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
Aug  4 19:59:41 lola-7 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  4 19:59:41 lola-7 kernel: ll_ost_io01_0 D 0000000000000005     0 13081      2 0x00000080
Aug  4 19:59:41 lola-7 kernel: ffff88040abdf818 0000000000000046 ffff88040abdf7e0 ffff88040abdf7dc
Aug  4 19:59:41 lola-7 kernel: 0000000000000000 ffff88043fe82800 0000020be1fe58b0 ffff8800386d5a00
Aug  4 19:59:41 lola-7 kernel: 0000000000000400 00000001001dbf28 ffff88040abc45f8 ffff88040abdffd8
Aug  4 19:59:41 lola-7 kernel: Call Trace:
Aug  4 19:59:41 lola-7 kernel: [<ffffffff8153cc55>] rwsem_down_failed_common+0x95/0x1d0
Aug  4 19:59:41 lola-7 kernel: [<ffffffff8153cde6>] rwsem_down_read_failed+0x26/0x30
Aug  4 19:59:41 lola-7 kernel: [<ffffffff8129e8f4>] call_rwsem_down_read_failed+0x14/0x30
Aug  4 19:59:41 lola-7 kernel: [<ffffffff8153c2e4>] ? down_read+0x24/0x30
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0fe88d7>] osd_attr_get+0x77/0x2e0 [osd_zfs]
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa1122861>] ofd_attr_handle_ugid+0x161/0x430 [ofd]
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa112842d>] ofd_write_attr_set+0xad/0x8b0 [ofd]
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa1128e5e>] ofd_commitrw_write+0x22e/0xfa0 [ofd]
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa112e23d>] ? ofd_fmd_find_nolock+0xad/0xd0 [ofd]
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa112a18f>] ofd_commitrw+0x5bf/0xb10 [ofd]
Aug  4 19:59:41 lola-7 kernel: [<ffffffff81149461>] ? kzfree+0x31/0x40
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa08e8121>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0b612e4>] obd_commitrw+0x114/0x380 [ptlrpc]
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0b6a080>] tgt_brw_write+0xc70/0x1530 [ptlrpc]
Aug  4 19:59:41 lola-7 kernel: [<ffffffff8105ab9c>] ? check_preempt_curr+0x7c/0x90
Aug  4 19:59:41 lola-7 kernel: [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0abee20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0b688bc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0b153c1>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
Aug  4 19:59:41 lola-7 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0b14690>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
Aug  4 19:59:41 lola-7 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
Aug  4 19:59:41 lola-7 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Aug  4 19:59:41 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
Aug  4 19:59:41 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Comment by John Hammond [ 05/Aug/16 ]

This is likely introduced by:

commit 6cd79ab5860c59c2a640a9e8ca4ee86eec050b43
Author: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Date:   Fri Mar 25 12:21:16 2016 +0300

    LU-7899 osd: batch EA updates
    
    during file creation we set number of EAs: LMA, VBR, LinkEA, LOVEA, ACLs.
    calling into SA to refill spill again and again is expensive. thus it
    makes sense to postpone this to osd_trans_stop() where all changed EAs
    has been already collected in a temporary buffer.
    
    Change-Id: I8f02a287b96615c3aa550d63ffd9dd3da51b39ee
    Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
    Reviewed-on: http://review.whamcloud.com/19143
    Tested-by: Jenkins
    Tested-by: Maloo <hpdd-maloo@intel.com>
    Reviewed-by: Lai Siyao <lai.siyao@intel.com>
    Reviewed-by: Bobi Jam <bobijam@hotmail.com>
    Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>

Alex, can you comment here?

Comment by Cliff White (Inactive) [ 05/Aug/16 ]

Set kernel.panic = 60 on all OSS nodes. Re-ran test. Errors are attached

Comment by Gerrit Updater [ 09/Aug/16 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/21827
Subject: LU-8449 osd: retake locks as planned
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7a6836c0e49a3065ef876b6fa41e233502adecd6

Comment by Cliff White (Inactive) [ 10/Aug/16 ]

We have not seen an oom-killer, however the immediate soft lockups continue.
Tends to freeze the node over time. Full console from timeouts attached.

Comment by Cliff White (Inactive) [ 10/Aug/16 ]

Will continue with run to see if it ooo-kills

Comment by Cliff White (Inactive) [ 10/Aug/16 ]

System did not oom-kill however all OSS are unusable due to massive timeous.

Comment by Alex Zhuravlev [ 10/Aug/16 ]

Cliff, with the patch above?

Comment by Cliff White (Inactive) [ 10/Aug/16 ]

Yes, with the patch above

Comment by Alex Zhuravlev [ 10/Aug/16 ]

hmm... would you mind to try refreshed version, please? please, attach the logs.

Comment by Peter Jones [ 11/Aug/16 ]

Fixed by reverting LU-7899

Generated at Sat Feb 10 02:17:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.