[LU-2766] lov_object.c:635:lov_layout_change()) ASSERTION( atomic_read(&lov->lo_active_ios) == 0 ) failed Created: 06/Feb/13  Updated: 24/Jun/16  Resolved: 17/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.5.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Critical
Reporter: Oleg Drokin Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: LB, cea
Environment:

Hit during racer.


Issue Links:
Duplicate
is duplicated by LU-8273 lov_sub_get()) ASSERTION( stripe < li... Resolved
Related
is related to LU-2652 lov_io.c:222:lov_sub_get()) ASSERTION... Resolved
is related to LU-7073 racer with OST object migration hangs... Resolved
is related to LU-8273 lov_sub_get()) ASSERTION( stripe < li... Resolved
Severity: 3
Rank (Obsolete): 6706

 Description   

Happens after http://review.whamcloud.com/#change,4507 landing, will upgrade to blocker then.

[34905.437520] LustreError: 3593:0:(ofd_io.c:147:ofd_preprw_write()) lustre-OST0001: BRW to missing obj 319/9663677440
[34910.734879] LustreError: 1156:0:(lov_object.c:635:lov_layout_change()) ASSERTION( atomic_read(&lov->lo_active_ios) == 0 ) failed: 
[34910.734881] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d8
[34910.734883] IP: [<ffffffffa0636d3e>] cl_object_top+0xe/0x150 [obdclass]
[34910.734913] PGD 35e44067 PUD c1c3067 PMD 0 
[34910.734915] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[34910.734917] last sysfs file: /sys/devices/system/cpu/possible
[34910.734919] CPU 5 
[34910.734920] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs exportfs mdd mgs lquota jbd obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet sha512_generic sha256_generic libcfs ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: speedstep_lib]
[34910.734941] 
[34910.734942] Pid: 1312, comm: lfs Not tainted 2.6.32-debug #6 Bochs Bochs
[34910.734944] RIP: 0010:[<ffffffffa0636d3e>]  [<ffffffffa0636d3e>] cl_object_top+0xe/0x150 [obdclass]
[34910.734968] RSP: 0018:ffff8800375599c8  EFLAGS: 00010292
[34910.734970] RAX: ffff88006bc13bf0 RBX: ffff88006ab94f50 RCX: 00000000000000d8
[34910.734971] RDX: ffff8800660abdf0 RSI: ffffffffa0687ac0 RDI: 00000000000000d8
[34910.734972] RBP: ffff8800375599d8 R08: 0000000000000000 R09: 0000000000000000
[34910.734973] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007071ceb8
[34910.734975] R13: 0000000000000005 R14: 00000000000000d8 R15: ffff88006bc13bf0
[34910.734976] FS:  00007f4bce93e700(0000) GS:ffff880006340000(0000) knlGS:0000000000000000
[34910.734978] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[34910.734979] CR2: 00000000000000d8 CR3: 000000009f3d7000 CR4: 00000000000006e0
[34910.734983] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[34910.734986] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[34910.734988] Process lfs (pid: 1312, threadinfo ffff880037558000, task ffff8800496ea500)
[34910.734989] Stack:
[34910.734990]  ffff8800375599d8 ffff88006ab94f50 ffff880037559a18 ffffffffa0647a1d
[34910.734992] <d> 0000000000000000 ffff8800b3e51db0 ffff88007071ce60 0000000000000000
[34910.734994] <d> ffff8800683eceb8 ffff8800a4cbaed0 ffff880037559a78 ffffffffa0aaea68
[34910.734997] Call Trace:
[34910.735017]  [<ffffffffa0647a1d>] cl_io_sub_init+0x3d/0xc0 [obdclass]
[34910.735031]  [<ffffffffa0aaea68>] lov_sub_get+0x218/0x690 [lov]
[34910.735040]  [<ffffffffa0aaac72>] lov_sublock_env_get+0xd2/0x140 [lov]
[34910.735047]  [<ffffffffa0aac381>] lov_sublock_alloc+0xf1/0x450 [lov]
[34910.735055]  [<ffffffffa0aad55c>] lov_lock_init_raid0+0x3dc/0xdd0 [lov]
[34910.735058]  [<ffffffff81163bf1>] ? kmem_cache_alloc+0x141/0x240
[34910.735066]  [<ffffffffa0aa534e>] lov_lock_init+0x1e/0x60 [lov]
[34910.735085]  [<ffffffffa064380c>] cl_lock_hold_mutex+0x34c/0x660 [obdclass]
[34910.735104]  [<ffffffffa0643c82>] cl_lock_request+0x62/0x270 [obdclass]
[34910.735122]  [<ffffffffa0e89644>] cl_get_grouplock+0x134/0x270 [lustre]
[34910.735132]  [<ffffffffa0e37b65>] ll_get_grouplock+0xe5/0x4a0 [lustre]
[34910.735142]  [<ffffffffa0e485df>] ll_file_ioctl+0x9af/0x2240 [lustre]
[34910.735145]  [<ffffffff8118e112>] vfs_ioctl+0x22/0xa0
[34910.735149]  [<ffffffff810385d8>] ? pvclock_clocksource_read+0x58/0xd0
[34910.735151]  [<ffffffff8118ea9e>] do_vfs_ioctl+0x3ee/0x5e0
[34910.735153]  [<ffffffff810376cc>] ? kvm_clock_read+0x1c/0x20
[34910.735155]  [<ffffffff810376d9>] ? kvm_clock_get_cycles+0x9/0x10
[34910.735158]  [<ffffffff8109af60>] ? getnstimeofday+0x60/0xf0
[34910.735159]  [<ffffffff8118ed11>] sys_ioctl+0x81/0xa0
[34910.735162]  [<ffffffff81070160>] ? sys_gettimeofday+0x40/0x90
[34910.735165]  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
[34910.735166] Code: 05 00 00 00 04 00 e8 a2 16 e4 ff 48 c7 c7 00 83 68 a0 e8 a6 11 e3 ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 <48> 8b 07 0f 1f 80 00 00 00 00 48 89 c2 48 8b 80 b0 00 00 00 48 
[34910.735180] RIP  [<ffffffffa0636d3e>] cl_object_top+0xe/0x150 [obdclass]
[34910.735198]  RSP <ffff8800375599c8>
[34910.735199] CR2: 00000000000000d8


 Comments   
Comment by Jinshan Xiong (Inactive) [ 19/Feb/13 ]

should have been fixed

Comment by John Hammond [ 28/Jun/13 ]

Saw this running racer today on 2.4.51-3-g9f5eea8. Reproduces much more readily if you do:

# llmount.sh
# cd /mnt/lustre
# touch 0 1
# while true; do
    lfs swap_layouts $((RANDOM % 2)) $((RANDOM % 2))
done &
# while true; do
    lfs swap_layouts $((RANDOM % 2)) $((RANDOM % 2))
done &

The same is true of LU-2652.

Comment by Jinshan Xiong (Inactive) [ 30/Jun/13 ]

patch is at: http://review.whamcloud.com/6828

Comment by Henri Doreau (Inactive) [ 28/Nov/14 ]

Jinshan, since this is an issue for us and this patch had to be refreshed I took the freedom to rebase it. Hope this is ok

Comment by Gerrit Updater [ 16/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/6828/
Subject: LU-2766 llite: don't ignore layout for group lock request
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 16b86fe65d285de93c3331e0c9afd5f0fb608e84

Generated at Sat Feb 10 01:28:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.