[LU-1773] Oops in lov_delete_raid0() Created: 20/Aug/12  Updated: 30/Aug/12  Resolved: 26/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: Lustre 2.3.0, Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

server,client lustre-master-tag2.2.93 RHEL6


Issue Links:
Related
is related to LU-1480 failure on replay-single test_74: ASS... Resolved
Severity: 3
Rank (Obsolete): 4477

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/2bce893c-e9a0-11e1-881a-52540035b04c.

This test timeout but there is no useful information



 Comments   
Comment by Sarah Liu [ 20/Aug/12 ]

I manually rerun this test and hit following error on client:

Lustre: DEBUG MARKER: == lfsck lfsck.sh test complete, duration 160 sec == 15:24:04 (1345501444)
LustreError: 10275:0:(lov_object.c:157:lov_init_sub()) Stripe is already owned by other file (0).
LustreError: 10275:0:(lov_object.c:158:lov_init_sub()) header@ffff880331062e80[0x0, 3, [0x100000000:0xe:0x0] hash]{ 
LustreError: 10275:0:(lov_object.c:158:lov_init_sub()) ....lovsub@ffff880331062f18[0]
LustreError: 10275:0:(lov_object.c:158:lov_init_sub()) ....osc@ffff88031d4c1e28id: 14 gr: 0 idx: 0 gen: 0 kms_valid: 1 kms 0 rc: 0 force_sync: 0 min_xid: 0 size: 1048576 mtime: 1345501306 atime: 0 ctime: 1345501306 blocks: 2048
LustreError: 10275:0:(lov_object.c:158:lov_init_sub()) } header@ffff880331062e80
LustreError: 10275:0:(lov_object.c:158:lov_init_sub()) 
LustreError: 10275:0:(lov_object.c:160:lov_init_sub()) header@ffff88031d5cae70[0x1, 0, [0x200000400:0x63:0x0]]{ 
LustreError: 10275:0:(lov_object.c:160:lov_init_sub()) ....vvp@ffff88031d5caf08(- 0 0) inode: ffff88031d5afc78 144115205255725155/33554436 100644 0 0 ffff88031d5caf08 [0x200000400:0x63:0x0]
LustreError: 10275:0:(lov_object.c:160:lov_init_sub()) ....lov@ffff88031d5a4e10stripes: 1:
LustreError: 10275:0:(lov_object.c:160:lov_init_sub()) header@ffff880331062e80[0x1, 1, [0x100000000:0xe:0x0] hash]{ 
LustreError: 10275:0:(lov_object.c:160:lov_init_sub()) ....lovsub@ffff880331062f18[0]
LustreError: 10275:0:(lov_object.c:160:lov_init_sub()) ....osc@ffff88031d4c1e28id: 14 gr: 0 idx: 0 gen: 0 kms_valid: 1 kms 0 rc: 0 force_sync: 0 min_xid: 0 size: 1048576 mtime: 1345501306 atime: 0 ctime: 1345501306 blocks: 2048
LustreError: 10275:0:(lov_object.c:160:lov_init_sub()) } header@ffff880331062e80
LustreError: 10275:0:(lov_object.c:160:lov_init_sub()) 
LustreError: 10275:0:(lov_object.c:160:lov_init_sub()) } header@ffff88031d5cae70
LustreError: 10275:0:(lov_object.c:160:lov_init_sub()) old
LustreError: 10275:0:(lov_object.c:161:lov_init_sub()) header@ffff8803318adeb0[0x0, 1, [0x290006e:0x2d75dcf9:0x0]]
LustreError: 10275:0:(lov_object.c:161:lov_init_sub()) new
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffffa099f867>] lov_delete_raid0+0x47/0x430 [lov]
PGD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
CPU 1 
Modules linked in: nfs fscache lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa igb mlx4_ib ib_mad ib_core mlx4_en mlx4_core microcode i2c_i801 i2c_core serio_raw sg iTCO_wdt iTCO_vendor_support i7core_edac edac_core ioatdma dca shpchp ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 10275, comm: ll_sa_10274 Not tainted 2.6.32-279.2.1.el6.x86_64 #1 Supermicro X8DTT/X8DTT
RIP: 0010:[<ffffffffa099f867>]  [<ffffffffa099f867>] lov_delete_raid0+0x47/0x430 [lov]
RSP: 0018:ffff8803319338e0  EFLAGS: 00010246
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffffffffa09b47d0 RSI: 0000000000000000 RDI: 0000000000000002
RBP: ffff880331933980 R08: ffff880331932000 R09: 00000000ffffffff
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880330d92728
R13: ffff8803318adeb0 R14: ffff88031d5f9ef8 R15: ffff88031d5f9f88
FS:  00007f6d48a82700(0000) GS:ffff880032e20000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001a85000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ll_sa_10274 (pid: 10275, threadinfo ffff880331932000, task ffff88031c600080)
Stack:
 000000000290006e 0000000000002c57 000000000290006e ffff88031d5f9ef8
<d> ffff880330d92728 ffff880331932000 ffff880330d92728 ffff8803318adeb0
<d> ffffc9001904e000 0000000000000030 ffff880330d92728 ffffffff810623da
Call Trace:
 [<ffffffff810623da>] ? __cond_resched+0x2a/0x40
 [<ffffffffa09a01a9>] lov_object_delete+0x69/0x190 [lov]
 [<ffffffffa0535069>] lu_object_free+0x89/0x1b0 [obdclass]
 [<ffffffffa038cfb8>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa0535326>] lu_object_alloc+0x196/0x310 [obdclass]
 [<ffffffffa0535a29>] lu_object_find_at+0x139/0x450 [obdclass]
 [<ffffffffa0adc61f>] ? cl_file_inode_init+0x5f/0x360 [lustre]
 [<ffffffffa0392521>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa0535d7f>] lu_object_find_slice+0x1f/0x80 [obdclass]
 [<ffffffffa053c782>] cl_object_find+0x42/0xb0 [obdclass]
 [<ffffffffa0adc7df>] cl_file_inode_init+0x21f/0x360 [lustre]
 [<ffffffffa0aa9e92>] ll_update_inode+0x112/0xe60 [lustre]
 [<ffffffffa0392521>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa0aab858>] ll_read_inode2+0x88/0x440 [lustre]
 [<ffffffffa0ac1e33>] ll_iget+0x1a3/0x2a0 [lustre]
 [<ffffffffa0aab0c1>] ll_prep_inode+0x4e1/0xbf0 [lustre]
 [<ffffffffa0ad6487>] do_statahead_interpret+0x347/0xde0 [lustre]
 [<ffffffffa0ada7ea>] ll_statahead_thread+0x27a/0xf60 [lustre]
 [<ffffffff81127c5f>] ? free_hot_page+0x2f/0x60
 [<ffffffff81060250>] ? default_wake_function+0x0/0x20
 [<ffffffffa0ada570>] ? ll_statahead_thread+0x0/0xf60 [lustre]
 [<ffffffff8100c14a>] child_rip+0xa/0x20
 [<ffffffffa0ada570>] ? ll_statahead_thread+0x0/0xf60 [lustre]
 [<ffffffffa0ada570>] ? ll_statahead_thread+0x0/0xf60 [lustre]
 [<ffffffff8100c140>] ? child_rip+0x0/0x20
Code: 01 48 89 7d b0 49 89 f6 49 89 d7 48 8b 5e 70 74 0d f6 05 ce 1d a1 ff 02 0f 85 9e 01 00 00 48 89 de bf 02 00 00 00 e8 79 a0 b5 ff <8b> 03 83 f8 01 0f 8f ee 02 00 00 49 8b 47 08 48 85 c0 0f 84 1e 
RIP  [<ffffffffa099f867>] lov_delete_raid0+0x47/0x430 [lov]
 RSP <ffff8803319338e0>
CR2: 0000000000000000
Comment by Peter Jones [ 20/Aug/12 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 20/Aug/12 ]

patch tracking at http://review.whamcloud.com/3732

patch description
    LU-1773 lov: lov_delete_raid0() need heed lov_fini_raid0()

    * When lov_init_raid0() fails, lov_fini_raid0() will free lov->lo_lsm,
      so lov_delete_raid0() should heed that.
    * lov_fini_raid0() need check lov->lsm existence, since error handling
      path may call it twice. (lu_object_alloc->lov_object_init->
      lov_init_raid0--(fail)-->lov_fini_raid0--(return to)-->
      lu_object_alloc->lu_object_free->lov_object_free->lov_fini_raid0
    * add a sanity test case.
Comment by Zhenyu Xu [ 22/Aug/12 ]

issue caused by the same reason

patch updated

patch description
LU-1773 lov: lov_delete_raid0() need heed lov_fini_raid0()

Add a sanity test case and handle failure of lov_init_raid0()
correctly.

LU-1480 is also due to the failure of lov_init_raid0().
Comment by Peter Jones [ 26/Aug/12 ]

Landed for 2.3 and 2.4

Generated at Sat Feb 10 01:19:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.