[LU-14781] sanity-flr test 70: mirror create and split race crash Created: 23/Jun/21  Updated: 14/Jan/22  Resolved: 30/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Zhenyu Xu Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14648 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC i... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for wangshilong <wshilong@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/515ea0f6-6502-4f1c-96ad-2df998d95993

[ 6188.499816] Lustre: DEBUG MARKER: == sanity-flr test 70: mirror create and split race ================================================== 03:31:01 (1624246261)
[ 6231.305351] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 6231.306707] IP: [<ffffffffc130bb3d>] osp_object_free+0x4d/0x490 [osp]
[ 6231.307887] PGD 0
[ 6231.308240] Oops: 0000 1 SMP
[ 6231.308838] Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc iosf_mbi crc32_pclmul ppdev ghash_clmulni_intel dm_mod aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev virtio_balloon pcspkr i2c_piix4 parport_pc parport ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk ata_piix libata 8139too crct10dif_pclmul crct10dif_common crc32c_intel serio_raw
[ 6231.321323] virtio_pci 8139cp virtio_ring mii virtio floppy [last unloaded: dm_flakey]
[ 6231.322546] CPU: 1 PID: 23593 Comm: mdt00_003 Kdump: loaded Tainted: G OE ------------ 3.10.0-1160.21.1.el7_lustre.ddn15.x86_64 #1
[ 6231.324404] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 6231.325248] task: ffff9e48d5310000 ti: ffff9e48d5098000 task.ti: ffff9e48d5098000
[ 6231.326340] RIP: 0010:[<ffffffffc130bb3d>] [<ffffffffc130bb3d>] osp_object_free+0x4d/0x490 [osp]
[ 6231.327661] RSP: 0018:ffff9e48d509b7c0 EFLAGS: 00010217
[ 6231.328447] RAX: ffff9e48e0eadae8 RBX: 0000000000000000 RCX: dead000000000200
[ 6231.329493] RDX: ffff9e48d509b7f8 RSI: ffff9e48e0eadb08 RDI: ffff9e48e0eadab8
[ 6231.330537] RBP: ffff9e48d509b7e8 R08: ffff9e48e0eadb20 R09: 0000000000000105
[ 6231.331579] R10: 0000000000000195 R11: 000000000007ffff R12: ffff9e48e0eadb20
[ 6231.332623] R13: ffff9e48e0eadbd8 R14: 00001fffffffffff R15: ffff9e48e0eadab8
[ 6231.333687] FS: 0000000000000000(0000) GS:ffff9e48ffd00000(0000) knlGS:0000000000000000
[ 6231.334876] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6231.335722] CR2: 0000000000000000 CR3: 0000000077340000 CR4: 00000000001606e0
[ 6231.336779] Call Trace:
[ 6231.337803] [<ffffffffc0a94ba2>] lu_object_free.isra.30+0xf2/0x170 [obdclass]
[ 6231.338948] [<ffffffffc0a989a6>] lu_object_find_at+0x496/0x930 [obdclass]
[ 6231.340080] [<ffffffffa59870d2>] ? mutex_lock+0x12/0x2f
[ 6231.341069] [<ffffffffc1220fcb>] ? lod_putref+0x6b/0x980 [lod]
[ 6231.341954] [<ffffffffc121f5f0>] ? lod_free_comp_buffer+0x2b0/0x460 [lod]
[ 6231.342971] [<ffffffffc1225ca4>] lod_initialize_objects+0x3e4/0xba0 [lod]
[ 6231.343982] [<ffffffffc1226af3>] lod_parse_striping+0x693/0xc20 [lod]
[ 6231.344946] [<ffffffffc1227332>] lod_striping_load+0x2b2/0x660 [lod]
[ 6231.345940] [<ffffffffc123d31b>] lod_declare_destroy+0x12b/0x600 [lod]
[ 6231.346920] [<ffffffffc123e62a>] ? lod_declare_attr_set+0xba/0x6b0 [lod]
[ 6231.348138] [<ffffffffc12bd209>] ? mdd_env_info+0x19/0x90 [mdd]
[ 6231.349034] [<ffffffffc12a52d1>] mdd_declare_finish_unlink+0x91/0x210 [mdd]
[ 6231.350078] [<ffffffffc12b50cf>] mdd_unlink+0x48f/0xab0 [mdd]
[ 6231.351294] [<ffffffffc11322c2>] mdt_reint_unlink+0xc32/0x1550 [mdt]
[ 6231.353207] [<ffffffffc0d79e27>] ? lustre_msg_buf+0x17/0x60 [ptlrpc]
[ 6231.354201] [<ffffffffc1136c43>] mdt_reint_rec+0x83/0x210 [mdt]
[ 6231.355116] [<ffffffffc1111f01>] mdt_reint_internal+0x6e1/0xb00 [mdt]
[ 6231.356087] [<ffffffffc111e2f7>] mdt_reint+0x67/0x140 [mdt]
[ 6231.357167] [<ffffffffc0de748e>] tgt_request_handle+0xaee/0x15f0 [ptlrpc]
[ 6231.358336] [<ffffffffc0853297>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[ 6231.359350] [<ffffffffc0d897cb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 6231.360536] [<ffffffffa52d32c3>] ? __wake_up+0x13/0x20
[ 6231.361346] [<ffffffffc0d8d134>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[ 6231.362274] [<ffffffffa52d4c1e>] ? finish_task_switch+0x4e/0x1c0
[ 6231.363178] [<ffffffffa598899f>] ? __schedule+0x3af/0x860
[ 6231.364013] [<ffffffffc0d8c600>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[ 6231.365129] [<ffffffffa52c5da1>] kthread+0xd1/0xe0
[ 6231.365862] [<ffffffffa52c5cd0>] ? insert_kthread_work+0x40/0x40
[ 6231.366775] [<ffffffffa5995df7>] ret_from_fork_nospec_begin+0x21/0x21
[ 6231.367753] [<ffffffffa52c5cd0>] ? insert_kthread_work+0x40/0x40
[ 6231.368653] Code: 8d 7f 50 4d 8d af 20 01 00 00 49 be ff ff ff ff ff 1f 00 00 e8 95 df 78 ff 48 89 df e8 8d 8e 78 ff 49 8b 9f 20 01 00 00 4c 39 eb <4c> 8b 23 48 89 df 0f 85 ce 00 00 00 e9 9b 01 00 00 66 90 4c 01
[ 6231.373352] RIP [<ffffffffc130bb3d>] osp_object_free+0x4d/0x490 [osp]
[ 6231.374350] RSP <ffff9e48d509b7c0>
[ 6231.374883] CR2: 0000000000000000



 Comments   
Comment by Gerrit Updater [ 23/Jun/21 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/44055
Subject: LU-14781 osp: init proper lu_object_header
Project: fs/lustre-release
Branch: master
Current Patch Set: 2
Commit: 051cd86ccd395283473e9edd66448694bb7001e6

Comment by Alex Zhuravlev [ 31/Aug/21 ]

on master https://testing.whamcloud.com/test_sessions/c74ca3d7-6927-41c3-abca-65bf5108eb71

Comment by Zhenyu Xu [ 03/Sep/21 ]

The stack trace of the Oops shows that osp_object_free encountered NULL pointer

Call Trace:
[ 6532.245084]  [] lu_object_free.isra.30+0xf2/0x170 [obdclass]
[ 6532.246905]  [] lu_object_find_at+0x496/0x930 [obdclass]
[ 6532.251132]  [] lod_initialize_objects+0x3e4/0xba0 [lod]
[ 6532.252240]  [] lod_parse_striping+0x693/0xc20 [lod]
[ 6532.253293]  [] lod_striping_load+0x2b2/0x660 [lod]
[ 6532.254347]  [] lod_declare_destroy+0x12b/0x600 [lod]
[ 6532.258781]  [] mdd_declare_finish_unlink+0x91/0x210 [mdd]
[ 6532.259911]  [] mdd_unlink+0x48f/0xab0 [mdd]
[ 6532.260957]  [] mdt_reint_unlink+0xc32/0x1550 [mdt]
[ 6532.263377]  [] mdt_reint_rec+0x83/0x210 [mdt]
[ 6532.264354]  [] mdt_reint_internal+0x6e1/0xb00 [mdt]
[ 6532.265410]  [] mdt_reint+0x67/0x140 [mdt]
[ 6532.266424]  [] tgt_request_handle+0xaee/0x15f0 [ptlrpc]
[ 6532.268678]  [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 6532.270797]  [] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[ 6532.273053]  [] kthread+0xd1/0xe0
[ 6532.274830]  [] ret_from_fork_nospec_begin+0x21/0x21
[ 6532.276848] Code: 8d 7f 50 4d 8d af 20 01 00 00 49 be ff ff ff ff ff 1f 00 00 e8 75 e1 79 ff 48 89 df e8 6d 90 79 ff 49 8b 9f 20 01 00 00 4c 39 eb  8b 23 48 89 df 0f 85 ce 00 00 00 e9 9b 01 00 00 66 90 4c 01 
[ 6532.281936] RIP  [] osp_object_free+0x4d/0x490 [osp]
[ 6532.283028]  RSP 
[ 6532.283602] CR2: 0000000000000000

And

 
(gdb) l *(osp_object_free+0x4d)
0x7b6d is in osp_object_free (/root/work/lustre/lustre/osp/osp_object.c:2296).
2289		struct lu_object_header	*h = o->lo_header;
2290		struct osp_xattr_entry *oxe;
2291		struct osp_xattr_entry *tmp;
2292		int			count;
2293	
2294		dt_object_fini(&obj->opo_obj);
2295		lu_object_header_fini(h);
2296		list_for_each_entry_safe(oxe, tmp, &obj->opo_xattr_list, oxe_list) {
2297			list_del(&oxe->oxe_list);
2298			count = atomic_read(&oxe->oxe_ref);
2299			LASSERTF(count == 1,
2300				 "Still has %d users on the xattr entry %.*s\n",

So the object's lu_object_header is NULL? Which leads me to

osp_object_alloc()
111 static struct lu_object *osp_object_alloc(const struct lu_env *env,             
 112                                           const struct lu_object_header *hdr,   
 113                                           struct lu_device *d)                  
 114 {                                                                               
 115         struct lu_object_header *h = NULL;                                      
 116         struct osp_object       *o;                                             
 117         struct lu_object        *l;                                             
 118                                                                                 
 119         OBD_SLAB_ALLOC_PTR_GFP(o, osp_object_kmem, GFP_NOFS);                   
 120         if (o != NULL) {                                                        
 121                 l = &o->opo_obj.do_lu;                                          
 122                                                                                 
 123                 /* If hdr is NULL, it means the object is not built             
 124                  * from the top dev(MDT/OST), usually it happens when           
 125                  * building striped object, like data object on MDT or          
 126                  * striped object for directory */                              
 127                 if (hdr == NULL) {                                              
 128                         h = &o->opo_header;                                     
 129                         lu_object_header_init(h);                               
 130                         dt_object_init(&o->opo_obj, h, d);                      
 131                         lu_object_add_top(h, l);                                
 132                 } else {                                                        
 133                         dt_object_init(&o->opo_obj, h, d);                              
 134                 }                                                               
 135                                                                                 
 136                 l->lo_ops = &osp_lu_obj_ops;                                    
 137                                                                                 
 138                 return l;                                                       
 139         } else {                                                                
 140                 return NULL;                                                    
 141         }                                                                       
 142 }                   

It seems to me that line 133 is faulty, it should use hdr instead of h.

Comment by Gerrit Updater [ 22/Sep/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44055/
Subject: LU-14781 osp: osp object header could be NULL
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 90dab62d635666001b8b4e2b7aa8cc1f23bc6f5b

Comment by Peter Jones [ 22/Sep/21 ]

Landed for 2.15

Comment by Zhenyu Xu [ 02/Nov/21 ]

another hit (https://testing.whamcloud.com/test_sets/8e467190-cbc4-43f6-8b42-89c274f86735) reveals that osp_object free could still access NULL memory

Comment by Gerrit Updater [ 03/Nov/21 ]

"Bobi Jam <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/45442
Subject: LU-14781 osp: osp_object_free access NULL pointer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 20dde5a8d428b3f9bf2d0421b333a09545be1c65

Comment by Gerrit Updater [ 30/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45442/
Subject: LU-14781 osp: osp_object_free access NULL pointer
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 86e6d90f69d8c34f6af4c6e7eb7cad3e60b7278e

Comment by Peter Jones [ 30/Nov/21 ]

Landed for 2.15

Generated at Sat Feb 10 09:29:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.