[LU-5340] oops in cl_object_top() after reconstructed setattr on removed directory Created: 14/Jul/14 Updated: 20/Jul/15 Resolved: 20/Jul/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.7.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | John Hammond | Assignee: | Lai Siyao |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llite, malloc | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 14897 | ||||||||
| Description |
|
See the discussion on from http://review.whamcloud.com/#/c/11025/2/lustre/mdt/mdt_recovery.c. To reproduce: MOUNT_2=y llmount.sh mkdir /mnt/lustre/d0 stat /mnt/lustre/d0 lctl set_param fail_loc=0x119 # OBD_FAIL_MDS_REINT_NET_REP touch /mnt/lustre/d0 & sleep 1 lctl set_param fail_loc=0 rmdir /mnt/lustre2/d0 [ 132.938246] Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt [ 132.970143] Lustre: *** cfs_fail_loc=119, val=2147483648*** [ 132.972029] LustreError: 3623:0:(ldlm_lib.c:2399:target_send_reply_msg()) @@@ dropping reply req@ffff8801f8edd908 x1473623045964728/t4294967298(0) o36->ac7dcf52-7dca-015b-3e5b-0a840c48c232@0@lo:0/0 lens 488/456 e 0 to 0 dl 1405356492 ref 1 fl Interpret:/0/0 rc 0/0 [ 139.969185] Lustre: 3855:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1405356486/real 1405356486] req@ffff8801f64beb00 x1473623045964728/t0(0) o36->lustre-MDT0000-mdc-ffff880203aa4860@0@lo:12/10 lens 488/1016 e 0 to 1 dl 1405356493 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [ 139.978122] Lustre: lustre-MDT0000-mdc-ffff880203aa4860: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete [ 139.983045] Lustre: lustre-MDT0000: Client ac7dcf52-7dca-015b-3e5b-0a840c48c232 (at 0@lo) reconnecting [ 139.984979] Lustre: lustre-MDT0000-mdc-ffff880203aa4860: Connection restored to lustre-MDT0000 (at 0@lo) [ 139.987189] BUG: unable to handle kernel NULL pointer dereference at (null) [ 139.988113] IP: [<ffffffffa046084e>] cl_object_top+0xe/0x150 [obdclass] [ 139.988113] PGD 1f4083067 PUD 1fbfb9067 PMD 0 [ 139.988113] Oops: 0000 [#1] SMP [ 139.988113] last sysfs file: /sys/devices/system/cpu/possible [ 139.988113] CPU 0 [ 139.988113] Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) mdd(U) mgs(U) nodemap(U) osd_ldiskfs(U) ldiskfs(U) exportfs lquota(U) lfsck(U) jbd obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] [ 139.988113] [ 139.988113] Pid: 3855, comm: touch Not tainted 2.6.32-431.5.1.el6.lustre.x86_64 #1 Bochs Bochs [ 139.988113] RIP: 0010:[<ffffffffa046084e>] [<ffffffffa046084e>] cl_object_top+0xe/0x150 [obdclass] [ 139.988113] RSP: 0018:ffff8801f40b9c58 EFLAGS: 00010292 [ 139.988113] RAX: ffff8801f8ed2f60 RBX: ffff880219e9a690 RCX: 0000000000000000 [ 139.988113] RDX: ffff8801f8f8bce8 RSI: ffffffffa04bca20 RDI: 0000000000000000 [ 139.988113] RBP: ffff8801f40b9c68 R08: 0000000000000001 R09: 0000000000000001 [ 139.988113] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880217e903c8 [ 139.988113] R13: 0000000000000002 R14: 0000000000000000 R15: ffff8801f8ed2f60 [ 139.988113] FS: 00007f641a061700(0000) GS:ffff88002f800000(0000) knlGS:0000000000000000 [ 139.988113] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 139.988113] CR2: 0000000000000000 CR3: 00000001f4074000 CR4: 00000000000006f0 [ 139.988113] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 139.988113] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 139.988113] Process touch (pid: 3855, threadinfo ffff8801f40b8000, task ffff8801fbf16380) [ 139.988113] Stack: [ 139.988113] ffff8801f40b9c68 ffff880219e9a690 ffff8801f40b9ca8 ffffffffa046fded [ 139.988113] <d> ffff8801f40b9cd0 ffff880217e903a0 ffff880219e9a690 ffff8801f40b9e48 [ 139.988113] <d> ffff880217e903c8 ffff8801f40b9cd4 ffff8801f40b9d08 ffffffffa0e846e9 [ 139.988113] Call Trace: [ 139.988113] [<ffffffffa046fded>] cl_io_init+0x3d/0xe0 [obdclass] [ 139.988113] [<ffffffffa0e846e9>] cl_setattr_ost+0xe9/0x2f0 [lustre] [ 139.988113] [<ffffffffa0e4e98c>] ll_setattr_raw+0xa2c/0x10d0 [lustre] [ 139.988113] [<ffffffff8107b997>] ? current_fs_time+0x27/0x30 [ 139.988113] [<ffffffffa0e4f095>] ll_setattr+0x65/0xd0 [lustre] [ 139.988113] [<ffffffff811c16a8>] notify_change+0x168/0x340 [ 139.988113] [<ffffffff811d601e>] utimes_common+0xde/0x1c0 [ 139.988113] [<ffffffff8119f51b>] ? put_unused_fd+0x3b/0x90 [ 139.988113] [<ffffffff811d61d0>] do_utimes+0xd0/0x170 [ 139.988113] [<ffffffff811d6372>] sys_utimensat+0x32/0x90 [ 139.988113] [<ffffffff81554222>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 139.988113] [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [ 139.988113] Code: 48 89 df e8 85 71 d2 e0 48 c7 c3 f4 ff ff ff e9 2a ff ff ff 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 <48> 8b 07 0f 1f 80 00 00 00 00 48 89 c2 48 8b 80 a0 00 00 00 48 [ 139.988113] RIP [<ffffffffa046084e>] cl_object_top+0xe/0x150 [obdclass] [ 139.988113] RSP <ffff8801f40b9c58> [ 139.988113] CR2: 0000000000000000 |
| Comments |
| Comment by Oleg Drokin [ 13/Aug/14 ] |
|
This same issue hits without any fault injection in upstream kernel: [ 516.480189] Lustre: lustre-MDT0000-mdc-ffff8800d87a9800: Connection restored to lustre-MDT0000 (at 192.168.10.226@tcp) [ 516.481053] BUG: unable to handle kernel NULL pointer dereference at (null) [ 516.481778] IP: [<ffffffffa02c3255>] cl_object_top+0x5/0x110 [obdclass] [ 516.482014] PGD 8e69b067 PUD 8c162067 PMD 0 [ 516.482014] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 516.482782] Modules linked in: lmv(C) fld(C) mgc(C) lustre(C) mdc(C) fid(C) lov(C) osc(C) ksocklnd(C) ptlrpc(C) obdclass(C) lvfs(C) lnet(C) sha512_generic crc32 libcfs(C) cfg80211 rfkill rpcsec_gss_krb5 ttm drm_kms_helper drm joydev i2c_piix4 pcspkr microcode serio_raw virtio_blk i2c_core nfsd [ 516.483868] CPU: 3 PID: 26002 Comm: chmod Tainted: G C 3.16.0-vm-nfs+ #26 [ 516.483868] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 516.483868] task: ffff8800bef95700 ti: ffff88008ea84000 task.ti: ffff88008ea84000 [ 516.483868] RIP: 0010:[<ffffffffa02c3255>] [<ffffffffa02c3255>] cl_object_top+0x5/0x110 [obdclass] [ 516.483868] RSP: 0018:ffff88008ea87d40 EFLAGS: 00010246 [ 516.483868] RAX: ffff880087e9cc00 RBX: 0000000000000000 RCX: 0000000000000000 [ 516.483868] RDX: ffff880090a82e00 RSI: ffffffffa02fd140 RDI: 0000000000000000 [ 516.483868] RBP: ffff88008ea87d70 R08: 0000000000000001 R09: 0000000000000000 [ 516.483868] R10: 0000000000000002 R11: 0000000000000008 R12: ffff8800bb12bce8 [ 516.483868] R13: ffff8800bd346eb8 R14: ffff880087e9cc00 R15: 0000000000000002 [ 516.483868] FS: 00007f20523d1740(0000) GS:ffff88011f4c0000(0000) knlGS:0000000000000000 [ 516.483868] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 516.483868] CR2: 0000000000000000 CR3: 000000010dabb000 CR4: 00000000000006e0 [ 516.483868] Stack: [ 516.483868] ffffffffa02cd90e ffff8800bd346e90 ffff8800bb12bce8 ffff8800bd346eb8 [ 516.483868] 0000000000000000 ffff88008ea87eb8 ffff88008ea87db8 ffffffffa05d61ed [ 516.483868] ffff88008de8ebc8 0000000100000000 ffff88008de8ebc8 0000000000000000 [ 516.483868] Call Trace: [ 516.483868] [<ffffffffa02cd90e>] ? cl_io_init+0x2e/0xd0 [obdclass] [ 516.483868] [<ffffffffa05d61ed>] cl_setattr_ost+0xfd/0x2a0 [lustre] [ 516.483868] [<ffffffffa05b6af9>] ll_setattr_raw+0x959/0xf10 [lustre] [ 516.483868] [<ffffffffa05b7113>] ll_setattr+0x63/0xc0 [lustre] [ 516.483868] [<ffffffff811ebc61>] notify_change+0x241/0x390 [ 516.483868] [<ffffffff811cdc71>] chmod_common+0x81/0x100 [ 516.483868] [<ffffffff811cf278>] SyS_fchmodat+0x48/0xb0 [ 516.483868] [<ffffffff816fb7ad>] system_call_fastpath+0x1a/0x1f [ 516.483868] Code: 00 48 8d 58 08 8d 51 01 89 90 98 00 00 00 41 89 14 24 e9 55 ff ff ff e8 70 09 01 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <48> 8b 0f eb 09 66 0f 1f 44 00 00 48 89 c1 48 8b 81 80 00 00 00 [ 516.483868] RIP [<ffffffffa02c3255>] cl_object_top+0x5/0x110 [obdclass] [ 516.483868] RSP <ffff88008ea87d40> [ 516.483868] CR2: 0000000000000000 (connection fails due to EPROTO in some enqueue that is a separate issue I guess. |
| Comment by Peter Jones [ 20/Jul/15 ] |
|
John Are you still able to trigger this on master now that the CLIO simplification changes are in place? Peter |
| Comment by John Hammond [ 20/Jul/15 ] |
|
No, I cannot. |
| Comment by Peter Jones [ 20/Jul/15 ] |
|
ok - thanks John! |