[LU-5340] oops in cl_object_top() after reconstructed setattr on removed directory Created: 14/Jul/14  Updated: 20/Jul/15  Resolved: 20/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: llite, malloc

Issue Links:
Related
is related to LU-5285 mdt_reconstruct_setattr() calls mdt_a... Resolved
Severity: 3
Rank (Obsolete): 14897

 Description   

See the discussion on from http://review.whamcloud.com/#/c/11025/2/lustre/mdt/mdt_recovery.c.

To reproduce:

MOUNT_2=y llmount.sh
mkdir /mnt/lustre/d0
stat /mnt/lustre/d0
lctl set_param fail_loc=0x119 # OBD_FAIL_MDS_REINT_NET_REP
touch /mnt/lustre/d0 &
sleep 1
lctl set_param fail_loc=0
rmdir /mnt/lustre2/d0
[  132.938246] Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt
[  132.970143] Lustre: *** cfs_fail_loc=119, val=2147483648***
[  132.972029] LustreError: 3623:0:(ldlm_lib.c:2399:target_send_reply_msg()) @@@ dropping reply  req@ffff8801f8edd908 x1473623045964728/t4294967298(0) o36->ac7dcf52-7dca-015b-3e5b-0a840c48c232@0@lo:0/0 lens 488/456 e 0 to 0 dl 1405356492 ref 1 fl Interpret:/0/0 rc 0/0
[  139.969185] Lustre: 3855:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1405356486/real 1405356486]  req@ffff8801f64beb00 x1473623045964728/t0(0) o36->lustre-MDT0000-mdc-ffff880203aa4860@0@lo:12/10 lens 488/1016 e 0 to 1 dl 1405356493 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[  139.978122] Lustre: lustre-MDT0000-mdc-ffff880203aa4860: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
[  139.983045] Lustre: lustre-MDT0000: Client ac7dcf52-7dca-015b-3e5b-0a840c48c232 (at 0@lo) reconnecting
[  139.984979] Lustre: lustre-MDT0000-mdc-ffff880203aa4860: Connection restored to lustre-MDT0000 (at 0@lo)
[  139.987189] BUG: unable to handle kernel NULL pointer dereference at (null)
[  139.988113] IP: [<ffffffffa046084e>] cl_object_top+0xe/0x150 [obdclass]
[  139.988113] PGD 1f4083067 PUD 1fbfb9067 PMD 0 
[  139.988113] Oops: 0000 [#1] SMP 
[  139.988113] last sysfs file: /sys/devices/system/cpu/possible
[  139.988113] CPU 0 
[  139.988113] Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) mdd(U) mgs(U) nodemap(U) osd_ldiskfs(U) ldiskfs(U) exportfs lquota(U) lfsck(U) jbd obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
[  139.988113] 
[  139.988113] Pid: 3855, comm: touch Not tainted 2.6.32-431.5.1.el6.lustre.x86_64 #1 Bochs Bochs
[  139.988113] RIP: 0010:[<ffffffffa046084e>]  [<ffffffffa046084e>] cl_object_top+0xe/0x150 [obdclass]
[  139.988113] RSP: 0018:ffff8801f40b9c58  EFLAGS: 00010292
[  139.988113] RAX: ffff8801f8ed2f60 RBX: ffff880219e9a690 RCX: 0000000000000000
[  139.988113] RDX: ffff8801f8f8bce8 RSI: ffffffffa04bca20 RDI: 0000000000000000
[  139.988113] RBP: ffff8801f40b9c68 R08: 0000000000000001 R09: 0000000000000001
[  139.988113] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880217e903c8
[  139.988113] R13: 0000000000000002 R14: 0000000000000000 R15: ffff8801f8ed2f60
[  139.988113] FS:  00007f641a061700(0000) GS:ffff88002f800000(0000) knlGS:0000000000000000
[  139.988113] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  139.988113] CR2: 0000000000000000 CR3: 00000001f4074000 CR4: 00000000000006f0
[  139.988113] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  139.988113] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  139.988113] Process touch (pid: 3855, threadinfo ffff8801f40b8000, task ffff8801fbf16380)
[  139.988113] Stack:
[  139.988113]  ffff8801f40b9c68 ffff880219e9a690 ffff8801f40b9ca8 ffffffffa046fded
[  139.988113] <d> ffff8801f40b9cd0 ffff880217e903a0 ffff880219e9a690 ffff8801f40b9e48
[  139.988113] <d> ffff880217e903c8 ffff8801f40b9cd4 ffff8801f40b9d08 ffffffffa0e846e9
[  139.988113] Call Trace:
[  139.988113]  [<ffffffffa046fded>] cl_io_init+0x3d/0xe0 [obdclass]
[  139.988113]  [<ffffffffa0e846e9>] cl_setattr_ost+0xe9/0x2f0 [lustre]
[  139.988113]  [<ffffffffa0e4e98c>] ll_setattr_raw+0xa2c/0x10d0 [lustre]
[  139.988113]  [<ffffffff8107b997>] ? current_fs_time+0x27/0x30
[  139.988113]  [<ffffffffa0e4f095>] ll_setattr+0x65/0xd0 [lustre]
[  139.988113]  [<ffffffff811c16a8>] notify_change+0x168/0x340
[  139.988113]  [<ffffffff811d601e>] utimes_common+0xde/0x1c0
[  139.988113]  [<ffffffff8119f51b>] ? put_unused_fd+0x3b/0x90
[  139.988113]  [<ffffffff811d61d0>] do_utimes+0xd0/0x170
[  139.988113]  [<ffffffff811d6372>] sys_utimensat+0x32/0x90
[  139.988113]  [<ffffffff81554222>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  139.988113]  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[  139.988113] Code: 48 89 df e8 85 71 d2 e0 48 c7 c3 f4 ff ff ff e9 2a ff ff ff 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 <48> 8b 07 0f 1f 80 00 00 00 00 48 89 c2 48 8b 80 a0 00 00 00 48 
[  139.988113] RIP  [<ffffffffa046084e>] cl_object_top+0xe/0x150 [obdclass]
[  139.988113]  RSP <ffff8801f40b9c58>
[  139.988113] CR2: 0000000000000000


 Comments   
Comment by Oleg Drokin [ 13/Aug/14 ]

This same issue hits without any fault injection in upstream kernel:

[  516.480189] Lustre: lustre-MDT0000-mdc-ffff8800d87a9800: Connection restored to lustre-MDT0000 (at 192.168.10.226@tcp)
[  516.481053] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  516.481778] IP: [<ffffffffa02c3255>] cl_object_top+0x5/0x110 [obdclass]
[  516.482014] PGD 8e69b067 PUD 8c162067 PMD 0 
[  516.482014] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[  516.482782] Modules linked in: lmv(C) fld(C) mgc(C) lustre(C) mdc(C) fid(C) lov(C) osc(C) ksocklnd(C) ptlrpc(C) obdclass(C) lvfs(C) lnet(C) sha512_generic crc32 libcfs(C) cfg80211 rfkill rpcsec_gss_krb5 ttm drm_kms_helper drm joydev i2c_piix4 pcspkr microcode serio_raw virtio_blk i2c_core nfsd
[  516.483868] CPU: 3 PID: 26002 Comm: chmod Tainted: G         C     3.16.0-vm-nfs+ #26
[  516.483868] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  516.483868] task: ffff8800bef95700 ti: ffff88008ea84000 task.ti: ffff88008ea84000
[  516.483868] RIP: 0010:[<ffffffffa02c3255>]  [<ffffffffa02c3255>] cl_object_top+0x5/0x110 [obdclass]
[  516.483868] RSP: 0018:ffff88008ea87d40  EFLAGS: 00010246
[  516.483868] RAX: ffff880087e9cc00 RBX: 0000000000000000 RCX: 0000000000000000
[  516.483868] RDX: ffff880090a82e00 RSI: ffffffffa02fd140 RDI: 0000000000000000
[  516.483868] RBP: ffff88008ea87d70 R08: 0000000000000001 R09: 0000000000000000
[  516.483868] R10: 0000000000000002 R11: 0000000000000008 R12: ffff8800bb12bce8
[  516.483868] R13: ffff8800bd346eb8 R14: ffff880087e9cc00 R15: 0000000000000002
[  516.483868] FS:  00007f20523d1740(0000) GS:ffff88011f4c0000(0000) knlGS:0000000000000000
[  516.483868] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  516.483868] CR2: 0000000000000000 CR3: 000000010dabb000 CR4: 00000000000006e0
[  516.483868] Stack:
[  516.483868]  ffffffffa02cd90e ffff8800bd346e90 ffff8800bb12bce8 ffff8800bd346eb8
[  516.483868]  0000000000000000 ffff88008ea87eb8 ffff88008ea87db8 ffffffffa05d61ed
[  516.483868]  ffff88008de8ebc8 0000000100000000 ffff88008de8ebc8 0000000000000000
[  516.483868] Call Trace:
[  516.483868]  [<ffffffffa02cd90e>] ? cl_io_init+0x2e/0xd0 [obdclass]
[  516.483868]  [<ffffffffa05d61ed>] cl_setattr_ost+0xfd/0x2a0 [lustre]
[  516.483868]  [<ffffffffa05b6af9>] ll_setattr_raw+0x959/0xf10 [lustre]
[  516.483868]  [<ffffffffa05b7113>] ll_setattr+0x63/0xc0 [lustre]
[  516.483868]  [<ffffffff811ebc61>] notify_change+0x241/0x390
[  516.483868]  [<ffffffff811cdc71>] chmod_common+0x81/0x100
[  516.483868]  [<ffffffff811cf278>] SyS_fchmodat+0x48/0xb0
[  516.483868]  [<ffffffff816fb7ad>] system_call_fastpath+0x1a/0x1f
[  516.483868] Code: 00 48 8d 58 08 8d 51 01 89 90 98 00 00 00 41 89 14 24 e9 55 ff ff ff e8 70 09 01 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <48> 8b 0f eb 09 66 0f 1f 44 00 00 48 89 c1 48 8b 81 80 00 00 00 
[  516.483868] RIP  [<ffffffffa02c3255>] cl_object_top+0x5/0x110 [obdclass]
[  516.483868]  RSP <ffff88008ea87d40>
[  516.483868] CR2: 0000000000000000

(connection fails due to EPROTO in some enqueue that is a separate issue I guess.

Comment by Peter Jones [ 20/Jul/15 ]

John

Are you still able to trigger this on master now that the CLIO simplification changes are in place?

Peter

Comment by John Hammond [ 20/Jul/15 ]

No, I cannot.

Comment by Peter Jones [ 20/Jul/15 ]

ok - thanks John!

Generated at Sat Feb 10 01:50:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.