[LU-3228] fc18: sanity test_183: @@@@@@ FAIL: test_183 failed with 1 Created: 25/Apr/13  Updated: 01/Jul/14  Resolved: 25/Jan/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.4.1
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Minh Diep Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: yuc2

Issue Links:
Duplicate
duplicates LU-3373 ldiskfs patches for FC19 Resolved
Related
is related to LU-1480 failure on replay-single test_74: ASS... Resolved
Severity: 3
Rank (Obsolete): 7889

 Description   

== sanity test 183: No crash or request leak in case of strange dispositions ========================= 20:23:20 (1366860200)
CMD: client-24vm3 /usr/sbin/lctl set_param fail_loc=0x148
fail_loc=0x148
ls: cannot open directory /mnt/lustre/d0.sanity/d183: No such file or directory
cat: /mnt/lustre/d0.sanity/d183/f.sanity.183: No such file or directory
CMD: client-24vm3 /usr/sbin/lctl set_param fail_loc=0
fail_loc=0
touch: cannot touch ‘/mnt/lustre/d0.sanity/d183/f.sanity.183’: No such file or directory
rm: cannot remove ‘/mnt/lustre/d0.sanity/d183’: Directory not empty
sanity test_183: @@@@@@ FAIL: test_183 failed with 1
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4024:error_noexit()
= /usr/lib64/lustre/tests/test-framework.sh:4047:error()
= /usr/lib64/lustre/tests/test-framework.sh:4301:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:4334:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:4189:run_test()
= /usr/lib64/lustre/tests/sanity.sh:9626:main()
Dumping lctl log to /logdir/test_logs/2013-04-24/lustre-reviews-el6-x86_64-vs-lustre-reviews-fc18-x86_64-review-2_4_1_15074_-70194495514420-185046/sanity.test_183.*.1366860201.log
CMD: client-24vm1,client-24vm2.lab.whamcloud.com,client-24vm3,client-24vm4 /usr/sbin/lctl dk > /logdir/test_logs/2013-04-24/lustre-reviews-el6-x86_64-vs-lustre-reviews-fc18-x86_64-review-2_4_1_15074_-70194495514420-185046/sanity.test_183.debug_log.\$(hostname -s).1366860201.log;
dmesg > /logdir/test_logs/2013-04-24/lustre-reviews-el6-x86_64-vs-lustre-reviews-fc18-x86_64-review-2_4_1_15074_-70194495514420-185046/sanity.test_183.dmesg.\$(hostname -s).1366860201.log

https://maloo.whamcloud.com/test_sets/83b66856-ad8e-11e2-bbea-52540035b04c



 Comments   
Comment by Peter Jones [ 25/Apr/13 ]

Yu, Jian

Could you please look into this one?

Thanks

Peter

Comment by Jian Yu [ 22/May/13 ]

Lustre Branch: master (tag 2.4.50)
Lustre Build: http://build.whamcloud.com/job/lustre-master/1502/

The sanity test 183 was added for LU-2275 in http://review.whamcloud.com/4458.

It passed on RHEL6.4 client:
https://maloo.whamcloud.com/test_sets/456be8de-c2db-11e2-b2eb-52540035b04c

but failed on FC18 client:
https://maloo.whamcloud.com/test_sets/a6a8e2d4-c2d4-11e2-b2eb-52540035b04c

After fail_loc was set to 0 on MDS, the touch operation failed on client node:

fail_loc=0
touch: cannot touch '/mnt/lustre/d0.sanity/d183/f.sanity.183': No such file or directory

Debug log on client node showed that:

00000001:00000001:0.0:1369222356.453337:0:8789:0:(debug.c:444:libcfs_debug_mark_buffer()) ***************************************************
00000001:02000400:0.0:1369222356.453339:0:8789:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: fail_loc=0
00000001:00000001:0.0:1369222356.453345:0:8789:0:(debug.c:446:libcfs_debug_mark_buffer()) ***************************************************
......
00000080:00000001:0.0:1369222356.699798:0:8803:0:(dcache.c:96:ll_dcompare()) Process entered
00000080:00002000:0.0:1369222356.699799:0:8803:0:(dcache.c:119:ll_dcompare()) found name f.sanity.183(ffff88007a14f540) flags 0xce refc 0
00000080:00000001:0.0:1369222356.699800:0:8803:0:(dcache.c:128:ll_dcompare()) Process leaving (rc=0 : 0 : 0)
00000080:00000001:0.0:1369222356.699801:0:8803:0:(dcache.c:669:ll_revalidate_nd()) Process entered
00000080:00200000:0.0:1369222356.699801:0:8803:0:(dcache.c:671:ll_revalidate_nd()) VFS Op:name=f.sanity.183,flags=769
00000080:00000001:0.0:1369222356.699802:0:8803:0:(dcache.c:684:ll_revalidate_nd()) Process leaving (rc=1 : 1 : 1)
00000080:00000001:0.0:1369222356.699803:0:8803:0:(file.c:2978:ll_inode_permission()) Process entered
00000080:00200000:0.0:1369222356.699803:0:8803:0:(file.c:3001:ll_inode_permission()) VFS Op:inode=144115205255725060/33554436(ffff88007b35c9f8), inode mode 41ed mask 3
00000080:00000001:0.0:1369222356.699804:0:8803:0:(file.c:3009:ll_inode_permission()) Process leaving (rc=0 : 0 : 0)
00000080:00000001:0.0:1369222356.699806:0:8803:0:(namei.c:622:ll_atomic_open()) Process entered
00000080:00200000:0.0:1369222356.699807:0:8803:0:(namei.c:627:ll_atomic_open()) VFS Op:name=f.sanity.183,dir=144115205255725060/33554436(ffff88007b35c9f8),file ffff88007adf5400,open_flags 8941,mode 81b6 opened 0
00000080:00000010:0.0:1369222356.699809:0:8803:0:(namei.c:629:ll_atomic_open()) kmalloced 'it': 72 at ffff88007a8bb420.
00000080:00000001:0.0:1369222356.699811:0:8803:0:(namei.c:496:ll_lookup_it()) Process entered
00000080:00200000:0.0:1369222356.699811:0:8803:0:(namei.c:503:ll_lookup_it()) VFS Op:name=f.sanity.183,dir=144115205255725060/33554436(ffff88007b35c9f8),intent=open|creat
......
00000080:00000001:0.0:1369222356.702127:0:8803:0:(namei.c:568:ll_lookup_it()) Process leaving via out (rc=0 : 0 : 0x0)
00000100:00000001:0.0:1369222356.702128:0:8803:0:(client.c:2271:__ptlrpc_req_finished()) Process entered
00000100:00000040:0.0:1369222356.702130:0:8803:0:(client.c:2283:__ptlrpc_req_finished()) @@@ refcount now 2 req@ffff88007b35b400 x1435733615182304/t4294967313(4294967313) o101->lustre-MDT0000-mdc-ffff88007c342000@10.10.4.208@tcp:12/10 lens 568/600 e 0 to 0 dl 1369222363 ref 3 fl Complete:RP/4/0 rc 0/0
00000100:00000001:0.0:1369222356.702133:0:8803:0:(client.c:2290:__ptlrpc_req_finished()) Process leaving (rc=0 : 0 : 0)
00000080:00000001:0.0:1369222356.702135:0:8803:0:(dcache.c:289:ll_intent_release()) Process entered
......
00000080:00000001:0.0:1369222356.702158:0:8803:0:(dcache.c:304:ll_intent_release()) Process leaving
00000080:00000010:0.0:1369222356.702158:0:8803:0:(namei.c:688:ll_atomic_open()) kfreed 'it': 72 at ffff88007a8bb420.
00000080:00000001:0.0:1369222356.702168:0:8803:0:(namei.c:690:ll_atomic_open()) Process leaving (rc=1 : 1 : 1)

The ll_atomic_open() failed with 1, which was returned from finish_no_open(). I'm still digging.

Comment by Jian Yu [ 09/Aug/13 ]

Lustre Branch: b2_4
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/27/
Distro/Arch: RHEL6.4/x86_64 + FC18/x86_64 (Server + Client)

The failure occurred regularly on Lustre b2_4 branch on FC18 client:
https://maloo.whamcloud.com/test_sets/996b2278-fd79-11e2-9fdb-52540035b04c

Comment by Jian Yu [ 04/Sep/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1)
Distro/Arch: RHEL6.4/x86_64 + FC18/x86_64 (Server + Client)

sanity test 183 hit the same failure:
https://maloo.whamcloud.com/test_sets/0cbde1d0-14ee-11e3-ac48-52540035b04c

Comment by Oleg Drokin [ 24/Jan/14 ]

After failing like this, next unmount, that happens to be in test 223 crashes with reference count assertion.
This happens to be LU-1480.

This bug is also observed in mainline kernel.

Comment by Oleg Drokin [ 25/Jan/14 ]

Apparently this bug was introduced by commit 784cd144103871bd421c139c09bfbf4d5d29ca08 coming from patch http://review.whamcloud.com/4387 to support atomic_open.

The problematic hunk is this:

@@ -438,13 +443,21 @@ int ll_lookup_it_finish(struct ptlrpc_request *request,
                    Also see bug 7198. */
        }
 
-       *de = ll_splice_alias(inode, *de);
+       /* Only hash *de if it is unhashed (new dentry).
+        * Atoimc_open may passin hashed dentries for open.
+        */
+       if (d_unhashed(*de))
+               *de = ll_splice_alias(inode, *de);

When this hits (as in, dentry IS hashed), two things happen:
1. we leak inode (and don't assign it to dentry, obviously)
2. we return NULL from ll_lookup_it, so it uses passed in dentry, but it is negative, and leads to that no_open branch, where we pass in de, that's NULL (probably should pass in dentry at all times?).

Comment by Oleg Drokin [ 25/Jan/14 ]

Apparently this was fixed by http://review.whamcloud.com/8110

Comment by Oleg Drokin [ 25/Jan/14 ]

I verified that the patch in http://review.whamcloud.com/8110 fixes the issue.

Generated at Sat Feb 10 01:32:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.