[LU-6147] sanity-lfsck test_4: '(7) unexpected status' Created: 22/Jan/15  Updated: 15/Jul/15  Resolved: 03/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 17167

 Description   

This issue was created by maloo for John Hammond <john.hammond@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/7626441a-a14b-11e4-807a-5254006e85c2.

The sub-test test_4 failed with the following error:

(7) unexpected status

Please provide additional information about the failure here.

Info required for matching: sanity-lfsck 4



 Comments   
Comment by John Hammond [ 22/Jan/15 ]

When I runn this locally I see an lfsck thread asleep in lu_object_find_at():

[  515.234216] Lustre: DEBUG MARKER: == sanity-lfsck test 4: FID-in-dirent can be rebuilt after MDT file-level backup/restore == 10:23:00 (1421943780)
[  521.780085] Lustre: Failing over lustre-MDT0000
[  524.984187] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 192.168.122.201@tcp (no target). If you are running an HA pair check that the 
target is mounted on the other server.
[  524.987971] LustreError: Skipped 13 previous similar messages
[  527.956175] Lustre: 5093:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1421943787/real 1421943787]  req@ffff8
80064738188 x1491015708973216/t0(0) o251->MGC192.168.122.200@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1421943793 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[  527.961754] Lustre: 5093:0:(client.c:1942:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
[  528.510534] Lustre: server umount lustre-MDT0000 complete
[  529.308146] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
[  531.180857] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
[  531.902794] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
[  537.656353] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
[  539.324451] Lustre: *** cfs_fail_loc=1601, val=0***
[  547.009622] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 3 clients reconnect
[  547.011917] Lustre: lustre-MDT0000: Denying connection for new client lustre-MDT0000-lwp-MDT0001_UUID (at 192.168.122.201@tcp), waiting for all 3 known clients (0
 recovered, 0 in progress, and 0 evicted) to recover in 1:06
[  547.128862] Lustre: lustre-MDT0000: Recovery over after 0:01, of 3 clients 3 recovered and 0 were evicted.
[  578.733248] Lustre: DEBUG MARKER: sanity-lfsck test_4: @@@@@@ FAIL: (7) unexpected status
[  724.197393] INFO: task lfsck:5461 blocked for more than 120 seconds.
[  724.198800]       Not tainted 2.6.32-431.29.2.el6.lustre.x86_64 #1
[  724.200136] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  724.201748] lfsck         D 0000000000000000     0  5461      2 0x00000080
[  724.203226]  ffff880067f37c00 0000000000000046 0000000000000000 ffff880065abe188
[  724.204927]  ffff880067f37b70 ffffffffa03183fe ffff880067f37c00 ffffffffa04a5cd9
[  724.206596]  ffff880062298cc0 ffff880067f37fd8 000000000000fcc8 ffff880062298cc0
[  724.208431] Call Trace:
[  724.208981]  [<ffffffffa03183fe>] ? cfs_hash_spin_unlock+0xe/0x10 [libcfs]
[  724.210466]  [<ffffffffa04a5cd9>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
[  724.211890]  [<ffffffffa04a5f2d>] lu_object_find_at+0x3d/0xe0 [obdclass]
[  724.212765]  [<ffffffff8155625b>] ? _read_unlock+0x2b/0x40
[  724.213476]  [<ffffffff81061d90>] ? default_wake_function+0x0/0x20
[  724.214280]  [<ffffffffa04a600f>] lu_object_find_slice+0x1f/0x80 [obdclass]
[  724.215198]  [<ffffffffa09e0838>] lfsck_master_oit_engine+0x608/0x1f40 [lfsck]
[  724.216154]  [<ffffffff810b74bd>] ? trace_hardirqs_on+0xd/0x10
[  724.216999]  [<ffffffffa09e2da1>] lfsck_master_engine+0xc31/0x14b0 [lfsck]
[  724.217933]  [<ffffffff81553065>] ? thread_return+0x4e/0x7e9
[  724.218670]  [<ffffffff81061d90>] ? default_wake_function+0x0/0x20
[  724.219474]  [<ffffffffa09e2170>] ? lfsck_master_engine+0x0/0x14b0 [lfsck]
[  724.220382]  [<ffffffff8109e856>] kthread+0x96/0xa0
[  724.220999]  [<ffffffff8100c30a>] child_rip+0xa/0x20
[  724.221665]  [<ffffffff815562e0>] ? _spin_unlock_irq+0x30/0x40
[  724.222430]  [<ffffffff8100bb10>] ? restore_args+0x0/0x30
[  724.223110]  [<ffffffff8109e7c0>] ? kthread+0x0/0xa0
[  724.223796]  [<ffffffff8100c300>] ? child_rip+0x0/0x20
Comment by nasf (Inactive) [ 22/Jan/15 ]

It is another failure instance of LU-6109 because of LFSCK deadlock with itself when try to find some special object, such as .lustre/lost+found/MDTxxxx

Comment by nasf (Inactive) [ 22/Jan/15 ]

John, would you please to verify whether the patch http://review.whamcloud.com/#/c/13493/ works for you or not? Thanks!

Comment by John Hammond [ 22/Jan/15 ]

Yes. It works for me.

Comment by nasf (Inactive) [ 23/Jan/15 ]

Thanks John.

Comment by nasf (Inactive) [ 23/Jan/15 ]

The LU-6109 failed for more complex reason, we need more investigate it.
But the reason for this failure (LU-6147) is known. So handle it separately.

Comment by nasf (Inactive) [ 23/Jan/15 ]

Here is the patch: http://review.whamcloud.com/#/c/13493/

Comment by nasf (Inactive) [ 02/Feb/15 ]

Many sanity-lfsck test_4 failed on Maloo recently:
https://testing.hpdd.intel.com/sub_tests/query?utf8=✓&test_set%5Btest_set_script_id%5D=4f25830c-64fe-11e2-bfb2-52540035b04c&sub_test%5Bsub_test_script_id%5D=50859f34-64fe-11e2-bfb2-52540035b04c&sub_test%5Bstatus%5D=FAIL&sub_test%5Bquery_bugs%5D=&test_session%5Btest_host%5D=&test_session%5Btest_group%5D=&test_session%5Buser_id%5D=&test_session%5Bquery_date%5D=&test_session%5Bquery_recent_period%5D=&test_node%5Bos_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Barchitecture_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=&test_node_network%5Bnetwork_type_id%5D=&commit=Update+results

So we need to land this patch ASAP to avoid blocking others.

Comment by Jian Yu [ 02/Feb/15 ]

This is blocking patch review testing on master branch:
https://testing.hpdd.intel.com/test_sets/8e856a20-aa3a-11e4-9ffb-5254006e85c2
https://testing.hpdd.intel.com/test_sets/1c657568-aaa1-11e4-8eda-5254006e85c2

Comment by Gerrit Updater [ 03/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13493/
Subject: LU-6147 lfsck: NOT purge object by OI scrub
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d11360f4cc5d38cd748a97ca05e10121353ae616

Comment by Peter Jones [ 03/Feb/15 ]

Landed for 2.7

Comment by Gerrit Updater [ 23/Feb/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13840
Subject: LU-6147 lfsck: NOT purge object by OI scrub
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: ef9a3b471bfda6587f1da282a073f5b9a5226cfb

Comment by Bruno Faccini (Inactive) [ 27/Feb/15 ]

+1 at https://testing.hpdd.intel.com/test_sets/6e9169d6-a179-11e4-9db6-5254006e85c2

Generated at Sat Feb 10 01:57:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.