[LU-4970] Test failure sanity-lfsck test_14: ls should success Created: 28/Apr/14  Updated: 01/Oct/14  Resolved: 05/Sep/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-4524 ldlm_handle_enqueue0 incorrectly inte... Resolved
is related to LU-5184 sanity-lfsck test 18d: file size neve... Resolved
Severity: 3
Rank (Obsolete): 13760

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:
http://maloo.whamcloud.com/test_sets/33c0eb0c-cd6b-11e3-b548-52540035b04c
https://maloo.whamcloud.com/test_sets/1ca115d8-bb68-11e3-8ec1-52540035b04c
https://maloo.whamcloud.com/test_sets/beed7496-c0a2-11e3-b5ea-52540035b04c

The sub-test test_14 failed with the following error:

'ls' should success after layout LFSCK repairing
ls: cannot access /mnt/lustre/d14.sanity-lfsck/f46: Cannot allocate memory
ls: cannot access /mnt/lustre/d14.sanity-lfsck/f32: Cannot allocate memory
sanity-lfsck test_14: @@@@@@ FAIL: (5) ls should success.

Info required for matching: sanity-lfsck 14



 Comments   
Comment by Andreas Dilger [ 29/Apr/14 ]

Nathaniel, it looks like there is a memory allocation problem during the running of test_14 on all three failures you reported.

Could you please take a look to see if this is something specific to ZFS, or is it related to LFSCK?

Comment by Nathaniel Clark [ 05/May/14 ]

I don't see this on LFSCK

OST debug_log:

2306:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue0()) ### server-side enqueue handler START
2305:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x171:0x0: rc = -2
2306:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x180:0x0: rc = -2
2305:0:(ldlm_lockd.c:1431:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock (null), rc -12)
Comment by Nathaniel Clark [ 19/May/14 ]

This bug doesn't date back much prior to Apr 1, 2014

The earliest commit I can associate a failure to is:
588a29b Wed Mar 26 00:14:46 2014 +0000 LU-4462 mdt: don't apply mdt_object_fid() to ERR_PTRs

but this is the root of long patch chains. The earliest patch I can associate a direct failure to:
a38ac90 Tue Apr 1 16:11:43 2014 +0000 LU-4805 lmv: lookup remote migrating object in LMV

I don't have a suspect patch but it appears somewhere in here an issue was introduced.

Comment by Andreas Dilger [ 27/May/14 ]

I see there is still the "ls" failure in the test logs:

ls: cannot access /mnt/lustre/d14.sanity-lfsck/f46: Cannot allocate memory

but nothing in the Lustre debug logs, since they are only running with the default debug level.

It probably makes sense to bump the debug level to -1 for test_14, and make the ls (9) failure an error_ignore() for the short term. That would allow debugging the problem more easily than just disabling the test.

Comment by Nathaniel Clark [ 28/May/14 ]

Turn up debugging:

http://review.whamcloud.com/10472

Comment by Nathaniel Clark [ 28/May/14 ]

EXCEPT test for zfs:
http://review.whamcloud.com/10473

Comment by Andreas Dilger [ 29/May/14 ]

It appears that ls: cannot allocate memory is a bogus message caused by lvbo_init returning NULL if the object cannot be found, and NULL being interpreted by the higher layers as -ENOMEM. This would be fixed by the patch http://review.whamcloud.com/9004 that returns the proper error in this case (-ENOENT). There would still a test failure, but at least it would be more understandable.

Comment by Nathaniel Clark [ 30/May/14 ]

Failures with full debugging turned on:
https://maloo.whamcloud.com/test_sets/8d617942-cffa-11e3-9aa7-52540035b04c
https://maloo.whamcloud.com/test_sets/c14112a2-e7d5-11e3-8bff-52540035b04c

Comment by Nathaniel Clark [ 30/May/14 ]

OST debug_log:

(ofd_objects.c:634:ofd_attr_get()) Process entered
(ofd_objects.c:659:ofd_attr_get()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
(service.c:1068:ptlrpc_update_export_timer()) Process entered
(service.c:1081:ptlrpc_update_export_timer()) Process leaving
(ofd_lvb.c:98:ofd_lvbo_init()) Process leaving via out_obj (rc=18446744073709551614 : -2 : 0xfffffffffffffffe)
(lustre_fid.h:712:fid_flatten32()) Process leaving (rc=4278190179 : 4278190179 : ff000063)
(lu_object.c:1643:keys_init()) kmalloced 'ctx->lc_value': 320 at ffff88006f9e9000.
(lu_object.c:429:lu_global_key_fini()) kfreed 'info': 512 at ffff88006f9e2e00.
(capa.c:416:lu_capainfo_key_init()) kmalloced 'value': 128 at ffff880071527ac0.
(dt_object.c:56:dt_global_key_fini()) kfreed 'info': 1216 at ffff88006f4ef800.
(lu_ucred.c:53:lu_ucred_key_init()) kmalloced 'value': 72 at ffff880072641cc0.
(tgt_main.c:163:tgt_key_fini()) kfreed 'info': 1152 at ffff88006bace000.
(fld_handler.c:60:fld_key_fini()) kfreed 'info': 72 at ffff88006fdac7c0.
(tgt_main.c:187:tgt_ses_key_init()) kmalloced 'value': 144 at ffff880071461bc0.
(fid_handler.c:406:seq_key_fini()) kfreed 'info': 48 at ffff88006e3b4800.
(service.c:1678:ptlrpc_server_request_add()) Process entered
(service.c:1579:ptlrpc_server_hpreq_init()) Process entered
(lquota_lib.c:53:lquota_key_fini()) kfreed 'info': 240 at ffff880072539cc0.
(qsd_lib.c:71:qsd_key_fini()) kfreed 'info': 376 at ffff88006e688a00.
(service.c:1630:ptlrpc_server_hpreq_init()) Process leaving (rc=0 : 0 : 0)
(service.c:1693:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0)
(lfsck_lib.c:55:lfsck_key_fini()) kfreed 'info': 1144 at ffff8800723f7000.
(ofd_dev.c:2332:ofd_key_fini()) kfreed 'info': 656 at ffff880071e1e000.
(service.c:2006:ptlrpc_server_handle_req_in()) Process leaving (rc=1 : 1 : 1)
(lu_object.c:1593:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff880071a58c00.
(ldlm_resource.c:1161:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x163:0x0: rc = -2

So ENOENT is the correct error that should be being passed up.

Comment by Nathaniel Clark [ 19/Jun/14 ]

This has been EXCEPT'd for 2.6

Comment by Nathaniel Clark [ 28/Jul/14 ]

This seems to have shown up on review-dne-part-2:
https://testing.hpdd.intel.com/test_sets/3db646c0-1433-11e4-88ed-5254006e85c2

Comment by nasf (Inactive) [ 30/Jul/14 ]

Hi Clark, I think that it is my code issue caused the failure, I will take the bug and fix it.

Comment by nasf (Inactive) [ 30/Jul/14 ]

Here are the patches:

http://review.whamcloud.com/#/c/11276/
http://review.whamcloud.com/#/c/11590/

Comment by nasf (Inactive) [ 05/Sep/14 ]

Patches have been landed to master.

Generated at Sat Feb 10 01:47:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.