[LU-4970] Test failure sanity-lfsck test_14: ls should success Created: 28/Apr/14 Updated: 01/Oct/14 Resolved: 05/Sep/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 13760 | ||||||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: The sub-test test_14 failed with the following error:
Info required for matching: sanity-lfsck 14 |
| Comments |
| Comment by Andreas Dilger [ 29/Apr/14 ] |
|
Nathaniel, it looks like there is a memory allocation problem during the running of test_14 on all three failures you reported. Could you please take a look to see if this is something specific to ZFS, or is it related to LFSCK? |
| Comment by Nathaniel Clark [ 05/May/14 ] |
|
I don't see this on LFSCK OST debug_log: 2306:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue0()) ### server-side enqueue handler START 2305:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x171:0x0: rc = -2 2306:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x180:0x0: rc = -2 2305:0:(ldlm_lockd.c:1431:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock (null), rc -12) |
| Comment by Nathaniel Clark [ 19/May/14 ] |
|
This bug doesn't date back much prior to Apr 1, 2014 The earliest commit I can associate a failure to is: but this is the root of long patch chains. The earliest patch I can associate a direct failure to: I don't have a suspect patch but it appears somewhere in here an issue was introduced. |
| Comment by Andreas Dilger [ 27/May/14 ] |
|
I see there is still the "ls" failure in the test logs: ls: cannot access /mnt/lustre/d14.sanity-lfsck/f46: Cannot allocate memory but nothing in the Lustre debug logs, since they are only running with the default debug level. It probably makes sense to bump the debug level to -1 for test_14, and make the ls (9) failure an error_ignore() for the short term. That would allow debugging the problem more easily than just disabling the test. |
| Comment by Nathaniel Clark [ 28/May/14 ] |
|
Turn up debugging: |
| Comment by Nathaniel Clark [ 28/May/14 ] |
|
EXCEPT test for zfs: |
| Comment by Andreas Dilger [ 29/May/14 ] |
|
It appears that ls: cannot allocate memory is a bogus message caused by lvbo_init returning NULL if the object cannot be found, and NULL being interpreted by the higher layers as -ENOMEM. This would be fixed by the patch http://review.whamcloud.com/9004 that returns the proper error in this case (-ENOENT). There would still a test failure, but at least it would be more understandable. |
| Comment by Nathaniel Clark [ 30/May/14 ] |
|
Failures with full debugging turned on: |
| Comment by Nathaniel Clark [ 30/May/14 ] |
|
OST debug_log: (ofd_objects.c:634:ofd_attr_get()) Process entered (ofd_objects.c:659:ofd_attr_get()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) (service.c:1068:ptlrpc_update_export_timer()) Process entered (service.c:1081:ptlrpc_update_export_timer()) Process leaving (ofd_lvb.c:98:ofd_lvbo_init()) Process leaving via out_obj (rc=18446744073709551614 : -2 : 0xfffffffffffffffe) (lustre_fid.h:712:fid_flatten32()) Process leaving (rc=4278190179 : 4278190179 : ff000063) (lu_object.c:1643:keys_init()) kmalloced 'ctx->lc_value': 320 at ffff88006f9e9000. (lu_object.c:429:lu_global_key_fini()) kfreed 'info': 512 at ffff88006f9e2e00. (capa.c:416:lu_capainfo_key_init()) kmalloced 'value': 128 at ffff880071527ac0. (dt_object.c:56:dt_global_key_fini()) kfreed 'info': 1216 at ffff88006f4ef800. (lu_ucred.c:53:lu_ucred_key_init()) kmalloced 'value': 72 at ffff880072641cc0. (tgt_main.c:163:tgt_key_fini()) kfreed 'info': 1152 at ffff88006bace000. (fld_handler.c:60:fld_key_fini()) kfreed 'info': 72 at ffff88006fdac7c0. (tgt_main.c:187:tgt_ses_key_init()) kmalloced 'value': 144 at ffff880071461bc0. (fid_handler.c:406:seq_key_fini()) kfreed 'info': 48 at ffff88006e3b4800. (service.c:1678:ptlrpc_server_request_add()) Process entered (service.c:1579:ptlrpc_server_hpreq_init()) Process entered (lquota_lib.c:53:lquota_key_fini()) kfreed 'info': 240 at ffff880072539cc0. (qsd_lib.c:71:qsd_key_fini()) kfreed 'info': 376 at ffff88006e688a00. (service.c:1630:ptlrpc_server_hpreq_init()) Process leaving (rc=0 : 0 : 0) (service.c:1693:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) (lfsck_lib.c:55:lfsck_key_fini()) kfreed 'info': 1144 at ffff8800723f7000. (ofd_dev.c:2332:ofd_key_fini()) kfreed 'info': 656 at ffff880071e1e000. (service.c:2006:ptlrpc_server_handle_req_in()) Process leaving (rc=1 : 1 : 1) (lu_object.c:1593:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff880071a58c00. (ldlm_resource.c:1161:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x163:0x0: rc = -2 So ENOENT is the correct error that should be being passed up. |
| Comment by Nathaniel Clark [ 19/Jun/14 ] |
|
This has been EXCEPT'd for 2.6 |
| Comment by Nathaniel Clark [ 28/Jul/14 ] |
|
This seems to have shown up on review-dne-part-2: |
| Comment by nasf (Inactive) [ 30/Jul/14 ] |
|
Hi Clark, I think that it is my code issue caused the failure, I will take the bug and fix it. |
| Comment by nasf (Inactive) [ 30/Jul/14 ] |
|
Here are the patches: http://review.whamcloud.com/#/c/11276/ |
| Comment by nasf (Inactive) [ 05/Sep/14 ] |
|
Patches have been landed to master. |