[LU-4926] sanity-scrub: Failed to copy files to mds2, Bad file descriptor Created: 17/Apr/14 Updated: 04/Jun/14 Resolved: 22/Apr/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | nasf (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 13616 |
| Description |
|
This issue was created by maloo for nasf <fan.yong@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/cb28be52-c631-11e3-874b-52540035b04c. The sub-test test_5 failed with the following error:
Info required for matching: sanity-scrub 5 Under a new formatted DNE environment, when copy file from local system to MDS2, it hit the failure of "Bad file descriptor". |
| Comments |
| Comment by nasf (Inactive) [ 18/Apr/14 ] |
|
The attachment is the log for another failure instance. Seems related with CLIO. Jinshan, can you please give some check? Thanks. |
| Comment by Andreas Dilger [ 21/Apr/14 ] |
|
Nasf, are you able to reproduce this with manual testing? What steps do you take? It seems serious that a file cannot be copied to a new filesystem. Is there something that the test scripts are doing that might be hiding this bug from us? |
| Comment by nasf (Inactive) [ 21/Apr/14 ] |
|
I can reproduce it by run sanity-scrub.sh with DNE enabled, but cannot reproduced it manually yet. I will investigate it more. |
| Comment by nasf (Inactive) [ 22/Apr/14 ] |
|
The reason for the copy failure is that the mdc_intent_lock() does not fetch the LOV EA for the new created file, then its caller will call ll_prep_inode() to prepare the local object with empty stripe, and then the subsequent write operations with empty LOV EA will be regarded as invalid. 00000004:00000001:0.0:1395927909.641540:0:27857:0:(mdt_handler.c:750:mdt_getattr_internal()) Process entered 00000004:00000040:0.0:1395927909.641540:0:27857:0:(mdt_handler.c:780:mdt_getattr_internal()) lustre-MDT0001: RPC from acaf5fbc-81d0-3a35-8cbb-3c2d460ad809: does not need LOVEA. ... 00000080:00000001:1.0:1395927909.641917:0:27983:0:(dcache.c:317:ll_revalidate_it_finish()) Process entered 00000080:00000001:1.0:1395927909.641918:0:27983:0:(llite_lib.c:2411:ll_prep_inode()) Process entered ... 00020000:00000002:1.0:1395927909.641948:0:27983:0:(lov_object.c:722:lov_layout_change()) [0x2c0002340:0x4f:0x0] from RAID0 to EMPTY ... 00000080:00000001:1.0:1395927909.642709:0:27983:0:(file.c:1355:ll_file_write()) Process entered 00020000:00000001:1.0:1395927909.642730:0:27983:0:(lov_io.c:909:lov_io_init_empty()) Process entered 00020000:00000001:1.0:1395927909.642730:0:27983:0:(lov_io.c:938:lov_io_init_empty()) Process leaving (rc=1 : 1 : 1) 00000020:00000001:1.0:1395927909.642731:0:27983:0:(cl_io.c:179:cl_io_init0()) Process leaving (rc=1 : 1 : 1) 00000020:00000001:1.0:1395927909.642731:0:27983:0:(cl_io.c:240:cl_io_rw_init()) Process leaving (rc=1 : 1 : 1) 00000080:00000001:1.0:1395927909.642732:0:27983:0:(file.c:1189:ll_file_io_generic()) Process leaving via out (rc=18446744073709551607 : -9 : 0xfffffffffffffff7) ... int lov_io_init_empty(const struct lu_env *env, struct cl_object *obj, struct cl_io *io) { struct lov_object *lov = cl2lov(obj); struct lov_io *lio = lov_env_io(env); int result; ENTRY; lio->lis_object = lov; switch (io->ci_type) { default: LBUG(); case CIT_MISC: case CIT_READ: result = 0; break; case CIT_FSYNC: case CIT_SETATTR: result = +1; break; case CIT_WRITE: result = -EBADF; break; case CIT_FAULT: result = -EFAULT; CERROR("Page fault on a file without stripes: "DFID"\n", PFID(lu_object_fid(&obj->co_lu))); break; } if (result == 0) { cl_io_slice_add(io, &lio->lis_cl, obj, &lov_empty_io_ops); cfs_atomic_inc(&lov->lo_active_ios); } io->ci_result = result < 0 ? result : 0; RETURN(result != 0); } In fact, such issue has been resolved by the patch http://review.whamcloud.com/#/c/9862/. I have repeatedly run sanity-scrub with such patch applied, and cannot reproduce the issue any longer. |
| Comment by nasf (Inactive) [ 22/Apr/14 ] |
|
It is another failure instance of |