[LU-4926] sanity-scrub: Failed to copy files to mds2, Bad file descriptor Created: 17/Apr/14  Updated: 04/Jun/14  Resolved: 22/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: nasf (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Attachments: Text File sanity-scrub_1b.log    
Severity: 3
Rank (Obsolete): 13616

 Description   

This issue was created by maloo for nasf <fan.yong@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/cb28be52-c631-11e3-874b-52540035b04c.

The sub-test test_5 failed with the following error:

Failed to copy files to mds2

Info required for matching: sanity-scrub 5

Under a new formatted DNE environment, when copy file from local system to MDS2, it hit the failure of "Bad file descriptor".



 Comments   
Comment by nasf (Inactive) [ 18/Apr/14 ]

The attachment is the log for another failure instance. Seems related with CLIO.

Jinshan, can you please give some check? Thanks.

Comment by Andreas Dilger [ 21/Apr/14 ]

Nasf, are you able to reproduce this with manual testing? What steps do you take? It seems serious that a file cannot be copied to a new filesystem. Is there something that the test scripts are doing that might be hiding this bug from us?

Comment by nasf (Inactive) [ 21/Apr/14 ]

I can reproduce it by run sanity-scrub.sh with DNE enabled, but cannot reproduced it manually yet. I will investigate it more.

Comment by nasf (Inactive) [ 22/Apr/14 ]

The reason for the copy failure is that the mdc_intent_lock() does not fetch the LOV EA for the new created file, then its caller will call ll_prep_inode() to prepare the local object with empty stripe, and then the subsequent write operations with empty LOV EA will be regarded as invalid.

00000004:00000001:0.0:1395927909.641540:0:27857:0:(mdt_handler.c:750:mdt_getattr_internal()) Process entered
00000004:00000040:0.0:1395927909.641540:0:27857:0:(mdt_handler.c:780:mdt_getattr_internal()) lustre-MDT0001: RPC from acaf5fbc-81d0-3a35-8cbb-3c2d460ad809: does not need LOVEA.
...
00000080:00000001:1.0:1395927909.641917:0:27983:0:(dcache.c:317:ll_revalidate_it_finish()) Process entered
00000080:00000001:1.0:1395927909.641918:0:27983:0:(llite_lib.c:2411:ll_prep_inode()) Process entered
...
00020000:00000002:1.0:1395927909.641948:0:27983:0:(lov_object.c:722:lov_layout_change()) [0x2c0002340:0x4f:0x0] from RAID0 to EMPTY
...
00000080:00000001:1.0:1395927909.642709:0:27983:0:(file.c:1355:ll_file_write()) Process entered
00020000:00000001:1.0:1395927909.642730:0:27983:0:(lov_io.c:909:lov_io_init_empty()) Process entered
00020000:00000001:1.0:1395927909.642730:0:27983:0:(lov_io.c:938:lov_io_init_empty()) Process leaving (rc=1 : 1 : 1)
00000020:00000001:1.0:1395927909.642731:0:27983:0:(cl_io.c:179:cl_io_init0()) Process leaving (rc=1 : 1 : 1)
00000020:00000001:1.0:1395927909.642731:0:27983:0:(cl_io.c:240:cl_io_rw_init()) Process leaving (rc=1 : 1 : 1)
00000080:00000001:1.0:1395927909.642732:0:27983:0:(file.c:1189:ll_file_io_generic()) Process leaving via out (rc=18446744073709551607 : -9 : 0xfffffffffffffff7)
...
int lov_io_init_empty(const struct lu_env *env, struct cl_object *obj,
                      struct cl_io *io)
{
        struct lov_object *lov = cl2lov(obj);
        struct lov_io *lio = lov_env_io(env);
        int result;
        ENTRY;

        lio->lis_object = lov;
        switch (io->ci_type) {
        default:
                LBUG();
        case CIT_MISC:
        case CIT_READ:
                result = 0;
                break;
        case CIT_FSYNC:
        case CIT_SETATTR:
                result = +1;
                break;
        case CIT_WRITE:
                result = -EBADF;
                break;
        case CIT_FAULT:
                result = -EFAULT;
                CERROR("Page fault on a file without stripes: "DFID"\n",
                       PFID(lu_object_fid(&obj->co_lu)));
                break;
        }
        if (result == 0) {
                cl_io_slice_add(io, &lio->lis_cl, obj, &lov_empty_io_ops);
                cfs_atomic_inc(&lov->lo_active_ios);
        }

        io->ci_result = result < 0 ? result : 0;
        RETURN(result != 0);
}

In fact, such issue has been resolved by the patch http://review.whamcloud.com/#/c/9862/. I have repeatedly run sanity-scrub with such patch applied, and cannot reproduce the issue any longer.

Comment by nasf (Inactive) [ 22/Apr/14 ]

It is another failure instance of LU-4847.

Generated at Sat Feb 10 01:47:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.