[LU-1096] Test failure on test suite lfsck,Segmentation fault Created: 12/Feb/12  Updated: 30/May/12  Resolved: 30/May/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 6460

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/1bfcbe72-553d-11e1-9aa8-5254004bbbd3.



 Comments   
Comment by Peter Jones [ 16/Feb/12 ]

Andreas will look into this one

Comment by Peter Jones [ 19/Mar/12 ]

Niu

Andreas is out on vacation for the next two weeks so could you please make an initial assessment of this issue?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 19/Mar/12 ]
lfsck -c -l --mdsdb /scratch/mdsdb --ostdb /scratch/ostdb-0 /scratch/ostdb-1 /scratch/ostdb-2 /scratch/ostdb-3 /scratch/ostdb-4 /scratch/ostdb-5 /mnt/lustre
lfsck 1.41.90.wc4 (01-Sep-2011)
/scratch/mdsdb:mdshdr
: Invalid argument
/usr/lib64/lustre/tests/test-framework.sh: line 2679: 19558 Segmentation fault      lfsck -c -l --mdsdb /scratch/mdsdb --ostdb /scratch/ostdb-0 /scratch/ostdb-1 /scratch/ostdb-2 /scratch/ostdb-3 /scratch/ostdb-4 /scratch/ostdb-5 /mnt/lustre

Looks like open mdsdb failed, and it caused segment fault.

+       if ((rc = dbp->open(dbp, NULL, fname, dbname, DB_HASH,
+                           DB_CREATE | DB_THREAD, 0664)) != 0)
+       {
+               dbp->err(dbp, rc, "%s:%s\n", fname, dbname);
+               dbp->close(dbp, 0);
+               return (EIO);

I suspect this failure is similar to the LU-367: the db4 version of the server which generating MDSDB is different with the db4 version of client which open the MDSDB.

The maloo shows that test server and client have different system, server is CentOS release 6.2, but client is CentOS release 5.7, so I think the db4 version should be different.

I think we may need to improve the test script to make sure that using same db4 to generate & check the db file, and lfsck also needs be improved to handle the db4 version mismatch gracefully. But I don't think this should be a blocker.

Comment by Peter Jones [ 09/Apr/12 ]

Yangsheng

Could you please look into how to address this problem. Andreas had some ideas about how to deal with this situation more gracefully

Thanks

Peter

Comment by Yang Sheng [ 12/Apr/12 ]

This issue isn't relate to db4 library. Just wrong invoke log_write(...) after lfsck_opendb():

        rc = lfsck_opendb(mds_file, MDS_HDR, &mds_hdrdb, 0, 0, 0);
        if (rc != 0) {
      >>>>>>>   log_write("%s: error opening mds_hdr in %s: rc %d\n",
                          mds_file, rc);
                return(-EINVAL);

Have a guidance for commit patch to e2fsprog?

Comment by Yang Sheng [ 12/Apr/12 ]

For record:

int log_close(int status)
{
        time_t tm;

        if (logfile == NULL)
                return(0);

        time(&tm);
        if (status < 0) {
                fprintf(logfile, "ERROR: lfsck aborted\n");
        } else {
                fprintf(logfile, "lfsck run completed:  %s\n",ctime(&tm));
        }
        fprintf(logfile, "===============================================\n\n");

        fclose(logfile);
>>>>>>>>logfile = NULL;
        return(0);
}

Else may cause a double free in some failure case.

Comment by Andreas Dilger [ 13/Apr/12 ]

YS, you can submit patches to Gerrit with the tools/e2fsprogs repo, on the master-lustre branch. I'm just in the middle of rebasing this tree, so any patch would need to be rebased again before it could land.

Comment by Yang Sheng [ 30/May/12 ]

Patch landed. close bug.

Generated at Sat Feb 10 01:13:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.