[LU-3231] fld_cache_lookup() copies fld_cache_entry onto lu_seq_range Created: 25/Apr/13  Updated: 26/Mar/14  Resolved: 29/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: John Hammond Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: LB, fid

Issue Links:
Related
is related to LU-3233 tgt_cb_last_committed()) ASSERTION( c... Resolved
Severity: 3
Rank (Obsolete): 7896

 Description   

fld_cache_lookup() keeps a prev pointer to a struct fld_cache_entry. But when it uses this pointer to return the prev range, it copies the entry onto the range argument and not the entry's fce_range member.

int fld_cache_lookup(struct fld_cache *cache,
                     const seqno_t seq, struct lu_seq_range *range)
{
        struct fld_cache_entry *flde;
        struct fld_cache_entry *prev = NULL;
        cfs_list_t *head;
        ENTRY;

        ...
        cfs_list_for_each_entry(flde, head, fce_list) {
                if (flde->fce_range.lsr_start > seq) {
                        if (prev != NULL)
                                memcpy(range, prev, sizeof(*range));
                        break;
                }
        ...
}


 Comments   
Comment by John Hammond [ 25/Apr/13 ]

Please see http://review.whamcloud.com/6171.

Comment by Andreas Dilger [ 26/Apr/13 ]

Di, John, how serious is this bug? Is this crashing, data corrupting, etc? Affecting normal usage, or only DNE? Please describe severity of problem, and mark it a blocker if so.

Comment by John Hammond [ 26/Apr/13 ]

Sorry, I spent too much time trying to understand what was going on.

DNE blocker. I cannot see any evidence that it affects non-DNE setups.

On the current master (2.3.64-43-g507dc87) I'm using MDSCOUNT=2 MOUNT_2=y llmount.sh to setup. About every 1 in 5 times this results in a unusable client mount:

# DURATION=10 sh ./lustre/tests/racer.sh 
Logging to shared log directory: /tmp/test_logs/1366998603
racer: /root/lustre-release/lustre/tests/racer/racer.sh with 2 MDTs
excepting tests: 
m: Checking config lustre mounted on /mnt/lustre2
Checking servers environments
Checking clients m environments
m: Checking config lustre mounted on /mnt/lustre
Checking servers environments
Checking clients m environments
Using TIMEOUT=20
disable quota as required
RACERDIRS=/mnt/lustre /mnt/lustre2

== racer test 1: racer on clients: m DURATION=10 == 12:50:04 (1366998604)
racers pids: 18395 18396 18397 18398
./dir_remote.sh: line 13: /mnt/lustre/racer/5/3: Input/output error
./dir_remote.sh: line 13: /mnt/lustre/racer/0/2: Input/output error
./dir_remote.sh: line 13: /mnt/lustre/racer/5/18: Input/output error
./dir_remote.sh: line 13: /mnt/lustre2/racer/16/5: Input/output error
./dir_remote.sh: line 13: /mnt/lustre2/racer1/7/3: No such file or directory
./dir_remote.sh: line 13: /mnt/lustre2/racer/15/19: Input/output error
./dir_remote.sh: line 13: /mnt/lustre2/racer/14/3: Input/output error
./dir_remote.sh: line 13: /mnt/lustre2/racer1/2/9: No such file or directory
./dir_remote.sh: line 13: /mnt/lustre/racer1/4/7: No such file or directory
./dir_remote.sh: line 13: /mnt/lustre/racer/12/9: Input/output error
./dir_remote.sh: line 13: /mnt/lustre2/racer/10/9: Input/output error
./dir_remote.sh: line 13: /mnt/lustre2/racer/3/16: Input/output error
./dir_remote.sh: line 13: /mnt/lustre2/racer1/3/8: Not a directory
./dir_remote.sh: line 13: /mnt/lustre2/racer/3/16: Input/output error
./dir_remote.sh: line 13: /mnt/lustre2/racer1/3/19: Not a directory
./dir_remote.sh: line 13: /mnt/lustre/racer1/2/5: Not a directory
...

Note the requested flag "ffff8801" below is really part of a list_head from the entry:

Lustre: DEBUG MARKER: == racer test 1: racer on clients: m DURATION=10 == 12:50:04 (1366998604)
Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000280000400-0x00000002c0000400):0:mdt
Lustre: cli-ctl-lustre-MDT0000-osp-MDT0001: Allocated super-sequence [0x00000002c0000400-0x0000000300000400):1:mdt]
LustreError: 15736:0:(fld_handler.c:158:fld_server_lookup()) srv-lustre-MDT0000: FLD cache range [0x0000000280000400-0x00000002c0000400):0:mdt does not matchrequested flag ffff8801: rc = -5
LustreError: 18811:0:(lmv_fld.c:78:lmv_fld_lookup()) Error while looking for mds number. Seq 0x280000400, err = -5
LustreError: 15736:0:(fld_handler.c:158:fld_server_lookup()) srv-lustre-MDT0000: FLD cache range [0x0000000280000400-0x00000002c0000400):0:mdt does not matchrequested flag ffff8801: rc = -5
LustreError: 18788:0:(lmv_fld.c:78:lmv_fld_lookup()) Error while looking for mds number. Seq 0x280000400, err = -5
LustreError: 15715:0:(mdt_reint.c:332:mdt_md_create()) lustre-MDT0001: remote dir is only permitted on MDT0 or set_param mdt.*.enable_remote_dir=1
LustreError: 15714:0:(mdt_reint.c:332:mdt_md_create()) lustre-MDT0001: remote dir is only permitted on MDT0 or set_param mdt.*.enable_remote_dir=1
LustreError: 19275:0:(lmv_fld.c:78:lmv_fld_lookup()) Error while looking for mds number. Seq 0x280000400, err = -5
LustreError: 19275:0:(lmv_fld.c:78:lmv_fld_lookup()) Skipped 284 previous similar messages
Comment by Di Wang [ 26/Apr/13 ]

I think it will affect non-DNE setup too, when it uses up current meta-sequence(128k seqs), then tries to get new seq. This bug should be hit during the sequence range merging in FLD, IMHO. Thanks again, John.

Comment by Peter Jones [ 29/Apr/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:32:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.