[LU-6167] endless loop in lustre_rsync Created: 28/Jan/15  Updated: 08/Feb/15  Resolved: 08/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Minor
Reporter: wu libin (Inactive) Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: patch
Environment:

CentOS


Attachments: File lustre-rsync-bug.sh    
Severity: 3
Rank (Obsolete): 17260

 Description   

When test lustre_rsync combined with racer, some strange behavior come out.
1. follw is the test case i use:

  1. Test 13 - lustre_rsync, use racer test suite
    test_13() {
    init_src
    init_changelog

local rrc=0
local rc=0
local clients=$CLIENTS
local RDIRS
local i

  1. 1. init racer directories
    for d in ${RACERDIRS}; do
    is_mounted $d || continue

RDIRS="$RDIRS $d/racer"
mkdir -p $d/racer

  1. lfs setstripe $d/racer -c -1
    if [ $MDSCOUNT -ge 2 ]; then
    for i in $(seq $((MDSCOUNT - 1))); do
    RDIRS="$RDIRS $d/racer$i"
    if [ ! -e $d/racer$i ]; then
    $LFS mkdir -i $i $d/racer$i ||
    error "lfs mkdir $i failed"
    fi
    done
    fi
    done
  1. 2. racer start
    local rpids=""
    for rdir in $RDIRS; do
    do_nodes $clients "DURATION=$DURATION MDSCOUNT=$MDSCOUNT \
    $racer $rdir $NUM_RACER_THREADS" &
    pid=$!
    rpids="$rpids $pid"
    done
    for pid in $rpids; do
    wait $pid
    rc=$?
    echo "pid=$pid rc=$rc"
    if [ $rc != 0 ]; then
    rrc=$((rrc + 1))
    fi
    done
  2. 8. Replicate the changes to $TGT and TGT2
    $LRSYNC -s $DIR -t $TGT -t $TGT2 -m $MDT0 -u $CL_USER -l $LREPL_LOG \
    -D $LRSYNC_LOG $EXTRA_FLAGS
  1. 9. check difference
    check_diff $DIR $TGT
    check_diff $DIR $TGT2
    echo "check difference on target dir"
    sleep 120
    fini_changelog
    cleanup_src_tgt
    return 0
    }
    run_test 13 "lustre_rsync, use racer test suite"
    It will cause lustre_rsync run in a endless loop, and never come out.


 Comments   
Comment by wu libin (Inactive) [ 28/Jan/15 ]

The script i used to test.

Comment by wu libin (Inactive) [ 28/Jan/15 ]

also core dump, the stack is like:
bt
#0 0x0000003566232925 in raise () from /lib64/libc.so.6
#1 0x0000003566234105 in abort () from /lib64/libc.so.6
#2 0x0000003566270837 in __libc_message () from /lib64/libc.so.6
#3 0x0000003566276166 in malloc_printerr () from /lib64/libc.so.6
#4 0x0000003566278f81 in _int_free () from /lib64/libc.so.6
#5 0x0000000000404375 in lr_cascade_move (fid=0x250a630 "[0x200000401:0x37a:0x0]", dest=0x2511d20 "/home/target/racer/11/3/11", info=0x24e0340) at lustre_rsync.c:682
#6 0x000000000040435a in lr_cascade_move (fid=0x250d960 "[0x200000400:0x366:0x0]", dest=0x2512e30 "/home/target/racer/11/3", info=0x24e0340) at lustre_rsync.c:677
#7 0x000000000040435a in lr_cascade_move (fid=0x24e0454 "[0x200000400:0x37a:0x0]", dest=0x24e1755 "/home/target/racer/11", info=0x24e0340) at lustre_rsync.c:677
#8 0x0000000000405369 in lr_move (info=0x24e0340) at lustre_rsync.c:964
#9 0x0000000000406eb8 in lr_replicate () at lustre_rsync.c:1552
#10 0x000000000040751b in main (argc=18, argv=<value optimized out>) at lustre_rsync.c:1776

the attached file is the test script i used.

Comment by Gerrit Updater [ 28/Jan/15 ]

Wu Libin (gnlwlb@gmail.com) uploaded a new patch: http://review.whamcloud.com/13545
Subject: LU-6167 utils: fix bugs in lustre_sync
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bb7871a24da289df3b0755508c3f307b21fdbdb0

Comment by wu libin (Inactive) [ 28/Jan/15 ]

I think this patch can solve this problem, but really not sure the root cause.
http://review.whamcloud.com/#/c/13545/

Comment by Peter Jones [ 31/Jan/15 ]

Yang Sheng

Could you please take care of this patch?

Thanks

Peter

Comment by Gerrit Updater [ 08/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13545/
Subject: LU-6167 utils: fix bugs in lustre_sync
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: dcf2a82d148797b4ac204a65ec795cde141e1d3b

Comment by Peter Jones [ 08/Feb/15 ]

Landed for 2.7

Generated at Sat Feb 10 01:57:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.