[LU-15977] sanityn test_80b: migration stopped 2 Created: 28/Jun/22  Updated: 19/Dec/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Li Xi <pkuelelixi@gmail.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/06a33583-a064-42b7-8610-d0ee29e60d1e

test_80b failed with the following error:

== sanityn test 80b: Accessing directory during migration ========================================================== 07:06:08 (1656313568)
start migration thread 777660
accessing the migrating directory for 5 minutes...
touch file failed with 0
/usr/lib64/lustre/tests/sanityn.sh: line 4770: kill: (777660) - No such process
 sanityn test_80b: @@@@@@ FAIL: migration stopped 2 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6522:error()
  = /usr/lib64/lustre/tests/sanityn.sh:4770:test_80b()
  = /usr/lib64/lustre/tests/test-framework.sh:6857:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:6904:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:6745:run_test()
  = /usr/lib64/lustre/tests/sanityn.sh:4773:main()
Dumping lctl log to /autotest/autotest-2/2022-06-27/lustre-reviews_review-dne-part-5_88105_1_7_01a12c8c-47bf-4fb1-9cfa-e97b369d8874//sanityn.test_80b.*.1656313569.log
CMD: trevis-103vm1.trevis.whamcloud.com,trevis-103vm2,trevis-103vm3,trevis-103vm4,trevis-103vm5 /usr/sbin/lctl dk > /autotest/autotest-2/2022-06-27/lustre-reviews_review-dne-part-5_88105_1_7_01a12c8c-47bf-4fb1-9cfa-e97b369d8874//sanityn.test_80b.debug_log.\$(hostname -s).1656313569.log;
		dmesg > /autotest/autotest-2/2022-06-27/lustre-reviews_review-dne-part-5_88105_1_7_01a12c8c-47bf-4fb1-9cfa-e97b369d8874//sanityn.test_80b.dmesg.\$(hostname -s).1656313569.log
/usr/lib64/lustre/tests/sanityn.sh: line 4656: kill: (777660) - No such process

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanityn test_80b - migration stopped 2



 Comments   
Comment by Alex Zhuravlev [ 21/Apr/23 ]

the test itself is not quite correct AFAICS - it's supposed to run for 5 minutes, but all runs in Maloo stops in 3-4 seconds still reporting success.

Comment by Etienne Aujames [ 01/Dec/23 ]

The test is not working on b2_15:

                                                             
#migrate the directories among MDTs                                            
(                                                                              
        while true; do                                                         
                mdt_idx=$((RANDOM % MDSCOUNT))                                 
                $LFS migrate -m $mdt_idx $migrate_dir1 &>/dev/null ||          
                        rc=$?                                                  
                (( $rc != 0 && $rc != 16 )) || break                           
        done                                                                   
) &                                                                            
migrate_pid=$!                                                                 
                                                                               
echo "start migration thread $migrate_pid"                                     
#Access the files at the same time                                             
start_time=$SECONDS                                                            
echo "accessing the migrating directory for 5 minutes..."

The migration process always exits at the first iteration, the test is supposed to run 5 min.
This should be:

                                                             
#migrate the directories among MDTs                                            
(                                                                              
        while true; do                                                         
                mdt_idx=$((RANDOM % MDSCOUNT))                                 
                $LFS migrate -m $mdt_idx $migrate_dir1 &>/dev/null ||          
                        rc=$?                                                  
                (( $rc == 0 || $rc == 16 )) || break                           
        done                                                                   
) &                                                       

Also, the test always returns with success:

echo "aaaaa" > $migrate_dir2/file4 > /dev/null || {         
        echo "access file4 fails"                           
        break                                               
}                                                           

This should be:

echo "aaaaa" > $migrate_dir2/file4 > /dev/null ||      
        error "access file4 fails"                                       

The test completely changed on master with the: https://review.whamcloud.com/40891 ("LU-15529 mdt: optimize dir migration locking")
So maybe we can backport the test, but I am not sure if this is compatible with 2.15.4.

Generated at Sat Feb 10 03:22:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.