[LU-15124] sanity-flr is unable to umount client Created: 18/Oct/21  Updated: 15/Apr/22  Resolved: 15/Apr/22

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Upstream
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Alex Zhuravlev Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-flr subtest 208 can't umount client:

 sanity-flr test_208a: @@@@@@ FAIL: umount failed 
  Trace dump:
  = ./../tests/test-framework.sh:6329:error()
  = sanity-flr.sh:3717:check_ost_used()
  = sanity-flr.sh:3775:test_208a()
  = ./../tests/test-framework.sh:6633:run_one()
  = ./../tests/test-framework.sh:6680:run_one_logged()
  = ./../tests/test-framework.sh:6521:run_test()

this is due to running process left by subtest 70

  32663 ?        S      0:07 /mnt/build/lustre/tests/../utils/lfs mirror split -d --mirror-id=1 /mnt/lustre/d70.sanity-flr/f70.sanity-flr


 Comments   
Comment by Gerrit Updater [ 18/Oct/21 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45280
Subject: LU-15124 tests: sanity-flr/70 to wait for completion
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0f1579e30199b1afe2320d28644d313de7fcfb77

Comment by Alex Zhuravlev [ 21/Oct/21 ]

there must be something else... the test creates a tiny file in a loop:

while true; do
     rm -f $tf
     $LFS mirror create -N -E 1M -c -1 -E eof -N $tf
     echo xxxx > $tf

and another process is removing replica in a loop as well:

while true; do
         $LFS mirror split -d --mirror-id=1 $tf &> /dev/null
done &

the interesting thing is how mirror splitting may grab that much CPU time

  32776 ?        R      0:07 /mnt/build/lustre/tests/../utils/lfs mirror split -d --mirror-id=1 /mnt/lustre/d70.sanity-flr/f70.sanity-flr
Comment by Alex Zhuravlev [ 10/Nov/21 ]

lfs gets stuck on the client side:

[  751.563462] lfs             S    0 32785      1 0x00000000
[  751.563509] Call Trace:
[  751.563540]  ? __schedule+0x2ab/0xa80
[  751.563579]  schedule+0x2a/0x80
[  751.563619]  schedule_timeout+0x1d9/0x500
[  751.563658]  ? collect_expired_timers+0xa0/0xa0
[  751.563714]  lov_io_init_composite+0x1ab2/0x1b70 [lov]
[  751.563769]  lov_io_init+0x19a/0x320 [lov]
[  751.563841]  cl_io_init0.isra.2+0x7f/0x150 [obdclass]
[  751.563934]  cl_glimpse_size0+0x82/0x240 [lustre]
[  751.564001]  ll_getattr_dentry+0x624/0xa30 [lustre]
[  751.564058]  vfs_statx+0x74/0xb0
[  751.564100]  __se_sys_newlstat+0x26/0x40
[  751.564141]  do_syscall_64+0x4b/0x190
[  751.564182]  entry_SYSCALL_64_after_hwframe+0x6a/0xdf
Comment by Alex Zhuravlev [ 15/Apr/22 ]

can't reproduce when LU-15300 applied

Generated at Sat Feb 10 03:15:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.