[LU-17251] parallel-scale test_rr_alloc: max/min OST objects (2800 : 923) too different Created: 01/Nov/23  Updated: 20/Dec/23

Status: In Progress
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Alex Deiter Assignee: Alex Deiter
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14377 parallel-scale test rr_alloc fails wi... Resolved
is related to LU-13941 parallel-scale/run_rr_alloc: Restrict... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

parallel-scale test_rr_alloc: max/min OST objects (2800 : 923) too different

RHEL 8.8 x86_64 master/2.15.58.130

Failed session: https://testing.whamcloud.com/test_sessions/cde52bbc-3bcd-40b6-b9ad-fa35f8bc4deb

CMD: onyx-82vm9 /usr/sbin/lctl set_param -n 			lod.lustre-MDT*.qos_threshold_rr=100 			osp.lustre-OST*-osc-MDT*.create_count=3052
CMD: onyx-82vm10 /usr/sbin/lctl set_param -n 			lod.lustre-MDT*.qos_threshold_rr=100 			osp.lustre-OST*-osc-MDT*.create_count=3052
CMD: onyx-82vm9 /usr/sbin/lctl set_param -n 			lod.lustre-MDT*.qos_threshold_rr=100 			osp.lustre-OST*-osc-MDT*.create_count=3052
CMD: onyx-82vm10 /usr/sbin/lctl set_param -n 			lod.lustre-MDT*.qos_threshold_rr=100 			osp.lustre-OST*-osc-MDT*.create_count=3052
CMD: onyx-82vm9 /usr/sbin/lctl get_param -n debug
CMD: onyx-82vm10,onyx-82vm1.onyx.whamcloud.com,onyx-82vm2,onyx-82vm5,onyx-82vm9 /usr/sbin/lctl set_param -n debug=0
CMD: onyx-82vm1.onyx.whamcloud.com,onyx-82vm2 /usr/sbin/lctl set_param debug=\"super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck\"
CMD: onyx-82vm10,onyx-82vm5,onyx-82vm9 /usr/sbin/lctl set_param debug=\"super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck\"
CMD: onyx-82vm9 /usr/sbin/lctl get_param -n debug
CMD: onyx-82vm10,onyx-82vm1.onyx.whamcloud.com,onyx-82vm2,onyx-82vm5,onyx-82vm9 /usr/sbin/lctl set_param -n debug=0
 - unlinked 0 (time 1698827219 ; total 0 ; last 0)
total: 1032 unlinks in 1 seconds: 1032.000000 unlinks/second
CMD: onyx-82vm1.onyx.whamcloud.com,onyx-82vm2 /usr/sbin/lctl set_param debug=\"super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck\"
CMD: onyx-82vm10,onyx-82vm5,onyx-82vm9 /usr/sbin/lctl set_param debug=\"super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck\"
CMD: onyx-82vm9 lctl get_param -n osp.lustre-OST0000-osc-MDT0000.prealloc_last_id
CMD: onyx-82vm9 lctl get_param -n osp.lustre-OST0000-osc-MDT0000.prealloc_next_id
Warning: test may fail from too few objs on OST0
CMD: onyx-82vm9 lctl get_param -n osp.lustre-OST0001-osc-MDT0000.prealloc_last_id
CMD: onyx-82vm9 lctl get_param -n osp.lustre-OST0001-osc-MDT0000.prealloc_next_id
Warning: test may fail from too few objs on OST1
CMD: onyx-82vm9 lctl get_param -n osp.lustre-OST0002-osc-MDT0000.prealloc_last_id
CMD: onyx-82vm9 lctl get_param -n osp.lustre-OST0002-osc-MDT0000.prealloc_next_id
CMD: onyx-82vm9 lctl get_param -n osp.lustre-OST0003-osc-MDT0000.prealloc_last_id
CMD: onyx-82vm9 lctl get_param -n osp.lustre-OST0003-osc-MDT0000.prealloc_next_id
CMD: onyx-82vm9 lctl get_param -n osp.lustre-OST0004-osc-MDT0000.prealloc_last_id
CMD: onyx-82vm9 lctl get_param -n osp.lustre-OST0004-osc-MDT0000.prealloc_next_id
Warning: test may fail from too few objs on OST4
CMD: onyx-82vm9 lctl get_param -n osp.lustre-OST0005-osc-MDT0000.prealloc_last_id
CMD: onyx-82vm9 lctl get_param -n osp.lustre-OST0005-osc-MDT0000.prealloc_next_id
Warning: test may fail from too few objs on OST5
CMD: onyx-82vm9 lctl get_param -n osp.lustre-OST0006-osc-MDT0000.prealloc_last_id
CMD: onyx-82vm9 lctl get_param -n osp.lustre-OST0006-osc-MDT0000.prealloc_next_id
Warning: test may fail from too few objs on OST6
CMD: onyx-82vm9 lctl get_param -n osp.lustre-OST0007-osc-MDT0000.prealloc_last_id
CMD: onyx-82vm9 lctl get_param -n osp.lustre-OST0007-osc-MDT0000.prealloc_next_id
Warning: test may fail from too few objs on OST7
+ chmod 0777 /mnt/lustre
drwxrwxrwx 4 root root 4096 Nov  1 08:26 /mnt/lustre
+ su mpiuser bash -c "/usr/lib64/openmpi/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh --oversubscribe -np 22 /usr/lib64/openmpi/bin/rr_alloc /tmp/rr_alloc_mntpt/lustre/drr_alloc.parallel-scale/f 555 2 "
CMD: onyx-82vm9 /usr/sbin/lctl set_param -n lod.lustre-MDT0000-mdtlov.qos_threshold_rr=17%
CMD: onyx-82vm9 /usr/sbin/lctl set_param -n osp.lustre-OST0000-osc-MDT0000.create_count=128
CMD: onyx-82vm9 /usr/sbin/lctl set_param -n osp.lustre-OST0001-osc-MDT0000.create_count=64
CMD: onyx-82vm9 /usr/sbin/lctl set_param -n osp.lustre-OST0002-osc-MDT0000.create_count=64
CMD: onyx-82vm9 /usr/sbin/lctl set_param -n osp.lustre-OST0003-osc-MDT0000.create_count=64
CMD: onyx-82vm9 /usr/sbin/lctl set_param -n osp.lustre-OST0004-osc-MDT0000.create_count=128
CMD: onyx-82vm9 /usr/sbin/lctl set_param -n osp.lustre-OST0005-osc-MDT0000.create_count=128
CMD: onyx-82vm9 /usr/sbin/lctl set_param -n osp.lustre-OST0006-osc-MDT0000.create_count=128
CMD: onyx-82vm9 /usr/sbin/lctl set_param -n osp.lustre-OST0007-osc-MDT0000.create_count=64
 parallel-scale test_rr_alloc: @@@@@@ FAIL: max/min OST objects (2230 : 1144) too different 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6727:error()
  = /usr/lib64/lustre/tests/functions.sh:1133:run_rr_alloc()
  = /usr/lib64/lustre/tests/parallel-scale.sh:163:test_rr_alloc()
  = /usr/lib64/lustre/tests/test-framework.sh:7067:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:7123:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:6953:run_test()
  = /usr/lib64/lustre/tests/parallel-scale.sh:165:main()


 Comments   
Comment by Gerrit Updater [ 01/Nov/23 ]

"Alex Deiter <alex.deiter@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52940
Subject: LU-17251 test: check for OST precreated objects
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2c6d1107f9c05c214022743945f74f777ad2bbf5

Comment by Andreas Dilger [ 02/Nov/23 ]

I think this is also the same as LU-14377, but just the message changed due to added debugging for the test failures in patch https://review.whamcloud.com/48914 "LU-14377 tests: make parallel-scale/rr_alloc less strict".

Comment by Gerrit Updater [ 03/Nov/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52968
Subject: LU-17251 osp: force precreate if create_count grows
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0c4611acf569dee51e2aa3bf4ee700eeffb208f1

Comment by Andreas Dilger [ 03/Nov/23 ]

Deiter I think my patch is complementary to yours. Yours is improving the test script, and the wait loop is still needed since my patch does not wait for the precreates to finish before returning from set_param. However, the "createmany/unlinkmany" dance is no longer needed, and maybe never was needed, and is counter-productive in my opinion.

Comment by Alex Deiter [ 03/Nov/23 ]

Hello adilger,

Thank you very much for the patch and detailed explanation!
Please let me update my patch!

Thank you!

Comment by Gerrit Updater [ 18/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52968/
Subject: LU-17251 osp: force precreate if create_count grows
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: df5b4c0a8b076f36c63da89d81fe020ca29d39aa

Comment by Gerrit Updater [ 26/Nov/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53245
Subject: LU-17251 osp: start OST object precreate earlier
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6ffb849d7086a2b2ae48f274d4f5b1b8fbf83fe2

Comment by Gerrit Updater [ 20/Dec/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53245/
Subject: LU-17251 osp: start OST object precreate earlier
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0ccf1311382059d22cf4788136939647fba1317a

Generated at Sat Feb 10 03:33:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.