[LU-11808] sanity-flr test 0b and 0c fails with ‘pool_new failed lustre.test_0b' Created: 18/Dec/18  Updated: 21/Dec/18  Resolved: 21/Dec/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: James A Simmons
Resolution: Duplicate Votes: 0
Labels: ubuntu
Environment:

Ubuntu 18.04 clients


Issue Links:
Cloners
is cloned by LU-11823 sanity-flr test 0b and 0c fails with ... Resolved
Duplicate
duplicates LU-11803 sanity test 255c fails with 'Ladvise ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-flr test_0b and test_0c fails to create a new OST pool and fails with error ‘pool_new failed lustre.test_0b'. So far, we see this error only for Ubuntu 18.04 clients.

Looking at the client test_log for https://testing.whamcloud.com/test_sets/d794bc4c-fdd4-11e8-a97c-52540065bddc ,

== sanity-flr test 0b: lfs mirror create plain layout mirrors ======================================== 00:37:49 (1544488669)
CMD: trevis-19vm4 lctl pool_new lustre.test_0b
trevis-19vm4: Pool lustre.test_0b created
CMD: trevis-19vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.test_0b 				2>/dev/null || echo foo
CMD: trevis-19vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.test_0b 				2>/dev/null || echo foo
CMD: trevis-19vm1.trevis.whamcloud.com lctl get_param -n lov.lustre-*.pools.test_0b 		2>/dev/null || echo foo
CMD: trevis-19vm1.trevis.whamcloud.com lctl get_param -n lov.lustre-*.pools.test_0b 		2>/dev/null || echo foo
Waiting 90 secs for update
CMD: trevis-19vm1.trevis.whamcloud.com lctl get_param -n lov.lustre-*.pools.test_0b 		2>/dev/null || echo foo
…
Waiting 10 secs for update
CMD: trevis-19vm1.trevis.whamcloud.com lctl get_param -n lov.lustre-*.pools.test_0b 		2>/dev/null || echo foo
CMD: trevis-19vm1.trevis.whamcloud.com lctl get_param -n lov.lustre-*.pools.test_0b 		2>/dev/null || echo foo
CMD: trevis-19vm1.trevis.whamcloud.com lctl get_param -n lov.lustre-*.pools.test_0b 		2>/dev/null || echo foo
CMD: trevis-19vm1.trevis.whamcloud.com lctl get_param -n lov.lustre-*.pools.test_0b 		2>/dev/null || echo foo
Update not seen after 90s: wanted '' got 'foo'
 sanity-flr test_0b: @@@@@@ FAIL: pool_new failed lustre.test_0b 

We see some errors in the Client 1 (vm1) console log

[  379.473969] Lustre: DEBUG MARKER: == sanity-flr test 0b: lfs mirror create plain layout mirrors ======================================== 00:37:49 (1544488669)
[  380.882615] LustreError: 10433:0:(obd_config.c:1264:class_process_config()) no device for: lustre-clilov-000000007c485d00
[  380.883815] Lustre: 10433:0:(obd_config.c:1351:class_process_config()) Ignoring error -22 on optional command 0xce020
[  390.562916] Lustre: DEBUG MARKER: lctl get_param -n lov.lustre-*.pools.test_0b 2>/dev/null || echo foo
[  390.574202] Lustre: DEBUG MARKER: lctl get_param -n lov.lustre-*.pools.test_0b 2>/dev/null || echo foo
[  391.586752] Lustre: DEBUG MARKER: lctl get_param -n lov.lustre-*.pools.test_0b 2>/dev/null || echo foo
[  392.598782] Lustre: DEBUG MARKER: lctl get_param -n lov.lustre-*.pools.test_0b 2>/dev/null || echo foo
[  393.610433] Lustre: DEBUG MARKER: lctl get_param -n lov.lustre-*.pools.test_0b 2>/dev/null || echo foo
[  394.622561] Lustre: DEBUG MARKER: lctl get_param -n lov.lustre-*.pools.test_0b 2>/dev/null || echo foo
[  395.471515] Lustre: lustre-OST0000-osc-        (ptrval): disconnect after 22s idle
[  395.472390] Lustre: Skipped 5 previous similar messages
[  395.634478] Lustre: DEBUG MARKER: lctl get_param -n lov.lustre-*.pools.test_0b 2>/dev/null || echo foo

We see similar errors the same test, but the test failure error is different; 'destroy pool failed lustre.test_0b'

Looking at the Client 2 (vm9) dmesg log for https://testing.whamcloud.com/test_sets/147f0c12-e2f4-11e8-b67f-52540065bddc , we see

[  210.247534] Lustre: DEBUG MARKER: == sanity-flr test 0b: lfs mirror create plain layout mirrors ======================================== 09:05:40 (1541495140)
[  217.367773] Lustre: DEBUG MARKER: lctl get_param -n lov.lustre-*.pools.test_0b 2>/dev/null || echo foo
[  217.379702] Lustre: DEBUG MARKER: lctl get_param -n lov.lustre-*.pools.test_0b 2>/dev/null || echo foo
[  225.070588] Lustre: lustre-OST0000-osc-        (ptrval): disconnect after 21s idle
[  228.360951] Lustre: DEBUG MARKER: lctl get_param -n lov.lustre-*.pools.test_0b | sort -u | tr '\n' ' ' 
[  228.372621] Lustre: DEBUG MARKER: lctl get_param -n lov.lustre-*.pools.test_0b | sort -u | tr '\n' ' ' 
[  250.803910] random: crng init done
[  260.423743] LustreError: 10730:0:(obd_config.c:1264:class_process_config()) no device for: lustre-clilov-00000000b93a97e7
[  260.425715] Lustre: 10730:0:(obd_config.c:1351:class_process_config()) Ignoring error -22 on optional command 0xce022
[  270.916901] LustreError: 10742:0:(obd_config.c:1264:class_process_config()) no device for: lustre-clilov-00000000b93a97e7
[  270.918884] LustreError: 10742:0:(obd_config.c:1264:class_process_config()) Skipped 1 previous similar message
[  270.920584] Lustre: 10742:0:(obd_config.c:1351:class_process_config()) Ignoring error -22 on optional command 0xce022
[  270.922396] Lustre: 10742:0:(obd_config.c:1351:class_process_config()) Skipped 1 previous similar message
[  285.248874] LustreError: 10765:0:(obd_config.c:1264:class_process_config()) no device for: lustre-clilov-00000000b93a97e7
[  285.250873] Lustre: 10765:0:(obd_config.c:1351:class_process_config()) Ignoring error -22 on optional command 0xce022
[  290.469224] Lustre: DEBUG MARKER: lctl get_param -n lov.lustre-*.pools.test_0b 2>/dev/null || echo foo


 Comments   
Comment by Andreas Dilger [ 18/Dec/18 ]

This also has the "osc.*. (ptrval)" problem with parameters.

Comment by Peter Jones [ 18/Dec/18 ]

James

Can you please comment on this one too?

Peter

Comment by James A Simmons [ 19/Dec/18 ]

Yep this shows that proc has the same issue as sysfs with newer kernels. lov pools are still in the procfs tree.

Comment by James A Simmons [ 19/Dec/18 ]

The patch https://review.whamcloud.com/#/c/33894 should resolve this

Comment by Andreas Dilger [ 21/Dec/18 ]

Duplicate of LU-11803, which has a patch.

Comment by James A Simmons [ 21/Dec/18 ]

the pool test still fails for ARM / Power8 so please keep this ticket open for 2.13. Ubuntu18 seems to be resolved.

Comment by Peter Jones [ 21/Dec/18 ]

James this ticket was specifically about Ubuntu 18 and it sounds like it is fixed. If there is a similar failure for ARM/Power let's track that under a new ticket

Comment by James A Simmons [ 21/Dec/18 ]

No easy way to clone the 3 tickets involved with this problem is their

Generated at Sat Feb 10 02:47:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.