[LU-10230] sanity test_239: 4336 not synced Created: 10/Nov/17  Updated: 25/Nov/19  Resolved: 19/Feb/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Casper Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

onyx, interop
servers: el7.4, ldiskfs, branch b2_10, v2.10.1, b30
clients: el7.4, branch master, v2.10.55, b3667


Issue Links:
Related
is related to LU-11208 Interop 2.10.4<->master sanity test_2... Resolved
is related to LU-7251 reduce commit callbacks in OSP Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

session: https://testing.hpdd.intel.com/test_sessions/86f20854-dbc9-4ccd-8316-e84444480a63
test set: https://testing.hpdd.intel.com/test_sets/20819896-c58f-11e7-9c63-52540065bddc

Note: This looks like LU-5387.

From test_log:

CMD: onyx-41vm7 /usr/sbin/lctl get_param -n version 2>/dev/null ||
				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
CMD: onyx-41vm7 lctl get_param -n osp.*MDT*.sync_changes 			osp.*MDT*.sync_in_flight
 sanity test_239: @@@@@@ FAIL: 4336 not synced 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5289:error()
  = /usr/lib64/lustre/tests/sanity.sh:14057:test_239()
  = /usr/lib64/lustre/tests/test-framework.sh:5565:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5604:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5451:run_test()
  = /usr/lib64/lustre/tests/sanity.sh:14059:main()


 Comments   
Comment by Minh Diep [ 31/Jan/18 ]

+1 on b2_10
https://testing.hpdd.intel.com/sub_tests/a8eea3ca-060a-11e8-bd00-52540065bddc

Comment by Saurabh Tandan (Inactive) [ 08/May/18 ]

+1 on b2_10 for 2.10.3_132
RHEL7.3 2.9 Server
RHEL7.4 2.10 Client
https://testing.hpdd.intel.com/test_sets/df8e75b0-509f-11e8-abc3-52540065bddc

Comment by James Nunez (Inactive) [ 25/Oct/18 ]

Alex,

On the b2_10 branch, we landed a patch (commit 236f73509cdcc83cd) for LU-7251 that added the following check to sanity test 239

+       [ $(lustre_version_code $SINGLEMDS) -gt $(version_code 2.10.1) ] &&
+               do_nodes $list "lctl set_param -n osp.*.force_sync=1"

For 2.10.2 and later servers, we would set force_sync=1 during testing.

On the master branch, we landed a patch (commit 0ba690a526be74c4cdffe7a7) for LU-7251 that added the following check to sanity test 239

+       [ $(lustre_version_code $SINGLEMDS) -gt $(version_code 2.10.53) ] &&
+               do_nodes $list "lctl set_param -n osp.*.force_sync=1"

In this case, for servers with version 2.10.54 and later, we would set force_sync=1 during testing.

When we do interop testing with a 2.10.5 server and a master clients, the server version is 2.10.5 and the server version check in test 239 fails and, thus, force_sync is not set. Is this the correct behavior?

Note: sanity test 239 was renamed to 239A in the master branch.

Comment by James Nunez (Inactive) [ 12/Feb/19 ]

Here are some recent interop failures of sanity test 239A:
2.10.6 server/2.12.51 clients - https://testing.whamcloud.com/test_sets/fd4cd52a-2e56-11e9-9b3a-52540065bddc
2.10.6 server/2.12.50 clients -https://testing.whamcloud.com/test_sets/002d2f4c-26e6-11e9-8486-52540065bddc

Comment by James Nunez (Inactive) [ 13/Feb/19 ]

I've modified the patch for LU-11208, https://review.whamcloud.com/#/c/33420/ , to fix this issue for the 2.10.x server/master client confguration. Please review the patch and see if you agree with the fix.

Comment by James Nunez (Inactive) [ 19/Feb/19 ]

Patch to modify version check landed to 2.13. If this issue persists, we can reopen this ticket.

Generated at Sat Feb 10 02:33:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.