[LU-8274] conf-sanity test_90b fails with 'Unexpectedly send 2 modify RPCs in parallel' Created: 14/Jun/16  Updated: 01/May/20  Resolved: 01/May/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

autotest


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

conf-sanity test 90b fails with

'Unexpectedly send 2 modify RPCs in parallel' 

From the test_log, we see:

…
mount lustre on /mnt/lustre.....
Starting client: trevis-6vm5.trevis.hpdd.intel.com:  -o user_xattr,flock trevis-6vm12@tcp:/lustre /mnt/lustre
CMD: trevis-6vm5.trevis.hpdd.intel.com mkdir -p /mnt/lustre
CMD: trevis-6vm5.trevis.hpdd.intel.com mount -t lustre -o user_xattr,flock trevis-6vm12@tcp:/lustre /mnt/lustre
mdc.lustre-MDT0000-mdc-ffff880067493000.max_mod_rpcs_in_flight=1
max_mod_rpcs_in_flight set to 1
creating 2 files ...
CMD: trevis-6vm12 /usr/sbin/lctl set_param fail_loc=0x159
fail_loc=0x159
launch 0 chmod in parallel ...
CMD: trevis-6vm12 /usr/sbin/lctl set_param fail_loc=0
fail_loc=0
launch 1 additional chmod in parallel ...
/mnt/lustre/d90b.conf-sanity1/file-1 has perms 0600 OK
CMD: trevis-6vm12 /usr/sbin/lctl set_param fail_loc=0x159
fail_loc=0x159
launch 1 chmod in parallel ...
CMD: trevis-6vm12 /usr/sbin/lctl set_param fail_loc=0
fail_loc=0
launch 1 additional chmod in parallel ...
/mnt/lustre/d90b.conf-sanity1/file-2 has perms 0666, not 0644
 conf-sanity test_90b: @@@@@@ FAIL: Unexpectedly send 2 modify RPCs in parallel 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4782:error()
  = /usr/lib64/lustre/tests/conf-sanity.sh:6042:check_max_mod_rpcs_in_flight()
  = /usr/lib64/lustre/tests/conf-sanity.sh:6080:test_90b()
  = /usr/lib64/lustre/tests/test-framework.sh:5047:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5086:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4932:run_test()
  = /usr/lib64/lustre/tests/conf-sanity.sh:6120:main()

From the client dmesg log, we see

[13339.929607] Lustre: DEBUG MARKER: trevis-6vm5.trevis.hpdd.intel.com: == conf-sanity test 90b: check max_mod_rpcs_in_flight is enforced after update ======================= 00:50:49 (1465865449)
[13340.229219] Lustre: Lustre: Build Version: 2.8.54_6_g5f55128
[13340.246893] LNet: Added LNI 10.9.4.59@tcp [8/256/0/180]
[13340.249380] LNet: Accept all, port 7988
[13341.759059] Lustre: 27303:0:(gss_svc_upcall.c:1162:gss_init_svc_upcall()) Init channel is not opened by lsvcgssd, following request might be dropped until lsvcgssd is active
[13341.763808] Lustre: 27303:0:(gss_mech_switch.c:72:lgss_mech_register()) Register gssnull mechanism
[13341.766207] Key type lgssc registered
[13341.823651] Lustre: Echo OBD driver; http://www.lustre.org/
[13347.857820] Lustre: DEBUG MARKER: trevis-6vm5.trevis.hpdd.intel.com: mkdir -p /mnt/lustre
[13347.866695] Lustre: DEBUG MARKER: trevis-6vm5.trevis.hpdd.intel.com: mount -t lustre -o user_xattr,flock trevis-6vm12@tcp:/lustre /mnt/lustre
[13352.914409] Lustre: Mounted lustre-client
[13363.881053] Lustre: 27838:0:(client.c:2067:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1465865474/real 1465865474]  req@ffff88004cd90300 x1537071338488544/t0(0) o36->lustre-MDT0000-mdc-ffff880067493000@10.9.4.66@tcp:12/10 lens 608/1016 e 0 to 1 dl 1465865481 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[13363.888251] Lustre: lustre-MDT0000-mdc-ffff880067493000: Connection to lustre-MDT0000 (at 10.9.4.66@tcp) was lost; in progress operations using this service will wait for recovery to complete
[13363.897466] Lustre: lustre-MDT0000-mdc-ffff880067493000: Connection restored to 10.9.4.66@tcp (at 10.9.4.66@tcp)
[13364.328290] Lustre: DEBUG MARKER: trevis-6vm5.trevis.hpdd.intel.com: /usr/sbin/lctl mark trevis-6vm5.trevis.hpdd.intel.com:  conf-sanity test_90b: @@@@@@ FAIL: Unexpectedly send 2 modify RPCs in parallel 
[13364.525077] Lustre: DEBUG MARKER: trevis-6vm5.trevis.hpdd.intel.com: conf-sanity test_90b: @@@@@@ FAIL: Unexpectedly send 2 modify RPCs in parallel
[13364.712061] Lustre: DEBUG MARKER: trevis-6vm5.trevis.hpdd.intel.com: /usr/sbin/lctl dk > /logdir/test_logs/2016-06-13/lustre-reviews-el7-x86_64--review-zfs-part-2--1_9_1__39706__-70228218622320-204237/conf-sanity.test_90b.debug_log.$(hostname -s).1465865482.log;
         dmesg > /logdir/t

This test has failed with this error three times in the past two months:
2016-06-13 - https://testing.hpdd.intel.com/test_sets/85938326-31e7-11e6-bbf5-5254006e85c2
2016-06-09 - https://testing.hpdd.intel.com/test_sets/65ef675c-2e7b-11e6-acf3-5254006e85c2
2016-04-07 - https://testing.hpdd.intel.com/test_sets/e528b0d4-fd37-11e5-abd3-5254006e85c2



 Comments   
Comment by Emoly Liu [ 20/Jun/16 ]

Another failure on master:
https://testing.hpdd.intel.com/test_sets/0810a1fc-34be-11e6-80b9-5254006e85c2

Comment by James Nunez (Inactive) [ 18/Mar/19 ]

We're seeing a very similar failure for conf-sanity test 90a for interop testing; https://testing.whamcloud.com/test_sets/bf2c7fb0-4306-11e9-92fe-52540065bddc .

Comment by Andreas Dilger [ 01/May/20 ]

Haven't seen this for a long time.

Generated at Sat Feb 10 02:16:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.