[LU-7133] Interop 2.7.0 <-> master- conf-sanity test_43: check lustre-MDTall.mdt.nosquash_nids failed! Created: 10/Sep/15  Updated: 10/Sep/18

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: Bob Glossman (Inactive)
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Client: 2.7.0
Server: lustre-master# 3166 , RHEL 7


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/d46e0c62-514d-11e5-9f68-5254006e85c2.

The sub-test test_43 failed with the following error:

check lustre-MDTall.mdt.nosquash_nids failed!

Test log:

Setting lustre.mdt.root_squash from 0:0 to 500:500
CMD: shadow-18vm12 /usr/sbin/lctl conf_param lustre.mdt.root_squash='500:500'
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.root_squash
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.root_squash
Waiting 90 secs for update
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.root_squash
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.root_squash
Updated after 2s: wanted '500:500' got '500:500'
CMD: shadow-18vm5.shadow.whamcloud.com /usr/sbin/lctl get_param -n llite.lustre*.root_squash
CMD: shadow-18vm5.shadow.whamcloud.com /usr/sbin/lctl get_param -n llite.lustre*.root_squash
/mnt/lustre/f43.conf-sanity-userfile: owner uid 500 (-rw-------): root read permission is granted - ok
/mnt/lustre/f43.conf-sanity-userfile: owner uid 500 (-rw-------): root write permission is granted - ok
/mnt/lustre/f43.conf-sanity-rootfile: owner uid 0 (-rw-------): root read permission is denied - ok
/mnt/lustre/f43.conf-sanity-rootfile: owner uid 0 (-rw-------): root write permission is denied - ok
/mnt/lustre/d43.conf-sanity-rootdir: owner uid 0 (drwx------): root unlink permission is denied - ok
/mnt/lustre/d43.conf-sanity-rootdir: owner uid 0 (drwx------): root create permission is denied - ok
/mnt/lustre/f43.conf-sanity-user1file: owner uid 501 (-rw-------): root read permission is denied - ok
/mnt/lustre/f43.conf-sanity-user1file: owner uid 501 (-rw-------): root write permission is denied - ok
/usr/lib64/lustre/tests/conf-sanity.sh: line 2844: 29182 Terminated              runas -u $ID1 tail -f $DIR/$tfile-user1file > /dev/null 2>&1
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
Setting lustre-MDTall.mdt.nosquash_nids from NONE to 2@elan 0@lo 10.1.4.215@tcp 192.168.0.[2,10]@tcp
CMD: shadow-18vm12 /usr/sbin/lctl conf_param lustre-MDTall.mdt.nosquash_nids='2@elan 0@lo 10.1.4.215@tcp 192.168.0.[2,10]@tcp'
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
Waiting 90 secs for update
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
Waiting 80 secs for update
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
Waiting 70 secs for update
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
Waiting 60 secs for update
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
Waiting 50 secs for update
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
Waiting 40 secs for update
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
Waiting 30 secs for update
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
Waiting 20 secs for update
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
Waiting 10 secs for update
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
CMD: shadow-18vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.nosquash_nids
Update not seen after 90s: wanted '2@elan 0@lo 10.1.4.215@tcp 192.168.0.[2,10]@tcp' got 'NONE'
 conf-sanity test_43: @@@@@@ FAIL: check lustre-MDTall.mdt.nosquash_nids failed! 

Console :

09:31:40:Lustre: DEBUG MARKER: == conf-sanity test 43: check root_squash and nosquash_nids == 09:28:26 (1441099706)
09:31:40:Lustre: DEBUG MARKER: mkdir -p /mnt/lustre
09:31:40:Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock shadow-18vm12@tcp:/lustre /mnt/lustre
09:31:40:LustreError: 28945:0:(obd_config.c:1322:class_process_proc_param()) llite: lustre-client-ffff8800795b0800 unknown param some_wrong_param=10
09:31:40:Lustre: Mounted lustre-client
09:31:40:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n llite.lustre*.root_squash
09:31:40:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n llite.lustre*.root_squash
09:31:40:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n llite.lustre*.nosquash_nids
09:31:40:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n llite.lustre*.nosquash_nids
09:31:40:Lustre: lustre: nosquash_nids is cleared
09:31:40:Lustre: lustre: root_squash is set to 500:500
09:31:40:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n llite.lustre*.root_squash
09:31:40:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n llite.lustre*.root_squash
09:31:40:Lustre: lustre: nosquash_nids set to 2@elan 0@lo 10.1.4.215@tcp 192.168.0.[2,10]@tcp
09:31:40:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  conf-sanity test_43: @@@@@@ FAIL: check lustre-MDTall.mdt.nosquash_nids failed! 
09:31:40:Lustre: DEBUG MARKER: conf-sanity test_43: @@@@@@ FAIL: check lustre-MDTall.mdt.nosquash_nids failed!


 Comments   
Comment by Andreas Dilger [ 29/Sep/15 ]

This is one of the top failing autotests:
https://testing.hpdd.intel.com/sub_tests/a0ca191c-664e-11e5-ba6e-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/0d7fd0e6-65ce-11e5-997c-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/18cccdd8-65af-11e5-997c-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/dfbe33d6-65b5-11e5-997c-5254006e85c2

Comment by Peter Jones [ 29/Sep/15 ]

Bob

Could you please look into this one?

Thanks

Peter

Comment by Andreas Dilger [ 29/Sep/15 ]

This just started before 2.7.59, so it may be possible to trace this to a specific patch landing. It might just be a test failure due to a feature, but it needs to be verified that it isn't an interop regression.

Comment by Bob Glossman (Inactive) [ 02/Oct/15 ]

here's the problem. from dmesg log of mds1, running new (master) version:

[29879.051694] LNet: 14647:0:(nidstrings.c:271:parse_nidrange()) can't parse nidrange: "2@elan"
[29879.053687] Lustre: 14647:0:(lprocfs_status.c:1981:lprocfs_wr_nosquash_nids()) lustre-MDT0000: failed to set nosquash_nids to "2@elan 0@lo 10.1.4.215@tcp 192.168.0.[2,10]@tcp", can't parse rc = -22
[29879.057391] LustreError: 14647:0:(obd_config.c:1389:class_process_proc_param()) mdt.: error writing proc entry 'nosquash_nids': rc = -22

elan is one of the obsolete lnds eliminated from master. however it's still used in example test nidlist in old version of conf-sanity.sh in v2.7.0. master server code can't parse it, so just throws up its hands and complains. I don't see this as easily fixable on the server side in master. could be fixed by moving part of the master fix in cont-sanity.sh into b2_7, but that won't fix the problem with interop of current released 2.7 with master.

Comment by Bob Glossman (Inactive) [ 02/Oct/15 ]

from the commit header of LU-6210 mod that removed the obsolete LNDs:

Remove old LND types from the netstrfns table, as they are
long obsolete and shouldn't be needed even for interop anymore.

Clearly this was a misstatement. At least one obsolete LND is still needed for interop, as there's a reference to it embedded in old cont-sanity.sh

Comment by Bob Glossman (Inactive) [ 02/Oct/15 ]

A possible fix might be to just put back an entry to the otherwise unsupported elan LND in the libcfs_netstrfns[] table. This would allow it to be parsed. However I'm unclear if putting an unsupported nidlist entry into lnet data structures might have bad side effects. It might get referenced and assume a functional LND is really there underneath.

Comment by James A Simmons [ 02/Oct/15 ]

You are correct putting the élan LND support back will have negative effects. The proper fix is to update the test like we did for master to test for gnilnd instead of élan.

Comment by James A Simmons [ 02/Oct/15 ]

I pushed a patch : http://review.whamcloud.com/#/c/16717. I assume we need a patch for 2.6 and 2.5 as well? Lets land this to 2.7.1 before it is officially released, then we will have no further interop issues.

Comment by Peter Jones [ 02/Oct/15 ]

James

We only test 2.8 interop with 2.5.x and 2.7.x releases, so I think that is the limit of what is needed.

Peter

Comment by James A Simmons [ 02/Oct/15 ]

I see you pushed a patch Bob so I will abandon my patch.

Comment by Peter Jones [ 02/Oct/15 ]

To summarize though, I think that we can discount this from a fix version 2.8 and just plan to tidy up the tests on the maintenance branches for future interop testing. As such I think that we can close this ticket and track that effort separately.

Comment by Andreas Dilger [ 04/Oct/15 ]

The patch for b2_7 still needs to land.

Comment by Peter Jones [ 04/Oct/15 ]

..which will be tracked separately

Comment by Saurabh Tandan (Inactive) [ 29/Oct/15 ]

Encountered same issue for interop testing for 2.7.62 Tag.
Server: master, 2.7.62, build #3225
Client: 2.5.5, b2_5_fe/62

https://testing.hpdd.intel.com/test_sets/44cc8dd8-7b67-11e5-a83c-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 15/Dec/15 ]

Another instance for following interop config
Server: Master, Build# 3266, Tag 2.7.64
Client: 2.5.5, b2_5_fe/62
https://testing.hpdd.intel.com/test_sets/bc333cda-9fcc-11e5-a33d-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 16/Dec/15 ]

Server: Master, Build# 3266, Tag 2.7.64 , RHEL 7
Client: 2.5.5, b2_5_fe/62
https://testing.hpdd.intel.com/test_sets/f8bb27de-9fff-11e5-a33d-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 19/Jan/16 ]

Another instance found for interop : EL6.7 Server/2.5.5 Client
Server: master, build# 3303, RHEL 6.7
Client: 2.5.5, b2_5_fe/62
https://testing.hpdd.intel.com/test_sets/2f6cb0c2-bad6-11e5-9137-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 08/Feb/16 ]

This is issue is seen 21 times in past 30 days.

Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ]

Another instance found for interop tag 2.7.66 - EL6.7 Server/2.5.5 Client, build# 3316
https://testing.hpdd.intel.com/test_sets/bdea5946-cc9f-11e5-963e-5254006e85c2

Another instance found for interop tag 2.7.66 - EL7 Server/2.5.5 Client, build# 3316
https://testing.hpdd.intel.com/test_sets/79a03aac-cc46-11e5-901d-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 24/Feb/16 ]

Another instance found for interop - EL6.7 Server/2.5.5 Client, tag 2.7.90.
https://testing.hpdd.intel.com/test_sessions/f99a2d60-d567-11e5-bc47-5254006e85c2
Another instance found for interop - EL7 Server/2.5.5 Client, tag 2.7.90.
https://testing.hpdd.intel.com/test_sessions/93baffee-d2ae-11e5-8697-5254006e85c2

Comment by James A Simmons [ 10/Sep/18 ]

Can we close this?

Generated at Sat Feb 10 02:06:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.