[LU-2458] Device MGC already exists, won't add Created: 10/Dec/12  Updated: 19/Jun/13  Resolved: 19/Jun/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.4, Lustre 2.1.6
Fix Version/s: Lustre 2.1.6

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 5801

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/d34560e2-42a0-11e2-8dba-52540035b04c.

The sanity.sh test_77i, conf-sanity.sh test_6, recovery-small.sh test_62 failed with the following error in the console logs:

class_newdev()) Device MGC10.10.4.160@tcp already exists, won't add
class_attach()) Cannot create device MGC10.10.4.160@tcp of type mgc : -17
lustre_start_simple()) MGC10.10.4.160@tcp attach error -17
lustre_fill_super()) Unable to mount  (-17)
DEBUG MARKER: conf-sanity test_6: @@@@@@ FAIL: test_6 failed with 87

Info required for matching: mount failed



 Comments   
Comment by Andreas Dilger [ 11/Dec/12 ]

This looks like it may have started due to "jitter" introduction on the test nodes. Testing for recovery-small.sh had 5/8 failures since 12/10 05:30, but none before 12/08 06:26 (the previous b2_1 test run).

http://review.whamcloud.com/4482 was the base for both passing and failing tests, but had been landed on 12/06 so it is questionable whether it was a code change that caused the problem.

Comment by Jian Yu [ 12/Dec/12 ]

This is preventing the patches on b2_1 from passing review test group:
http://review.whamcloud.com/#change,4783
http://review.whamcloud.com/#change,4784
http://review.whamcloud.com/#change,4782

and so on.

Comment by Jian Yu [ 12/Dec/12 ]

This also affected the test results of b2_1 build #148 on 2.1.3<->2.1.4 interop combinations.

Comment by Bob Glossman (Inactive) [ 12/Dec/12 ]

similar problem seen a few days earlier, also on a b2_1 test run:
https://maloo.whamcloud.com/test_sets/8b24f332-40d3-11e2-a16b-52540035b04c

Comment by Peter Jones [ 12/Dec/12 ]

Bob

Could you please look into this one?

Thanks

Peter

Comment by Bob Glossman (Inactive) [ 13/Dec/12 ]

It now appears that these failures are side effects of the increased number of virtual cpus in our test VMs done for TT-928/TT-968. With that change backed out of our test environment repeating the test run for http://review.whamcloud.com/#change,4516 (build 11037) now passes where it failed before.

Begs the question of if our testing is really appropriate. With more and more mulitcore servers and clients in the real world, maybe our test framework should work correctly on multicore test nodes.

Comment by Jian Yu [ 21/Dec/12 ]

Lustre Tag: v2_1_4_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/159/
Distro/Arch: RHEL5.8/x86_64 + RHEL5.8/i686 (Server + Client)

While running replay-dual tests, the same issue occurred:
https://maloo.whamcloud.com/test_sets/df2f801c-4ad0-11e2-b87e-52540035b04c

Comment by Jian Yu [ 05/Mar/13 ]

Lustre Branch: b2_1
Lustre Build: http://build.whamcloud.com/job/lustre-reviews/13578/
Distro/Arch: RHEL5.9/x86_64

sanity test_77i, conf-sanity, recovery-small test_62 failed with this issue again:
https://maloo.whamcloud.com/test_sets/d511b2f0-834f-11e2-98f5-52540035b04c
https://maloo.whamcloud.com/test_sets/243a6366-8352-11e2-98f5-52540035b04c
https://maloo.whamcloud.com/test_sets/681304c4-8354-11e2-98f5-52540035b04c

Comment by Jian Yu [ 06/Mar/13 ]

Hi Chris,

Did you turn on "jitter" on the test nodes again?

Comment by Jian Yu [ 23/Mar/13 ]

Lustre Client: v2_1_5_RC1
Lustre Server: 2.1.4

conf-sanity, recovery-small and replay-dual failed with this issue again:
https://maloo.whamcloud.com/test_sets/0f16b3ba-9327-11e2-b06e-52540035b04c
https://maloo.whamcloud.com/test_sets/17874dc8-9329-11e2-b06e-52540035b04c
https://maloo.whamcloud.com/test_sets/7ff9bc14-932a-11e2-b06e-52540035b04c

Comment by Jian Yu [ 05/Jun/13 ]

Lustre Tag: v2_1_6_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/208
Distro/Arch: RHEL5.9/x86_64

sanity, conf-sanity, recovery-small tests failed with this issue again:
https://maloo.whamcloud.com/test_sessions/8e59afbc-cd68-11e2-a1e0-52540035b04c

Comment by Jian Yu [ 05/Jun/13 ]

Lustre Tag: v2_1_6_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/208
Distro/Arch: RHEL5.9/i686 (client), RHEL5.9/x86_64 (server)

sanity, conf-sanity, recovery-small, replay-dual tests failed with this issue again:
https://maloo.whamcloud.com/test_sessions/dad32c08-cda7-11e2-ba28-52540035b04c

Comment by Bob Glossman (Inactive) [ 13/Jun/13 ]

in b2_1, sanity, conf-sanity, recovery-small:
https://maloo.whamcloud.com/test_sessions/e3e05362-d43b-11e2-9e73-52540035b04c

Comment by Bob Glossman (Inactive) [ 14/Jun/13 ]

more b2_1:
https://maloo.whamcloud.com/test_sessions/fb0ca642-d479-11e2-9e73-52540035b04c

Comment by Bruno Faccini (Inactive) [ 17/Jun/13 ]

Bob, according to the Lustre log taken during failure, this problem could the b2_1 duplicate of LU-639! So, I decided to back-port associated change (http://review.whamcloud.com/1896) to b2_1 and see if it fixes. I give it a try at http://review.whamcloud.com/6670.

Generated at Sat Feb 10 01:25:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.