[LU-2326] Assertion triggered in class_add_uuid Created: 14/Nov/12  Updated: 16/Apr/20  Resolved: 16/Apr/20

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Prakash Surya (Inactive) Assignee: Alex Zhuravlev
Resolution: Cannot Reproduce Votes: 0
Labels: llnl
Environment:

2.3.54-6chaos


Severity: 3
Rank (Obsolete): 5557

 Description   

Hit this assertion on the MDS today:

LustreError: 1700:0:(lustre_peer.c:129:class_add_uuid()) ASSERTION( entry->un_nid_count < 32 ) failed: 
LustreError: 1700:0:(lustre_peer.c:129:class_add_uuid()) LBUG
PID: 1700   TASK: ffff8817a85d8040  CPU: 1   COMMAND: "llog_process_th"
 #0 [ffff88177da81aa8] machine_kexec at ffffffff8103216b
 #1 [ffff88177da81b08] crash_kexec at ffffffff810b8d12
 #2 [ffff88177da81bd8] panic at ffffffff814eea99
 #3 [ffff88177da81c58] lbug_with_loc at ffffffffa0577fcb [libcfs]
 #4 [ffff88177da81c78] class_add_uuid at ffffffffa06f8321 [obdclass]
 #5 [ffff88177da81cc8] class_process_config at ffffffffa070eb3c [obdclass]
 #6 [ffff88177da81d48] class_config_llog_handler at ffffffffa071110b [obdclass]
 #7 [ffff88177da81e28] llog_process_thread at ffffffffa06d1f3b [obdclass]
 #8 [ffff88177da81ed8] llog_process_thread_daemonize at ffffffffa06d249c [obdclass]
 #9 [ffff88177da81f48] kernel_thread at ffffffff8100c14a


 Comments   
Comment by Ned Bass [ 14/Nov/12 ]

This happened on the grove Lustre cluster for Sequoia. We were trying to start the MDT for a new filesystem "lsfull" that spans all of the grove OSS's. lsfull uses existing ZFS pools already in use by filesystems which use only subsets of the grove OSTs, "ls1" and "lstest". That is, we wanted to non-destructively create a new Lustre filesystem across all of grove whose datasets coexist in the ZFS pools with those of our existing filesystems. "lsfull" used the same MGS service as "ls1". We were not attempting to run lsfull concurrently with ls1 or lstest.

A couple of points may be noteworthy.

  • The system administrators restarted the ls1 filesystem with the writeconf option this morning
  • The first mount attempt of the MDT failed because the MGS hadn't started
Comment by Peter Jones [ 15/Nov/12 ]

Alex will triage this one

Comment by Alex Zhuravlev [ 22/Nov/12 ]

Ned, how many interfaces that system is equipped with ? probably this or some other system has >32 NIDs assigned?

Comment by Prakash Surya (Inactive) [ 03/Dec/12 ]

I don't think this really warrants the "blocker" priority. It only occurred when we tried what Ned detailed above. Alex, I highly doubt the system has more that 32 network interfaces, but I'd have to check to say for sure.

Could this happen as a result of a configuration mistake (i.e. if all OSTs were configured and trying to connect using the same NID)?

Comment by Peter Jones [ 03/Dec/12 ]

Praskash

It is marked as a blocker because it had been marked as a top Sequoia issue. Is that no longer the case?

Peter

Comment by Ned Bass [ 03/Dec/12 ]

Hi Alex,

None of our systems have anywhere near 32 NIDs. However there was some administrative adding and removing of NIDs on this system. We recently started mounting this filesystem on additional clusters in our center, and I think a few different LNET configurations may have been tried before we arrived at one that worked. I don't have exact details on what was done, but I wonder if it could be a factor.

Comment by Alex Zhuravlev [ 03/Dec/12 ]

hi, seem to be related. going to reproduce locally first. thanks!

Comment by Prakash Surya (Inactive) [ 03/Dec/12 ]

Peter, Well nobody has really has told me how the "topsequoia" label should be used. I've been marking anything related to Sequoia and/or Grove as "topseqoia" (not only "top" issues), and then trying to use the priority tag to add finer grained information. If it'd be better, I can instead use a "sequoia" label (or something similar) instead? and isolate "topseqoia" for blockers?

Generated at Sat Feb 10 01:24:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.