[LU-2326] Assertion triggered in class_add_uuid Created: 14/Nov/12 Updated: 16/Apr/20 Resolved: 16/Apr/20 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
2.3.54-6chaos |
||
| Severity: | 3 |
| Rank (Obsolete): | 5557 |
| Description |
|
Hit this assertion on the MDS today: LustreError: 1700:0:(lustre_peer.c:129:class_add_uuid()) ASSERTION( entry->un_nid_count < 32 ) failed: LustreError: 1700:0:(lustre_peer.c:129:class_add_uuid()) LBUG PID: 1700 TASK: ffff8817a85d8040 CPU: 1 COMMAND: "llog_process_th" #0 [ffff88177da81aa8] machine_kexec at ffffffff8103216b #1 [ffff88177da81b08] crash_kexec at ffffffff810b8d12 #2 [ffff88177da81bd8] panic at ffffffff814eea99 #3 [ffff88177da81c58] lbug_with_loc at ffffffffa0577fcb [libcfs] #4 [ffff88177da81c78] class_add_uuid at ffffffffa06f8321 [obdclass] #5 [ffff88177da81cc8] class_process_config at ffffffffa070eb3c [obdclass] #6 [ffff88177da81d48] class_config_llog_handler at ffffffffa071110b [obdclass] #7 [ffff88177da81e28] llog_process_thread at ffffffffa06d1f3b [obdclass] #8 [ffff88177da81ed8] llog_process_thread_daemonize at ffffffffa06d249c [obdclass] #9 [ffff88177da81f48] kernel_thread at ffffffff8100c14a |
| Comments |
| Comment by Ned Bass [ 14/Nov/12 ] |
|
This happened on the grove Lustre cluster for Sequoia. We were trying to start the MDT for a new filesystem "lsfull" that spans all of the grove OSS's. lsfull uses existing ZFS pools already in use by filesystems which use only subsets of the grove OSTs, "ls1" and "lstest". That is, we wanted to non-destructively create a new Lustre filesystem across all of grove whose datasets coexist in the ZFS pools with those of our existing filesystems. "lsfull" used the same MGS service as "ls1". We were not attempting to run lsfull concurrently with ls1 or lstest. A couple of points may be noteworthy.
|
| Comment by Peter Jones [ 15/Nov/12 ] |
|
Alex will triage this one |
| Comment by Alex Zhuravlev [ 22/Nov/12 ] |
|
Ned, how many interfaces that system is equipped with ? probably this or some other system has >32 NIDs assigned? |
| Comment by Prakash Surya (Inactive) [ 03/Dec/12 ] |
|
I don't think this really warrants the "blocker" priority. It only occurred when we tried what Ned detailed above. Alex, I highly doubt the system has more that 32 network interfaces, but I'd have to check to say for sure. Could this happen as a result of a configuration mistake (i.e. if all OSTs were configured and trying to connect using the same NID)? |
| Comment by Peter Jones [ 03/Dec/12 ] |
|
Praskash It is marked as a blocker because it had been marked as a top Sequoia issue. Is that no longer the case? Peter |
| Comment by Ned Bass [ 03/Dec/12 ] |
|
Hi Alex, None of our systems have anywhere near 32 NIDs. However there was some administrative adding and removing of NIDs on this system. We recently started mounting this filesystem on additional clusters in our center, and I think a few different LNET configurations may have been tried before we arrived at one that worked. I don't have exact details on what was done, but I wonder if it could be a factor. |
| Comment by Alex Zhuravlev [ 03/Dec/12 ] |
|
hi, seem to be related. going to reproduce locally first. thanks! |
| Comment by Prakash Surya (Inactive) [ 03/Dec/12 ] |
|
Peter, Well nobody has really has told me how the "topsequoia" label should be used. I've been marking anything related to Sequoia and/or Grove as "topseqoia" (not only "top" issues), and then trying to use the priority tag to add finer grained information. If it'd be better, I can instead use a "sequoia" label (or something similar) instead? and isolate "topseqoia" for blockers? |