Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.15.0
-
TOSS 4.4-4.1
4.18.0-372.19.1.1toss.t4.x86_64
lustre 2.15.0_3.llnl
-
3
-
9223372036854775807
Description
After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up. They are stuck "WAITING".
When the mgs node was powered off, an ior test was in progress.
Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted.
After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as
2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6
I believe that -6 is coming from lfsck_start()
lfsck = lfsck_instance_find(key, true, false); if (unlikely(lfsck == NULL)) RETURN(-ENXIO); if (unlikely(lfsck->li_stopping)) GOTO(put, rc = -ENXIO);
For our reference, our local ticket is TOSS5875
Attachments
Activity
Resolution | New: Cannot Reproduce [ 5 ] | |
Status | Original: Open [ 1 ] | New: Closed [ 6 ] |
Labels | Original: llnl topllnl | New: llnl |
Description |
Original:
After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up. They are stuck "WAITING".
When the mgs node was powered off, an ior test was in progress. Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted. After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as {noformat} 2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6{noformat} I believe that -6 is coming from lfsck_start() {noformat} lfsck = lfsck_instance_find(key, true, false); if (unlikely(lfsck == NULL)) RETURN(-ENXIO); if (unlikely(lfsck->li_stopping)) GOTO(put, rc = -ENXIO); {noformat} |
New:
After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up. They are stuck "WAITING".
When the mgs node was powered off, an ior test was in progress. Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted. After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as {noformat} 2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6{noformat} I believe that -6 is coming from lfsck_start() {noformat} lfsck = lfsck_instance_find(key, true, false); if (unlikely(lfsck == NULL)) RETURN(-ENXIO); if (unlikely(lfsck->li_stopping)) GOTO(put, rc = -ENXIO); {noformat} For our reference, our local ticket is TOSS5875 |
Description |
Original:
After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up. They are stuck "WAITING".
When the mgs node was powered off, an ior test was in progress. Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted. After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as {noformat} 2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6{noformat} I believe that -6 is coming from lfsck_start() {noformat} lfsck = lfsck_instance_find(key, true, false); if (unlikely(lfsck == NULL)) RETURN(-ENXIO); if (unlikely(lfsck->li_stopping)) GOTO(put, rc = -ENXIO); {noformat} |
New:
After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up. They are stuck "WAITING".
When the mgs node was powered off, an ior test was in progress. Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted. After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as {noformat} 2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6{noformat} I believe that -6 is coming from lfsck_start() {noformat} lfsck = lfsck_instance_find(key, true, false); if (unlikely(lfsck == NULL)) RETURN(-ENXIO); if (unlikely(lfsck->li_stopping)) GOTO(put, rc = -ENXIO); {noformat} |
Link | Original: This issue is related to JFC-21 [ JFC-21 ] |
Attachment | New: 2022-09-07_console.garter2.gz [ 45660 ] |
Attachment | New: 2022-09-07_console.garter1.gz [ 45659 ] |
Labels | New: llnl topllnl |
Description |
Original:
After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up. They are stuck "WAITING".
When the mgs node was powered off, an ior test was in progress. Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted. After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as {noformat} 2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6{noformat} I believe that -6 is coming from lfsck_start() {noformat} lfsck = lfsck_instance_find(key, true, false); if (unlikely(lfsck == NULL)) RETURN(-ENXIO); if (unlikely(lfsck->li_stopping)) GOTO(put, rc = -ENXIO); {noformat} |
New:
After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up. They are stuck "WAITING".
When the mgs node was powered off, an ior test was in progress. Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted. After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as {noformat} 2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6{noformat} I believe that -6 is coming from lfsck_start() {noformat} lfsck = lfsck_instance_find(key, true, false); if (unlikely(lfsck == NULL)) RETURN(-ENXIO); if (unlikely(lfsck->li_stopping)) GOTO(put, rc = -ENXIO); {noformat} |
Link | New: This issue is related to JFC-21 [ JFC-21 ] |