[LU-1687] Unloading lustre modules and reloading again leaves MDS with an empty /proc/fs/lustre Created: 27/Jul/12 Updated: 19/Apr/19 Resolved: 06/Aug/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jay Lan (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
https://github.com/jlan/lustre-nas/tree/nas-2.1.2 |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 6179 | ||||
| Description |
|
The sanity-quota test_32 failed in my testing. The problem is reproducible. Fortunately, we do not need to perform this sequence of operation on mds often. |
| Comments |
| Comment by Peter Jones [ 27/Jul/12 ] |
|
Lai Could you please look into this one? Thanks Peter |
| Comment by Lai Siyao [ 30/Jul/12 ] |
|
I can't reproduce in my setup, I'll look more into debuglog and code to find the cause. |
| Comment by Jay Lan (Inactive) [ 30/Jul/12 ] |
|
Fortunately I am still able to reproduce. Let me know what I can do to help debug this problem. BTW, the source of our lustre source can be git cloned from |
| Comment by Lai Siyao [ 31/Jul/12 ] |
|
I can't build lustre against your git code because LUSTRE_KERNEL_VERSION is undefined. Git log shows it's removed in commit c2751b31e55518d1791cd5b87adc842f4fbbee83, could you help verify it? And if the code can be built on your system, could you output /proc/fs/lustre/version? |
| Comment by Jay Lan (Inactive) [ 31/Jul/12 ] |
|
I checked again, the commit c2751b3 is in nas-2.1.2 branch. It was not removed or reversed. service337 ~ # cat /proc/fs/lustre/version I built the 2.6.32-220.4.1.el6 kernel with kernel_patches for el6 from the 2.1.2 branch. Your kernel can be named different. Also, I used 1.5.4.1 ofa_kernel modules. I hereby attach two scripts file I used for my build: nas-config.sh.rhel62.212 and nas-make.sh.rhel62.212. |
| Comment by Lai Siyao [ 01/Aug/12 ] |
|
Yes, I can compile with your script, but this test still passes on my system. After this test fails, could you successfully mount lustre on MDS? |
| Comment by Jay Lan (Inactive) [ 01/Aug/12 ] |
|
I am able to reproduce "good" and "bad" cases without running sanity-quota test_32. After a lustre system has been set up, the "good" case operation sequence will be:
The filesystem will be in good shape and usable. If I 'umount /mnt/mds1' before run 'lustre_rmmod', the the empty /proc/fs/lustre problem will happen after I do the same mds recovery opearations. The mount command will return and 'mount' will show mds1 mounted. However, the filesystem is not usable. Can you try the operation sequence and let me know if you can reproduce the problem? |
| Comment by Lai Siyao [ 01/Aug/12 ] |
|
No, I can't reproduce here. BTW could you explain why you need do mds recovery? IMO it's shutdown and started up normally in your case. |
| Comment by Jay Lan (Inactive) [ 01/Aug/12 ] |
|
Ah, I meant to say restart. Somehow the restart after shutdown does not behave the same way as the initial start. I will do more debugging tomorrow. |
| Comment by Jay Lan (Inactive) [ 02/Aug/12 ] |
|
I put in printk to lprocfs_seq_create(): int lprocfs_seq_create(cfs_proc_dir_entry_t *parent, char *name, mode_t mode, LPROCFS_WRITE_ENTRY(); LPROCFS_WRITE_EXIT(); if (entry == NULL) { printk("lprocfs_seq_create: failed to create %s\n", name); RETURN(-ENOMEM); } else RETURN(0); And the syslog showed: So the lprocfs_seq_create() thought the /proc/fs/devices was created successfully. Yet, `ls /proc/fs/lustre` returned empty. This is weird. I will continue look into this. |
| Comment by Jay Lan (Inactive) [ 02/Aug/12 ] |
|
Oh, in addition to the changes to lproc_seq_create(), I also made changes to class_procfs_init(). Without this, the syslog output I quoted above did not make sense. diff - + printk("class_procfs_init: registering /proc/fs/lustre\n");
|
| Comment by Jay Lan (Inactive) [ 03/Aug/12 ] |
|
In the test case, when /proc/fs/lustre appeared to have been removed, it actually not. service337 /proc/fs # ls You need to use 'ls -lid' to see it. So when next time we restart mds, another /proc/fs/lustre was created (with a different inode number.) All other inodes were created succssfully under the new /proc/fs/lustre. Unfortunately the system can not see it. What caused the original /proc/fs/lustre to hang around? From lustre's perspective, it was completed: Systemtap into kernel showed: So, /proc/fs/lustre/version was not removed for some reason. Function remove_proc_entry() did not call free_proc_entry() in the case of lustre/version. Consequently, /proc/lustre can not be removed I suppose. service337 /proc/fs # ls -lid lustre/version Why? I will investigate more... |
| Comment by Jay Lan (Inactive) [ 06/Aug/12 ] |
|
service337 /proc/fs # fuser lustre/version Please close this ticket. The problem was caused by /proc/fs/lustre/version being used by a NASA admin script. |
| Comment by Peter Jones [ 06/Aug/12 ] |
|
ok thanks Jay! |