Saturday, May 3, 2014

Art of Linux Troubleshooting

Here are some valuable tips to help you find important system files in RHEL 5, which get deleted by accident.
Everything seems fine when your Linux machine works just the way you want it to. But that feeling changes dramatically when your machine starts creating problems that you find really difficult to sort out. Not everyone can troubleshoot a Linux machine efficiently… but you can, if you are ready to stay with us for the next few minutes. Let’s look at what to do when some of your important system files are deleted or corrupted under Red Hat Enterprise Linux 5. The action begins now!
Scene 1: /etc/passwd is deleted
This is an important file in Linux as it contains information about user accounts and passwords. If it’s missing in your system and you try to log in to a user account, you get an error message stating Log-in incorrect and after restarting the system.
Now that you have seen the problem and its consequences, it’s time to solve it. Boot into single user mode. At the start of booting, press any key to enter into the GRUB menu. Here you will see a list of the operating systems installed. Just select the one you are working with and press.
It’s time to have some fun with kernel parameters. So highlight the kernel and again press e to edit its parameters.
Next, instruct the kernel to boot into single user mode, which is also known as maintenance mode. Just type 1 after a space and press the Enter key. Now press b to continue the booting process.
Now that you have booted into single user mode, you are probably asking yourself, What is next?. The tricky portion of this exercise is now over and it takes just one command to have your passwd file in its place. Actually, there is a file /etc/passwd-, which is nothing but the backup file for /etc/passwd. So all you need to do is to issue the following command:
cp /etc/passwd- /etc/passwd
…and you are done. Now you can issue the init 5 command to switch to the graphical mode. Everything is fine now. You can also find the backup of /etc/shadow and /etc/gshadow as /etc/shadow- and /etc/gshadow- respectively.
Scene 2: /etc/pam.d/login is deleted
If your /etc/pam.d/login file is deleted and you try to log in, it won’t ask you to enter your password after entering your username. Instead, it will continuously show the localhost login prompt. Here again, there is a single command that will solve the problem for you:
cp /etc/pam.d/system-auth /etc/pam.d/login
Just boot into the single user mode as done earlier, type this command and you’ll be able to log in normally. There is also a second solution to this problem, which we’ll look at after a while.
Scene 3: /etc/inittab is deleted
We know that in Linux, init is the first process to be started and it starts all the other processes. The /etc/inittab file contains instructions for the init process and if it’s missing, then no further process can be launched. On starting a system with no inittab file, it will show the following message:
INIT:No inittab file found …and will ask you to enter a runlevel. When you do that, it again shows the message that no more processes are left in this runlevel.
Fixing this problem is not easy because being in the single user mode doesn’t help in this case. Here, you need the Linux rescue environment to fix this problem. So set your first boot device to CD and boot with the RHEL5 CD. At the boot prompt, type ‘Linux rescue’ to enter the rescue environment.
Once you have entered into the rescue environment, your system will be mounted under /mnt/sysimage. Here, reinstall the package that provides the /etc/inittab file. The overall process is given below:
chroot /mnt/sysimage
rpm -q –whatprovides /etc/inittab
mkdir /a
mount /dev/hdc /a
Here /dev/hdc is the path of the CD. It may vary on your system, though.
rpm –Uvh –force /a/Server/initscripts-8.45.25-1.el5.i386.rpm
You can also hit the Tab key after init to auto complete the name.
Now you’ll get your /etc/inittab file back. The same procedure can be applied to recover the /etc/pam.d/login file. In this case, you’ll have to install the util-linux package. Once you are done with it, type Exit to leave the rescue environment, set your first boot device to hard disk and boot normally.
Scene 4: /boot/grub/grub.conf is deleted
This file is the configuration file of the GRUB boot loader. If it is deleted and you start your machine, you will see a GRUB prompt that indicates that grub.conf is missing and there is no further instruction for GRUB to carry on its operation.
But don’t worry, as we’ll solve this problem, too, in the next few minutes. You don’t even need to enter single user mode or the Linux rescue environment for this. At the GRUB prompt, you can enter some command that can make your system boot. So here we go: Type root (and hit Tab to find out the hard disks attached to the system. In my case, I got hd0 and fd0—the hard disk and floppy disk, respectively). Now we know that GRUB is stored in the first sector of a hard disk, which is hd0,0. So the complete command would be root (hd0,0). Enter this command and press the Enter key to carry on.
You now need to find out the kernel image file. So enter kernel /v and hit Tab to auto complete it. In my system, it’s vmlinuz-2.6.18-128.el5. Please note it down as we’ll require this information further, and then press Enter.
Next, let’s figure out the initrd image file. So enter initrd /i and press Tab to auto complete it. For me, it’s initrd-2.6.18-128.el5.img. Again note it down and press Enter.
Type boot and press Enter, and the system will boot normally.
Now it’s time to create a grub.conf file manually. So create the /boot/grub/grub.conf file and enter the following data in it:
splashimage=(hd0,0)/grub/splash.xpm.gz
default=0
timeout=5
title Red Hat
root (hd0,0)
kernel /vmlinuz-2.6.18-128.el5
initrd /initrd-26.18-128.el5.img
Save the file and quit it. You have created a grub.conf file manually to resolve the problem. Don’t forget that the kernel and the initrd image file name may vary on your system. That’s why I asked you to note them down earlier. You can also find them in the /boot folder once you are logged in it’s not a big issue.




Some essential tricks for admins



The best systems administrators are set apart by their efficiency. And if an efficient systems administrator can do a task in 10 minutes that would take another mortal two hours to complete, then the efficient systems administrator should be rewarded (paid more) because the company is saving time, and time is money, right?
 The trick is to prove your efficiency to management. While I won’t attempt to cover that trick in this article, I will give you 10 essential gems from the lazy admin’s bag of tricks. These tips will save you time—and even if you don’t get paid more money to be more efficient, you’ll at least have more time to play an Action Game.
Trick 1: Unmounting the unresponsive DVD drive
The newbie states that when he pushes the Eject button on the DVD drive of a server running a certain Redmond-based operating system, it will eject immediately. He then complains that, in most enterprise Linux servers, if a process is running in that directory, then the ejection won’t happen. For too long as a Linux administrator, I would reboot the machine and get my disk on the bounce if I couldn’t figure out what was running and why it wouldn’t release the DVD drive. But this is ineffective.
Here’s how you find the process that holds your DVD drive and eject it to your heart’s content: First, simulate it. Stick a disk in your DVD drive, open up a terminal, and mount the DVD drive:
# mount /media/cdrom
# cd /media/cdrom
# while [ 1 ]; do echo “All your drives are belong to us!”; sleep 30; done
Now open up a second terminal and try to eject the DVD drive:
# eject
You’ll get a message like:
umount: /media/cdrom: device is busy
Before you free it, let’s find out who is using it.
# fuser /media/cdrom
You see the process was running and, indeed, it is our fault we can not eject the disk.
Now, if you are root, you can exercise your godlike powers and kill processes:
# fuser -k /media/cdrom
Boom! Just like that, freedom. Now solemnly unmount the drive:
# eject
fuser is good.
Trick 2: Getting your screen back when it’s hosed
Try this:
# cat /bin/cat
Behold! Your terminal looks like garbage. Everything you type looks like you’re looking into the Matrix. What do you do?
You type reset. But wait you say, typing reset is too close to typing reboot or shutdown. Your palms start to sweat—especially if you are doing this on a production machine.
Rest assured: You can do it with the confidence that no machine will be rebooted. Go ahead, do it:
# reset
Now your screen is back to normal. This is much better than closing the window and then logging in again, especially if you just went through five machines to SSH to this machine.
Trick 3: Collaboration with screen
David, the high-maintenance user from product engineering, calls: “I need you to help me understand why I can’t compile supercode.c on these new machines you deployed.”
“Fine,” you say. “What machine are you on?”
David responds: ” Posh.” (Yes, this fictional company has named its five production servers in honor of the Spice Girls.) OK, you say. You exercise your godlike root powers and on another machine become David:
# su – david
Then you go over to posh:
# ssh posh
Once you are there, you run:
# screen -S foo
Then you holler at David:
“Hey David, run the following command on your terminal: # screen -x foo.”
This will cause your and David’s sessions to be joined together in the holy Linux shell. You can type or he can type, but you’ll both see what the other is doing. This saves you from walking to the other floor and lets you both have equal control. The benefit is that David can watch your troubleshooting skills and see exactly how you solve problems.
At last you both see what the problem is: David’s compile script hard-coded an old directory that does not exist on this new server. You mount it, recompile, solve the problem, and David goes back to work. You then go back to whatever lazy activity you were doing before.
The one caveat to this trick is that you both need to be logged in as the same user. Other cool things you can do with the screen command include having multiple windows and split screens. Read the man pages for more on that.
But I’ll give you one last tip while you’re in your screen session. To detach from it and leave it open, type: Ctrl-A D . (I mean, hold down the Ctrl key and strike the A key. Then push the D key.)
You can then reattach by running the screen -x foo command again.
Trick 4: Getting back the root password
You forgot your root password. Nice work. Now you’ll just have to reinstall the entire machine. Sadly enough, I’ve seen more than a few people do this. But it’s surprisingly easy to get on the machine and change the password. This doesn’t work in all cases (like if you made a GRUB password and forgot that too), but here’s how you do it in a normal case with a Cent OS Linux example.
First reboot the system. When it reboots you’ll come to the GRUB screen as shown in Figure 1. Move the arrow key so that you stay on this screen instead of proceeding all the way to a normal boot.
Figure 1. GRUB screen after reboot
Next, select the kernel that will boot with the arrow keys, and type E to edit the kernel line. You’ll then see something like Figure 2:
Figure 2. Ready to edit the kernel line
Use the arrow key again to highlight the line that begins with kernel, and press E to edit the kernel parameters. When you get to the screen shown in Figure 3, simply append the number 1 to the arguments as shown in Figure 3:
Figure 3. Append the argument with the number 1
Then press Enter, B, and the kernel will boot up to single-user mode. Once here you can run the passwd command, changing password for user root:
sh-3.00# passwd
New UNIX password:
Retype new UNIX password:
passwd: all authentication tokens updated successfully
Now you can reboot, and the machine will boot up with your new password.
Trick 5: SSH back door
Many times I’ll be at a site where I need remote support from someone who is blocked on the outside by a company firewall. Few people realize that if you can get out to the world through a firewall, then it is relatively easy to open a hole so that the world can come into you.
In its crudest form, this is called “poking a hole in the firewall.” I’ll call it an SSH back door. To use it, you’ll need a machine on the Internet that you can use as an intermediary.
In our example, we’ll call our machine blackbox.example.com. The machine behind the company firewall is called ginger. Finally, the machine that technical support is on will be called tech. Figure 4 explains how this is set up.
Figure 4. Poking a hole in the firewall
Here’s how to proceed:
Check that what you’re doing is allowed, but make sure you ask the right people. Most people will cringe that you’re opening the firewall, but what they don’t understand is that it is completely encrypted. Furthermore, someone would need to hack your outside machine before getting into your company. Instead, you may belong to the school of “ask-for-forgiveness-instead-of-permission.” Either way, use your judgment and don’t blame me if this doesn’t go your way.
SSH from ginger to blackbox.example.com with the -R flag. I’ll assume that you’re the root user on ginger and that tech will need the root user ID to help you with the system. With the -R flag, you’ll forward instructions of port 2222 on blackbox to port 22 on ginger. This is how you set up an SSH tunnel. Note that only SSH traffic can come into ginger: You’re not putting ginger out on the Internet naked.
You can do this with the following syntax:
~# ssh -R 2222:localhost:22 thedude@blackbox.example.com
Once you are into blackbox, you just need to stay logged in. I usually enter a command like:
thedude@blackbox:~$ while [ 1 ]; do date; sleep 300; done
to keep the machine busy. And minimize the window.
Now instruct your friends at tech to SSH as thedude into blackbox without using any special SSH flags. You’ll have to give them your password:
root@tech:~# ssh thedude@blackbox.example.com .
Once tech is on the blackbox, they can SSH to ginger using the following command:
thedude@blackbox:~$: ssh -p 2222 root@localhost
Tech will then be prompted for a password. They should enter the root password of ginger.
Now you and support from tech can work together and solve the problem. You may even want to use screen together! (See Trick 4.)
Back to top
Trick 6: Remote VNC session through an SSH tunnel
VNC or virtual network computing has been around a long time. I typically find myself needing to use it when the remote server has some type of graphical program that is only available on that server.
For example, suppose in Trick 5, ginger is a storage server. Many storage devices come with a GUI program to manage the storage controllers. Often these GUI management tools need a direct connection to the storage through a network that is at times kept in a private subnet. Therefore, the only way to access this GUI is to do it from ginger.
You can try SSH’ing to ginger with the -X option and launch it that way, but many times the bandwidth required is too much and you’ll get frustrated waiting. VNC is a much more network-friendly tool and is readily available for nearly all operating systems.
Let’s assume that the setup is the same as in Trick 5, but you want tech to be able to get VNC access instead of SSH. In this case, you’ll do something similar but forward VNC ports instead. Here’s what you do:
Start a VNC server session on ginger. This is done by running something like:
root@ginger:~# vncserver -geometry 1024×768 -depth 24 :99
The options tell the VNC server to start up with a resolution of 1024×768 and a pixel depth of 24 bits per pixel. If you are using a really slow connection setting, 8 may be a better option. Using :99 specifies the port the VNC server will be accessible from. The VNC protocol starts at 5900 so specifying :99 means the server is accessible from port 5999.
When you start the session, you’ll be asked to specify a password. The user ID will be the same user that you launched the VNC server from. (In our case, this is root.)
SSH from ginger to blackbox.example.com forwarding the port 5999 on blackbox to ginger. This is done from ginger by running the command:
root@ginger:~# ssh -R 5999:localhost:5999 thedude@blackbox.example.com
Once you run this command, you’ll need to keep this SSH session open in order to keep the port forwarded to ginger. At this point if you were on blackbox, you could now access the VNC session on ginger by just running:
thedude@blackbox:~$ vncviewer localhost:99
That would forward the port through SSH to ginger. But we’re interested in letting tech get VNC access to ginger. To accomplish this, you’ll need another tunnel.
From tech, you open a tunnel via SSH to forward your port 5999 to port 5999 on blackbox. This would be done by running:
root@tech:~# ssh -L 5999:localhost:5999 thedude@blackbox.example.com
This time the SSH flag we used was -L, which instead of pushing 5999 to blackbox, pulled from it. Once you are in on blackbox, you’ll need to leave this session open. Now you’re ready to VNC from tech!
From tech, VNC to ginger by running the command:
root@tech:~# vncviewer localhost:99 .
Tech will now have a VNC session directly to ginger.
While the effort might seem like a bit much to set up, it beats flying across the country to fix the storage arrays. Also, if you practice this a few times, it becomes quite easy.
Let me add a trick to this trick: If tech was running the Windows® operating system and didn’t have a command-line SSH client, then tech can run Putty. Putty can be set to forward SSH ports by looking in the options in the sidebar. If the port were 5902 instead of our example of 5999, then you would enter something like in Figure 5.
Figure 5. Putty can forward SSH ports for tunneling
If this were set up, then tech could VNC to localhost:2 just as if tech were running the Linux operating system.
Back to top
Trick 7: Checking your bandwidth
Imagine this: Company A has a storage server named ginger and it is being NFS-mounted by a client node named beckham. Company A has decided they really want to get more bandwidth out of ginger because they have lots of nodes they want to have NFS mount ginger’s shared filesystem.
The most common and cheapest way to do this is to bond two Gigabit ethernet NICs together. This is cheapest because usually you have an extra on-board NIC and an extra port on your switch somewhere.
So they do this. But now the question is: How much bandwidth do they really have?
Gigabit Ethernet has a theoretical limit of 128MBps. Where does that number come from? Well,
1Gb = 1024Mb; 1024Mb/8 = 128MB; “b” = “bits,” “B” = “bytes”
But what is it that we actually see, and what is a good way to measure it? One tool I suggest is iperf. You can grab iperf like this:
You’ll need to install it on a shared filesystem that both ginger and beckham can see. or compile and install on both nodes. I’ll compile it in the home directory of the bob user that is viewable on both nodes:
tar zxvf iperf*gz
cd iperf-2.0.2
./configure -prefix=/home/bob/perf
make
make install
On ginger, run:
# /home/bob/perf/bin/iperf -s -f M
This machine will act as the server and print out performance speeds in MBps.
On the beckham node, run:
# /home/bob/perf/bin/iperf -c ginger -P 4 -f M -w 256k -t 60
You’ll see output in both screens telling you what the speed is. On a normal server with a Gigabit Ethernet adapter, you will probably see about 112MBps. This is normal as bandwidth is lost in the TCP stack and physical cables. By connecting two servers back-to-back, each with two bonded Ethernet cards, I got about 220MBps.
In reality, what you see with NFS on bonded networks is around 150-160MBps. Still, this gives you a good indication that your bandwidth is going to be about what you’d expect. If you see something much less, then you should check for a problem.
I recently ran into a case in which the bonding driver was used to bond two NICs that used different drivers. The performance was extremely poor, leading to about 20MBps in bandwidth, less than they would have gotten had they not bonded the Ethernet cards together!
Back to top
Trick 8: Command-line scripting and utilities
A Linux systems administrator becomes more efficient by using command-line scripting with authority. This includes crafting loops and knowing how to parse data using utilities like awk, grep, and sed. There are many cases where doing so takes fewer keystrokes and lessens the likelihood of user errors.
For example, suppose you need to generate a new /etc/hosts file for a Linux cluster that you are about to install. The long way would be to add IP addresses in vi or your favorite text editor. However, it can be done by taking the already existing /etc/hosts file and appending the following to it by running this on the command line:
# P=1; for i in $(seq -w 200); do echo “192.168.99.$P n$i”; P=$(expr $P + 1);
done >>/etc/hosts
Two hundred host names, n001 through n200, will then be created with IP addresses 192.168.99.1 through 192.168.99.200. Populating a file like this by hand runs the risk of inadvertently creating duplicate IP addresses or host names, so this is a good example of using the built-in command line to eliminate user errors. Please note that this is done in the bash shell, the default in most Linux distributions.
As another example, let’s suppose you want to check that the memory size is the same in each of the compute nodes in the Linux cluster. In most cases of this sort, having a distributed or parallel shell would be the best practice, but for the sake of illustration, here’s a way to do this using SSH.
Assume the SSH is set up to authenticate without a password. Then run:
# for num in $(seq -w 200); do ssh n$num free -tm | grep Mem | awk ‘{print $2}’;
done | sort | uniq
A command line like this looks pretty terse. (It can be worse if you put regular expressions in it.) Let’s pick it apart and uncover the mystery.
First you’re doing a loop through 001-200. This padding with 0s in the front is done with the -w option to the seq command. Then you substitute the num variable to create the host you’re going to SSH to. Once you have the target host, give the command to it. In this case, it’s:
free -m | grep Mem | awk ‘{print $2}’
That command says to:
Use the free command to get the memory size in megabytes.
Take the output of that command and use grep to get the line that has the string Mem in it.
Take that line and use awk to print the second field, which is the total memory in the node.
This operation is performed on every node.
Once you have performed the command on every node, the entire output of all 200 nodes is piped (|d) to the sort command so that all the memory values are sorted.
Finally, you eliminate duplicates with the uniq command. This command will result in one of the following cases:
If all the nodes, n001-n200, have the same memory size, then only one number will be displayed. This is the size of memory as seen by each operating system.
If node memory size is different, you will see several memory size values.
Finally, if the SSH failed on a certain node, then you may see some error messages.
This command isn’t perfect. If you find that a value of memory is different than what you expect, you won’t know on which node it was or how many nodes there were. Another command may need to be issued for that.
What this trick does give you, though, is a fast way to check for something and quickly learn if something is wrong. This is it’s real value: Speed to do a quick-and-dirty check.
Back to top
Trick 9: Spying on the console
Some software prints error messages to the console that may not necessarily show up on your SSH session. Using the vcs devices can let you examine these. From within an SSH session, run the following command on a remote server: # cat /dev/vcs1. This will show you what is on the first console. You can also look at the other virtual terminals using 2, 3, etc. If a user is typing on the remote system, you’ll be able to see what he typed.
In most data farms, using a remote terminal server, KVM, or even Serial Over LAN is the best way to view this information; it also provides the additional benefit of out-of-band viewing capabilities. Using the vcs device provides a fast in-band method that may be able to save you some time from going to the machine room and looking at the console.
Back to top
Trick 10: Random system information collection
In Trick 8, you saw an example of using the command line to get information about the total memory in the system. In this trick, I’ll offer up a few other methods to collect important information from the system you may need to verify, troubleshoot, or give to remote support.
First, let’s gather information about the processor. This is easily done as follows:
# cat /proc/cpuinfo .
This command gives you information on the processor speed, quantity, and model. Using grep in many cases can give you the desired value.
A check that I do quite often is to ascertain the quantity of processors on the system. So, if I have purchased a dual processor quad-core server, I can run:
# cat /proc/cpuinfo | grep processor | wc -l .
I would then expect to see 8 as the value. If I don’t, I call up the vendor and tell them to send me another processor.
Another piece of information I may require is disk information. This can be gotten with the df command. I usually add the -h flag so that I can see the output in gigabytes or megabytes. # df -h also shows how the disk was partitioned.
And to end the list, here’s a way to look at the firmware of your system—a method to get the BIOS level and the firmware on the NIC.
To check the BIOS version, you can run the dmidecode command. Unfortunately, you can’t easily grep for the information, so piping it is a less efficient way to do this. On my Lenovo T61 laptop, the output looks like this:
#dmidecode | less

BIOS Information
Vendor: LENOVO
Version: 7LET52WW (1.22 )
Release Date: 08/27/2007
This is much more efficient than rebooting your machine and looking at the POST output.
To examine the driver and firmware versions of your Ethernet adapter, run ethtool:
# ethtool -i eth0
driver: e1000
version: 7.3.20-k2-NAPI
firmware-version: 0.3-0

No comments:

Post a Comment