Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

Linux Troubleshooting

News	Enterprise Unix Administration	Recommended Books	Recommended Links	Linux root password recovery	Linux Run Levels	Init Script Actions
Baseliners	Syslog analyzers	Mounting partitions with chroot in rescue mode	Grub	Step-by-step boot	Changing runlevel when booting with grub	Boot Directly into a Shell
Troubleshooting X	Suse Troubleshooting	Troubleshooting Linux Performance	Too high refresh rate/ too high resolution	RPM Dependency Hell	Resetting the Root Password	Root filesystem is mounted read only on boot
TCP/IP Network Troubleshooting	Network Troubleshooting Tools	Troubleshooting NTP on Red Hat Linux	NIS Troubleshooting	Mounting partitions with chroot in rescue mode	Troubleshooting Samba problems	DNS Troubleshooting
Troubleshooting of ftpd daemon	SSH troubleshooting	Linux disk subsystem tuning	Loopback filesystem	Reverting permissions in etc to redhat defaults	TCP Performance Tuning	Security
The Linux Logical Volume Manager	Troubleshooting Errors in /etc/fstab	Authentication token manipulation error		Script is broken: incomplete LSB comment, missing `Required-Start:' entry
strace		Linux Tips	Linux Troubleshooting Tips	Sysadmin Horror Stories	Humor	Etc

Main Entry: trou-ble-shoot-er
Pronunciation: -"sh�-t&r
Function: noun
Date: 1905

a skilled worker employed to locate trouble and make repairs in machinery and technical equipment

an expert in resolving diplomatic or political disputes : a mediator of disputes that are at an impasse

a person skilled at solving or anticipating problems or difficulties

From the Merriam-Webster online dictionary

Troubleshooting, problem analysis and root cause determination requires patience, determination, and experience. It is important to fully investigate the problem and collect relevant data in order to get to the correct path. You might start on one path and end up on another. In any case the most important skill is patience. Keep log as in a process of solving complex problem as you can become distracted and forget about important facts or findings.

Merriam-Webster online dictionary provides a fitting definition of troubleshooting and by extension a definition of a Unix system administrator in general -- making repairs, dealing with people, and trying to anticipate as well as prevent problems. Like program debugging, linux problems troubleshooting is very similar to investigation of the crime scene. Some "obvious" leads can be and often are false. Finding relevant information is not easy and can take tremendous amount of time and require maintaining well organized documentation about the problem. Some highly suspicious suspects without alibi are actually innocent. You need to have a plan and abilities to see big picture not to be led off-track. You need clear analytical thinking and experience to get to the root cause.

Linux is a very complex system and for many subsystems the administrator have little or no understanding of internals. There are several like of fault analysis that can help in this situation:

Collecting information about the problem from system and application logs is an important first step. Not all problems leave traces in logs (infamous Linux "spontaneous reboots" and BSODs do not) but in any case you should check them first. Unix syslog daemon provides wealth of information both for security and troubleshooting of problems. It permits creating centralized log server so that each syslog daemon sent to it its logs for correlated analysis. Contrary to popular opinion that does not exclude using local log analyzers on each and every server. Typical for half-dollar security specialists paranoia about log modification is a little bit naive. In modern systems such modifications are often picked up by monitoring system. That means that while centralization of logs is a necessary step toward a good log infrastructure it is neither necessary nor not sufficient. Multilevel log analysis with a good, flexible analyzer of each server and a different, mostly oriented on correlation of logs analyzer on the syslog host is a better deal. Some important log files are:
- /var/log/messages, which stores logs from many services that don’t have their own log files. This log file tells you if there are kernel problems (kernel panic messages) or kernel limits violations, such as the number of currently open files, which can cause system problems.
- /var/log/dmesg, which contains information about hardware found by the kernel drivers. It can help you troubleshoot hardware problems and missing drivers. You can also use the command /bin/dmesg for similar purposes. /bin/dmesg provides more detailed information in real time, while the log file keeps less information for historical purposes.
- /var/log/audit/audit.log, is only Red Hat and derivatives log file Linux Auditing System (auditd) writes its logs, including all SELinux information. If auditd is disabled, SELinux sends its logs to /var/log/messages. It is common to disable SELinux in on servers on internal network in production environment. You can check its status with the command sestatus.
- Application-specific logs – Many applications create logs in other places, and have options that control where and what to log. For example, Apache writes web server logs to the directory /var/log/httpd/, mail servers write log to /var/log/maillog, and MySQL logs in /var/log/mysqld.log. However, not all logs are located in the /var/log subtree. Some applications, such as user-space programs, may not have privileges to write there. Others prefer to log inside their own root directory. You may need to consult an application’s manual to learn where it writes its logs.
Having a reference system (baseline) and a complete track record of all changes (via CVS or similar package) is vital for troubleshooting of complex problems.
Lab is also important. Attempts to reproduce the problem in a lab in many cases are easier that finding the root problem based on symptoms (for example to answer the question is problem reproducible on a different run level).

Baselining

The most general strategy is to compare a problematic system to working system or backup of the current system that did not have the problem you are experiencing now. Often the problem is the result of miscommunication when several system admins implement some changes on the system. Or other sysadmin implement some changes in the absence of primary sysadmin. Here configuration management is critical.

Like with any complex system creation of baseline for a Linux system is critically important. On modern system this is not a time-consuming operation and can be done when you log to the system (run script that creates baseline from your .profile). The most primitive way is to tar /etc and /root directories as well as couple of other about which you know that they contain important configurations files.

Supportconfig is a standard tool to collect all the relevant for troubleshooting information in Suse. There is a tool that helps to analyze the collected information called the Supportconfig Health Check Report Tool (schealth). For Red Hat several tools are also available. See Baseliners

Haste makes waste

When an event occurs that will cause system or application downtime, the number one priority is get it working again as quickly as possible. But excessive zeal here is counterproductive. Much depends on the context:

If the event is something that you are familiar with, you will approach the issue with confidence due to your experience.
When the event is new and one that you have not personally resolved, you are likely to tread with a more cautious mindset.

The confidence comes with experience; the techniques used are generally a combination of proven solutions. If the solution is documented from the application or system vendor, it has been tested and will most likely bring the system or application online with the least downtime.

Collaboration with colleagues is important when the issue falls outside the realm of familiar territory. It never hurts to ask a question or two of those who might have seen a similar situation and have more current experience. Finally, stubbon bulldog-like determination is the essential quality for someone facing the troubleshooting task even when the situation is unfamiliar.

Information you need to collect before calling support

Can the problem be reproduced?
Do you have a current system backup?
What is the complete error message? Where did the error occur? When did the error initially occur?
Create supportconfig or similar tarball with critical config files and settings. That helps to answer the questions:
- What versions of the software products are in use?
- What versions of the operating system are running on the server?
- Server hardware platform?
- What is the status of the system processes?
- Did any error messages appear in the log files?
- Do you have the current patches installed for the software and operating system?
Have there been any recent changes to the system (such as new software; including those that possibly were not done by you. )?

Additional useful information can be found in Linux Troubleshooting Wiki

ltrace

ltrace is very similar to strace, except ltrace focuses on tracing library calls.

For apps that use a lot of libs, this can be a very powerful debugging tool. However, because most modern apps use libraries very heavily, the output from ltrace can sometimes be painfully verbose.

There is a distinction between what makes a system call and a call to a library function. Sometimes the line between the two is blurry, but the basic difference is that system calls are communicating to the kernel, and library calls are just running more userland code. System calls are usually required for things like I/O, process control, memory management issues, and other kernel things.

Library calls are by bulk, generally calls to the standard C library (glibc..), but can of course be calls to any library, for example, Gtk, libjpeg, libnss, etc. Luckily most glibc functions are well documented and have either man or info pages. Documentation for other libraries varies greatly.

ltrace supports the -r, -tt, -p, and -c options the same as strace. In addition it supports the -S option which tells it to print out system calls as well as library calls.

One of the more useful options is "-n 2" which will indent 2 spaces for each nested call. This can make it much easier to read.

Another useful option is the "-l" option, which allows you to specify a specific library to trace, potentionaly cutting down on the rather verbose output.

gdb

`gdb` is the GNU debugger. A debugger is typically used by developers to debug applications in development. It allows for a very detailed examination of exactly what a program is doing.

That said, gdb isn't as useful as strace/ltrace for troubleshooting/sysadmin types of issues, but occasionally it comes in handy.

For troubleshooting, it's useful for determining what application created a core file. (`file core` will also typically show you this information too). But gdb can also show you "where" the file crashed. Once you determine the name of the app that caused the failure, you can start gdb with:
gdb filename corefile
The unfortunate thing is that all the binaries are typically stripped of debugging symbols to make them smaller, so this often returns less than useful information. However, starting in Red Hat Enterprise Linux 3, and included in Fedora, there are "debuginfo" packages. These packages include all the debugging symbols. You can install them the same as any other rpm, so `rpm`, `up2date`, and `yum` all work.

The only difficult part about debuginfo rpms is figuring out which ones you need. Generally, you want the debuginfo package for the src rpm of the package thats crashing.
rpm -qif /path/to/app
Will tell you the info for the binary package the app is part of. Part of that info include the src.rpm. Just use the package name of the src rpm plus "-debuginfo"
  FIXME: insert info about debug packages for other systems
top

`top` is a simple text based system monitoring tool. It packs a lot of information unto the screen, which can be helpful troubleshooting problems, particularly performance related problems.

The top of the "top" output includes a basic summary of the system. The top line is current time, uptime since the last reboot, users logged in, and the load average. The load average values here are the load for the last 1, 5, and 15 minutes. A load of 1.0 is considered 100% utilization, so loads over 1 typically means stuff is having to wait. There is a lot of leeway and approxiation in these load values, however.

The memory line shows the total physical ram available on the system, how much of it is used, how much is free, and how much is shared, along with the amount of ram in buffers. These buffers are typically file system caching, but can be other things. On a system with a significant uptime, expect the buffer value to take up all free physical ram not in use by a process. The swap line is similar.

Each of the entries viewable in the system contain several fields by default. The most interesting are RES, %CPU, and time. RES shows the amount of physical ram the process is consuming. %CPU shows the percentage of the available processor time a process is taking, and time shows the total amount of processor time the process has had. A processor intensive program can easily have more "time" in just a few seconds than a long running low cpu process.

Sorting the output

M : sorts the output by memory usage. Pretty handy for figuring out which version of openoffice.org to kill.

P : sorts the process by the percentage of cpu time they are using.

T : sorts by cumulative cpu time used

A : sorts by age of the process, newest process first

Command line options

The only really useful command line options are:

b [batch mode] writes the standard top output to stdout. Useful for a quick "system monitoring hack".

e.g.:
top d 360 b >>  foo.output
to get a snapshot of the system appended to foo.output every six minutes.

ps

`ps` can be thought of as a one shot `top`. But it's a bit more flexible in its output than top.

As far as `ps` commandline options go, it can get pretty hairy. The Linux version of `ps` inherits ideas from both the BSD version, and the SYSV version. So be warned.

The `ps` man page does a pretty good job of explaining this, so look there for more examples.

One thing to be aware of is that ps behaves differently depending on if a - is prepended to the options:
ps ef
and
ps -ef
are two very different things (either BSD or System V formatting).

examples
ps aux
shows all the processes on the system in a "user" oriented format. In this case meaning the username of the owner of the process is shown in the first column.
ps auxww
the "w" option, when used twice, allows the output to be of unlimited width. For apps started with lots of commandline options, this will allow you to see all the options.
ps auxf
the 'f" option, for "forest" tries to present the list of processes in a tree format. This is a quick and easy way to see which processes are child processes of what.
ps -eo pid,%cpu,vsz,args,wchan
This is an interesting example of the -eo option. This allows you to customize the output of `ps`. In this case, the interesting bit is the "wchan" option, which attempts to show what syscall the process is in which `ps` checks.

For things like apache httpds, this can be useful to get an idea of what all the processes are doing at one time. See the info in the strace section on understanding system call info for more info.

sysstat/sar

Sysstat works with two steps, a daemon process that collects information, and a "monitoring" tool.

The start script is typically called "sysstat", and the monitoring tool is called `sar`, which will normally perform its monitoring via the `sadc` command.

To start it, start the systat daemon:
sysstat start
To see a list of `sar` options, just try `sar --help`

examples

Things to note. There are lots of commandline options. The last one is always the "count", meaning the time between updates.
sar 3
Will run the default sar invocation every three seconds.

For a complete summary, try:
sar -A
This generates a very large pile of info ;->

To get a good idea of disk i/o activity:
sar -b 3
For something like a heavily used web server, you may want to get a good idea how many processes are being created per second:
sar -c 2
Kind of surprising to see how many processes can be created.

There's also some degree of hardware monitoring built in. Monitoring how many times an IRQ is triggered can also provide good hints at what's causing system performance problems.

Show the total number of system interrupts
sar -I SUM 3
Watch the standard IDE controller IRQ every two seconds.
sar -I 14 2
Network monitoring is in here too: Show # of packets sent/receiced. # of bytes transfered, etc
sar -n DEV 2
Show stats on network errors.
sar -n EDEV 2
Memory usege can be monitored with something like:
sar -r 2
This is similar to the output from `free`, except more easily parsed.

For SMP machines, you can monitor per CPU stats with:
sar -U 0
(If your version of sar doesn't support the -U flag, try -P or -u)

where 0 is the first processor. The keyword ALL will show all of them.

A really useful one on web servers and other configurations that use lots and lots of open files is:
sar -v
This will show the number of used file handles, %of available filehandles, and same for inodes.

To show the number of context switches ( a good indication of how much time a process is wasting..)
sar -w 2
vmstat

This util is part of the procps package and can provide lots of useful information when diagnosing performance problems.

Here's a sample vmstat output on a lightly used desktop:

procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 5416 2200 1856 34612 0 1 2 1 140 194 2 1 97
And here's some sample output on a heavily used server:

procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 16 0 0 2360 264400 96672 9400 0 0 0 1 53 24 3 1 96 24 0 0 2360 257284 96672 9400 0 0 0 6 3063 17713 64 36 0 15 0 0 2360 250024 96672 9400 0 0 0 3 3039 16811 66 34 0

The interesting numbers here are the first ones. This is the number of processes that are in the run queue. This value shows how many processes are ready to be executed, but can not be run at the moment because other processes need to finish. For lightly loaded systems, this is almost never above 1-3, and numbers consistently higher than 10 indicate the machine is getting pounded.

Other interesting values include the "system" numbers for in and cs. The in value is the number of interupts per second a system is getting. A system doing a lot of network or disk I/O will have high values here, as interupts are generated every time something is read or written to the disk or network.

The cs value is the number of context switches per second. A context switch is when the kernel has to take the executable code for a program out of memory, and switch in another. It's actually _way_ more complicated than that, but that's the basic idea. Lots of context swithes are bad, since it takes some fairly large number of cycles to perform a context switch, so if you are doing lots of them, you are spending all your time changing jobs and not actually doing any work. I think we can all understand that concept.

A note on Linux memory management

This is one area where the saying "Linux is not Unix" is accurate.

Linux does not manage memory like a traditional Unix (like HPUX). Linux memory management uses a free on demand system, whereby memory isn't actually free unless there is a demand for the pages. The kernel will use all available (dirty) memory for buffer cache until and unless there is memory pressure.

So, if you are coming from a closed Unix to Linux, don't freak out when you see only 10 meg of that 4GB free - it's being used for file system buffer cache.

tcpdump/ethereal

Ethereal will display all the connections it traced during the capture. There are a couple ways to look for bandwidth hogs.

The "Statistics" menu has a couple of useful options. The "Protocol Hierarchy" shows what % of packets in the trace is from each type of protocol. In the case of a bandwith hog, at least what protocol is the culprit should be easy to spot here.

The "Conversations" screen is also helpful for looking for bandwidth hogs. Since you can sort the "conversations" by number of packets, the culprit is likely to hop to the top. This isn't always the case, as it could easily be many small connections killing the bandwidth, not one big heavy connection.

As far as tcpdump goes, the best way to spot bandwidth hogs is just to start it up. Since it pretty much dumps all traffic to the screen in a text format, just keep your eyes peel for what seems to be coming up a lot.

tcpdump can also be used to see if a given service may be unresponsive because your packets are simply not reaching the remote machine. Since tcpdump is a commandline tool, you'll very probably need to add filters - especially when you're firing tcpdump up on a remote machine, where you're logged in via SSH. Otherwise you'll get lots of packet dumps of SSH packets that are telling you of packets dumped that belong to ssh telling you of packets dumped...
  tcpdump -l -i eth0 port 25
This will dump all packets aimed at, or originating from, a TCP or UDP port 25. The '-l' is to do line buffering, so we'll actually see each packet as it crosses the wire.

If you're debugging network connections over an SSH connection, the following will probably be the most frequent way that you'll invoke tcpdump:
 tcpdump -l not port 22
And to monitor the communication between Server A.local.net (running tcpdump) and the remote server B.remote.net:
 tcpdump -l src or dst B.remote.net
The tcpdump filter syntax is actually surprisingly powerful - take 5 minutes and grab your nearest manpage on tcpdump if you need a better filter.

netstat

Netstat is an app for getting general information about the status of network connections to the machine.
netstat
will just show all the current open sockets on the machine. This will include UNIX domain sockets, TCP sockets, UDP sockets, etc.

One of the more useful options is:
netstat -pa
The `-p` option tells it to try to determine what program has the socket open, which is often very useful info. For example, someone nmap's their system and wants to know what is using port 666 for example. Running netstat -pa will show you its satand running on that tcp port.

One of the most twisted, but useful invocations is:
netstat -t -n | cut -c 68- | sort | uniq -c | sort -n
This will show you a sorted list of how many sockets are in each connection state. For example:
   9  LISTEN      
   21  ESTABLISHED 
what process is doing what and to whom over the network

number of sockets open

socket status

A quick and dirty way to see what daemons are running and accepting connections on your machine is
netstat -tlpn
for TCP services and
netstat -ulpn
for UDP services. Unix domain sockets are usually more abundant than either of these two and a lot less interesting.

If you're having trouble with network throughput for some reason, try
netstat -s
This will print out a summary of the network stack state counters, going into way more detail than the RX/TX frames dropped counter of ifconfig. By looking at what counters are rapidly increasing, you may be able to find out why your network throughput is misbehaving.

lsof

/usr/sbin/lsof is a utility that checks to see what all open files are on the system. There's a ton of options, almost none of which you ever need.

This is mostly useful for seeing what processes have what file open. Useful in cases where you need to unmount a partition or perhaps you have deleted some file, but its space wasn't reclaimed and you want to know why.

The EXAMPLES section of the lsof man page includes many useful examples. One of the more common usages is to see which services are accepting network connections over TCP:

lsof -i tcp fuser Displays PIDs of processes that are using some filesystem object. Kind of like the small brother of lsof. The most frequent use will be the '-m' option when you're trying to unmount a filesystem and you get an error message telling you that the specified device is busy: turing:/home/sr# umount /usr umount: /usr: device is busy turing:/home/sr# fuser -m /usr /usr: 2522e 2604e 2646e 2652e 2662e 2761e 2764e 2775e 2798e 2804e 2843e 2846e 2849e 2988m 3018m 3740e 3741e 3759m 3772m 3773e 3776e 3779e 3782e 3785e 3789e 3791e 3793e 3828e 3832e 3833m 3869e 3893e 3907e 3908m 3915e 3999e 4124m 4125m 4127m
This list are all the PIDs that are working within the '/usr' mountpoint and keeping you from unmounting the filesystem. Check who's what with 'ps ax | grep [PID]' and kill them gently.

ldd

ldd prints out shared library dependencies.

For apps that are reporting missing libraries, this is a handy utility. It shows all the libraries a given app or library is linked to.

For most cases, what you will be looking for is missing libs. In the ldd output, they will show something like:
    libpng.so.3 => (file not found)
In this case, you need to figure out why libpng.so.3 isn't being found. It might not be in the standard lib paths or perhaps not in a path in /etc/ld.so.conf. Or you need to run `ldconfig` again to update the ld cache.

ldd can also be useful when tracking down cases where an app is finding a library, but it's finding the wrong library. This can happen if there are two libraries with the same name installed on a system in different paths.

Since the `ldd` output includes the full path to the lib, you can see if anything is pointing at a wrong path. One thing to look for when scanning for this, is one lib that's in a different lib path than the rest. If an app uses libs from /usr/lib, except for one from /usr/local/lib, there's a good chance that's your culprit.

If you are missing a library, be sure to edit your ld config file (typically /etc/ld.config) and re-run ldconfig.

nm

`nm` is a utility that shows all the library symbols an application expects to find. It can be used in combination with `ldd` and `ldconfig` to try to track down library linking problems.

A common case would be a binary that is compiled against a newer version of a library that has symbols in it that the version of the library the app is dynamlically linking against does not.

file

`file` is a simple utility that tries to figure out what kind of file a given file is. It does this by magic(5).

Where this sometimes comes in handy for troubleshooting is looking for rogue files. A .jpg file that is actually a .html file. A tar.gz that's not actually compressed. Cases like those can sometimes cause apps to behave very strangely.

netcat / nc

Ah, netcat. That wonderful utility which functions just like the normal cat command, but accepts a given interface:port for stdin or stdout.

One common usage is to clone a system over a network. Using only a set of commands similar to "dd | netcat", you can clone a system disk at the bit level. Here's what you do (actual commands to follow, but for now...)

On the slave system, boot from a CD (like Knoppix) and issue a command such as
 netcat -l -p 5678 | dd of=/dev/sda
Then, on the master, start sending a bit image over the network
 dd if=/dev/sda | netcat <slave_IP> 5678
CHECK THESE COMMANDS - MAY NOT BE 100% ACCURATE

md5sum

`md5sum` is a utilty that calculates a checksum of a file. For troubleshooting purposes, you can assume every unique file will have a unique checksum. md5sum is not 100% secure - it is subject to hash collisions - so for added security, please use `sha1sum` in addition to md5sum as a collision between both sets of results is currently considered to be impossible. The `sha1sum` command functions exactly as the `md5sum` command in these examples.

verifying files

Since an MD5 sum will change if any part of a file changes, it can also be used to verify that a file has not changed. Systems like `tripwire` use this to detect if a file has been compromised in a security breach.

This can be used to see if a file has been modified or corrupted if you know what the MD5 sum is supposed to be.

You can also use it to see if two files are exactly the same or not. A common case is to check to see if a config file has been modified or if it's different from what's in a config management system.

verifying ISOs

Linux distributions are often distributed as CD images or ISOs. An MD5 sum of these images is always provided to verify the integrity of the downloaded ISOs. A few bits missing here and there is enough to make an install a painful experience.

Check the location the ISOs were downloaded to for a text file containing the MD5 sums of the ISOs. It will typically look something like:

2af10158545bc24477381e80412ff209 bar.iso 9761d6ce118a1230bc48b0a59f7b5639 foo.iso

You can run `md5sum` directly on the ISOs:
bash# md5um bar.iso
    2af10158545bc24477381e80412ff209  bar.iso
Or you can often use the md5sums text file as input to `md5sum` to tell it what to check and to verify. If the above example was in a file called "iso.md5s":
md5sum -c iso.md5s
That command will check both ISOs and check the computed checksum against what the file lists as correct.

md5sum is also a good way to verify a burned CD. Something like:
find /mnt/cdrom -name "*" -exec md5sum {} \;
will run a md5sum on all the files on the CD mounted at /mnt/cdrom. Since md5sum checks every bit (literally..) of a file, if the CD is bad, there's a good chance this will find it. If the above command causes any errors about the media, chances are the CD is bad. Better to find it now than later.

For recent Red Hat and Fedora based distros, the installer includes an option to perform a mediacheck. This is essentially the same as verifying the ISO MD5 sum by hand. If you have already done that, you can skip the media check.

diff

diff compares two files and shows the difference between them.

For troubleshooting, this is most often used on config files. If one version of a config file works, but another does not, a `diff` of the two files can often be enlightening. Since it can be very easy to miss a small difference in a file, being able to see just the differences is useful.

For debugging during development, diff (especially the versions built into revision control systems like cvs) is invaluable. Seeing exactly what changed between two versions is a great help.

For example, if foo-2.2 is acting weird, where foo-2.1 worked fine, it's not uncommon to `diff` the source code between the two versions to see if anything related to your problem changed.

find

For troubleshooting a system that seems to have suddenly stopped working, find has a few tricks up its sleeve.

When a system stops working suddenly, the first question to ask is "what changed?".
find / -mtime -1
That command will recursively list all the files from / that have changed in the last day.

To list all the files in /usr/lib that changed in the last 30 minutes.
find /usr/lib -mmin -30
Similar options exist for ctime and atime. To show all the files in /tmp that have been accessed in the last 30 minutes.
find /tmp -amin -30
The -atime/-amin options are useful when trying to determine if an app is actually reading the files it is supposed. If you run the app, then run that command where the files are, and nothing has been accessed, something is wrong.

If no "+" or "-" is given for the time value, find will match only exactly that time. This is handy in several cases. You can determine what files were modified/created at the same time.

A good example of this is cleaning up from a tar package that was unpacked into the wrong directory. Since all the files will have the same access time, you can use find and -exec to delete them all.

`find` can also find files with particular permisions set. To find all world writable files / down:
find / -perm -0777
To find all files in /tmp owned by "alikins":
find /tmp -user alikins 
Using find in combo with grep to find markers (errors, filename, etc)

When troubleshooting, there are plenty of cases where you want to find all instances of a filename, or a hostname, etc.

To recursively grep a large number of files, you can use find and its exec options. This will grep for "foo" on all files down from the current working directory:
find . -exec grep foo {} \;
Note that in many cases, you can also use `grep -r` to do this as well. Another common usage is with xargs as such
find / -print | xargs grep "look for this"
ls/stat

while `ls` is one of the first commands linux users learn, do not overlook it's utility in troubleshooting. It's the easiest way to see whats on the file system.

finding sym links and hard links

A simple `ls -al` will show the contents of a directory. But it will also indicate what files are symlinks.

Normally, having a file being a symlink is fine, but some apps, especially security sensitive apps, are picky about what can and can not be a symlink.

The other thing to look for is dangling or broken symlinks. Some apps don't expect to get handed a symlink that doesn't go anywhere.

file system usage

Some simple `ls` invocations useful for troubleshooting.

Show a detailed view of all files, sorted by the last modified time. Quick, easy way to see if an app is modifying files:
    `ls -lart`
Show a detailed view of all files in the current directory, sorted by file size. Quick, easy way to see what files are consuming all of your precious disk space.
    `ls -larS`
Show some basic info about what type of file each file is. Maybe that directory the app is looking for is a file or vice versa?
     `ls -F`
df

Running out of disk space causes so many apps to fail in weird and bizarre ways. A quick `df -h` is a pretty good troubleshooting starting point.

Using it is easy; look for any volume that is 100% full. Or in the case of apps that might be writing lots of data at once, reasonably close to being filled.

It's pretty common to spend more time that anyone would like to admit debugging a problem to suddenly here someone yell "Damnit! It's out of disk space!".

A quick check avoids that problem.

In addition to running out of space, it's possible to run out of file system inodes. A `df -h` will not show this, but a `df -i` will show the number of inodes available on each filesystem.

Being out of inodes can cause even more obscure failures than being out of space, so something to keep in mind.

watch

`watch` is a command that executes another command, displays its output, then repeats. This can be more used to repeatedly watch a reporting process. There is also a "-d" option that will highlight any output that changes between each invocation of the command.

For an example, to watch diskspace useage:
     watch -d df 
Another example, is to simply watch a `ls -al` output, to look for any tmp files that get created:
     watch -d "ls -al"
Note that the above example only runs `ls -al` every two seconds, so will not catch all file creations.

"watch" is often used in combo with commands like "ls", "df", "netstat", "ps".

ipcs/iprm

anything that uses shm/ipc

oracle/apache/etc

A lot of apps make fairly extensive use of SysV shm and IPC (oracle, apache, gimp, etc). Most of the time, on current Linux systems, this works pretty well. But it's occasionally useful to be able to take a look at what shm is being used and how it's being used. `ipcs` is the tool for that.

One common usage is to check for Oracle's usage of "shared memory glue" (typically noticed as shm_glue), which is the method they use for large SGA creation when they cannot obtain a single shm segment large enough for their needs. A good rule of thumb is that if you see Oracle with a large number of maximum sized shared memory segments, then you have a problem and need to tune your shm sizes and restart Oracle. shm_glue is a performance killer.

Typically, you will use ipcs -ma on Linux to see both shared memory, semaphores, and message queues. Here's a lightly loaded system example.
# ipcs -ma

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 8093696    root      600        393216     2          dest
0x00000000 8126465    root      600        393216     2          dest
0x00000000 19759106   root      666        262080     1          dest
0x00000000 19529731   root      600        393216     2          dest
0x00000000 19562500   root      600        393216     2          dest      

------ Semaphore Arrays --------
key        semid      owner      perms      nsems 

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
Searching the web for error messages

A pretty common and often very effective approach to tracking down the cause of errors or problems is searching the web. Using search engines like Google or Yahoo can find documentation, FAQ's, web forum posts, mailing list archives, Usenet posts, and other useful resources.

Start searching by quoting the entire error message exactly and searching for it. Be sure to put the message in ""'s. If it's a common problem, there's a good chance you will get some hits. Anything that looks like a FAQ is a good start; mailing list archives can also been a good source. Just be sure to check the archive indexes for other messages in the discussion.

If you are using a commercial distribution, you could also consider looking up their knowledgebase. Both Red Hat and Suse have useful documents for assisting in troubleshooting in their knowledgebase.

source code

For most Linux distros, you have the source code, so it can often be useful to search through the code for error messages, filenames, or other markers related to the problem. In many cases, you don't really need to be able to understand the programming language to get some useful info.

Kernel drivers are a great example for this, since they often include very detailed info about which hardware is supported, what's likely to break, etc.

On RPM based systems, to install the source code, you want to install the source RPM. To see which source RPM corresponds to a given file or utility, use the command:
    rpm -qi /path/to/file
there will be a Source field with the name of the source RPM. If you have the source CD, you can install it from there.

Altervatively, you can use up2date or other package tools to get the source RPM.
   up2date --get-source packagename
will download the source RPM to /var/spool/up2date.

To install a source RPM, just issue the command:
   rpm -Uvh /path/to/package.src.rpm
The source will get installed in /usr/src/redhat/SOURCES, with a spec file in /usr/src/redhat/SPECS, on Red Hat linux systems. Other distros will be similar.

The easiest way to extract the source is:
   rpmbuild -bp /usr/src/redhat/SPECS/package.spec
where package.spec is the spec file for the src package installed.

`find` and `grep` are good tools for searching for the markers of interest.

strings

`strings` is a utility that will search through a file and try to find text strings. For troubleshooting sometimes it is handy to be able to look for strings in an executable.

For example, you can run `strings` on a binary to see if it has any hard coded paths to helper utilities. If those utils are in the wrong place, that app may fail.

Searching for error messages can help as well, especially in cases where you are not sure what binary is reporting an error message.

It some ways, it's a bit like grep'ing through source code for error messages, but a bit easier. Unfortunately, it also provides far less info.

syslog/log levels

Syslog is a daemon that mutated out of a sendmail debugging aid into a logfile-catchall for unix. A lot of applications send their log output to syslog, but they have to send it to syslog, otherwise syslog won't know about the stuff that is to be logged. To keep logs apart, during the evolution of syslog, facilities (nothing more than "categories" in syslog-speak) and severeties got introduced. The actual filtering of what gets output where can be defined in syslogs /etc/syslog.conf(5) file.

Getting stuff into Syslog

Syslog generally can receive messages in three ways: - Through the syslog() function most languages provide (after an appropriate call to openlog()) - Through named sockets such as /dev/log which is enabled by default on most distributions - Via UDP on port 514, if syslogd is running with the -r option (this can be a security hole since there is no authentication or authorization implemented in the standard syslog protocol! Caveat emptor!)

Defining Filters in /etc/syslog.conf

The basic syntax of this file is easy, but it contains some subtleties that can lead you into a long, slow suffering (when using synchronous writes on logfiles, more about that below).

Empty lines and everything behind a hash mark (#) is ignored

Rules are of the format
 <What>    <Goes Where>
What

Your basic "what" is a specification of a facility and a severity delimited by a period:
  <facility>.<severity>
This will catch all messages belonging to the given facility that have the given severity and higher.

If you only want to catch messages belonging to exactly the given severity, prefix the priority with an equation sign (=):
 <facility>.=<severity>
You can also negate the severity selection by prepending an exclamation sign (!):
 <facility>.!<severity>
This will select all messages belonging to the given facility and that have a severity lower than the one specified. Note that this also weeds out messages belonging to the given severity - which is logical, since the opposite of >= is &lt.

Of course this can make things tedious if you have to list all combinations of the 20 facilities and 9 severities by hand. So there are shortcuts, such as specifying an asterisk (*) as a catchall:
  <facility>.*  -> All messages belonging to <facility> 
  *.<severity>  -> All messages of the given <severity> 
  *.*                 -> All messages
And then, you can specify lists of "whats", where the "whats" are delimited by semicola (;):
 <facility>.<severity>;<facility>.<severity>
Or, if you want to process the same severities of different facilities, list the facilities using commas (,) first:
 <facility>,<facility>.<severity>
To make matters interesting, there is also a special severity called "none", which implies that no message of the given facility are to be logged with this rule:
  *.*;<facility>.none  -> Log all messages except those of the given facility
Goes Where

After the "What" part with all it's twists and turns, the "Where" is actually pretty simple:
 </path/to/logfile>
will log everything to the given logfile.

Asynchronously

This logging is done with synchronous writes, which means that after each log entry, syslog waits for the operating system kernel to acknowledge that the data has indeed been written to the disk before writing its next entry. This can slow down your system 10-fold for services with extensive logging (especially mail servers!). This factor has been verified in the wild, so only if you can afford to write logs asynchronously, do so.

To indicate to syslog that you want log entries to be written asynchronously, prepend a minus (-) to the logfile:
 -</path/to/logfile>
This is basically what is needed in 99% of everyday life.

Note that you can specify the same "What" multiple times pointing to different "wheres" for each. The messages will then be logged to all "wheres" given.

Goes Where Again?

Ok, the "Where" part isn't actually all that simple. You have a couple of other choices: - Remote machines:
 @<hostname>
- Named Pipes:
 |<path to fifo>
- Terminals by giving their device files as logfiles - Specific users (if they're logged on) using write:
 <user>,<user>
- All users logged on:
 *
But again, these are things you don't need that often, and if you do, you'd better read up on them in the manpage first!

RPM

RPM is the RPM Package Manager. It's a package tool widely used on many Linux distributions, including Red Hat Enterprise Linux, Fedora, Novell, and Mandriva.

It's commonly used to install, update, and remove software and to keep track of software dependencies. The RPM database also includes a lot of information about the software currently installed, and can often be a useful resource for troubleshooting.

using rpm to verify package contents

`rpm` includes support for verifying a file's contents, size, permissions, mtime, user and group ownership, and selinux context.

If you are having problems with "gaim", you might want to verify if all of the files are correct:
    rpm -V gaim
That command will check the ondisk files against the expected values in the RPM database. If a file has been modified, it will show up. See the `rpm` man page for info on decoding the string of chars at the left of the output. But, if the file shows up at all, `rpm` thinks something has changed about the file, which is often enough to know, without decoding the info.

Also useful is verifying all packages. Sometimes you just don't know what's changed and want an overview of files that have been edited or modified from the original:
   rpm -Va
That will take a while on most system, but it will print out a list of all files `rpm` thinks have been modified. Note that on most systems, there will be some files that show up and are perfectly acceptable.

using rpm to find config files

A good place to start looking when some software is having trouble is the config files. To see a list of the config files for package "up2date":
   rpm -q --configfiles up2date
using rpm to see what was installed recently

One of the bits of information `rpm` keeps track of is when a package was installed. Since most software problems originate when software is updated or installed, this is useful information.

To get a list of all RPM packages install, in order, with the installation date:
   rpm -qa --last
The list is sorted so that the newest packages are at the top of the list. If you are troubleshooting a problem that recently appeared, that's a good place to start looking for clues.

resetting file permissions and user/group info

If you think a file from a package has had its perms or ownership changed, an easy way to resolve this is:
   rpm --setperms packagename
ksymoops

To quote from the ksymoops web page, "The Linux kernel produces error messages that contain machine specific numbers which are meaningless for debugging. 'ksymoops' reads machine specific files and the error log and does its best to converts the code to instructions and map addresses to kernel symbols. "

See the man page for more info.

Kernel core dumps (netdump, diskdump and crash)

Netdump and diskdump are utilities for logging kernel crashes. `netdump` sends the core image of the kernel (vmcore) across the network to a netdump server, while `diskdump` writes it to disk. The image can be examined with the `crash` utility.

Netdump and diskdump create a vmcore. A vmcore is a representation of what was in the system's memory when the crash occured. The `crash` utility is a modified version of gdb, which automates the basic steps required to analyse a vmcore.

At the time of writing, `netdump` does not work on Itanium or Itanium II architecture systems.

Netdump

Netdump requires another machine to capture the crash from the crashing kernel. The machine that is crashing is considered the netdump client, the machine that is going to host the core is considered the netdump server. One netdump server can capture crashes from multiple clients.

Server Side Configuration

The netdump server does not have to use any specific network card. It must be on the same subnet and the netdump client must be able to have a clear path (No Network Address Translation or packet modification) between the server and the client.

Start the service with the command
       service netdump-server start
The server saves the vmcore file in /var/crash. Ensure that there is enough space for the server to send the file. There is a formula that can be applied to find the amount of space necessary.

(RAM on client + SWAP on client * 1.1)

Also note that there is a RAM limit; only the first 4GB of RAM is dumped, so you can feel safe in allocating 5GB per concurrent client dump on your server. For example, if you wanted to have 4 clients dumping at the same time, allocate 20GB of storage for the core files.

The next step is to set the password for the netdump user. Do so with the command
       passwd netdump-user
Be sure to set a strong password for this user.

Client Side Configuration

Currently, only a limited set of hardware is able to send a core to a netdump server. The chosen LAN card for sending the crashdump should support one of the following drivers: 3c59x, e100, e1000, eepro100, pcnet32, tg3, tlan, and tulip.

The next step is to modify /etc/sysconfig/netdump and add the following line:

NETDUMPADDR=10.0.0.222

The address 10.0.0.222 should be the IP address of the machine configured as the netdump server.

Notice, you can also set up the netdump server as a syslog server for messages generated by the client during the crash. Don't worry - the messages will only be logged during a crash and not during the client's normal operation. This is a handy thing to know, since interrupts are disabled on the client during a netdump.

Netdump client will now need to connect to the netdump server and create a set of public/private ssh keys. Enter the command:
edit this...not sure what this garbage is, but its not the command NaodW29-pre9ca058bbf031c300000004

You should be prompted for a password. Enter the password of the netdump user on the netdump server.

The next step is to start the netdump service. Run the command
       service netdump start
And then you need to test crash your machine. The example given in the Netdump How-To assumes you are using an old 2.4 kernel. For a new 2.6 kernel `crash` module, see this site (http://blog.dkpdev.com) or just copy and paste the below code into a pair of files.

-> Note that $PWD is used in the Makefile, so it would be wise to put these two files in a directory named panic/ <-

Makefile:
       obj-m += panic.o
       
       all:
       make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
       
       clean:
       make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
panic.c
       /*
        *  Panic kernel module
        */
       
       #include <linux/module.h>
       #include <linux/kernel.h>
       #define DRIVER_AUTHOR "Thundarr <[email protected]>"
       #define DRIVER_DESC   "A panic module to test NetDump on 2.6 kernels"
       
       MODULE_LICENSE("GPL");
       MODULE_AUTHOR(DRIVER_AUTHOR);
       MODULE_DESCRIPTION(DRIVER_DESC);
       
       int init_module(void)
       {
               printk(KERN_INFO "Panic module inserted to force a crash.\n");
               panic("Panic module inserted to force a crash.\n");
               return 0;
       }  
       
       void cleanup_module(void)
       {
               printk(KERN_INFO "How did we get here? Failed to panic?\n");
       }
You can now just insmod panic.ko and watch your box die a painful death. Be sure to stop activity and sync your disks before inserting this module.

The crash should end up in /var/crash/ on the netdump server.

Diskdump

- yep, im working on this

Supported cards

Cross platform:
   * aic7xxx
   * aic79xx
   * megaraid2
   * mpt fusion
   * sata_promise
   * sym53c8xx
i386, AMD64, EM64T. ata_piix

i386 only dpt_i2o

Additionally, ata_piix is supported on the i386, AMD64 and Intel® EM64T architectures. dtp_i2o is supported only on i386.

How do you turn on diskdump?

What DEVICE can you use (in /etc/sysconfig/diskdump)? Can you use a device already in use (/var or swap) or must this be a unused partition?

From what I can see you need an unused partition - you need to format the DEVICE:
service diskdump initialformat
To init the service.

Crash

The crash package can be used to investigate live systems, kernel core dumps created from the netdump or diskdump package

xev

`xev` is a small utilty that can be used to debug problems with X11. In particular, odd behaviour related to keypresses and mouseclick can be tracked down.

`xev` just shows all the X11 "events" that get passed to it. For example, if a keypress doesn't seem to be doing what it is supposed to do, you can check to see if X11 is actually getting the keyclick, and if so, what value it is getting. For basic troubleshooting, no knowledge of X11 is needed, but `xev` can present a ton of information that only the most diehard X11 hacker cares about.

Related, but more low-level are also the files in /proc/bus/input. If you're having trouble getting an input device to be accepted by X, you can check if you're giving the correct device file/protocols in xorg.conf/xfree86.conf by cross referencing your config file with the information in these proc files.

pmap

`pmap` is part of the "procps" suite of tools. It can be used to display the memory map of a process. It is essentialy a wrapper for reading from /proc/PID/maps.

It's useful to be able to see what libraries and modules an app has loaded. `ldd` can show the list of libraries an executable is linked against, but it doesn't know anything about dynamically loaded modules. A variety of large applications make significant useage of dynamic loaded modules, as well as most scripting languages, so `pmap` can come in handy when trying to diagnose issues that might be related to modules.

Scripting languages and shell programming

For more information, see Scripting Languages

Shell scripting and scripting languages are what make Unix and Linux work. They are everywhere, so knowing how to track down problems with scripts is a handy skill.

For more information, see Scripting Languages

Logs

The key to troubleshooting is knowing what is going on. For core system services, there is a significant amount of logging turned on by default, especially for error cases. The trick is knowing where to look.

For more info, see Log Files

Enviroment settings

Allowing Core Files

"core" files are dumps of a processes memory. When a program crashes it can leave behind a core file that can help determine what was the cause of the crash by loading the core file in a debugger.

By default, most Linuxes turn off core file support by setting the maximum allowed core file size to 0.

In order to allow a segfaulting application to leave a core, you need to raise this limit. This is done via `ulimit`. To allow core files to be of an unlimitted size, issue:
ulimit -c unlimited
See the section on GDB for more information on what to do with core files.

LD_ASSUME_KERNEL

LD_ASSUME_KERNEL is an enviroment variable used by the dynamic linker to decide what implementation of libraries are used. For most cases, the most important lib is the c library, or "libc" or "glibc".

The reason "glibc" is important is because it contains the thread implentation for a system.

The values you can set LD_ASSUME_KERNEL to equate to Linux kernel versions. Since glibc and the kernel are tighly bound, it's neccasary for glibc to change its behaviour based on what kernel version is installed.

For properly written apps, there should be no reason to use this setting. However, for some legacy apps that depends on a particular thread implementation in glibc, LD_ASSUME_KERNEL can be used to force the app to use an older implementation.

The primary targets for LD_ASSUME_KERNEL=2.4.20 for use of the NTPL thread library. LD_ASSUME_KERNEL=2.4.1 uses the implementation in /lib/i686 (newer LinuxTrheads). LD_ASSUME_KERNEL=2.2.5 or older uses the implementation in /lib (old LinuxThreads)

For an app that requires the old thread implementation, it can be launched as:
LD_ASSUME_KERNEL=2.2.5 ./some-old-app
see http://people.redhat.com/drepper/assumekernel.html for more details.

glibc enviroment variables

There's a wide variety of enviroment varibles that glibc uses to alter its behaviour, many of which are useful for debugging or troubleshoot purposes.

A good reference on these variables is at http://www.scratchbox.org/documentation/general/tutorials/glibcenv.html

Some interesting ones:

LANG and LANGUAGE

LANG sets what message catalog to use, while LANGUAGE sets LANG and all the LC_* variables. These control the locale specific parts of glibc.

Lots of programs are written expecting to be one in one locale and can break in other locales. Since locale settings can change things like sort order (LC_COLLATE), and the time formats (LC_TIME), shell scripts are particularly prone to problems from this.

A script that assumes the sort order of something is a good example.

A common way to test this is to try running the troublesome app with the locale set to "C" or the default locale.
LANGUAGE=C ls -al
If the app starts behaving when run that way, there is probably something in the code that is assuming "C" local (sorted lists and timeformats are strong candidates).

glibc malloc stuff

Recent (>5.4.23 for libc/>2.0 for glibc) libc implementations offer a small scale malloc debugger by way of the MALLOC_CHECK_ environment variable. MALLOC_CHECK_ can be set to 3 different values:
 - 0: ignores any heap corruptions
 - 1: prints diagnostics on STDERR
 - 2: calls abort(3) as soon as memory corruption is detected
This will help with the kind of memory corruption that can't be found with the tried and proven software engineering method of "staring at the code", but where electric fence/valgrind would be overkill.

Types Of Problems

Software is complicated and there can be a wide variety of problems that occur. But there are categories of problems that come up often, and it's useful to have tools and techniques for solving them

For more info, see Types Of Problems

App specific troubleshooting info

apache

mod_status

mod_status is an Apache module that can show an HTML page representing various information about the internal status of Apache. This includes number of httpds, their current status, network connections, amount of traffic, etc.

Very useful when trying to track down performance related issues.

module debugging

Some Apache httpd modules include options to enable extra debugging info. Unfortunately, this seems to depend on the module.

log files

Log files, the httpd error logs in particular (typically in /var/log/httpd/error_log), are often the best place to look when troubleshooting. It's also where any module debugging information will log to.

Testing the configuration file for syntax errors

Apache comes with an executable called apachectl(8). This program can run a configuration check on Apache's configuration files by issuing the command
apachectl configtest
Some distros (like RedHat/Fedora) also include this command in Apache's init script and invoke apachectl in the background.

-X debug mode

One of the biggest problems with trying to track down problems with apache httpd is the multiprocess nature of it. It makes it difficult to strace or to attach gdb.

To force httpd to run in a single process mode start it with:
       httpd -X
Note that on Red Hat linux boxes you probably need to include the commandline arguments that the init scripts start httpd with. The easiest way to do this is to start httpd normally, then run `ps auxwwww` and cut and paste one of the httpd commandline lines.

PHP

The following assumes that you know PHP coding.

The most informative (but also most disruptive in a visual sense) thing to do is set
  error_reporting = E_ALL
in your php.ini (under debian: /etc/php/<calling entity>/php.ini). Remember to restart your webserver/calling entity after changing this setting. If you come from the C corner of things, you'll know that good programming style dictates that you treat warnings and notices as errors. So off you go, clean up that code!

Back and still not working? Ok, now it gets ugly. PHP doesn't come with a debugger like `gdb`. Such things exist, but usually they will be embedded in an IDE that also emulates a web server and costs $$$. So basically you get to do stuff just as in regular shell scripts: debug echos. Echo early, echo often. Hand in hand with echo statements comes the print_r function, which will print arrays/hashes (same thing in PHP) recursively. Drawback here: print_r formats in plain ASCII, not HTML. So you'll either have to look at the page source to see a clean version of the output, or do something ugly like
 echo join( "<br>", print_r($myarray) );
 FIXME: can you turn on warnings about variables only used once, like in perl? One of my most
 frequent errors....
iptables

I have a Windows VPN Client behind an Linux Gateway doing NAT and I can't connect to the server

First things first, you'll want to know what kind of Windows VPN tunnel you're building. The following will assume the standard PPTP tunnel.
 FIXME: What about l2tp tunnels?
First things first, you need rules that allow the forwarding of the used connections and rules for NATing. The tricky part here is that the PPTP tunnel uses two connections: one going to tcp/1723 on the server, and one GRE tunnel (meaning you can only have one PPTP NATting session active on the gateway at a time). So you'll need the following rules to allow the forwarding:
 iptables -A FORWARD -p tcp --dport 1723 -d vpn-server-address -j ACCEPT
 iptables -A FORWARD -p gre -d vpn-server-address -j ACCEPT
and the NATting is handled by these rules:
 iptables -t nat -A PREROUTING -p tcp --sport 1723 -s vpn-server-address -j DNAT --to-dest vpn-client-ip:1723
 iptables -t nat -A PREROUTING -p gre -s vpn-server-address -j DNAT --to-dest vpn-client-ip
 iptables -t nat -A POSTROUTING -p tcp --dport 1723 -d vpn-server-address -j SNAT --to-source gateway-public-ip
 iptables -t nat -A POSTROUTING -p gre -d vpn-server-address -j SNAT --to-source gateway-public-ip
If you're still having trouble connecting, and Windows is giving you an error 721 (or, if you're looking at the data flow with tcpdump and you're seeing the 1723/tcp connection working fine, but the GRE tunnel connection not working because for some reason the source IP of the GRE tunnel is the private ip of the machine running the vpn client), you will need to build the PPTP connection tracking module for the linux kernel (as of 2.6.x?) and insert the following to modules:
 modprobe ip_conntrack_pptp
 modprobe ip_nat_pptp
Now everything should be working as expected.

SSH

Most problems occur here when you're trying to set up logins via RSA/DSA keys (and probably without passwords too...). It's usally down to basics: Make sure that your ~/.ssh is owned by your user and set to mode 600. ~/.ssh/authorized_keys has to be set to 0600. If these basic conditions aren't met, sshd will refuse to even look at your authorized_keys file and drop you back to password logins.

Another word about the format of the authorized_keys file: it's one key per row. Make sure that your added keys are in a single row! vi is notorious for adding linebreaks if you have 'tw' set in your ~/.vimrc and use copy and paste to add a new key to the file. Use cat or ssh-copy-id instead.

You can run
  ssh -v fred@godot
to see what SSH is up to and where things start hickupping. You can go all the way up to
  ssh -vvv fred@godot
if you really want to know about how modulo groups are being prodded. Usually -vv suffices.

I just updated my openssh packages and now I can't login

If the error message is something like "Upsupported Protocol - Remote host closed the connection", it's probably due to an incompatibility between OpenSSH 4.2 and anything pre-4.2. If you have the server under your control, the solution is easy: Update the server to the 4.2 version as well (recommended as there are some nasty zlib buffer overruns in pre-4.2 anyway).
 FIXME: What other solutions are there?
Kerberos

When something goes wrong with Kerberos, it's usually down to a few things: - Something in the network topology changed, mandating that you re-check your /etc/krb5.conf - Your Kerberos server is unreachable - You entered a wrong password while generating a keytab file or the associated user/service name is not known to the server.

Unfortunately, tools like kinit(1) do have a -v option for verbose output, but this only starts outputting useful information after they aquire a TGT from the KDC. It's more useful to watch the logs of the KDC and see what (if anything) actually happens there.

/etc/krb5.conf

This configuration file is read and used by the Kerberos libraries, so any settings here affect everything on your system that uses Kerberos. The most important setting is
[realms] 
<YOUR DEFAULT REALM> = {
kdc = <IP of your KDC>
}
There may be several realm definitions within the [realms] section. Be sure that you set the correct IP here. Otherwise your Kerberos requests will just hang there and time out after a while.

The second most important setting is
[libdefaults] 
default_realm = <YOUR DEFAULT REALM>
This specifies what realm Kerberos tools will use if no explicit realm is given for a request.

Finally, if you're fooling around with a KDC that resides on a Windows2003 server, be sure that you've enabled arcfour-hmac-md5 and des-cbc-crc as cryto algorithms for the settings default_tgs_enctypes, default_tkt_enctypes and permitted enctypes in the [libdefaults] section. Otherwise your keytab files will be unreadable.

OpenSwan/IPSEC

http://www.tech-recipes.com/sendmail_tips246.html

http://docs.hp.com/en/B2355-90685/ch04s11.html

Desktop Enviroments

Gnome

http://dcs.nac.uci.edu/~strombrg/Troubleshooting-a-gnome-problem-early-in-the-login.html

http://docs.sun.com/app/docs/doc/817-1740

Links

Linux server system tuning (http://people.redhat.com/~alikins/system_tuning.html) Similar concepts.

glibc env variables explained (http://www.scratchbox.org/documentation/general/tutorials/glibcenv.html)

Mac OSX debugging (http://developer.apple.com/technotes/tn2004/tn2124.html)

Unix and Linux Troubleshooting (http://aplawrence.com/Unixart/troubleshooting.html)

Linux Troubleshooting Tutorials - Solving Problems (http://www.tutorialized.com/tutorial/Solving-Problems/4521)

Unix Debugging Tips at sial.org (http://sial.org/howto/debug/unix/)

How To Be A Programmer (http://samizdat.mines.edu/howto/HowToBeAProgrammer.html) Info on debugging strategies

Credits

Comments, suggestions, hints, ideas, critisicms, pointers, and other useful info from various folks were used to create the original version of this document. Check the history for more.

Adrian Likins

Mihai Ibanescu

Chip Turner

Chris MacLeod

Todd Warner

Nicholas Hansen

Sven Riedel

Jacob Frelinger

James Clark

Brian Naylor

Drew Puch

Ted Johnson

License

This work is licensed under a Creative Commons Attribution 2.5 License (http://creativecommons.org/licenses/by/2.5/)

If folks are interested in also applying other licenses (GNU FDL, etc), let Adrian know.

How to Help

See How To Help for more info.

Retrieved from "http://www.linuxtroubleshooting.com/wiki/index.php?title=Main_Page"

Top Visited <p>Your browser does not support iframes.</p>					Switchboard
					Latest
					Past week
					Past month

NEWS CONTENTS

20181223 : Rule #0 of any checklist ( Dec 23, 2018 , hexmode.com )
20111006 : Adam Leventhals blog " DTrace for Linux ( Adam Leventhal's blog " DTrace for Linux, Oct 06, 2011 )
20111006 : Oracle Updates Linux, Sticks with Intel and Promises Solaris by Sean Michael Kerner ( October 5, 2011 , Datamation )
20090830 : Troubleshooting X ( softpanorama.org, Aug 30, 2009 )
20080702 : Linux Configure Netconsole To Log Messages Over UDP Network by Vivek Gite ( Linux Configure Netconsole To Log Messages Over UDP Network, Jul 2, 2008 )

Old News ;-)

[Dec 23, 2018] Rule #0 of any checklist

Notable quotes:

"... The Checklist Manifesto ..."

"... The book talks about how checklists reduce major errors in surgery. Hospitals that use checklists are drastically less likely to amputate the wrong leg . ..."

"... any checklist should start off verifying that what you "know" to be true is true ..."

"... Before starting, ask the "Is it plugged in?" question first. What happened today was an example of when asking "Is it plugged in?" would have helped. ..."

"... moral of the story: Make sure that your understanding of the current state is correct. If you're a developer trying to fix a problem, make sure that you are actually able to understand the problem first. ..."

Dec 23, 2018 | hexmode.com

A while back I mentioned Atul Gawande 's book The Checklist Manifesto . Today, I got another example of how to improve my checklists.

The book talks about how checklists reduce major errors in surgery. Hospitals that use checklists are drastically less likely to amputate the wrong leg .

So, the takeaway for me is this: any checklist should start off verifying that what you "know" to be true is true . (Thankfully, my errors can be backed out with very little long term consequences, but I shouldn't use this as an excuse to forego checklists.)

Before starting, ask the "Is it plugged in?" question first. What happened today was an example of when asking "Is it plugged in?" would have helped.

Today I was testing the thumbnailing of some MediaWiki code and trying to understand the $wgLocalFileRepo variable. I copied part of an /images/ directory over from another wiki to my test wiki. I verified that it thumbnailed correctly.

So far so good.

Then I changed the directory parameter and tested. No thumbnail. Later, I realized this is to be expected because I didn't copy over the original images. So that is one issue.

I erased (what I thought was) the thumbnail image and tried again on the main repo. It worked again–I got a thumbnail.

I tried copying over the images directory to the new directory, but it the new thumbnailing directory structure didn't produce a thumbnail.

I tried over and over with the same thumbnail and was confused because it kept telling me the same thing.

I added debugging statements and still got no where.

Finally, I just did an ls on the directory to verify it was there. It was. And it had files in it.

But not the file I was trying to produce a thumbnail of.

The system that "worked" had the thumbnail, but not the original file.

So, moral of the story: Make sure that your understanding of the current state is correct. If you're a developer trying to fix a problem, make sure that you are actually able to understand the problem first.

Maybe your perception of reality is wrong. Mine was. I was sure that the thumbnails were being generated each time until I discovered that I hadn't deleted the thumbnails, I had deleted the original.

[Oct 06, 2011] Adam Leventhal's blog " DTrace for Linux

Back in 2005, I worked on Linux-branded Zones, Solaris containers that contained a Linux user environment. I wrote a coyly-titled blog post about examining Linux applications using DTrace. The subject was honest - we used precisely the same techniques to bring the benefits of DTrace to Linux applications - but the title wasn't completely accurate. That wasn't exactly "DTrace for Linux", it was more precisely "The Linux user-land for Solaris where users can reap the benefits of DTrace"; I chose the snappier title.

I also wrote about DTrace knockoffs in 2007 to examine the Linux counter-effort. While the project is still in development, it hasn't achieved the functionality or traction of DTrace. Suggesting that Linux was inferior brought out the usual NIH reactions which led me to write a subsequent blog post about a theoretical port of DTrace to Linux. While a year later Paul Fox started exactly such a port, my assumption at the time was that the primary copyright holder of DTrace wouldn't be the one porting DTrace to Linux. Now that Oracle is claiming a port, the calculus may change a bit.

What is Oracle doing? Even among Oracle employees, there's uncertainty about what was announced. Ed Screven gave us just a couple of bullet points in his keynote; Sergio Leunissen, the product manager for OEL, didn't have further details in his OpenWorld talk beyond it being a beta of limited functionality; and the entire Solaris team seemed completely taken by surprise.

What is in the port? Leunissen stated that only the kernel components of DTrace are part of the port. It's unclear whether that means just fbt or includes sdt and the related providers. It sounds certain, though, that it won't pass the DTrace test suite which is the deciding criterion between a DTrace port and some sort of work in progress.

What is the license? While I abhor GPL v. CDDL discussions, this is a pretty interesting case. According to the release manager for OEL, some small kernel components and header files will be dual-licensed while the bulk of DTrace - the kernel modules, libraries, and commands - will use the CDDL as they had under (the now defunct) OpenSolaris (and to the consernation of Linux die-hards I'm sure). Oracle already faces an interesting conundum with their CDDL-licensed files: they can't take the fixes that others have made to, for example, ZFS without needing to release their own fixes. The DTrace port to Linux is interesting in that Oracle apparently thinks that the CDDL license will make DTrace too toxic for other Linux vendors to touch.

[Oct 06, 2011] Oracle Updates Linux, Sticks with Intel and Promises Solaris by Sean Michael Kerner

October 5, 2011 | Datamation

Oracle is now updating that kernel to version 2, delivering even more performance thanks to an improved scheduler for high thread count applications like Java. The Unbreakable Enterprise Kernel 2 release also provides transmit packet steering across CPUs, which Screven said delivers lower network latency. There is also a virtual switch that enables VLAN isolation as well as Quality of Service (QoS) and monitoring.

The new kernel also provides Linux containers, which are similar to the Solaris containers, for virtualization isolation.

"Linux containers give you low-overhead operating system isolation," Screven said.

In another nod to Solaris, Oracle is now also bringing Solaris' Dtrace to Linux. Dtrace is one of the primary new features that debuted in Solaris 10 and provides administrators better visibility into their system performance.

[Aug 30, 2009] Troubleshooting X

[Jul 2, 2008] Linux Configure Netconsole To Log Messages Over UDP Network by Vivek Gite

July 2, 2008
Linux can be configured to log dmesg output to another system via network using syslog. It is done using kernel level networking stuff ia UDP port 514. There is module called netconsole which logs kernel printk messages over udp allowing debugging of problem where disk logging fails and serial consoles are impractical. Most modern distro has this netconsole as a built-in module. netconsole initializes immediately after NIC cards. There are two steps to configure netconsole:

Syslogd server - Let us assume 192.168.1.100 IP having FQDN - syslogd.nixcraft.in. Please note that the remote host can run either 'netcat -u -l -p <port>' or syslogd.

All other systems running netconsole module in kernel

Step # 1: Configure Centralized syslogd

Login to syslogd.nixcraft.in server. Open syslogd configuration file. Different UNIX / Linux variant have different configuration files

Red Hat / CentOS / Fedora Linux Configuration

If you are using Red Hat / CentOS / Fedora Linux open /etc/sysconfig/syslog file and set SYSLOGD_OPTIONS option for udp logging.
# vi /etc/sysconfig/syslog
Configure syslogd option as follows:
SYSLOGD_OPTIONS="-m 0 -r -x"
Save and close the file. Restart syslogd, enter:
# service syslog restart

Debian / Ubuntu Linux Configuration

If you are using Debian / Ubuntu Linux open file /etc/default/syslogd set SYSLOGD option for udp logging.
# vi /etc/default/syslogd
Configure syslogd option as follows:
SYSLOGD_OPTIONS="-r"
# /etc/init.d/sysklogd restart

FreeBSD configuration

If you are using FreeBSD open /etc/rc.conf and set syslogd_flags option option for udp logging. Please note that FreeBSD by default accepts network connections. Please refer to syslogd man page for more information.

Firewall configuration

You may need to open UDP port 514 to allow network login. Sample iptables rules to open UDP port 514:
MYNET="192.168.1.0/24" SLSERVER="192.168.1.100" iptables -A INPUT -p udp -s $MYNET --sport 1024:65535 -d $SLSERVER --dport 514 -m state --state NEW,ESTABLISHED -j ACCEPT iptables -A OUTPUT -p udp -s $SLSERVER --sport 514 -d $MYNET --dport 1024:65535 -m state --state ESTABLISHED -j ACCEPT

Step # 2: Configure Linux Netconsole

You need to configure netconsole service. Once this service started, you are allowed a remote syslog daemon to record console output from local system. The local port number that the netconsole module will use 6666 (default). You need to set the IP address of the remote syslog server to send messages.

Open /etc/sysconfig/netconsole file under CentOS / RHEL / Fedora Linux, enter:
# vi /etc/sysconfig/netconsole
Set SYSLOGADDR to 192.168.1.100 (IP address of remote syslog server)
SYSLOGADDR=192.168.0.1
Save and close the file. Restart netconsole service, enter:
# /etc/init.d/netconsole restart

A note about Debian / Ubuntu Linux

Red Hat has netconsole init script. However, under Debian / Ubuntu Linux, you need to manually configure netconsole. Type the following command to start netconsole by loading kernel netconsole module, enter:
# modprobe netconsole [email protected]/eth0,[email protected]/00:19:D1:2A:BA:A8
Where,

6666 - Local port

192.168.1.5 - Local system IP

eth0 - Local system interface

514 - Remote syslogd udp port

192.168.1.100 - Remote syslogd IP

00:19:D1:2A:BA:A8 - Remote syslogd Mac

You can add above modprobe line to /etc/rc.local to load module automatically. Another recommend option is create /etc/modprobe.d/netconsole file and append following text:
# echo 'options netconsole [email protected]/eth0,[email protected]/00:19:D1:2A:BA:A8 '> /etc/modprobe.d/netconsole

How do I verify netconsole is logging messages over UDP network?

Login to remote syslog udp server (i.e. 192.168.1.100 our sample syslogd system), enter:
# tail -f /var/log/messages
/var/log/messages is default log file under many distributions to log messages. Refer to /etc/syslog.conf for exact location of your file.

How do I use nc / netcat instead of messing with syslogd?

This is called one minute configuration. You can easily get output on 192.168.1.100 without using syslogd. All you have to do is run netcat (nc) command, on 192.168.1.100:
$ nc -l -p 30000 -u
Login to any other box, enter command:
# modprobe netconsole [email protected]/eth0,[email protected]/00:19:D1:2A:BA:A8
Output should start to appear on 192.168.1.100 from 192.168.1.5 without configuring syslogd or anything else.

Further readings:

nc / netcat command

modprobe command

netconsole documentation

man pages nc, modeprobe

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: February 19, 2020

Linux Troubleshooting

Baselining

Haste makes waste

Information you need to collect before calling support

ltrace

gdb

top

Sorting the output

Command line options

ps

examples

sysstat/sar

examples

vmstat

A note on Linux memory management

tcpdump/ethereal

netstat

lsof

fuser

ldd

nm

file

netcat / nc

md5sum

verifying files

verifying ISOs

diff

find

Using find in combo with grep to find markers (errors, filename, etc)

ls/stat

finding sym links and hard links

file system usage

df

watch

ipcs/iprm

Searching the web for error messages

source code

strings

syslog/log levels

Getting stuff into Syslog

Defining Filters in /etc/syslog.conf

What

Goes Where

Asynchronously

Goes Where Again?

RPM

using rpm to verify package contents

using rpm to find config files

using rpm to see what was installed recently

resetting file permissions and user/group info

ksymoops

Kernel core dumps (netdump, diskdump and crash)

Netdump

Server Side Configuration

Client Side Configuration

Diskdump

Crash

xev

pmap

Allowing Core Files

LD_ASSUME_KERNEL

glibc enviroment variables

LANG and LANGUAGE

glibc malloc stuff

apache

mod_status

module debugging

log files

Testing the configuration file for syntax errors

-X debug mode

PHP

iptables

I have a Windows VPN Client behind an Linux Gateway doing NAT and I can't connect to the server

SSH

I just updated my openssh packages and now I can't login

Kerberos

/etc/krb5.conf

OpenSwan/IPSEC

Desktop Enviroments

Gnome

Defining Filters in `/etc/syslog.conf`