|
Home | Switchboard | Unix Administration | Red Hat | TCP/IP Networks | Neoliberalism | Toxic Managers |
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix |
Main Entry: trou-ble-shoot-er Pronunciation: -"sh�-t&r Function: noun Date: 1905
From the Merriam-Webster online dictionary |
|
Troubleshooting, problem analysis and root cause determination requires patience, determination, and experience. It is important to fully investigate the problem and collect relevant data in order to get to the correct path. You might start on one path and end up on another. In any case the most important skill is patience. Keep log as in a process of solving complex problem as you can become distracted and forget about important facts or findings.
|
Merriam-Webster online dictionary provides a fitting definition of troubleshooting and by extension a definition of a Unix system administrator in general -- making repairs, dealing with people, and trying to anticipate as well as prevent problems. Like program debugging, linux problems troubleshooting is very similar to investigation of the crime scene. Some "obvious" leads can be and often are false. Finding relevant information is not easy and can take tremendous amount of time and require maintaining well organized documentation about the problem. Some highly suspicious suspects without alibi are actually innocent. You need to have a plan and abilities to see big picture not to be led off-track. You need clear analytical thinking and experience to get to the root cause.
Linux is a very complex system and for many subsystems the administrator have little or no understanding of internals. There are several like of fault analysis that can help in this situation:
/bin/dmesg
for
similar purposes. /bin/dmesg
provides more detailed information
in real time, while the log file keeps less information for historical purposes.sestatus
. The most general strategy is to compare a problematic system to working system or backup of the current system that did not have the problem you are experiencing now. Often the problem is the result of miscommunication when several system admins implement some changes on the system. Or other sysadmin implement some changes in the absence of primary sysadmin. Here configuration management is critical.
Like with any complex system creation of baseline for a Linux system is critically important. On modern system this is not a time-consuming operation and can be done when you log to the system (run script that creates baseline from your .profile). The most primitive way is to tar /etc and /root directories as well as couple of other about which you know that they contain important configurations files.
Supportconfig is a standard tool to collect all the relevant for troubleshooting information in Suse. There is a tool that helps to analyze the collected information called the Supportconfig Health Check Report Tool (schealth). For Red Hat several tools are also available. See Baseliners
When an event occurs that will cause system or application downtime, the number one priority is get it working again as quickly as possible. But excessive zeal here is counterproductive. Much depends on the context:
The confidence comes with experience; the techniques used are generally a combination of proven solutions. If the solution is documented from the application or system vendor, it has been tested and will most likely bring the system or application online with the least downtime.
Collaboration with colleagues is important when the issue falls outside the realm of familiar territory. It never hurts to ask a question or two of those who might have seen a similar situation and have more current experience. Finally, stubbon bulldog-like determination is the essential quality for someone facing the troubleshooting task even when the situation is unfamiliar.
Additional useful information can be found in Linux Troubleshooting Wiki
ltrace
ltrace is very similar to strace, except ltrace focuses on tracing library calls.
For apps that use a lot of libs, this can be a very powerful debugging tool. However, because most modern apps use libraries very heavily, the output from ltrace can sometimes be painfully verbose.
There is a distinction between what makes a system call and a call to a library function. Sometimes the line between the two is blurry, but the basic difference is that system calls are communicating to the kernel, and library calls are just running more userland code. System calls are usually required for things like I/O, process control, memory management issues, and other kernel things.
Library calls are by bulk, generally calls to the standard C library (glibc..), but can of course be calls to any library, for example, Gtk, libjpeg, libnss, etc. Luckily most glibc functions are well documented and have either man or info pages. Documentation for other libraries varies greatly.
ltrace supports the -r, -tt, -p, and -c options the same as strace. In addition it supports the -S option which tells it to print out system calls as well as library calls.
One of the more useful options is "-n 2" which will indent 2 spaces for each nested call. This can make it much easier to read.
Another useful option is the "-l" option, which allows you to specify a specific library to trace, potentionaly cutting down on the rather verbose output.
gdb
`gdb` is the GNU debugger. A debugger is typically used by developers to debug applications in development. It allows for a very detailed examination of exactly what a program is doing.
That said, gdb isn't as useful as strace/ltrace for troubleshooting/sysadmin types of issues, but occasionally it comes in handy.
For troubleshooting, it's useful for determining what application created a core file. (`file core` will also typically show you this information too). But gdb can also show you "where" the file crashed. Once you determine the name of the app that caused the failure, you can start gdb with:
gdb filename corefileThe unfortunate thing is that all the binaries are typically stripped of debugging symbols to make them smaller, so this often returns less than useful information. However, starting in Red Hat Enterprise Linux 3, and included in Fedora, there are "debuginfo" packages. These packages include all the debugging symbols. You can install them the same as any other rpm, so `rpm`, `up2date`, and `yum` all work.
The only difficult part about debuginfo rpms is figuring out which ones you need. Generally, you want the debuginfo package for the src rpm of the package thats crashing.
rpm -qif /path/to/appWill tell you the info for the binary package the app is part of. Part of that info include the src.rpm. Just use the package name of the src rpm plus "-debuginfo"
FIXME: insert info about debug packages for other systemstop
`top` is a simple text based system monitoring tool. It packs a lot of information unto the screen, which can be helpful troubleshooting problems, particularly performance related problems.
The top of the "top" output includes a basic summary of the system. The top line is current time, uptime since the last reboot, users logged in, and the load average. The load average values here are the load for the last 1, 5, and 15 minutes. A load of 1.0 is considered 100% utilization, so loads over 1 typically means stuff is having to wait. There is a lot of leeway and approxiation in these load values, however.
The memory line shows the total physical ram available on the system, how much of it is used, how much is free, and how much is shared, along with the amount of ram in buffers. These buffers are typically file system caching, but can be other things. On a system with a significant uptime, expect the buffer value to take up all free physical ram not in use by a process. The swap line is similar.
Each of the entries viewable in the system contain several fields by default. The most interesting are RES, %CPU, and time. RES shows the amount of physical ram the process is consuming. %CPU shows the percentage of the available processor time a process is taking, and time shows the total amount of processor time the process has had. A processor intensive program can easily have more "time" in just a few seconds than a long running low cpu process.
Sorting the output
- M : sorts the output by memory usage. Pretty handy for figuring out which version of openoffice.org to kill.
- P : sorts the process by the percentage of cpu time they are using.
- T : sorts by cumulative cpu time used
- A : sorts by age of the process, newest process first
Command line options
The only really useful command line options are:
- b [batch mode] writes the standard top output to stdout. Useful for a quick "system monitoring hack".
e.g.:
top d 360 b >> foo.outputto get a snapshot of the system appended to foo.output every six minutes.
ps
`ps` can be thought of as a one shot `top`. But it's a bit more flexible in its output than top.
As far as `ps` commandline options go, it can get pretty hairy. The Linux version of `ps` inherits ideas from both the BSD version, and the SYSV version. So be warned.
The `ps` man page does a pretty good job of explaining this, so look there for more examples.
One thing to be aware of is that ps behaves differently depending on if a - is prepended to the options:
ps efand
ps -efare two very different things (either BSD or System V formatting).
examples
ps auxshows all the processes on the system in a "user" oriented format. In this case meaning the username of the owner of the process is shown in the first column.
ps auxwwthe "w" option, when used twice, allows the output to be of unlimited width. For apps started with lots of commandline options, this will allow you to see all the options.
ps auxfthe 'f" option, for "forest" tries to present the list of processes in a tree format. This is a quick and easy way to see which processes are child processes of what.
ps -eo pid,%cpu,vsz,args,wchanThis is an interesting example of the -eo option. This allows you to customize the output of `ps`. In this case, the interesting bit is the "wchan" option, which attempts to show what syscall the process is in which `ps` checks.
For things like apache httpds, this can be useful to get an idea of what all the processes are doing at one time. See the info in the strace section on understanding system call info for more info.
sysstat/sar
Sysstat works with two steps, a daemon process that collects information, and a "monitoring" tool.
The start script is typically called "sysstat", and the monitoring tool is called `sar`, which will normally perform its monitoring via the `sadc` command.
To start it, start the systat daemon:
sysstat startTo see a list of `sar` options, just try `sar --help`
examples
Things to note. There are lots of commandline options. The last one is always the "count", meaning the time between updates.
sar 3Will run the default sar invocation every three seconds.
For a complete summary, try:
sar -AThis generates a very large pile of info ;->
To get a good idea of disk i/o activity:
sar -b 3For something like a heavily used web server, you may want to get a good idea how many processes are being created per second:
sar -c 2Kind of surprising to see how many processes can be created.
There's also some degree of hardware monitoring built in. Monitoring how many times an IRQ is triggered can also provide good hints at what's causing system performance problems.
Show the total number of system interrupts
sar -I SUM 3Watch the standard IDE controller IRQ every two seconds.
sar -I 14 2Network monitoring is in here too: Show # of packets sent/receiced. # of bytes transfered, etc
sar -n DEV 2Show stats on network errors.
sar -n EDEV 2Memory usege can be monitored with something like:
sar -r 2This is similar to the output from `free`, except more easily parsed.
For SMP machines, you can monitor per CPU stats with:
sar -U 0(If your version of sar doesn't support the -U flag, try -P or -u)
where 0 is the first processor. The keyword ALL will show all of them.
A really useful one on web servers and other configurations that use lots and lots of open files is:
sar -vThis will show the number of used file handles, %of available filehandles, and same for inodes.
To show the number of context switches ( a good indication of how much time a process is wasting..)
sar -w 2vmstat
This util is part of the procps package and can provide lots of useful information when diagnosing performance problems.
Here's a sample vmstat output on a lightly used desktop:
procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 5416 2200 1856 34612 0 1 2 1 140 194 2 1 97And here's some sample output on a heavily used server:
procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 16 0 0 2360 264400 96672 9400 0 0 0 1 53 24 3 1 96 24 0 0 2360 257284 96672 9400 0 0 0 6 3063 17713 64 36 0 15 0 0 2360 250024 96672 9400 0 0 0 3 3039 16811 66 34 0The interesting numbers here are the first ones. This is the number of processes that are in the run queue. This value shows how many processes are ready to be executed, but can not be run at the moment because other processes need to finish. For lightly loaded systems, this is almost never above 1-3, and numbers consistently higher than 10 indicate the machine is getting pounded.
Other interesting values include the "system" numbers for in and cs. The in value is the number of interupts per second a system is getting. A system doing a lot of network or disk I/O will have high values here, as interupts are generated every time something is read or written to the disk or network.
The cs value is the number of context switches per second. A context switch is when the kernel has to take the executable code for a program out of memory, and switch in another. It's actually _way_ more complicated than that, but that's the basic idea. Lots of context swithes are bad, since it takes some fairly large number of cycles to perform a context switch, so if you are doing lots of them, you are spending all your time changing jobs and not actually doing any work. I think we can all understand that concept.
A note on Linux memory management
This is one area where the saying "Linux is not Unix" is accurate.
Linux does not manage memory like a traditional Unix (like HPUX). Linux memory management uses a free on demand system, whereby memory isn't actually free unless there is a demand for the pages. The kernel will use all available (dirty) memory for buffer cache until and unless there is memory pressure.
So, if you are coming from a closed Unix to Linux, don't freak out when you see only 10 meg of that 4GB free - it's being used for file system buffer cache.
tcpdump/ethereal
Ethereal will display all the connections it traced during the capture. There are a couple ways to look for bandwidth hogs.
The "Statistics" menu has a couple of useful options. The "Protocol Hierarchy" shows what % of packets in the trace is from each type of protocol. In the case of a bandwith hog, at least what protocol is the culprit should be easy to spot here.
The "Conversations" screen is also helpful for looking for bandwidth hogs. Since you can sort the "conversations" by number of packets, the culprit is likely to hop to the top. This isn't always the case, as it could easily be many small connections killing the bandwidth, not one big heavy connection.
As far as tcpdump goes, the best way to spot bandwidth hogs is just to start it up. Since it pretty much dumps all traffic to the screen in a text format, just keep your eyes peel for what seems to be coming up a lot.
tcpdump can also be used to see if a given service may be unresponsive because your packets are simply not reaching the remote machine. Since tcpdump is a commandline tool, you'll very probably need to add filters - especially when you're firing tcpdump up on a remote machine, where you're logged in via SSH. Otherwise you'll get lots of packet dumps of SSH packets that are telling you of packets dumped that belong to ssh telling you of packets dumped...
tcpdump -l -i eth0 port 25This will dump all packets aimed at, or originating from, a TCP or UDP port 25. The '-l' is to do line buffering, so we'll actually see each packet as it crosses the wire.
If you're debugging network connections over an SSH connection, the following will probably be the most frequent way that you'll invoke tcpdump:
tcpdump -l not port 22And to monitor the communication between Server A.local.net (running tcpdump) and the remote server B.remote.net:
tcpdump -l src or dst B.remote.netThe tcpdump filter syntax is actually surprisingly powerful - take 5 minutes and grab your nearest manpage on tcpdump if you need a better filter.
netstat
Netstat is an app for getting general information about the status of network connections to the machine.
netstatwill just show all the current open sockets on the machine. This will include UNIX domain sockets, TCP sockets, UDP sockets, etc.
One of the more useful options is:
netstat -paThe `-p` option tells it to try to determine what program has the socket open, which is often very useful info. For example, someone nmap's their system and wants to know what is using port 666 for example. Running netstat -pa will show you its satand running on that tcp port.
One of the most twisted, but useful invocations is:
netstat -t -n | cut -c 68- | sort | uniq -c | sort -nThis will show you a sorted list of how many sockets are in each connection state. For example:
9 LISTEN 21 ESTABLISHED
- what process is doing what and to whom over the network
- number of sockets open
- socket status
A quick and dirty way to see what daemons are running and accepting connections on your machine is
netstat -tlpnfor TCP services and
netstat -ulpnfor UDP services. Unix domain sockets are usually more abundant than either of these two and a lot less interesting.
If you're having trouble with network throughput for some reason, try
netstat -sThis will print out a summary of the network stack state counters, going into way more detail than the RX/TX frames dropped counter of ifconfig. By looking at what counters are rapidly increasing, you may be able to find out why your network throughput is misbehaving.
lsof
/usr/sbin/lsof is a utility that checks to see what all open files are on the system. There's a ton of options, almost none of which you ever need.
This is mostly useful for seeing what processes have what file open. Useful in cases where you need to unmount a partition or perhaps you have deleted some file, but its space wasn't reclaimed and you want to know why.
The EXAMPLES section of the lsof man page includes many useful examples. One of the more common usages is to see which services are accepting network connections over TCP:
lsof -i tcpfuser
Displays PIDs of processes that are using some filesystem object. Kind of like the small brother of lsof.
The most frequent use will be the '-m' option when you're trying to unmount a filesystem and you get an error message telling you that the specified device is busy:
turing:/home/sr# umount /usr umount: /usr: device is busyturing:/home/sr# fuser -m /usr /usr: 2522e 2604e 2646e 2652e 2662e 2761e 2764e 2775e 2798e 2804e 2843e 2846e 2849e 2988m 3018m 3740e 3741e 3759m 3772m 3773e 3776e 3779e 3782e 3785e 3789e 3791e 3793e 3828e 3832e 3833m 3869e 3893e 3907e 3908m 3915e 3999e 4124m 4125m 4127mThis list are all the PIDs that are working within the '/usr' mountpoint and keeping you from unmounting the filesystem. Check who's what with 'ps ax | grep [PID]' and kill them gently.
ldd
ldd prints out shared library dependencies.
For apps that are reporting missing libraries, this is a handy utility. It shows all the libraries a given app or library is linked to.
For most cases, what you will be looking for is missing libs. In the ldd output, they will show something like:
libpng.so.3 => (file not found)In this case, you need to figure out why libpng.so.3 isn't being found. It might not be in the standard lib paths or perhaps not in a path in /etc/ld.so.conf. Or you need to run `ldconfig` again to update the ld cache.
ldd can also be useful when tracking down cases where an app is finding a library, but it's finding the wrong library. This can happen if there are two libraries with the same name installed on a system in different paths.
Since the `ldd` output includes the full path to the lib, you can see if anything is pointing at a wrong path. One thing to look for when scanning for this, is one lib that's in a different lib path than the rest. If an app uses libs from /usr/lib, except for one from /usr/local/lib, there's a good chance that's your culprit.
If you are missing a library, be sure to edit your ld config file (typically /etc/ld.config) and re-run ldconfig.
nm
`nm` is a utility that shows all the library symbols an application expects to find. It can be used in combination with `ldd` and `ldconfig` to try to track down library linking problems.
A common case would be a binary that is compiled against a newer version of a library that has symbols in it that the version of the library the app is dynamlically linking against does not.
file
`file` is a simple utility that tries to figure out what kind of file a given file is. It does this by magic(5).
Where this sometimes comes in handy for troubleshooting is looking for rogue files. A .jpg file that is actually a .html file. A tar.gz that's not actually compressed. Cases like those can sometimes cause apps to behave very strangely.
netcat / nc
Ah, netcat. That wonderful utility which functions just like the normal cat command, but accepts a given interface:port for stdin or stdout.
One common usage is to clone a system over a network. Using only a set of commands similar to "dd | netcat", you can clone a system disk at the bit level. Here's what you do (actual commands to follow, but for now...)
On the slave system, boot from a CD (like Knoppix) and issue a command such as
netcat -l -p 5678 | dd of=/dev/sdaThen, on the master, start sending a bit image over the network
dd if=/dev/sda | netcat <slave_IP> 5678CHECK THESE COMMANDS - MAY NOT BE 100% ACCURATE
md5sum
`md5sum` is a utilty that calculates a checksum of a file. For troubleshooting purposes, you can assume every unique file will have a unique checksum. md5sum is not 100% secure - it is subject to hash collisions - so for added security, please use `sha1sum` in addition to md5sum as a collision between both sets of results is currently considered to be impossible. The `sha1sum` command functions exactly as the `md5sum` command in these examples.
verifying files
Since an MD5 sum will change if any part of a file changes, it can also be used to verify that a file has not changed. Systems like `tripwire` use this to detect if a file has been compromised in a security breach.
This can be used to see if a file has been modified or corrupted if you know what the MD5 sum is supposed to be.
You can also use it to see if two files are exactly the same or not. A common case is to check to see if a config file has been modified or if it's different from what's in a config management system.
verifying ISOs
Linux distributions are often distributed as CD images or ISOs. An MD5 sum of these images is always provided to verify the integrity of the downloaded ISOs. A few bits missing here and there is enough to make an install a painful experience.
Check the location the ISOs were downloaded to for a text file containing the MD5 sums of the ISOs. It will typically look something like:
2af10158545bc24477381e80412ff209 bar.iso 9761d6ce118a1230bc48b0a59f7b5639 foo.isoYou can run `md5sum` directly on the ISOs:
bash# md5um bar.iso 2af10158545bc24477381e80412ff209 bar.isoOr you can often use the md5sums text file as input to `md5sum` to tell it what to check and to verify. If the above example was in a file called "iso.md5s":
md5sum -c iso.md5sThat command will check both ISOs and check the computed checksum against what the file lists as correct.
md5sum is also a good way to verify a burned CD. Something like:find /mnt/cdrom -name "*" -exec md5sum {} \;will run a md5sum on all the files on the CD mounted at /mnt/cdrom. Since md5sum checks every bit (literally..) of a file, if the CD is bad, there's a good chance this will find it. If the above command causes any errors about the media, chances are the CD is bad. Better to find it now than later.
For recent Red Hat and Fedora based distros, the installer includes an option to perform a mediacheck. This is essentially the same as verifying the ISO MD5 sum by hand. If you have already done that, you can skip the media check.diff
diff compares two files and shows the difference between them.
For troubleshooting, this is most often used on config files. If one version of a config file works, but another does not, a `diff` of the two files can often be enlightening. Since it can be very easy to miss a small difference in a file, being able to see just the differences is useful.
For debugging during development, diff (especially the versions built into revision control systems like cvs) is invaluable. Seeing exactly what changed between two versions is a great help.
For example, if foo-2.2 is acting weird, where foo-2.1 worked fine, it's not uncommon to `diff` the source code between the two versions to see if anything related to your problem changed.
find
For troubleshooting a system that seems to have suddenly stopped working, find has a few tricks up its sleeve.
When a system stops working suddenly, the first question to ask is "what changed?".
find / -mtime -1That command will recursively list all the files from / that have changed in the last day.
To list all the files in /usr/lib that changed in the last 30 minutes.find /usr/lib -mmin -30
Similar options exist for ctime and atime. To show all the files in /tmp that have been accessed in the last 30 minutes.find /tmp -amin -30
The -atime/-amin options are useful when trying to determine if an app is actually reading the files it is supposed. If you run the app, then run that command where the files are, and nothing has been accessed, something is wrong.If no "+" or "-" is given for the time value, find will match only exactly that time. This is handy in several cases. You can determine what files were modified/created at the same time.
A good example of this is cleaning up from a tar package that was unpacked into the wrong directory. Since all the files will have the same access time, you can use find and -exec to delete them all.
`find` can also find files with particular permisions set. To find all world writable files / down:
find / -perm -0777To find all files in /tmp owned by "alikins":
find /tmp -user alikinsUsing find in combo with grep to find markers (errors, filename, etc)
When troubleshooting, there are plenty of cases where you want to find all instances of a filename, or a hostname, etc.
To recursively grep a large number of files, you can use find and its exec options. This will grep for "foo" on all files down from the current working directory:
find . -exec grep foo {} \;Note that in many cases, you can also use `grep -r` to do this as well. Another common usage is with xargs as such
find / -print | xargs grep "look for this"ls/stat
while `ls` is one of the first commands linux users learn, do not overlook it's utility in troubleshooting. It's the easiest way to see whats on the file system.
finding sym links and hard links
A simple `ls -al` will show the contents of a directory. But it will also indicate what files are symlinks.
Normally, having a file being a symlink is fine, but some apps, especially security sensitive apps, are picky about what can and can not be a symlink.
The other thing to look for is dangling or broken symlinks. Some apps don't expect to get handed a symlink that doesn't go anywhere.
file system usage
Some simple `ls` invocations useful for troubleshooting.
Show a detailed view of all files, sorted by the last modified time. Quick, easy way to see if an app is modifying files:
`ls -lart`Show a detailed view of all files in the current directory, sorted by file size. Quick, easy way to see what files are consuming all of your precious disk space.
`ls -larS`Show some basic info about what type of file each file is. Maybe that directory the app is looking for is a file or vice versa?
`ls -F`
df
Running out of disk space causes so many apps to fail in weird and bizarre ways. A quick `df -h` is a pretty good troubleshooting starting point.
Using it is easy; look for any volume that is 100% full. Or in the case of apps that might be writing lots of data at once, reasonably close to being filled.
It's pretty common to spend more time that anyone would like to admit debugging a problem to suddenly here someone yell "Damnit! It's out of disk space!".
A quick check avoids that problem.
In addition to running out of space, it's possible to run out of file system inodes. A `df -h` will not show this, but a `df -i` will show the number of inodes available on each filesystem.
Being out of inodes can cause even more obscure failures than being out of space, so something to keep in mind.
watch
`watch` is a command that executes another command, displays its output, then repeats. This can be more used to repeatedly watch a reporting process. There is also a "-d" option that will highlight any output that changes between each invocation of the command.
For an example, to watch diskspace useage:
watch -d dfAnother example, is to simply watch a `ls -al` output, to look for any tmp files that get created:
watch -d "ls -al"
Note that the above example only runs `ls -al` every two seconds, so will not catch all file creations."watch" is often used in combo with commands like "ls", "df", "netstat", "ps".
ipcs/iprm
- anything that uses shm/ipc
- oracle/apache/etc
A lot of apps make fairly extensive use of SysV shm and IPC (oracle, apache, gimp, etc). Most of the time, on current Linux systems, this works pretty well. But it's occasionally useful to be able to take a look at what shm is being used and how it's being used. `ipcs` is the tool for that.
One common usage is to check for Oracle's usage of "shared memory glue" (typically noticed as shm_glue), which is the method they use for large SGA creation when they cannot obtain a single shm segment large enough for their needs. A good rule of thumb is that if you see Oracle with a large number of maximum sized shared memory segments, then you have a problem and need to tune your shm sizes and restart Oracle. shm_glue is a performance killer.
Typically, you will use ipcs -ma on Linux to see both shared memory, semaphores, and message queues. Here's a lightly loaded system example.
# ipcs -ma ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x00000000 8093696 root 600 393216 2 dest 0x00000000 8126465 root 600 393216 2 dest 0x00000000 19759106 root 666 262080 1 dest 0x00000000 19529731 root 600 393216 2 dest 0x00000000 19562500 root 600 393216 2 dest ------ Semaphore Arrays -------- key semid owner perms nsems ------ Message Queues -------- key msqid owner perms used-bytes messagesSearching the web for error messages
A pretty common and often very effective approach to tracking down the cause of errors or problems is searching the web. Using search engines like Google or Yahoo can find documentation, FAQ's, web forum posts, mailing list archives, Usenet posts, and other useful resources.
Start searching by quoting the entire error message exactly and searching for it. Be sure to put the message in ""'s. If it's a common problem, there's a good chance you will get some hits. Anything that looks like a FAQ is a good start; mailing list archives can also been a good source. Just be sure to check the archive indexes for other messages in the discussion.
If you are using a commercial distribution, you could also consider looking up their knowledgebase. Both Red Hat and Suse have useful documents for assisting in troubleshooting in their knowledgebase.
source code
For most Linux distros, you have the source code, so it can often be useful to search through the code for error messages, filenames, or other markers related to the problem. In many cases, you don't really need to be able to understand the programming language to get some useful info.
Kernel drivers are a great example for this, since they often include very detailed info about which hardware is supported, what's likely to break, etc.
On RPM based systems, to install the source code, you want to install the source RPM. To see which source RPM corresponds to a given file or utility, use the command:
rpm -qi /path/to/filethere will be a Source field with the name of the source RPM. If you have the source CD, you can install it from there.
Altervatively, you can use up2date or other package tools to get the source RPM.
up2date --get-source packagenamewill download the source RPM to /var/spool/up2date.
To install a source RPM, just issue the command:
rpm -Uvh /path/to/package.src.rpmThe source will get installed in /usr/src/redhat/SOURCES, with a spec file in /usr/src/redhat/SPECS, on Red Hat linux systems. Other distros will be similar.
The easiest way to extract the source is:
rpmbuild -bp /usr/src/redhat/SPECS/package.specwhere package.spec is the spec file for the src package installed.
`find` and `grep` are good tools for searching for the markers of interest.
strings
`strings` is a utility that will search through a file and try to find text strings. For troubleshooting sometimes it is handy to be able to look for strings in an executable.
For example, you can run `strings` on a binary to see if it has any hard coded paths to helper utilities. If those utils are in the wrong place, that app may fail.
Searching for error messages can help as well, especially in cases where you are not sure what binary is reporting an error message.
It some ways, it's a bit like grep'ing through source code for error messages, but a bit easier. Unfortunately, it also provides far less info.
syslog/log levels
Syslog is a daemon that mutated out of a sendmail debugging aid into a logfile-catchall for unix. A lot of applications send their log output to syslog, but they have to send it to syslog, otherwise syslog won't know about the stuff that is to be logged. To keep logs apart, during the evolution of syslog, facilities (nothing more than "categories" in syslog-speak) and severeties got introduced. The actual filtering of what gets output where can be defined in syslogs /etc/syslog.conf(5) file.
Getting stuff into Syslog
Syslog generally can receive messages in three ways: - Through the syslog() function most languages provide (after an appropriate call to openlog()) - Through named sockets such as /dev/log which is enabled by default on most distributions - Via UDP on port 514, if syslogd is running with the -r option (this can be a security hole since there is no authentication or authorization implemented in the standard syslog protocol! Caveat emptor!)
Defining Filters in /etc/syslog.conf
The basic syntax of this file is easy, but it contains some subtleties that can lead you into a long, slow suffering (when using synchronous writes on logfiles, more about that below).
- Empty lines and everything behind a hash mark (#) is ignored
- Rules are of the format
<What> <Goes Where>What
Your basic "what" is a specification of a facility and a severity delimited by a period:
<facility>.<severity>This will catch all messages belonging to the given facility that have the given severity and higher.
If you only want to catch messages belonging to exactly the given severity, prefix the priority with an equation sign (=):
<facility>.=<severity>You can also negate the severity selection by prepending an exclamation sign (!):
<facility>.!<severity>This will select all messages belonging to the given facility and that have a severity lower than the one specified. Note that this also weeds out messages belonging to the given severity - which is logical, since the opposite of >= is <.
Of course this can make things tedious if you have to list all combinations of the 20 facilities and 9 severities by hand. So there are shortcuts, such as specifying an asterisk (*) as a catchall:
<facility>.* -> All messages belonging to <facility> *.<severity> -> All messages of the given <severity> *.* -> All messagesAnd then, you can specify lists of "whats", where the "whats" are delimited by semicola (;):
<facility>.<severity>;<facility>.<severity>Or, if you want to process the same severities of different facilities, list the facilities using commas (,) first:
<facility>,<facility>.<severity>To make matters interesting, there is also a special severity called "none", which implies that no message of the given facility are to be logged with this rule:
*.*;<facility>.none -> Log all messages except those of the given facility
Goes Where
After the "What" part with all it's twists and turns, the "Where" is actually pretty simple:
</path/to/logfile>will log everything to the given logfile.
Asynchronously
This logging is done with synchronous writes, which means that after each log entry, syslog waits for the operating system kernel to acknowledge that the data has indeed been written to the disk before writing its next entry. This can slow down your system 10-fold for services with extensive logging (especially mail servers!). This factor has been verified in the wild, so only if you can afford to write logs asynchronously, do so.
To indicate to syslog that you want log entries to be written asynchronously, prepend a minus (-) to the logfile:
-</path/to/logfile>This is basically what is needed in 99% of everyday life.
Note that you can specify the same "What" multiple times pointing to different "wheres" for each. The messages will then be logged to all "wheres" given.
Goes Where Again?
Ok, the "Where" part isn't actually all that simple. You have a couple of other choices: - Remote machines:
@<hostname>- Named Pipes:
|<path to fifo>- Terminals by giving their device files as logfiles - Specific users (if they're logged on) using write:
<user>,<user>- All users logged on:
*But again, these are things you don't need that often, and if you do, you'd better read up on them in the manpage first!
RPM
RPM is the RPM Package Manager. It's a package tool widely used on many Linux distributions, including Red Hat Enterprise Linux, Fedora, Novell, and Mandriva.
It's commonly used to install, update, and remove software and to keep track of software dependencies. The RPM database also includes a lot of information about the software currently installed, and can often be a useful resource for troubleshooting.
using rpm to verify package contents
`rpm` includes support for verifying a file's contents, size, permissions, mtime, user and group ownership, and selinux context.
If you are having problems with "gaim", you might want to verify if all of the files are correct:
rpm -V gaimThat command will check the ondisk files against the expected values in the RPM database. If a file has been modified, it will show up. See the `rpm` man page for info on decoding the string of chars at the left of the output. But, if the file shows up at all, `rpm` thinks something has changed about the file, which is often enough to know, without decoding the info.
Also useful is verifying all packages. Sometimes you just don't know what's changed and want an overview of files that have been edited or modified from the original:
rpm -VaThat will take a while on most system, but it will print out a list of all files `rpm` thinks have been modified. Note that on most systems, there will be some files that show up and are perfectly acceptable.
using rpm to find config files
A good place to start looking when some software is having trouble is the config files. To see a list of the config files for package "up2date":
rpm -q --configfiles up2dateusing rpm to see what was installed recently
One of the bits of information `rpm` keeps track of is when a package was installed. Since most software problems originate when software is updated or installed, this is useful information.
To get a list of all RPM packages install, in order, with the installation date:
rpm -qa --lastThe list is sorted so that the newest packages are at the top of the list. If you are troubleshooting a problem that recently appeared, that's a good place to start looking for clues.
resetting file permissions and user/group info
If you think a file from a package has had its perms or ownership changed, an easy way to resolve this is:
rpm --setperms packagenameksymoops
To quote from the ksymoops web page, "The Linux kernel produces error messages that contain machine specific numbers which are meaningless for debugging. 'ksymoops' reads machine specific files and the error log and does its best to converts the code to instructions and map addresses to kernel symbols. "
See the man page for more info.
Kernel core dumps (netdump, diskdump and crash)
Netdump and diskdump are utilities for logging kernel crashes. `netdump` sends the core image of the kernel (vmcore) across the network to a netdump server, while `diskdump` writes it to disk. The image can be examined with the `crash` utility.
Netdump and diskdump create a vmcore. A vmcore is a representation of what was in the system's memory when the crash occured. The `crash` utility is a modified version of gdb, which automates the basic steps required to analyse a vmcore.
At the time of writing, `netdump` does not work on Itanium or Itanium II architecture systems.
Netdump
Netdump requires another machine to capture the crash from the crashing kernel. The machine that is crashing is considered the netdump client, the machine that is going to host the core is considered the netdump server. One netdump server can capture crashes from multiple clients.Server Side Configuration
The netdump server does not have to use any specific network card. It must be on the same subnet and the netdump client must be able to have a clear path (No Network Address Translation or packet modification) between the server and the client.
Start the service with the command
service netdump-server startThe server saves the vmcore file in /var/crash. Ensure that there is enough space for the server to send the file. There is a formula that can be applied to find the amount of space necessary.
(RAM on client + SWAP on client * 1.1)
Also note that there is a RAM limit; only the first 4GB of RAM is dumped, so you can feel safe in allocating 5GB per concurrent client dump on your server. For example, if you wanted to have 4 clients dumping at the same time, allocate 20GB of storage for the core files.
The next step is to set the password for the netdump user. Do so with the commandpasswd netdump-userBe sure to set a strong password for this user.
Client Side Configuration
Currently, only a limited set of hardware is able to send a core to a netdump server. The chosen LAN card for sending the crashdump should support one of the following drivers: 3c59x, e100, e1000, eepro100, pcnet32, tg3, tlan, and tulip.
The next step is to modify /etc/sysconfig/netdump and add the following line:
NETDUMPADDR=10.0.0.222
The address 10.0.0.222 should be the IP address of the machine configured as the netdump server.
Notice, you can also set up the netdump server as a syslog server for messages generated by the client during the crash. Don't worry - the messages will only be logged during a crash and not during the client's normal operation. This is a handy thing to know, since interrupts are disabled on the client during a netdump.
Netdump client will now need to connect to the netdump server and create a set of public/private ssh keys. Enter the command:
edit this...not sure what this garbage is, but its not the command NaodW29-pre9ca058bbf031c300000004
You should be prompted for a password. Enter the password of the netdump user on the netdump server.
The next step is to start the netdump service. Run the command
service netdump startAnd then you need to test crash your machine. The example given in the Netdump How-To assumes you are using an old 2.4 kernel. For a new 2.6 kernel `crash` module, see this site (http://blog.dkpdev.com) or just copy and paste the below code into a pair of files.
-> Note that $PWD is used in the Makefile, so it would be wise to put these two files in a directory named panic/ <-
Makefile:
obj-m += panic.o all: make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules clean: make -C /lib/modules/$(shell uname -r)/build M=$(PWD) cleanpanic.c
/* * Panic kernel module */ #include <linux/module.h> #include <linux/kernel.h> #define DRIVER_AUTHOR "Thundarr <[email protected]>" #define DRIVER_DESC "A panic module to test NetDump on 2.6 kernels" MODULE_LICENSE("GPL"); MODULE_AUTHOR(DRIVER_AUTHOR); MODULE_DESCRIPTION(DRIVER_DESC); int init_module(void) { printk(KERN_INFO "Panic module inserted to force a crash.\n"); panic("Panic module inserted to force a crash.\n"); return 0; } void cleanup_module(void) { printk(KERN_INFO "How did we get here? Failed to panic?\n"); }You can now just insmod panic.ko and watch your box die a painful death. Be sure to stop activity and sync your disks before inserting this module.
The crash should end up in /var/crash/ on the netdump server.
Diskdump
- yep, im working on this
Supported cards
Cross platform:
* aic7xxx * aic79xx * megaraid2 * mpt fusion * sata_promise * sym53c8xx
i386, AMD64, EM64T. ata_piixi386 only dpt_i2o
Additionally, ata_piix is supported on the i386, AMD64 and Intel® EM64T architectures. dtp_i2o is supported only on i386.
How do you turn on diskdump?
What DEVICE can you use (in /etc/sysconfig/diskdump)? Can you use a device already in use (/var or swap) or must this be a unused partition?
From what I can see you need an unused partition - you need to format the DEVICE:service diskdump initialformatTo init the service.
Crash
The crash package can be used to investigate live systems, kernel core dumps created from the netdump or diskdump package
xev
`xev` is a small utilty that can be used to debug problems with X11. In particular, odd behaviour related to keypresses and mouseclick can be tracked down.
`xev` just shows all the X11 "events" that get passed to it. For example, if a keypress doesn't seem to be doing what it is supposed to do, you can check to see if X11 is actually getting the keyclick, and if so, what value it is getting. For basic troubleshooting, no knowledge of X11 is needed, but `xev` can present a ton of information that only the most diehard X11 hacker cares about.
Related, but more low-level are also the files in /proc/bus/input. If you're having trouble getting an input device to be accepted by X, you can check if you're giving the correct device file/protocols in xorg.conf/xfree86.conf by cross referencing your config file with the information in these proc files.
pmap
`pmap` is part of the "procps" suite of tools. It can be used to display the memory map of a process. It is essentialy a wrapper for reading from /proc/PID/maps.
It's useful to be able to see what libraries and modules an app has loaded. `ldd` can show the list of libraries an executable is linked against, but it doesn't know anything about dynamically loaded modules. A variety of large applications make significant useage of dynamic loaded modules, as well as most scripting languages, so `pmap` can come in handy when trying to diagnose issues that might be related to modules.
Scripting languages and shell programming
For more information, see Scripting Languages
Shell scripting and scripting languages are what make Unix and Linux work. They are everywhere, so knowing how to track down problems with scripts is a handy skill.
For more information, see Scripting Languages
Logs
The key to troubleshooting is knowing what is going on. For core system services, there is a significant amount of logging turned on by default, especially for error cases. The trick is knowing where to look.
For more info, see Log Files
Enviroment settings
Allowing Core Files
"core" files are dumps of a processes memory. When a program crashes it can leave behind a core file that can help determine what was the cause of the crash by loading the core file in a debugger.
By default, most Linuxes turn off core file support by setting the maximum allowed core file size to 0.
In order to allow a segfaulting application to leave a core, you need to raise this limit. This is done via `ulimit`. To allow core files to be of an unlimitted size, issue:
ulimit -c unlimitedSee the section on GDB for more information on what to do with core files.
LD_ASSUME_KERNEL
LD_ASSUME_KERNEL is an enviroment variable used by the dynamic linker to decide what implementation of libraries are used. For most cases, the most important lib is the c library, or "libc" or "glibc".
The reason "glibc" is important is because it contains the thread implentation for a system.
The values you can set LD_ASSUME_KERNEL to equate to Linux kernel versions. Since glibc and the kernel are tighly bound, it's neccasary for glibc to change its behaviour based on what kernel version is installed.
For properly written apps, there should be no reason to use this setting. However, for some legacy apps that depends on a particular thread implementation in glibc, LD_ASSUME_KERNEL can be used to force the app to use an older implementation.
The primary targets for LD_ASSUME_KERNEL=2.4.20 for use of the NTPL thread library. LD_ASSUME_KERNEL=2.4.1 uses the implementation in /lib/i686 (newer LinuxTrheads). LD_ASSUME_KERNEL=2.2.5 or older uses the implementation in /lib (old LinuxThreads)
For an app that requires the old thread implementation, it can be launched as:
LD_ASSUME_KERNEL=2.2.5 ./some-old-appsee http://people.redhat.com/drepper/assumekernel.html for more details.
glibc enviroment variables
There's a wide variety of enviroment varibles that glibc uses to alter its behaviour, many of which are useful for debugging or troubleshoot purposes.
A good reference on these variables is at http://www.scratchbox.org/documentation/general/tutorials/glibcenv.html
Some interesting ones:
LANG and LANGUAGE
LANG sets what message catalog to use, while LANGUAGE sets LANG and all the LC_* variables. These control the locale specific parts of glibc.
Lots of programs are written expecting to be one in one locale and can break in other locales. Since locale settings can change things like sort order (LC_COLLATE), and the time formats (LC_TIME), shell scripts are particularly prone to problems from this.
A script that assumes the sort order of something is a good example.
A common way to test this is to try running the troublesome app with the locale set to "C" or the default locale.
LANGUAGE=C ls -alIf the app starts behaving when run that way, there is probably something in the code that is assuming "C" local (sorted lists and timeformats are strong candidates).
glibc malloc stuff
Recent (>5.4.23 for libc/>2.0 for glibc) libc implementations offer a small scale malloc debugger by way of the MALLOC_CHECK_ environment variable. MALLOC_CHECK_ can be set to 3 different values:
- 0: ignores any heap corruptions - 1: prints diagnostics on STDERR - 2: calls abort(3) as soon as memory corruption is detectedThis will help with the kind of memory corruption that can't be found with the tried and proven software engineering method of "staring at the code", but where electric fence/valgrind would be overkill.
Types Of Problems
Software is complicated and there can be a wide variety of problems that occur. But there are categories of problems that come up often, and it's useful to have tools and techniques for solving them
For more info, see Types Of Problems
App specific troubleshooting info
apache
mod_status
mod_status is an Apache module that can show an HTML page representing various information about the internal status of Apache. This includes number of httpds, their current status, network connections, amount of traffic, etc.
Very useful when trying to track down performance related issues.
module debugging
Some Apache httpd modules include options to enable extra debugging info. Unfortunately, this seems to depend on the module.
log files
Log files, the httpd error logs in particular (typically in /var/log/httpd/error_log), are often the best place to look when troubleshooting. It's also where any module debugging information will log to.
Testing the configuration file for syntax errors
Apache comes with an executable called apachectl(8). This program can run a configuration check on Apache's configuration files by issuing the command
apachectl configtestSome distros (like RedHat/Fedora) also include this command in Apache's init script and invoke apachectl in the background.
-X debug mode
One of the biggest problems with trying to track down problems with apache httpd is the multiprocess nature of it. It makes it difficult to strace or to attach gdb.
To force httpd to run in a single process mode start it with:
httpd -XNote that on Red Hat linux boxes you probably need to include the commandline arguments that the init scripts start httpd with. The easiest way to do this is to start httpd normally, then run `ps auxwwww` and cut and paste one of the httpd commandline lines.
PHP
The following assumes that you know PHP coding.
The most informative (but also most disruptive in a visual sense) thing to do is set
error_reporting = E_ALLin your php.ini (under debian: /etc/php/<calling entity>/php.ini). Remember to restart your webserver/calling entity after changing this setting. If you come from the C corner of things, you'll know that good programming style dictates that you treat warnings and notices as errors. So off you go, clean up that code!
Back and still not working? Ok, now it gets ugly. PHP doesn't come with a debugger like `gdb`. Such things exist, but usually they will be embedded in an IDE that also emulates a web server and costs $$$. So basically you get to do stuff just as in regular shell scripts: debug echos. Echo early, echo often. Hand in hand with echo statements comes the print_r function, which will print arrays/hashes (same thing in PHP) recursively. Drawback here: print_r formats in plain ASCII, not HTML. So you'll either have to look at the page source to see a clean version of the output, or do something ugly like
echo join( "<br>", print_r($myarray) );FIXME: can you turn on warnings about variables only used once, like in perl? One of my most frequent errors....iptables
I have a Windows VPN Client behind an Linux Gateway doing NAT and I can't connect to the server
First things first, you'll want to know what kind of Windows VPN tunnel you're building. The following will assume the standard PPTP tunnel.
FIXME: What about l2tp tunnels?First things first, you need rules that allow the forwarding of the used connections and rules for NATing. The tricky part here is that the PPTP tunnel uses two connections: one going to tcp/1723 on the server, and one GRE tunnel (meaning you can only have one PPTP NATting session active on the gateway at a time). So you'll need the following rules to allow the forwarding:
iptables -A FORWARD -p tcp --dport 1723 -d vpn-server-address -j ACCEPT iptables -A FORWARD -p gre -d vpn-server-address -j ACCEPTand the NATting is handled by these rules:
iptables -t nat -A PREROUTING -p tcp --sport 1723 -s vpn-server-address -j DNAT --to-dest vpn-client-ip:1723 iptables -t nat -A PREROUTING -p gre -s vpn-server-address -j DNAT --to-dest vpn-client-ip iptables -t nat -A POSTROUTING -p tcp --dport 1723 -d vpn-server-address -j SNAT --to-source gateway-public-ip iptables -t nat -A POSTROUTING -p gre -d vpn-server-address -j SNAT --to-source gateway-public-ipIf you're still having trouble connecting, and Windows is giving you an error 721 (or, if you're looking at the data flow with tcpdump and you're seeing the 1723/tcp connection working fine, but the GRE tunnel connection not working because for some reason the source IP of the GRE tunnel is the private ip of the machine running the vpn client), you will need to build the PPTP connection tracking module for the linux kernel (as of 2.6.x?) and insert the following to modules:
modprobe ip_conntrack_pptp modprobe ip_nat_pptpNow everything should be working as expected.
SSH
Most problems occur here when you're trying to set up logins via RSA/DSA keys (and probably without passwords too...). It's usally down to basics: Make sure that your ~/.ssh is owned by your user and set to mode 600. ~/.ssh/authorized_keys has to be set to 0600. If these basic conditions aren't met, sshd will refuse to even look at your authorized_keys file and drop you back to password logins.
Another word about the format of the authorized_keys file: it's one key per row. Make sure that your added keys are in a single row! vi is notorious for adding linebreaks if you have 'tw' set in your ~/.vimrc and use copy and paste to add a new key to the file. Use cat or ssh-copy-id instead.
You can run
ssh -v fred@godotto see what SSH is up to and where things start hickupping. You can go all the way up to
ssh -vvv fred@godotif you really want to know about how modulo groups are being prodded. Usually -vv suffices.
I just updated my openssh packages and now I can't login
If the error message is something like "Upsupported Protocol - Remote host closed the connection", it's probably due to an incompatibility between OpenSSH 4.2 and anything pre-4.2. If you have the server under your control, the solution is easy: Update the server to the 4.2 version as well (recommended as there are some nasty zlib buffer overruns in pre-4.2 anyway).
FIXME: What other solutions are there?Kerberos
When something goes wrong with Kerberos, it's usually down to a few things: - Something in the network topology changed, mandating that you re-check your /etc/krb5.conf - Your Kerberos server is unreachable - You entered a wrong password while generating a keytab file or the associated user/service name is not known to the server.
Unfortunately, tools like kinit(1) do have a -v option for verbose output, but this only starts outputting useful information after they aquire a TGT from the KDC. It's more useful to watch the logs of the KDC and see what (if anything) actually happens there.
/etc/krb5.conf
This configuration file is read and used by the Kerberos libraries, so any settings here affect everything on your system that uses Kerberos. The most important setting is
[realms] <YOUR DEFAULT REALM> = { kdc = <IP of your KDC> }There may be several realm definitions within the [realms] section. Be sure that you set the correct IP here. Otherwise your Kerberos requests will just hang there and time out after a while.
The second most important setting is
[libdefaults] default_realm = <YOUR DEFAULT REALM>This specifies what realm Kerberos tools will use if no explicit realm is given for a request.
Finally, if you're fooling around with a KDC that resides on a Windows2003 server, be sure that you've enabled arcfour-hmac-md5 and des-cbc-crc as cryto algorithms for the settings default_tgs_enctypes, default_tkt_enctypes and permitted enctypes in the [libdefaults] section. Otherwise your keytab files will be unreadable.
OpenSwan/IPSEC
Desktop Enviroments
Gnome
- http://dcs.nac.uci.edu/~strombrg/Troubleshooting-a-gnome-problem-early-in-the-login.html
- http://docs.sun.com/app/docs/doc/817-1740
Links
- Linux server system tuning (http://people.redhat.com/~alikins/system_tuning.html) Similar concepts.
- glibc env variables explained (http://www.scratchbox.org/documentation/general/tutorials/glibcenv.html)
- Mac OSX debugging (http://developer.apple.com/technotes/tn2004/tn2124.html)
- Unix and Linux Troubleshooting (http://aplawrence.com/Unixart/troubleshooting.html)
- Linux Troubleshooting Tutorials - Solving Problems (http://www.tutorialized.com/tutorial/Solving-Problems/4521)
- Unix Debugging Tips at sial.org (http://sial.org/howto/debug/unix/)
- How To Be A Programmer (http://samizdat.mines.edu/howto/HowToBeAProgrammer.html) Info on debugging strategies
Credits
Comments, suggestions, hints, ideas, critisicms, pointers, and other useful info from various folks were used to create the original version of this document. Check the history for more.
- Adrian Likins
- Mihai Ibanescu
- Chip Turner
- Chris MacLeod
- Todd Warner
- Nicholas Hansen
- Sven Riedel
- Jacob Frelinger
- James Clark
- Brian Naylor
- Drew Puch
- Ted Johnson
License
This work is licensed under a Creative Commons Attribution 2.5 License (http://creativecommons.org/licenses/by/2.5/)If folks are interested in also applying other licenses (GNU FDL, etc), let Adrian know.
How to Help
See How To Help for more info.
Retrieved from "http://www.linuxtroubleshooting.com/wiki/index.php?title=Main_Page"
|
Switchboard | ||||
Latest | |||||
Past week | |||||
Past month |
Dec 23, 2018 | hexmode.com
A while back I mentioned Atul Gawande 's book The Checklist Manifesto . Today, I got another example of how to improve my checklists.
The book talks about how checklists reduce major errors in surgery. Hospitals that use checklists are drastically less likely to amputate the wrong leg .
So, the takeaway for me is this: any checklist should start off verifying that what you "know" to be true is true . (Thankfully, my errors can be backed out with very little long term consequences, but I shouldn't use this as an excuse to forego checklists.)
Before starting, ask the "Is it plugged in?" question first. What happened today was an example of when asking "Is it plugged in?" would have helped.
Today I was testing the thumbnailing of some MediaWiki code and trying to understand the
$wgLocalFileRepo
variable. I copied part of an/images/
directory over from another wiki to my test wiki. I verified that it thumbnailed correctly.So far so good.
Then I changed the directory parameter and tested. No thumbnail. Later, I realized this is to be expected because I didn't copy over the original images. So that is one issue.
I erased (what I thought was) the thumbnail image and tried again on the main repo. It worked again–I got a thumbnail.
I tried copying over the images directory to the new directory, but it the new thumbnailing directory structure didn't produce a thumbnail.
I tried over and over with the same thumbnail and was confused because it kept telling me the same thing.
I added debugging statements and still got no where.
Finally, I just did an
ls
on the directory to verify it was there. It was. And it had files in it.But not the file I was trying to produce a thumbnail of.
The system that "worked" had the thumbnail, but not the original file.
So, moral of the story: Make sure that your understanding of the current state is correct. If you're a developer trying to fix a problem, make sure that you are actually able to understand the problem first.
Maybe your perception of reality is wrong. Mine was. I was sure that the thumbnails were being generated each time until I discovered that I hadn't deleted the thumbnails, I had deleted the original.
Back in 2005, I worked on Linux-branded Zones, Solaris containers that contained a Linux user environment. I wrote a coyly-titled blog post about examining Linux applications using DTrace. The subject was honest - we used precisely the same techniques to bring the benefits of DTrace to Linux applications - but the title wasn't completely accurate. That wasn't exactly "DTrace for Linux", it was more precisely "The Linux user-land for Solaris where users can reap the benefits of DTrace"; I chose the snappier title.
I also wrote about DTrace knockoffs in 2007 to examine the Linux counter-effort. While the project is still in development, it hasn't achieved the functionality or traction of DTrace. Suggesting that Linux was inferior brought out the usual NIH reactions which led me to write a subsequent blog post about a theoretical port of DTrace to Linux. While a year later Paul Fox started exactly such a port, my assumption at the time was that the primary copyright holder of DTrace wouldn't be the one porting DTrace to Linux. Now that Oracle is claiming a port, the calculus may change a bit.
What is Oracle doing? Even among Oracle employees, there's uncertainty about what was announced. Ed Screven gave us just a couple of bullet points in his keynote; Sergio Leunissen, the product manager for OEL, didn't have further details in his OpenWorld talk beyond it being a beta of limited functionality; and the entire Solaris team seemed completely taken by surprise.
What is in the port? Leunissen stated that only the kernel components of DTrace are part of the port. It's unclear whether that means just fbt or includes sdt and the related providers. It sounds certain, though, that it won't pass the DTrace test suite which is the deciding criterion between a DTrace port and some sort of work in progress.
What is the license? While I abhor GPL v. CDDL discussions, this is a pretty interesting case. According to the release manager for OEL, some small kernel components and header files will be dual-licensed while the bulk of DTrace - the kernel modules, libraries, and commands - will use the CDDL as they had under (the now defunct) OpenSolaris (and to the consernation of Linux die-hards I'm sure). Oracle already faces an interesting conundum with their CDDL-licensed files: they can't take the fixes that others have made to, for example, ZFS without needing to release their own fixes. The DTrace port to Linux is interesting in that Oracle apparently thinks that the CDDL license will make DTrace too toxic for other Linux vendors to touch.
October 5, 2011 | Datamation
Oracle is now updating that kernel to version 2, delivering even more performance thanks to an improved scheduler for high thread count applications like Java. The Unbreakable Enterprise Kernel 2 release also provides transmit packet steering across CPUs, which Screven said delivers lower network latency. There is also a virtual switch that enables VLAN isolation as well as Quality of Service (QoS) and monitoring.
The new kernel also provides Linux containers, which are similar to the Solaris containers, for virtualization isolation.
"Linux containers give you low-overhead operating system isolation," Screven said.
In another nod to Solaris, Oracle is now also bringing Solaris' Dtrace to Linux. Dtrace is one of the primary new features that debuted in Solaris 10 and provides administrators better visibility into their system performance.
See also
Verifying that there is not configuration errors in xorg.conf
The xorg.conf might have a wrong value/entry. You can try the following procedures
init 3
cd /etc/X11
mv xorg.conf xorg.conf.old
cp xorg.conf.install xorg.conf
init 5
retest
in case previous method didn't help, try the following
init 3
sax2 -r
check keyboard, mouse and resolution configuration. Save the file, issue init 5 and retest it
I've seen cases where switching the runlevels didn't validate the new configuration so, you might need to reboot the server to validate the changes on steps 1 and 2 in case you don't see any difference.
July 2, 2008Linux can be configured to log dmesg output to another system via network using syslog. It is done using kernel level networking stuff ia UDP port 514. There is module called netconsole which logs kernel printk messages over udp allowing debugging of problem where disk logging fails and serial consoles are impractical. Most modern distro has this netconsole as a built-in module. netconsole initializes immediately after NIC cards. There are two steps to configure netconsole:
- Syslogd server - Let us assume 192.168.1.100 IP having FQDN - syslogd.nixcraft.in. Please note that the remote host can run either 'netcat -u -l -p <port>' or syslogd.
- All other systems running netconsole module in kernel
Step # 1: Configure Centralized syslogd
Login to syslogd.nixcraft.in server. Open syslogd configuration file. Different UNIX / Linux variant have different configuration files
Red Hat / CentOS / Fedora Linux Configuration
If you are using Red Hat / CentOS / Fedora Linux open /etc/sysconfig/syslog file and set SYSLOGD_OPTIONS option for udp logging.
# vi /etc/sysconfig/syslog
Configure syslogd option as follows:
SYSLOGD_OPTIONS="-m 0 -r -x"
Save and close the file. Restart syslogd, enter:
# service syslog restart
Debian / Ubuntu Linux Configuration
If you are using Debian / Ubuntu Linux open file /etc/default/syslogd set SYSLOGD option for udp logging.
# vi /etc/default/syslogd
Configure syslogd option as follows:
SYSLOGD_OPTIONS="-r"
# /etc/init.d/sysklogd restart
FreeBSD configuration
If you are using FreeBSD open /etc/rc.conf and set syslogd_flags option option for udp logging. Please note that FreeBSD by default accepts network connections. Please refer to syslogd man page for more information.
Firewall configuration
You may need to open UDP port 514 to allow network login. Sample iptables rules to open UDP port 514:
MYNET="192.168.1.0/24"
SLSERVER="192.168.1.100"
iptables -A INPUT -p udp -s $MYNET --sport 1024:65535 -d $SLSERVER --dport 514 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A OUTPUT -p udp -s $SLSERVER --sport 514 -d $MYNET --dport 1024:65535 -m state --state ESTABLISHED -j ACCEPTStep # 2: Configure Linux Netconsole
You need to configure netconsole service. Once this service started, you are allowed a remote syslog daemon to record console output from local system. The local port number that the netconsole module will use 6666 (default). You need to set the IP address of the remote syslog server to send messages.
Open /etc/sysconfig/netconsole file under CentOS / RHEL / Fedora Linux, enter:
# vi /etc/sysconfig/netconsole
Set SYSLOGADDR to 192.168.1.100 (IP address of remote syslog server)
SYSLOGADDR=192.168.0.1
Save and close the file. Restart netconsole service, enter:
# /etc/init.d/netconsole restart
A note about Debian / Ubuntu Linux
Red Hat has netconsole init script. However, under Debian / Ubuntu Linux, you need to manually configure netconsole. Type the following command to start netconsole by loading kernel netconsole module, enter:
# modprobe netconsole [email protected]/eth0,[email protected]/00:19:D1:2A:BA:A8
Where,
- 6666 - Local port
- 192.168.1.5 - Local system IP
- eth0 - Local system interface
- 514 - Remote syslogd udp port
- 192.168.1.100 - Remote syslogd IP
- 00:19:D1:2A:BA:A8 - Remote syslogd Mac
You can add above modprobe line to /etc/rc.local to load module automatically. Another recommend option is create /etc/modprobe.d/netconsole file and append following text:
# echo 'options netconsole [email protected]/eth0,[email protected]/00:19:D1:2A:BA:A8 '> /etc/modprobe.d/netconsole
How do I verify netconsole is logging messages over UDP network?
Login to remote syslog udp server (i.e. 192.168.1.100 our sample syslogd system), enter:
# tail -f /var/log/messages
/var/log/messages is default log file under many distributions to log messages. Refer to /etc/syslog.conf for exact location of your file.How do I use nc / netcat instead of messing with syslogd?
This is called one minute configuration. You can easily get output on 192.168.1.100 without using syslogd. All you have to do is run netcat (nc) command, on 192.168.1.100:
$ nc -l -p 30000 -u
Login to any other box, enter command:
# modprobe netconsole [email protected]/eth0,[email protected]/00:19:D1:2A:BA:A8
Output should start to appear on 192.168.1.100 from 192.168.1.5 without configuring syslogd or anything else.Further readings:
- nc / netcat command
- modprobe command
- netconsole documentation
- man pages nc, modeprobe
Google matched content |
Internal
External
Course outline
Section 1Troubleshooting methodology
Section 2
Tools
- common troubleshooting tools
- RPM queries and verification
- src packages and spec files
- strace, ltrace, lsof, and fuser
- ipcs and ipcrm
- vmstat, iostat, mpstat, and sar
- ifconfig, ip, arp, and route
- name resolution
- netstat and rpcinfo
- nmap and nc
- tcpdump and ethereal
Lab
- exploring and documenting current system configuration state
- troubleshooting techniques with RPM, process related tools, and network related tools
Section 3
Rescue environments
- rescue procedures
- recovery examples
Lab
- using rescue disk
- using mount and chroot to access hard disk
- reinstalling the Master Boot Record (MBR) with grub-install
- Setting up networking statically
- mounting an NFS share
- installing an RPM using the root option
Section 4
- Linux boot process
- booting Linux
- boot process troubleshooting
- process management and troubleshooting
- file systems concepts and troubleshooting
- backups concepts and troubleshooting
Lab
- troubleshooting common system and daemon errors
- restoring files from backup
- booting scenarios: six exercises
- process scenarios: three exercises
- backup scenarios: one exercise
Section 5
- networking commands review and troubleshooting
- Internet Protocol (IP) aliases versus virtual interfaces
- xinetd concepts and troubleshooting
- Transmission Control Protocol (TCP) wrappers concepts and troubleshooting
- iptables concepts and troubleshooting
Lab
- iptables scenario: two exercises
- networking scenarios: four exercises
- TCP wrappers scenarios: two exercises
- xinetd scenarios: four exercises
Section 6
- X11 concepts, troubleshooting, and server operation
- X11 concepts and troubleshooting
- syslog concepts and troubleshooting
- RPM concepts and troubleshooting
- Common UNIX Printing System (CUPS) troubleshooting
- at and cron troubleshooting
Lab
- at and cron scenarios: four exercises
- CUPS scenerio: two exercises
- RPM scenarios: four exercises
- at/cron scenarios: four exercises
- syslog scenarios: three exercises
- X scenarios: seven exercises
Section 7
- users and groups troubleshooting
- Pluggable Authentication Module (PAM) concepts and troubleshooting
- filesystem quotas and quotas troubleshooting
- File Access Control Lists (FACL) and Access Control Lists (ACL) for users or groups
- FACLs and troubleshooting
Lab
- filesystem scenarios: six exercises
- PAM scenarios: four exercises
- quota scenarios: five exercises
- user and group scenarios: five exercises
Section 8
- DNS concepts and troubleshooting
- Apache concepts and troubleshooting
- FTP concepts and troubleshooting
- Squid concepts and troubleshooting
Lab
- Apache scenarios: five exercises
- DNS scenarios: four exercises
- FTP scenarios: two exercises
- Squid scenarios: four exercises
Section 9
- Samba concepts and troubleshooting
- Sendmail concepts and troubleshooting
- Postfix concepts and troubleshooting
- Internet Message Access Protocol (IMAP) and Post Office Protocol (POP) concepts and troubleshooting
Lab
- IMAP/POP scenarios: three exercises
- Postfix scenarios: five exercises
- Samba scenarios: three exercises
- Sendmail scenarios: four exercises
Section 10
- Kernel modules and troubleshooting
- logical volume management and creating logical volumes
- Logical Volume Manager (LVM) deployment issues and troubleshooting
- Redundant Array of Independent Disks (RAID) concepts and troubleshooting
- LDAP and LDAP troubleshooting
Lab
- Kernel module scenarios: three exercises
- LDAP scenarios: three exercises
- LVM scenario: 1 exercise
- Network Information Service (NIS) scenarios: two exercises
- RAID scenario: three exercise
Society
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Quotes
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
Classic books:
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|
You can use PayPal to to buy a cup of coffee for authors of this site |
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.
Last modified: February 19, 2020