Hack and / - Linux Troubleshooting, Part III: Remote Networks
This column is the third in a series dedicated to one of my favorite subjects: troubleshooting. Because my column generally is aimed more at tips and tricks and less on philosophy and design, I don't talk much about overall approaches to problem solving. Instead, in this series, I describe some general classes of problems you might find on a Linux system, and then I discuss how to use common tools, most of which are probably already on your system, to isolate and resolve each class of problem.
In my previous column, I introduced some ways to troubleshoot network problems on your local network. Many network problems extend past your local network and either onto other local subnets or onto the Internet itself. In this column, I provide you with the tools and techniques for answering that immortal question: is the Internet down, or is it just me?
The scenario I use here to test troubleshooting skills is one that everyone has run into at one point or another—you try to load a Web site, perhaps even a reliable site like Google, and it won't come up. Because I covered local network troubleshooting in my last column, I'm assuming you already have gone through those steps and are ready to proceed past the local network. Even though this example deals with testing access to the Internet, you can use the same steps to troubleshoot problems accessing any remote network.
For your computer to communicate with any other computer outside your local network, you must have a gateway (router) configured on your local network, and you must be able to reach it. Without getting into heavy-duty network theory, a router connects two or more networks and knows how to route packets between those networks. Your Linux computer has a list of all of the routers it knows about for each network of which it is a member and when it should use those routers all stored in its routing table. You can use the route command to show your computer's current routing table:
$ route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 10.1.1.0 * 255.255.255.0 U 0 0 0 eth0 default 10.1.1.1 0.0.0.0 UG 100 0 0 eth0
In the above example, I have one gateway defined: 10.1.1.1. It is listed as my default gateway, which is the router it will use whenever it doesn't have any other routers defined for that network. In my case, it's also the only router in my routing table. That means any time my machine wants to communicate with a remote network (in my example, anything that's not within 10.1.1.0/255.255.255.0 or 10.1.1.1–10.1.1.254), it's going to send the packet to 10.1.1.1 to forward on.
So now that I know my default gateway, I use ping to test whether it's available:
$ ping -c 5 10.1.1.1 PING 10.1.1.1 (10.1.1.1) 56(84) bytes of data. 64 bytes from 10.1.1.1: icmp_seq=1 ttl=64 time=3.13 ms 64 bytes from 10.1.1.1: icmp_seq=2 ttl=64 time=1.43 ms 64 bytes from 10.1.1.1: icmp_seq=3 ttl=64 time=1.79 ms 64 bytes from 10.1.1.1: icmp_seq=5 ttl=64 time=1.50 ms --- 10.1.1.1 ping statistics --- 5 packets transmitted, 4 received, 20% packet loss, time 4020ms rtt min/avg/max/mdev = 1.436/1.966/3.132/0.686 ms
In this example, four out of five ping packets were received, so I can be reasonably sure my gateway works. If I couldn't ping the gateway, either my network admin is blocking ICMP packets (I hate when people do that), my switch port is set to the wrong VLAN, or my gateway is truly down. If the gateway is down, fixing the problem might mean rebooting your DSL or wireless router (if that's how you connect to the Internet) or moving your troubleshooting to whatever device is acting as your gateway.
In my case, I was able to ping the gateway, so I'm ready to move on to DNS. Because most of us don't browse the Web by IP address, we need DNS to resolve the hostnames we type into IP addresses. If DNS isn't working correctly, even if we technically can reach that remote IP address, we never will know what the IP address is.
A basic way to test DNS is via the nslookup command:
$ nslookup www.linuxjournal.com Server: 10.2.2.2 Address: 10.2.2.2#53 Non-authoritative answer: Name: www.linuxjournal.com Address: 76.74.252.198
In this example, DNS is functioning correctly as far as I can tell. I say as far as I can tell, because I'm assuming that 76.74.252.198 is the correct IP address for www.linuxjournal.com. If it were the wrong address, that very well could be the cause of the problem! The DNS server in this case is 10.2.2.2, but in some environments, it could be the same IP address as your gateway.
Even though the DNS server worked, because I want to show how to troubleshoot DNS, I need some examples of how it can fail. To illustrate this, let me show a few different nslookup commands that have failed:
$ nslookup www.linuxjournal.com ;; connection timed out; no servers could be reached
This error tells me that nslookup couldn't communicate with my DNS server. That could be because either I don't have any name servers configured on my system or I just can't reach them. To see whether I have any name servers configured, I would check my /etc/resolv.conf file. This file keeps track of what name servers I should use. In my case, it would look like this:
search example.net nameserver 10.2.2.2
If your resolv.conf file doesn't have a name server entry, you have found the problem. You need to add the IP address of your name server here. Because I do have a name server defined in resolv.conf, the next step is to attempt to ping the name server's IP with the same ping command that I used for the gateway above. If you can't ping the name server, either a firewall is blocking ICMP (those pesky network administrators!) or there's a routing problem between you and the name server. To rule out the latter, use a tool called traceroute. Traceroute tests the route between you and a remote IP address. To use it, type traceroute followed by the IP address you want to reach. In my case, I would use 10.2.2.2:
$ traceroute 10.2.2.2 traceroute to 10.2.2.2 (10.2.2.2), 30 hops max, 40 byte packets 1 10.1.1.1 (10.1.1.1) 5.432 ms 5.206 ms 5.472 ms 2 10.2.2.2 (10.2.2.2) 8.039 ms 8.348 ms 8.643 ms
In this example, I can route to 10.2.2.2 successfully. To get there, my packets first go to 10.1.1.1 and then move straight to 10.2.2.2. This tells me that 10.1.1.1 is likely the gateway for both networks. If there are more routers between you and your remote server, you will have more hops in between. On the other hand, if you do have a routing problem, your output might look more like the following:
$ traceroute 10.2.2.2 traceroute to 10.2.2.2 (10.2.2.2), 30 hops max, 40 byte packets 1 10.1.1.1 (10.1.1.1) 5.432 ms 5.206 ms 5.472 ms 2 * * * 3 * * *
If you start seeing asterisks in the output, you know the problem likely begins on the last router on the list, so you would need to start troubleshooting from that router. Instead, you might see output like this:
$ traceroute 10.1.2.5 traceroute to 10.1.2.5 (10.1.2.5), 30 hops max, 40 byte packets 1 10.1.1.1 (10.1.1.1) 5.432 ms 5.206 ms 5.472 ms 1 10.1.1.1 (10.1.1.1) 3006.477 ms !H 3006.779 ms !H 3007.072 ms
This means your ping timed out at the gateway, so the remote host could be down, unplugged or otherwise inaccessible, so you would need to troubleshoot its connection to the network.
Note: traceroute relies on ICMP, so if ICMP is blocked on your network, install a tool called tcptraceroute to perform a similar test over TCP (the syntax is the same, you just type tcptraceroute instead of traceroute).
If you can ping the name server but it isn't responding to you, go back to my previous column and perform all the troubleshooting steps to test whether the remote port is open and accessible on the remote host. Keep in mind though that DNS servers use port 53 on TCP and UDP. Again, if you aren't sure what port a service uses, check the /etc/services file on your system. It lists most of the common services you will use.
Another common nslookup error you might run into is this:
$ nslookup web1 Server: 10.2.2.2 Address: 10.2.2.2#53 ** server can't find web1: NXDOMAIN
Here my name server at 10.2.2.2 responded to me but told me it couldn't find the record for server web1. This error could mean that I don't have web1's proper domain name in my DNS search path. If you don't specify a host's fully qualified domain name (for instance, web1.mysite.com) but instead use the shorthand form of the hostname, your system will check /etc/resolv.conf for domains in your DNS search path. It then will add those domains one by one to the end of your hostname to see if it resolves. The DNS search path is the line in /etc/resolv.conf that starts with the word search:
search example.net example2.net nameserver 10.2.2.2
In my case, when I search for web1's IP address, my system will first search for web1.example.net, and if that has no records, it will search for web1.example2.net. If you want to test whether this is the problem, simply run nslookup again but with the fully qualified domain name (such as web1.mysite.com). If it resolves, either make sure you always use the fully qualified domain name when you access that server, or add that domain to the search path in /etc/resolv.conf.
If you try nslookup against the fully qualified domain name and you still get the same NXDOMAIN error above, your problem is with the name server itself. Troubleshooting the full range of DNS server problems is a bit beyond what I could reasonably fit in this column, but here are a few steps to get you started. If you know your DNS server is configured to have the record you are looking for itself, you need to examine its zone records to make sure that particular hostname exists. If, on the other hand, you are searching for a domain for which you know it doesn't have a record (say, www.linuxjournal.com), it's possible your DNS server isn't allowing recursive queries from your host or at all. You can test that by trying to resolve some other remote host on the Internet. If it doesn't resolve, it's probably a recursion setting. If it does resolve, the problem might very well be with that remote site's DNS server.
If after all these tests you find that your DNS servers are working fine, but you still can't access the remote server, the final step is to perform another traceroute like above, only directly against the remote server. So for instance, if you wanted to test your route to www.linuxjournal.com, the traceroute might look like the following:
$ traceroute www.linuxjournal.com traceroute to www.linuxjournal.com (76.74.252.198), 30 hops max, ↪60 byte packets 1 10.1.1.1 (10.1.1.1) 1.016 ms 2.222 ms 2.308 ms 2 75-101-46-1.dsl.static.sonic.net (75.101.46.1) 6.916 ms ↪7.389 ms 8.386 ms 3 921.gig0-3.gw.sjc2.sonic.net (75.101.33.221) 11.265 ms ↪12.435 ms 13.050 ms 4 108.ae0.gw.equinix-sj.sonic.net (64.142.0.73) 13.846 ms ↪15.233 ms 15.390 ms 5 GIG2-0.sea-dis-2.peer1.net (206.81.80.38) 35.149 ms ↪36.272 ms 36.944 ms 6 oc48.so-2-1-0.sea-coloc-dis-1.peer1.net (216.187.89.190) ↪37.340 ms 27.884 ms 27.266 ms 7 10ge.ten1-2.sj-mkp16-dis-1.peer1.net (216.187.88.202) ↪28.421 ms 29.014 ms 29.688 ms 8 10ge.ten1-2.sj-mkp2-dis-1.peer1.net (216.187.88.134) ↪30.903 ms 31.015 ms 31.804 ms 9 10ge-ten1-3.la-600w-cor-1.peer1.net (216.187.88.130) ↪40.840 ms 41.279 ms 42.069 ms 10 10ge.ten1-1.la-600w-cor-2.peer1.net (216.187.88.146) ↪42.587 ms 43.710 ms 44.921 ms 11 10ge-ten1-2.dal-eqx-cor-1.peer1.net (216.187.124.122) ↪81.702 ms 82.959 ms 83.934 ms 12 10ge-ten1-1.dal-eqx-cor-2.peer1.net (216.187.124.134) ↪74.876 ms 72.454 ms 72.798 ms 13 10ge-ten1-3.sat-8500v-cor-2.peer1.net (216.187.124.178) ↪80.224 ms 81.872 ms 82.569 ms 14 216.187.124.110 (216.187.124.110) 83.499 ms 84.162 ms ↪85.048 ms 15 www.linuxjournal.com (76.74.252.198) 85.484 ms 86.461 ms ↪87.153 ms
In this example, I'm 15 hops (or routers) away from the www.linuxjournal.com server. This is an example of a successful query, but if you ran the same query and noticed a number of rows of asterisks that never made it to your destination and you couldn't ping www.linuxjournal.com directly, the problem could be an Internet routing issue between you and the remote network. Unfortunately, it's probably something outside your control, but fortunately, these sorts of problems tend to resolve themselves pretty quickly, so just keep trying.
If, on the other hand, your traceroute command was successful, but the remote site still didn't work, go back to the steps I discussed in my previous column on how to use telnet and nmap to test whether a remote port is open. It actually could be that the remote server is down (hey, it happens to the best of us) or that someone has configured a firewall to block you from that remote server.
I hope this series has kindled (or rekindled) your interest in troubleshooting under Linux. One of the things I love about Linux is how little it hides from you about how it works and how many troubleshooting tools it provides when things do go wrong. If this has piqued your interest, there are many more troubleshooting avenues for you to explore—from DNS servers like I mentioned above, to troubleshooting just about any type of service. Also, if you have any other great tools or techniques you use to track down these problems, drop me a line. I'm always on the lookout for tools to solve problems faster.
Kyle Rankin is a Systems Architect in the San Francisco Bay Area and the author of a number of books, including The Official Ubuntu Server Book, Knoppix Hacks and Ubuntu Hacks. He is currently the president of the North Bay Linux Users' Group.