August 21st, 2007
The Application is down! No, wait! Our Unix administrators just checked the server and it is running. They swear by it, and say it is the network team’s equipment dropping packets. The network team checks their systems and swear they are passing the traffic, and it must be an application configuration issue. The application folks, who originally reported the problem, throw up their hands and go back to the systems administrators for more help. Before long the teams are in a quagmire of blame and nothing is getting done towards fixing the issue. How can we step around that? What should we be doing to fix the Application? If you are the systems administrator then let us try and verify what the network guys are saying. If they really are dropping packets then let us bring some confirmation of that to the table.
Firstly, check your routing tables and your basic network configuration. More than one Unix administrator on a rampage of righteous indignation against their network team has found himself swallowing his own pride when it was discovered that he had errors in his routing table. So check your IP address, your netmask, and your default router first. All good? Now check if you need or have any other routes in your routing table. All good here, too? Excellent! If you are new to Unix and need help finding this information, read the man pages for ifconfig and netstat to get started. I would love to give some examples but it differs very much from Unix variant to Unix variant. In general, though, use “netstat -nr” for viewing the routing table and “ifconfig -a” for seeing all of the network interface configuration information.
Now that you IP configuration is confirmed, let us move on to the connection in question. Firstly, you have to identify the port and protocol of the connection. If you are running a standard protocol, such as HTTP, then you can likely assume port 80, but perhaps you do not know? We have a couple options.
There are a myriad of tools to see what connections and proto-connections exist on the system. I like lsof because I can view the connections for each individual process. So let us say I have that HTTP server I mentioned above running on my Unix server, but I am not sure if it is running on port 80 or some other port. I can start the web server process, note what its process ID is with ps, and then use lsof to view its listening sockets to see what port the clients should connect to.
I am using Mac OS X today, which has a BSD style ps command:
% ps auxw | grep httpd root 13888 0.0 -0.3 41732 5348 ?? Ss Tue05PM 0:15.24 /usr/sbin/httpd www 14549 0.0 -0.0 32636 796 ?? S Wed09AM 0:00.00 /usr/sbin/httpd
When using a System V style ps I use “ps -ef” instead of “ps auxw”. The process ID here for httpd is 13888 (I could also use 14549 because they share the port, in a sense, but that may not always be the case). So let us run lsof to see what ports are open:
% sudo lsof -Pnp 13888 | egrep 'TCP|UDP' httpd 13888 root 16u IPv4 0x76fe710 0t0 TCP *:80 (LISTEN)
It is port 80! And in particular it is TCP port 80. Note that this is opposed to UDP port 80, which is another beast altogether that we will address below. For now let us move on to testing this connection.
Testing TCP Connectivity
The most widespread tool for checking if there is a firewall between two systems blocking a particular TCP port is telnet. It is not the best utility, but it does the job and is ubiquitous; you will find it on almost all Unix and Windows systems, and sometimes even VMS and mainframes.
To use telnet, just get to a command line and type “telnet” followed by the target system and port number:
telnet my-web-server.example.com 80
If all is well and you can reach the server, you should see something like this:
% telnet my-web-server.example.com 80 Trying 192.168.1.10... Connected to my-web-server.example.com. Escape character is '^]'.
If the service is down, but there is no firewall intervention, you will see something along the lines of “connection refused” like below:
% telnet my-web-server.example.com 80 Trying 192.168.1.10... telnet: connect to address 192.168.1.10: Connection refused telnet: Unable to connect to remote host
But what if there is a firewall in between dropping all the packets? You would see the message attempting to make the connection, and then telnet will seem to just freeze for many seconds. After some time it will report a failure, but if you see it freezing for more than ten seconds then you can terminate the connection attempt with control-C. Here is what it looks like:
% telnet my-web-server.example.com 80 Trying 192.168.1.10... (At this point telnet will stop and wait for a few minutes.) telnet: connect to address 192.168.1.10: Operation timed out
Note that even if it looks like a firewall is dropping the packets, there may be any number of other causes:
- Host-based firewall on source host blocking outbound traffic.
- Host-based firewall on destination host blocking inbound traffic.
- Target host is down.
- Either host has an error in its routing table.
Testing UDP Connectivity
Sometimes applications use UDP instead of TCP for their communications. This can be tricky to handle if trying to determine if you have a network issue between nodes. The reason is that UDP does not require the other host to reply; if you send a packet that is received you see the same thing as if you send a packet that never reaches its destination. Note that if you send a UDP packet that arrives to a system not accepting packets then you do get a different behavior that you can measure, because in that case the system will notify you that it is not accepting packets there.
So first make a little extra effort to ensure both hosts are up and that the routing table works between them. You can do this by probing any TCP port (open or closed) or with an ICMP ping. As long as those packets can get through you know your routing tables are okay (note this does not rule out host-based firewalls as a problem as specific ports may be filtered).
We can try to send a generic packet to the other host with either ping or traceroute—those two are widely available—or netcat if you have it. Netcat is the best if it is available:
% nc -vvuz my-cifs-server.example.com 137 my-cifs-server.example.com [192.168.1.20] 137 (netbios-ns) open sent 0, rcvd 0
The means of getting ping or traceroute to send a single UDP packet to a specific port varies per system and is even impossible in some (all?) versions of Windows. Because of these differences, I recommend installing netcat and using it if you need to debug UDP connections (it is also better than telnet for debugging TCP connections).
Since we really only know that we do not have a network issue if the port shows up as “closed” we have to do more research to confirm a network problem. Note that if you have the luxury of being able to shutdown the UDP service on the destination host then you should do so for testing so you can observe the closed state.
The next tool you need to fully debug a broken UDP connection is a packet sniffer. I recommend Wireshark if you are on Windows, or tcpdump if you are on any non-Solaris Unix system. On Solaris, use the built-in “snoop” command. Each of these has a somewhat different usage:
tcpdump -i en1 'host my-cifs-server.example.com and udp and port 137'
snoop -d ce0 'host my-cifs-server.example.com and udp and port 137'
Wireshark is usually run in a GUI, particularly on Windows.
Get your packet sniffer running and start talking to your port. If you do not see your packets going out, you are doing something wrong. Go back and figure out if you are sniffing the wrong interface or have an error in your usage. If you do see your packets going out but none returning, then it is still inconclusive, but at least you know you are sniffing the right thing. If you do see return packets, your connection is open and you can proceed with troubleshooting your application instead of the network.
The key here is to get the other server to send packets back in response to packets sent. If you can do that then you know the connection works. Unfortunately, many servers just ignore stuff they consider garbage, like the packets we were sending with netcat.
In our case, we are trying to connect to a NetBIOS name service running on UDP port 137. Some such servers reply to garbage with garbage (success for us!) and some just drop the packets, leaving us confused. To get these guys stirred up we need to generate some real NetBIOS name service traffic from our workstations or servers. In this case, we would log into our Windows workstation and try to connect to a file share on the server; we could also run the nbtstat command instead. In either case, with our sniffer armed and ready we will hope it is enough to generate a reply. If there is truly a NetBIOS name service running on the remote port, we should see evidence of it in the sniffer output. If not, then it is at this point that we can turn towards the network and in good conscience assume all is well from the server side (because you did check your host-based firewall settings, right?).
Debugging network connectivity issues does not have to be a nightmare. With the right tools, knowledge, and experience doing so you can be a pro at determining if an issue needs resolution in your application, on the OS it is running on, or the network. Be the hero that knows instead of guesses what is happening out there, because the resolution often follows.