\documentclass[a4paper]{article} \usepackage[scale=0.9]{geometry} \usepackage{multicol} \begin{document} \begin{multicols}{2} The reader should be able to list the stages of encapsulated data flow process. Also the reader should be able to compare the logical layers of the OSI and TCP/IP networking models and identify the logical layers used by devices on a network. The reader should understand: \begin{itemize} \item The stages of the general troubleshooting process \item The bottom-up troubleshooting approach \item The top-down troubleshooting approach \item The divide and conquer troubleshooting approach \item How to select an effective troubleshooting approach based on a specific situation \item The process of gathering symptoms from a network \item Guidelines for gathering symptoms from a user \item The process of gathering symptoms from an end-system \end{itemize} \section{Overview} Troubleshooting networks is more important than ever. As time goes on, services continue to be added to networks. With each added service comes more variables. This adds to the complexity of the network troubleshooting as well as the network itself. Organizations increasingly depend on network administrators and network engineers having strong troubleshooting skills. Troubleshooting begins by looking at a methodology that breaks down the process of troubleshooting into manageable pieces. This permits a systematic approach, minimizes confusion, and cuts down on time otherwise wasted with trial and error troubleshooting. Network engineers, administrators, and support personnel realize that troubleshooting is a process that takes the greatest percentage their time. One of the primary goals in this module is to present efficient troubleshooting techniques, in order to shorten overall troubleshooting time when working in a production environment. Two extreme approaches to troubleshooting almost always result in disappointment, delay, or failure. On one extreme is the theorist, or rocket scientist, approach. On the other is the practical, or caveman, approach. Since both of these approaches are extremes, the better approach is somewhere in the middle using elements of both. The rocket scientist analyzes and re-analyzes the situation until the exact cause at the root of the problem has been identified and corrected with surgical precision. This sometimes requires taking a high-end protocol analyzer and collecting a huge sample, possibly megabytes, of the network traffic, while the problem is present. The sample is then inspected in minute detail. While this process is fairly reliable, few companies can afford to have their networks down for the hours, or days, it can take for this exhaustive analysis. The caveman's first instinct is to start swapping cards, cables, hardware and software until miraculously the network begins operating again. This does not mean that the network is working properly, just that it is operating. Unfortunately, the troubleshooting section in some manuals actually recommends caveman style procedures as a way to avoid providing more technical information. While this approach may achieve a change in symptoms faster, this approach is not very reliable and the root cause of the problem may still be present. In fact, the parts used for swapping may include marginal or failed parts swapped out during prior troubleshooting episodes. Analyze the network as a whole rather than in a piecemeal fashion. One technician following a logical sequence will almost always be more successful than a gang of technicians, each with their own theories and methods for troubleshooting. \section{2.1 Using a Layered Architectural Model to Describe Data Flow} \subsection{2.1.1 Encapsulating data} Logical networking models separate network functionality into modular layers. These modular layers are applied to the physical network to isolate network problems and even create divisions of labor. For example, if the symptoms of a communications problem suggest a physical connection problem, the telephone company service person can focus on troubleshooting the T1 circuit that operates at the physical layer. The repair person does not have to know anything about TCP/IP, which operates at the network layer, or attempt to make changes to devices operating outside of the realm of the suspected logical layer. The repair person can concentrate on the physical circuit. If it functions properly, then either the repair person or a different specialist looks at areas in another layer that could be causing the problem. The Open Systems Interconnection (OSI) model provides a common language for network engineers. Having looked at using a systematic approach, documentation, and network architectures, it can be seen that the OSI model is pervasive in troubleshooting networks. The model allows troubleshooting to be described in a structured fashion. Problems are typically described in terms of a given OSI model layer. At this stage it is assumed that there should be an intimate familiarity with the model. Taking a quick look at the OSI model helps clarify its role in troubleshooting methodology. The OSI reference model describes how information from a software application in one computer moves through a network medium to a software application in another computer. The OSI reference model is a conceptual model composed of seven layers, each specifying particular network functions. With this technique, one transition is guaranteed for each bit cycle, or bit time. The model was developed by the International Organization for Standardization (ISO) in 1984, and it is now considered the primary architectural model for intercomputer communications. The OSI model divides the tasks involved with moving information between networked computers into seven smaller, more manageable task groups. A task, or group of tasks, is then assigned to each of the seven OSI layers. Each layer is reasonably self-contained, so that the tasks assigned to each layer can be implemented independently. This enables the solutions offered by one layer to be updated without adversely affecting the other layers. The figure details the seven layers of the Open System Interconnection reference model. The OSI model provides a logical framework and a common language used by network engineers to articulate network scenarios. The Layer 1 through Layer 7 terminology is so common that most engineers do not think twice about it any more. The upper layers (5-7) of the OSI model deal with application issues and generally are implemented only in software. The application layer is closest to the end user. Both users and application layer processes interact with software applications that contain a communications component. The lower layers (1-4) of the OSI model handle data-transport issues. The physical layer and data-link layer are implemented in hardware and software. The other lower layers generally are implemented only in software. The physical layer is closest to the physical network medium, such as the network cabling, and is responsible for actually placing information on the medium. When sending data from an application in one host to an application in a second, the network software on the source host takes data from an application and converts it as needed for transmission over a physical network. The process involves: \begin{description} \item[Converting data into segments] Encapsulating segments with header information that includes logical network addressing information, also the process of converting segments into packets Encapsulating packets with a header, including physical addressing information, and converting packets to frames \item[Encoding frames into bits] The data is now ready for travel over the physical medium as bits. The encapsulation process as a whole represents the initial stage in transferring data between two end systems. \end{description} \section{2.1 Using a Layered Architectural Model to Describe Data Flow} \subsection{2.1.2 Bits on the physical medium} The Ethernet receiver derives the clock rate from the incoming data stream. Using a direct signal encoding of 0 volts for a logic 0 value and 5 volts for a logic 1 value could lead to timing problems. Specifically, a long string of 1s or 0s could cause the receiver to lose synchronization with the data. Further, the recipient would be unable to determine the difference between an idle sender (0 voltage) and a string of 0s (again 0 voltage). The solution for this dilemma is found in the Ethernet encoding scheme. Rather than transmitting the logic level directly, Manchester encoding is used. With this technique, one transition is guaranteed for each bit cycle: With a Manchester encoded signal, a binary 1 is represented by a change of amplitude from a low to a high during the middle of a bit-time. Conversely, a binary 0 is represented by a change of amplitude from a high to a low during the middle of a bit-time. However, the trade-off for this synchronization technique is that twice the signaling bandwidth is required, since there must be two pulses for every bit transmitted. As a result, 10-Mbps Ethernet actually works with a 20 MHz serial data signal. Data moving through the physical layer medium from the source to the destination is the end product of the encapsulation process \subsection{2.1.3 Network devices utilize control information} Layer 2 network devices utilize the control information within a frame to assess where a frame is physically destined to on a local network segment. The physical address, or MAC address, of the destination network adapter, or interface, is read so that the proper decision on switching to an appropriate port can be made. In addition to addressing information, the Layer 2 device can check on the validity of the frame by recalculating the frame check sequence (FCS) and matching it with the FCS included as part of the encapsulation process at the data-link layer. Layer 3 network devices are responsible for determining logical paths between networks through an internetwork. Layer 3 devices read the networking address of a destination contained within the control information of packets, and then forward them to an appropriate interface. Layer 3 addressing is hierarchical so that intermediate devices need only know which network the destination device is a member of in order to deliver the packet to the correct location. Data flow alternates between the physical medium which is stage two of data flow, and Layer 2 and 3 devices representing the third stage in the flow of data from a source to a target end-system \subsection{2.1.4 Decapsulation} When the interface of an end-system receives data from the physical medium, frames must be extracted from the bit-stream so that the end-system can verify that the destination physical address of the frame equals its own. When the physical address is verified, the packet is decapsulated from the frame control information and the packets logical control information is examined. Data is further decapsulated from packets as needed for use with the target application. This represents the fourth stage in the layered model of data flow. Data returned to the original sender goes through the same process: \begin{description} \item[Stage 1] Encapsulation \item[Stage 2] Transmission over the physical medium \item[Stage 3] Network devices utilizing control information to deliver data to the appropriate end-system \item[Stage 4] Decapsulation of data as needed for use with the target application \end{description} \subsection{2.1.5 OSI model versus TCP/IP model} Similar to the OSI networking model, the TCP/IP networking model divides networking architecture into modular layers. Figure shows how the TCP/IP networking model maps to the layers of the OSI networking model. It is this close mapping that allows the TCP/IP suite of protocols to successfully communicate with so many networking technologies. The TCP/IP network access layer corresponds to the OSI physical and data- link layers. The network access layer communicates directly with the network media and provides an interface between the architecture of the network and the Internet layer. TCP/IP Internet layer relates to the OSI Network layer. The Internet layer of the TCP/IP protocol model is responsible for placing messages in a fixed format that allows devices to handle them. The transport layers of TCP/IP and OSI directly correspond in function. The transport layer is responsible for exchanging packets between devices on a TCP/IP network. The application layer in the TCP/IP suite actually combines the functions of the three OSI model layers which are session, presentation, and application. The application layer provides communication between applications such as FTP, HTTP, and SMTP on separate hosts. \subsection{2.1.6 Position of network devices in layered model} The ability to identify which layers pertain to a networking device gives a troubleshooter the ability to minimize the complexity of a problem by dividing the problem into manageable parts. For instance, knowing that Layer 3 issues are of no importance to a switch, aside from multilayer switches, defines the boundaries of a task to Layer 1 and Layer 2. Given the fact that there is still plenty to consider at only these two layers, this simple knowledge can prevent the wasting of time troubleshooting irrelevant possibilities and will significantly reduce the amount of time spent attempting to correct a problem. However, it is still important to note that there are network applications that are part of these devices that move into Layers 4-7. \section{2.2 Troubleshooting Approaches} \subsection{2.2.1 General troubleshooting process} The stages of the general troubleshooting process are: \begin{description} \item[Step 1] Gather symptoms \item[Step 2] Isolate the problem \item[Step 3] Correct the problem \end{description} The stages are not mutually exclusive. At any point in the process, it may be necessary to retrace to previous steps. For instance, it may be required to gather more symptoms while isolating a problem. Additionally, when attempting to correct a problem, another unidentified problem could be created. As a result, it would be necessary to gather the symptoms, isolate, and correct the new problem. A troubleshooting policy should be established for each stage. A policy will give a consistent manner in which to perform each stage. Part of the policy should include documenting every important piece of information. Gathering Symptoms - To perform the "Gathering Symptoms" stage of the general troubleshooting process, the troubleshooter gathers and documents symptoms from the network, end systems, or users. In addition, the troubleshooter determines what network components have been affected and how the functionality of the network has changed compared to the baseline. Symptoms may appear in many different forms. These forms include alerts from the network management system, console messages, and user complaints. While gathering symptoms, questions should be used as a method of localizing the problem to a smaller range of possibilities. However, the problem is not truly isolated until a single problem, or a set of related problems, is identified. Isolation of Problem - To perform the "Isolate the Problem" stage of the general troubleshooting process, the troubleshooter identifies the characteristics of problems at the logical layers of the network so that the most likely cause can be selected. At this stage, the troubleshooter may gather and document more symptoms depending on the problem characteristics that are identified. Correct the Problem - To perform the "Correct the Problem" stage, the troubleshooter corrects an identified problem by implementing, testing, and documenting a solution. If the troubleshooter determines that the corrective action has created another problem, the attempted solution is documented, the changes are removed, and the troubleshooter returns to gathering symptoms and isolating the problem. \subsection{2.2.2 Bottom-up} When applying a bottom-up approach towards troubleshooting a networking problem, the examination starts with the physical components of the network and then is worked up through the layers of the OSI model until the cause of the problem is identified. It is a good approach for a troubleshooter to use when the problem is suspected to be physical. Most networking problems reside at the lower levels, so implementing the bottom-up approach will often result in effective results. The downside to selecting this approach is that it requires checking of every device and interface on the network until the possible cause of the problem is found. It is a requirement to document each conclusion and possibility. The challenge is to determine which devices to start with. In many cases, problems within the first four layers can be determined by entering a ping or traceroute command. If the connection is successful, then the cause is likely at the application level. Otherwise, a closer look at the lower levels will be needed to locate the problem. Verify that Internet control message protocol (ICMP) echo request and echo reply are enabled on the network in order for commands such as ping and traceroute to work. This action should include authorization from the network administrator and documentation of that authorization. If ping has been disabled on the network, it is a result of the implementation of policy. Document in a station log or your personal work log that ping, or any command that was initially disabled, was enabled for network testing and subsequently disabled. This is important should there be an unauthorized intrustion into the network while you are troubleshooting the network. If disabled, the failure of a ping or traceroute command can easily be mistaken for a loss of connectivity. \subsection{2.2.3 Top-down} When applying a top-down approach towards troubleshooting a networking problem, the end user application is examined first. Then work down from the upper-layers of the OSI model until the cause of the problem has been identified. When a troubleshooter selects this approach, the applications of an end system are tested before tackling the more specific networking pieces. A troubleshooter would most likely select this approach for simpler problems or when the troubleshooter thinks that the problem is with a piece of software. The disadvantage to selecting this approach is that it requires checking of every network application until the possible cause of the problem is found. It is a requirement to document each conclusion and possibility. Like the bottom-up approach, the challenge is to determine which application to start with. \subsection{2.2.4 Divide and conquer} When the divide and conquer approach is applied towards troubleshooting a networking problem, a layer is selected and tested in both directions from the starting layer. The divide and conquer approach is initiated at a particular layer. The layer is based on troubleshooter experience level and the symptoms gathered about the problem. Once the direction of the problem is identified, troubleshooting follows that direction until the cause of the problem is identified. If it can be verified that a layer is functioning, it is typically a safe assumption that the layers below it are functioning as well. If a layer is not functioning properly, gather symptoms of the problem at that layer and work downward to lower layers. \subsection{2.2.5 Guidelines for selecting and approach} When selecting an effective troubleshooting approach to solve a network problem, the problem is usually resolved in a quicker, more cost-effective manner. Consider the following when selecting an effective troubleshooting approach. \subsubsection{Determine the scope of the problem} A troubleshooting approach is often selected based on its complexity. A bottom-up approach typically works better for complex problems. Using a bottom-up approach for a simple problem may be overkill and inefficient. Typically, if symptoms come from users then a top-down approach is used. If symptoms come from the network, a bottom-up approach will likely be more effective. \subsubsection{Apply previous experiences} If a particular problem has been experienced previously, then the troubleshooter may know of a way to shorten the troubleshooting process. A less experienced troubleshooter will likely implement a bottom-up approach, while a skilled troubleshooter may be able to jump into a problem at a different layer using the divide and conquer approach. \subsubsection{Analyze the symptoms} The more known about a problem, the better the chance that it can be solved. It may be possible to immediately correct a problem simply by analyzing the symptoms. \paragraph{Example} Two IP routers have been identified in a network that have connectivity, but are not exchanging routing information. Before attempting to solve the problem, a troubleshooting approach needs to be selected. Similar symptoms have been seen previously, which point to a likely protocol issue. Since there is connectivity between the routers, it is not likely to be a problem at the physical or data link layer. Based on this past experience knowledge, it is decided to use the divide and conquer approach, and the troubleshooter begins testing the TCP/IP-related functions at the network layer. \section{2.3 Gathering Symptoms} \subsection{2.3.1 Gathering symptoms for a network problem} Following are the stages for gathering symptoms for a network problem: \paragraph{Stage 1} The troubleshooter analyzes symptoms gathered from the trouble ticket, users, or end systems affected by the problem to form a definition of the problem. \paragraph{Stage 2} If the problem is in the troubleshooter's system, it will be necessary to move on to stage 3. If the problem is outside the boundary of the troubleshooter's control, it will be necessary to contact an administrator for the external system before gathering additional network symptoms. \paragraph{Stage 3} The troubleshooter determines if the problem is at the core, distribution or access layer of the network. At the identified layer use an analysis of existing symptoms and knowledge of the network topology to determine which piece or pieces of equipment are the most likely cause. \paragraph{Stage 4} Using a layered troubleshooting approach, the troubleshooter gathers hardware and software symptoms from the suspect devices. The technician starts with the most likely possibility and uses knowledge and experience to determine if the problem is more likely a hardware or software configuration problem. \paragraph{Stage 5} Document any hardware or software symptoms. If the problem can be solved using the documented symptoms, a troubleshooter will solve the problem and document the solution. If the problem cannot be solved, the technician begins the isolating phase of the general troubleshooting process. Be prudent with use of the debug command on a network. It generates enough console message traffic that the performance of a network device can be noticeably affected. Be sure to disable debugging when its capabilities are no longer needed. \subsection{2.3.2 Gathering symptoms from an end-user: hardware} When gathering symptoms for perceived hardware problems, a troubleshooter should physically inspect or ask for physical inspection of the devices using the senses of hearing, sight, smell, and touch. Physical symptoms may be related, but not limited, to the following: \begin{itemize} \item Electromagnetic Interference (EMI) from radio and television transmitters, or the introduction of portable devices that create EMI to the area Indicator lights of a NIC or networking device \item Cable connections, the crimping of connectors and the physical state of connection sockets \item Incorrect seating of modules and cards \item Burning smells from insulative material which has melted, or of burnt out components \item Overheating due to cooling fan malfunction \end{itemize} \subsection{2.3.3 Gathering symptoms from an end-user: software} When gathering symptoms for probable software configuration problems, a troubleshooter should start at the last known point where the network functioned correctly. If an end-user station can successfully ping the gateway but not the DNS server on another network segment, then an entire set of potential problems associated with the physical layer at the user-site can be eliminated. Effective questioning techniques can discover this type of information without requiring a trip to the end-user location. The commands shown in the figure can be used to check the status of various devices and be used to determine which configuration aspects to inspect. The troubleshooter should use effective questioning techniques to document the symptoms of a problem: \begin{itemize} \item Ask questions that are pertinent to the problem. \item Use each question as a means to either eliminate or discover possible problems. \item Speak at a technical level that the user can understand. \item Ask the user when the problem was first noticed. \item Ask the user to re-create the problem, if possible. \item Determine the sequence of events that took place before the problem happened. \item Match the symptoms that the user describes with common problem causes \end{itemize} \subsection{2.3.4 Questions to ask an end-user} When asking an end user questions, it is important to follow a specific sequence to allow the troubleshooter to gain the knowledge necessary to attain a solution. A typical format for interviewing an end user concerning their problem is: \begin{itemize} \item What does not work? \item What does work? \item Are the things that do and do not work related? \item Has the thing that does not work ever worked? \item When the problem was first noticed? \item What has changed since the last time it did work? \item Did anything unusual happen since the last time it worked? \item When exactly does the problem occur? \item Can the problem be reproduced and if so, how can it be reproduced? \end{itemize} \paragraph{question criteria:} \begin{itemize} \item ask questions that are pertinent to the problem \item use questions to either eliminate or discover possible problem. \item speak at a technical level the user can understand \item match user symptoms with common problem causes \end{itemize} \paragraph{questions to end-user} \begin{itemize} \item when did the user first notice the problem \item can the user re-create the problem \item what sequence of event took place before the problem happened \end{itemize} \subsection{2.3.5 Flow charts for gathering network and end-user symptoms} \begin{description} \item[Stage 1 Interview user] If possible, a troubleshooter gathers initial symptoms from the user and uses these symptoms as a basis for additional troubleshooting. \item[Stage 2 Analyze symptoms] A troubleshooter will get a description of the problem by analyzing any gathered symptoms from the user \item[Stage 3 Determine symptoms] - Using a layered troubleshooting approach, a troubleshooter gathers hardware and software symptoms from the end system starting with the most likely cause. The troubleshooter should rely on previous experience, if possible, to decide if the problem is more likely a hardware or software problem. \item[Stage 4 Document symptoms] - Document any hardware and software symptoms. If the problem can be solved using the documented symptoms, a troubleshooter solves the problem and documents the solution. If the problem cannot be solved at this point, then the isolating phase of the general troubleshooting process is initiated \end{description} \end{multicols} \end{document}