This link has been bookmarked by 91 people . It was first bookmarked on 31 Oct 2007, by xinran zhou.
-
29 Nov 12
-
31 Oct 12
-
10 Sep 12
-
29 Aug 12
-
26 Apr 12
-
13 Apr 12
-
27 Feb 12
-
08 Oct 11
-
- Fault-Tolerant: It can recover from component failures without performing incorrect actions.
- Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed.
- Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.
- Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a non-distributed system.
- Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we might increase the size of the network on which the system is running. This increases the frequency of network outages and could degrade a "non-scalable" system. Similarly, we might increase the number of users or servers, or overall load on the system. In a scalable system, this should not have a significant effect.
- Predictable Performance: The ability to provide desired responsiveness in a timely manner.
- Secure: The system authenticates access to data and services [1]
To be truly reliable, a distributed system must have the following characteristics:
-
-
17 Jun 11
-
a binding occurs when a process that needs to access a service becomes associated with a particular server which provides the service
-
data replication, where a service maintains multiple copies of data to permit local access at multiple locations, or to increase availability when a server process may have crashed
-
Caching is a related concept and very common in distributed systems
-
If a cache is actively refreshed by the primary service, caching is identical to replication
-
doesn't provide any sort of guarantee that the receiver will receive the packets that are sent in the right order
-
The normal maximum disconnection time is between 30 and 90 seconds
-
when an RPC is made, the arguments are passed to the remote procedure and the caller waits for a response to be returned
-
ach incoming request to a server typically spawns a new thread
-
A thread in the client typically issues an RPC and then blocks (waits)
-
binding error
-
Version mismatches
-
Fault-tolerant systems, however, have alternate sources for critical services and fail-over from a primary server to a backup server
-
Network data loss resulting in retransmit
-
Server process crashes during RPC operation
-
Client process crashes before receiving response
-
to design distributed systems with the expectation of failure
-
Explicitly define failure scenarios and identify how likely each one might occur
-
Both clients and servers must be able to deal with unresponsive senders/receivers
-
Minimize traffic as much as possible
-
The way to make this decision is to experiment
-
State is something held in one place on behalf of a process that is in another place, something that cannot be reconstructed by any other component
-
Be sensitive to speed and performance
-
Acks are expensive and tend to be avoided in distributed systems wherever possible
-
Retransmission is costly
-
-
05 Jan 11
-
What is a distributed system? It's one of those things that's hard to define without first defining many other things. Here is a "cascading" definition of a distributed system:
- A program
- is the code you write.
- A process
- is what you get when you run it.
- A message
- is used to communicate between processes.
- A packet
- is a fragment of a message that might travel on a wire.
- A protocol
- is a formal description of message formats and the rules that two processes must follow in order to exchange those messages.
- A network
- is the infrastructure that links computers, workstations, terminals, servers, etc. It consists of routers which are connected by communication links.
- A component
- can be a process or any piece of hardware required to run a process, support communications between processes, store data, etc.
- A distributed system
- is an application that executes a collection of protocols to coordinate the actions of multiple processes on a network, such that all components cooperate together to perform a single or small set of related tasks.
-
scalable, we mean the system can easily be altered to accommodate changes in the number of users, resources and computing entities.
-
- Fault-Tolerant: It can recover from component failures without performing incorrect actions.
- Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed.
- Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.
- Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a non-distributed system.
- Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we might increase the size of the network on which the system is running. This increases the frequency of network outages and could degrade a "non-scalable" system. Similarly, we might increase the number of users or servers, or overall load on the system. In a scalable system, this should not have a significant effect.
- Predictable Performance: The ability to provide desired responsiveness in a timely manner.
- Secure: The system authenticates access to data and services [1]
To be truly reliable, a distributed system must have the following characteristics:
-
Handling failures is an important theme in distributed systems design. Failures fall into two obvious categories: hardware and software.
-
- Heisenbug: A bug that seems to disappear or alter its characteristics when it is observed or researched. A common example is a bug that occurs in a release-mode compile of a program, but not when researched under debug-mode. The name "heisenbug" is a pun on the "Heisenberg uncertainty principle," a quantum physics term which is commonly (yet inaccurately) used to refer to the way in which observers affect the measurements of the things that they are observing, by the act of observing alone (this is actually the observer effect, and is commonly confused with the Heisenberg uncertainty principle).
- Bohrbug: A bug (named after the Bohr atom model) that, in contrast to a heisenbug, does not disappear or alter its characteristics when it is researched. A Bohrbug typically manifests itself reliably under a well-defined set of conditions. [6]
Software failures are a significant issue in distributed systems. Even with rigorous testing, software bugs account for a substantial fraction of unplanned downtime (estimated at 25-35%). Residual bugs in mature systems can be classified into two main categories [5].
-
Heisenbugs tend to be more prevalent in distributed systems than in local systems. One reason for this is the difficulty programmers have in obtaining a coherent and comprehensive view of the interactions of concurrent processes.
-
- Halting failures: A component simply stops. There is no way to detect the failure except by timeout: it either stops sending "I'm alive" (heartbeat) messages or fails to respond to requests. Your computer freezing is a halting failure.
- Fail-stop: A halting failure with some kind of notification to other components. A network file server telling its clients it is about to go down is a fail-stop.
- Omission failures: Failure to send/receive messages primarily due to lack of buffering space, which causes a message to be discarded with no notification to either the sender or receiver. This can happen when routers become overloaded.
- Network failures: A network link breaks.
- Network partition failure: A network fragments into two or more disjoint sub-networks within which messages can be sent, but between which messages are lost. This can occur due to a network failure.
- Timing failures: A temporal property of the system is violated. For example, clocks on different computers which are used to coordinate processes are not synchronized; when a message is delayed longer than a threshold period, etc.
- Byzantine failures: This captures several types of faulty behaviors including data corruption or loss, failures caused by malicious programs, etc. [1]
Let's get a little more specific about the types of failures that can occur in a distributed system:
-
- The network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn't change.
- There is one administrator.
- Transport cost is zero.
- The network is homogeneous. [3]
Everyone, when they first build a distributed system, makes the following eight assumptions. These are so well-known in this field that they are commonly referred to as the "8 Fallacies".
Bandwidth: A measure of the capacity of a communications channel. The higher a channel's bandwidth, the more information it can carry.
Topology: The different configurations that can be adopted in building networks, such as a ring, bus, star or meshed.
Homogeneous network: A network running a single network protocol. -
There are many types of servers we encounter in a distributed system. For example, file servers manage disk storage units on which file systems reside. Database servers house databases and make them available to clients. Network name servers implement a mapping between a symbolic name or a service description and a value such as an IP address and port number for a process that provides the service.
-
- Application Layer : The application layer is used by most programs that require network communication. Data is passed down from the program in an application-specific format to the next layer, then encapsulated into a transport layer protocol. Examples of applications are HTTP, FTP or Telnet.
- Transport Layer : The transport layer's responsibilities include end-to-end message transfer independent of the underlying network, along with error control, fragmentation and flow control. End-to-end message transmission at the transport layer can be categorized as either connection-oriented (TCP) or connectionless (UDP). TCP is the more sophisticated of the two protocols, providing reliable delivery. First, TCP ensures that the receiving computer is ready to accept data. It uses a three-packet handshake in which both the sender and receiver agree that they are ready to communicate. Second, TCP makes sure that data gets to its destination. If the receiver doesn't acknowledge a particular packet, TCP automatically retransmits the packet typically three times. If necessary, TCP can also split large packets into smaller ones so that data can travel reliably between source and destination. TCP drops duplicate packets and rearranges packets that arrive out of sequence.
There are four layers in the IP suite:
-
UDP is similar to TCP in that it is a protocol for sending and receiving packets across a network, but with two major differences. First, it is connectionless. This means that one program can send off a load of packets to another, but that's the end of their relationship. The second might send some back to the first and the first might send some more, but there's never a solid connection. UDP is also different from TCP in that it doesn't provide any sort of guarantee that the receiver will receive the packets that are sent in the right order. All that is guaranteed is the packet's contents. This means it's a lot faster, because there's no extra overhead for error-checking above the packet level. For this reason, games often use this protocol. In a game, if one packet for updating a screen position goes missing, the player will just jerk a little. The other packets will simply update the position, and the missing packet - although making the movement a little rougher - won't change anything.
-
servers evolved called RPC, which means remote procedure cal
-
- Specifies the protocol for client-server communication
- Develops the client program
- Develops the server program
A programmer writing RPC-based code does three things:
-
communication protocol is created by stubs generated by a protocol compiler. A stub is a routine that doesn't actually do much other than declare itself and the parameters it accepts. The stub contains just enough code to allow it to be compiled and linked.
-
RPC introduces a set of error cases that are not present in local procedure programming. For example, a binding error can occur when a server is not running when the client is started. Version mismatches occur if a client was compiled against one version of a server, but the server has now been updated to a newer version. A timeout can result from a server crash, network problem, or a problem on a client computer.
-
-
-
-
26 Nov 10
Radoslaw Tomaszewski"Introduction to Distributed System Design"
computing architecture google rpc corba protocols distributed concurrency programming parallel
-
15 Sep 10
-
stubs
-
-
09 Aug 10
-
10 Feb 10
-
03 Feb 09
-
Acks are expensive and tend to be avoided in distributed systems wherever possible.
-
-
30 Jan 09
-
14 Dec 08
-
28 Oct 08
-
16 Sep 08
-
24 Aug 08
Alan DeanThis tutorial covers the basics of distributed systems design. The pre-requisites are significant programming experience with a language such as C++ or Java, a basic understanding of networking, and data structures & algorithms.
-
10 May 08
-
24 Mar 08
-
30 Jan 08
-
18 Dec 07
-
03 Dec 07
-
21 Nov 07
-
19 Nov 07
-
18 Nov 07
-
17 Nov 07
-
15 Nov 07
-
14 Nov 07
-
13 Nov 07
-
08 Nov 07
-
31 Oct 07
-
30 Oct 07
-
29 Oct 07
-
tvaananenGoogle's educational program contains little gems like this; how to build distributed systems.
-
22 Oct 07
-
21 Oct 07
-
11 Oct 07
Jeff StewartThis tutorial covers the basics of distributed systems design. The pre-requisites are significant programming experience with a language such as C++ or Java, a basic understanding of networking, and data structures & algorithms.
distributed systems parallel computing google education resources tutorials primers
-
18 Sep 07
-
14 Sep 07
-
13 Sep 07
mandarineThis tutorial covers the basics of distributed systems design. The pre-requisites are significant programming experience with a language such as C++ or Java, a basic understanding of networking, and data structures & algorithms.
lectures distributed server programming tutorial google education
-
pickerel yeeIntroduction to Distributed System Design
program programming develop distributed google tutorial development
-
12 Sep 07
Would you like to comment?
Join Diigo for a free account, or sign in if you are already a member.