Google Code for Educators - Introduction to Distributed System Design

This link has been bookmarked by 91 people . It was first bookmarked on 31 Oct 2007, by xinran zhou.

29 Nov 12

Alex Farcas
"heart"

distributed
31 Oct 12

hao yang
distributed
10 Sep 12

Phil Garcia
distributed
29 Aug 12

Abhishek Amberkar
distributed google education tutorial architecture distributed-systems
26 Apr 12

Bernd Prager
distributed programming google architecture development computing
13 Apr 12

nikerym
google code university online
27 Feb 12

web2bruno
iteszliso26a12 operating systems processor multiprocessor Concepts programming architecture computing distributed education
08 Oct 11

fullofgrace88
- To be truly reliable, a distributed system must have the following characteristics:
  
  Fault-Tolerant: It can recover from component failures without performing incorrect actions.
  Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed.
  Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.
  Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a non-distributed system.
  Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we might increase the size of the network on which the system is running. This increases the frequency of network outages and could degrade a "non-scalable" system. Similarly, we might increase the number of users or servers, or overall load on the system. In a scalable system, this should not have a significant effect.
  Predictable Performance: The ability to provide desired responsiveness in a timely manner.
  Secure: The system authenticates access to data and services [1]
17 Jun 11

Shawn Chen
- a binding occurs when a process that needs to access a service becomes associated with a particular server which provides the service
- data replication, where a service maintains multiple copies of data to permit local access at multiple locations, or to increase availability when a server process may have crashed
- Caching is a related concept and very common in distributed systems
- If a cache is actively refreshed by the primary service, caching is identical to replication
- doesn't provide any sort of guarantee that the receiver will receive the packets that are sent in the right order
- The normal maximum disconnection time is between 30 and 90 seconds
- when an RPC is made, the arguments are passed to the remote procedure and the caller waits for a response to be returned
- ach incoming request to a server typically spawns a new thread
- A thread in the client typically issues an RPC and then blocks (waits)
- binding error
- Version mismatches
- Fault-tolerant systems, however, have alternate sources for critical services and fail-over from a primary server to a backup server
- Network data loss resulting in retransmit
- Server process crashes during RPC operation
- Client process crashes before receiving response
- to design distributed systems with the expectation of failure
- Explicitly define failure scenarios and identify how likely each one might occur
- Both clients and servers must be able to deal with unresponsive senders/receivers
- Minimize traffic as much as possible
- The way to make this decision is to experiment
- State is something held in one place on behalf of a process that is in another place, something that cannot be reconstructed by any other component
- Be sensitive to speed and performance
- Acks are expensive and tend to be avoided in distributed systems wherever possible
- Retransmission is costly
22 more annotations...
05 Jan 11

Marvin Cheng
- What is a distributed system? It's one of those things that's hard to define without first defining many other things. Here is a "cascading" definition of a distributed system:
  
  A program
  is the code you write.
  A process
  is what you get when you run it.
  A message
  is used to communicate between processes.
  A packet
  is a fragment of a message that might travel on a wire.
  A protocol
  is a formal description of message formats and the rules that two processes must follow in order to exchange those messages.
  A network
  is the infrastructure that links computers, workstations, terminals, servers, etc. It consists of routers which are connected by communication links.
  A component
  can be a process or any piece of hardware required to run a process, support communications between processes, store data, etc.
  A distributed system
  is an application that executes a collection of protocols to coordinate the actions of multiple processes on a network, such that all components cooperate together to perform a single or small set of related tasks.
- scalable, we mean the system can easily be altered to accommodate changes in the number of users, resources and computing entities.
- To be truly reliable, a distributed system must have the following characteristics:
  
  Fault-Tolerant: It can recover from component failures without performing incorrect actions.
  Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed.
  Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.
  Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a non-distributed system.
  Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we might increase the size of the network on which the system is running. This increases the frequency of network outages and could degrade a "non-scalable" system. Similarly, we might increase the number of users or servers, or overall load on the system. In a scalable system, this should not have a significant effect.
  Predictable Performance: The ability to provide desired responsiveness in a timely manner.
  Secure: The system authenticates access to data and services [1]
- Handling failures is an important theme in distributed systems design. Failures fall into two obvious categories: hardware and software.
- Software failures are a significant issue in distributed systems. Even with rigorous testing, software bugs account for a substantial fraction of unplanned downtime (estimated at 25-35%). Residual bugs in mature systems can be classified into two main categories [5].
  
  Heisenbug: A bug that seems to disappear or alter its characteristics when it is observed or researched. A common example is a bug that occurs in a release-mode compile of a program, but not when researched under debug-mode. The name "heisenbug" is a pun on the "Heisenberg uncertainty principle," a quantum physics term which is commonly (yet inaccurately) used to refer to the way in which observers affect the measurements of the things that they are observing, by the act of observing alone (this is actually the observer effect, and is commonly confused with the Heisenberg uncertainty principle).
  
  Bohrbug: A bug (named after the Bohr atom model) that, in contrast to a heisenbug, does not disappear or alter its characteristics when it is researched. A Bohrbug typically manifests itself reliably under a well-defined set of conditions. [6]
- Heisenbugs tend to be more prevalent in distributed systems than in local systems. One reason for this is the difficulty programmers have in obtaining a coherent and comprehensive view of the interactions of concurrent processes.
- Let's get a little more specific about the types of failures that can occur in a distributed system:
  
  Halting failures: A component simply stops. There is no way to detect the failure except by timeout: it either stops sending "I'm alive" (heartbeat) messages or fails to respond to requests. Your computer freezing is a halting failure.
  Fail-stop: A halting failure with some kind of notification to other components. A network file server telling its clients it is about to go down is a fail-stop.
  Omission failures: Failure to send/receive messages primarily due to lack of buffering space, which causes a message to be discarded with no notification to either the sender or receiver. This can happen when routers become overloaded.
  Network failures: A network link breaks.
  Network partition failure: A network fragments into two or more disjoint sub-networks within which messages can be sent, but between which messages are lost. This can occur due to a network failure.
  Timing failures: A temporal property of the system is violated. For example, clocks on different computers which are used to coordinate processes are not synchronized; when a message is delayed longer than a threshold period, etc.
  Byzantine failures: This captures several types of faulty behaviors including data corruption or loss, failures caused by malicious programs, etc. [1]
- Everyone, when they first build a distributed system, makes the following eight assumptions. These are so well-known in this field that they are commonly referred to as the "8 Fallacies".
  
  The network is reliable.
  Latency is zero.
  Bandwidth is infinite.
  The network is secure.
  Topology doesn't change.
  There is one administrator.
  Transport cost is zero.
  The network is homogeneous. [3]
  
  Latency: the time between initiating a request for data and the beginning of the actual data transfer.
  Bandwidth: A measure of the capacity of a communications channel. The higher a channel's bandwidth, the more information it can carry.
  Topology: The different configurations that can be adopted in building networks, such as a ring, bus, star or meshed.
  Homogeneous network: A network running a single network protocol.
- There are many types of servers we encounter in a distributed system. For example, file servers manage disk storage units on which file systems reside. Database servers house databases and make them available to clients. Network name servers implement a mapping between a symbolic name or a service description and a value such as an IP address and port number for a process that provides the service.
- There are four layers in the IP suite:
  
  Application Layer : The application layer is used by most programs that require network communication. Data is passed down from the program in an application-specific format to the next layer, then encapsulated into a transport layer protocol. Examples of applications are HTTP, FTP or Telnet.
  
  Transport Layer : The transport layer's responsibilities include end-to-end message transfer independent of the underlying network, along with error control, fragmentation and flow control. End-to-end message transmission at the transport layer can be categorized as either connection-oriented (TCP) or connectionless (UDP). TCP is the more sophisticated of the two protocols, providing reliable delivery. First, TCP ensures that the receiving computer is ready to accept data. It uses a three-packet handshake in which both the sender and receiver agree that they are ready to communicate. Second, TCP makes sure that data gets to its destination. If the receiver doesn't acknowledge a particular packet, TCP automatically retransmits the packet typically three times. If necessary, TCP can also split large packets into smaller ones so that data can travel reliably between source and destination. TCP drops duplicate packets and rearranges packets that arrive out of sequence.
- UDP is similar to TCP in that it is a protocol for sending and receiving packets across a network, but with two major differences. First, it is connectionless. This means that one program can send off a load of packets to another, but that's the end of their relationship. The second might send some back to the first and the first might send some more, but there's never a solid connection. UDP is also different from TCP in that it doesn't provide any sort of guarantee that the receiver will receive the packets that are sent in the right order. All that is guaranteed is the packet's contents. This means it's a lot faster, because there's no extra overhead for error-checking above the packet level. For this reason, games often use this protocol. In a game, if one packet for updating a screen position goes missing, the player will just jerk a little. The other packets will simply update the position, and the missing packet - although making the movement a little rougher - won't change anything.
- servers evolved called RPC, which means remote procedure cal
- A programmer writing RPC-based code does three things:
  
  Specifies the protocol for client-server communication
  Develops the client program
  Develops the server program
- communication protocol is created by stubs generated by a protocol compiler. A stub is a routine that doesn't actually do much other than declare itself and the parameters it accepts. The stub contains just enough code to allow it to be compiled and linked.
- RPC introduces a set of error cases that are not present in local procedure programming. For example, a binding error can occur when a server is not running when the client is started. Version mismatches occur if a client was compiled against one version of a server, but the server has now been updated to a newer version. A timeout can result from a server crash, network problem, or a problem on a client computer.
14 more annotations...
26 Nov 10

Radoslaw Tomaszewski
"Introduction to Distributed System Design"

computing architecture google rpc corba protocols distributed concurrency programming parallel
15 Sep 10

owen2785
- stubs
09 Aug 10

dotysan
HA_HS Google network-design from-delicious
10 Feb 10

rohitjoshi
distributed google programming tutorial
03 Feb 09

Rolf B
- Acks are expensive and tend to be avoided in distributed systems wherever possible.
30 Jan 09

timwee
imported Bookmarks freeTechBooks distributed programming google tutorial architecture development education
14 Dec 08

Ahmed Diaa
distributed systems
28 Oct 08

Tarun Pant
From Internet Explorer Google Learning Center
16 Sep 08

x x
software
24 Aug 08

Alan Dean
This tutorial covers the basics of distributed systems design. The pre-requisites are significant programming experience with a language such as C++ or Java, a basic understanding of networking, and data structures & algorithms.

Distributed Architecture
10 May 08

mapReduce
24 Mar 08

Carlos Junior
architecture java network programming
30 Jan 08

Mat Schaffer
development education
18 Dec 07

Sergey Gordienko
google distributed
03 Dec 07

Jerry Wong
distributed google architecture solution
21 Nov 07

n b
distributed programming
19 Nov 07

Carlos Pérez
tutorial distributed CASO delicious
18 Nov 07

Arrix Z
17 Nov 07

Antoine Fortuin
google design tutorial programming
15 Nov 07

hg ye
google code dsd tutorial concurrency architecture education
14 Nov 07

Ken Wei
design distributed-system google
fei dong
distributed programming google architecture
13 Nov 07

stateless
distributed architecture programming google development education design systems
Steve Gabriel
developer computer Google
08 Nov 07

flyingbrick
Distributed_Computing
31 Oct 07

Mike Griffith
distributed concurrency Google
xinran zhou
coding
30 Oct 07

chew barkla
architecture concurrency distributed programming google tutorial education development scalability
Alexander Lockshyn
architecture development programming
jpoetker
tutorial google development architecture hdportal
29 Oct 07

Fabien Schwob
google distributed design system
tvaananen
Google's educational program contains little gems like this; how to build distributed systems.

architecture design
22 Oct 07

dryganets
distributed toread
21 Oct 07

oulipo
distributed programming
11 Oct 07

Jeff Stewart
This tutorial covers the basics of distributed systems design. The pre-requisites are significant programming experience with a language such as C++ or Java, a basic understanding of networking, and data structures & algorithms.

distributed systems parallel computing google education resources tutorials primers
18 Sep 07

Gautham
distributed-computing delicious-import
14 Sep 07

clementi1832
academic computing development education parallel processing scalability science computer cs systems software toread tutorial programming distributed
papasasha
google programming education
13 Sep 07

mandarine
This tutorial covers the basics of distributed systems design. The pre-requisites are significant programming experience with a language such as C++ or Java, a basic understanding of networking, and data structures & algorithms.

lectures distributed server programming tutorial google education
Miron Brezuleanu
toread
romanzeyde
distributed development
pickerel yee
Introduction to Distributed System Design

program programming develop distributed google tutorial development
andreys
programming distributed Google tutorial development education concurrency
programmation ressources
craig hancock
academic computing concurrency course development distributed education toread tutorial google programming delicious
12 Sep 07

Carlos Santos
ComputerScience reference software systems tutorial concurrency distributed google lectures programming course computing education
Eric Blue
toread

Google Code for Educators - Introduction to Distributed System Design - The Diigo Meta page

Would you like to comment?

Top Tags

Check out another URL