This link has been bookmarked by 136 people . It was first bookmarked on 07 Aug 2007, by TooManySecrets.
-
22 Apr 17
-
14 Sep 16
-
- High availability. If one box goes down the others still operate.
- Faster queries. Smaller amounts of data in each user group mean faster querying.
- More write bandwidth. With no master database serializing writes you can write in parallel which increases your write throughput. Writing is major bottleneck for many websites.
-
- High availability. If one box goes down the others still operate.
- Faster queries. Smaller amounts of data in each user group mean faster querying.
- More write bandwidth. With no master database serializing writes you can write in parallel which increases your write throughput. Writing is major bottleneck for many websites.
- You can do more work. A parallel backend means you can do more work simultaneously. You can handle higher user loads, especially when writing data, because there are parallel paths through your system. You can load balance web servers, which access shards over different network paths, which are processed by separate CPUs, which use separate caches of RAM and separate disk IO paths to process work. Very few bottlenecks limit your work.
The advantages are:
-
Replicating data from a master server to slave servers is a traditional approach to scaling. Data is written to a master server and then replicated to one or more slave servers. At that point read operations can be handled by the slaves, but all writes happen on the master.
Obviously the master becomes the write bottleneck and a single point of failure. And as load increases the cost of replication increases. Replication costs in CPU, network bandwidth, and disk IO. The slaves fall behind and have stale data. The folks at YouTube had a big problem with replication overhead as they scaled. -
Some Problems With Sharding
-
-
07 Sep 15
-
29 Aug 15
-
04 Jun 15
-
01 Jun 15
-
14 Nov 13
-
10 Oct 13
-
08 Jul 13
-
Data are denormalized. Traditionally we normalize data. Data are splayed out into anomaly-less tables and then joined back together again when they need to be used. In sharding the data are denormalized. You store together data that are used together.
-
Data are more highly available. Since the shards are independent a failure in one doesn't cause a failure in another. And if you make each shard operate at 50% capacity it's much easier to upgrade a shard in place. Keeping multiple data copies within a shard also helps with redundancy and making the data more parallelized so more work can be done on the data. You can also setup a shard to have a master-slave or dual master relationship within the shard to avoid a single point of failure within the shard. If one server goes down the other can take over.
-
Obviously the master becomes the write bottleneck and a single point of failure. And as load increases the cost of replication increases. Replication costs in CPU, network bandwidth, and disk IO. The slaves fall behind and have stale data. The folks at YouTube had a big problem with replication overhead as they scaled.
-
Rebalancing data. What happens when a shard outgrows your storage and needs to be split? Let's say some user has a particularly large friends list that blows your storage capacity for the shard. You need to move the user to a different shard.
-
On some platforms I've worked on this is a killer problem. You had to build out the data center correctly from the start because moving data from shard to shard required a lot of downtime.
-
Joining data from multiple shards. To create a complex friends page, or a user profile page, or a thread discussion page, you usually must pull together lots of different data from many different sources. With sharding you can't just issue a query and get back all the data. You have to make individual requests to your data sources, get all the responses, and the build the page. Thankfully, because of caching and fast networks this process is usually fast enough that your page load times can be excellent.
-
mplementing shards is not well supported. Sharding is currently mostly a roll your own approach. LiveJournal makes their tool chain available. Hibernate has a library under development. MySQL has added support for partioning. But in general it's still something you must implement yourself.
-
-
23 May 13
-
16 Mar 13
-
12 Feb 13
-
18 Nov 12
-
27 Jun 12
-
26 May 12
-
02 Jan 12
-
18 Aug 11
Denis GuerreroArticle discussing DB sharding. Technique used by the big boys Facebook, Google, etc...
-
13 Aug 11
-
07 Aug 11
-
08 Jun 11
-
02 May 11
-
24 Dec 10
-
28 Jul 10
-
16 May 10
-
20 Apr 10
-
03 Mar 10
-
20 Nov 09
-
10 Aug 09
-
16 Jun 09
-
18 May 09
-
14 May 09
-
10 May 09
Dante-Gabryell MonsonWhat is sharding?
While working at Auction Watch, Dathan got the idea to solve their scaling problems by creating a database server for a group of users and running those servers on cheap Linux boxes. In this scheme the data for User A is stored on one s -
20 Apr 09
-
20 Mar 09
-
21 Feb 09
-
11 Feb 09
-
04 Feb 09
-
27 Jan 09
-
22 Jan 09
-
21 Jan 09
-
07 Jan 09
-
12 Nov 08
-
09 Nov 08
-
16 Oct 08
-
30 Aug 08
-
28 Aug 08
-
27 Aug 08
-
25 Aug 08
-
-
Flickr now handles more than 1 billion transactions per day, responding in less then a few seconds and can scale linearly at a low cost.
-
You can keep a user's profile data separate from their comments, blogs, email, media, etc, but the user profile data would be stored and retrieved as a whole. This is a very fast approach. You just get a blob and store a blob. No joins are needed and it can be written with one disk write.
-
-
09 Aug 08
-
29 Jul 08
-
08 Jul 08
-
23 Jun 08
-
15 Jun 08
-
13 Jun 08
-
03 Jun 08
-
27 May 08
-
21 May 08
-
06 May 08
-
03 May 08
-
24 Apr 08
-
22 Apr 08
-
05 Apr 08
-
01 Apr 08
-
31 Jan 08
-
09 Oct 07
-
In sharding the data are denormalized. You store together data that are used together.
-
-
04 Oct 07
-
03 Oct 07
-
23 Sep 07
-
15 Sep 07
-
08 Sep 07
-
28 Aug 07
-
16 Aug 07
Nicolas PerriaultWhat is sharding and how has it come to be the answer to large website scaling problems?
database scalability sharding architecture performance mysql cleverplanet
-
What is sharding and how has it come to be the answer to large website scaling problems?
database scalability sharding architecture performance mysql cleverplanet
-
15 Aug 07
Steve WillerWell, I have a name to put to the "mod 12" row-based data scaling technique, at least.
-
13 Aug 07
-
Olifante *"Dathan Pattishall explains his motivation for a revolutionary new database architecture - sharding - that he began thinking about even before he worked at Friendster, and fully implemented at Flickr. Flickr now handles more than 1B transactions per day"
-
12 Aug 07
-
08 Aug 07
-
07 Aug 07
-
TooManySecretsMagnífico documento donde se describe la técnica del "sharding", como una alternativa para una escalabilidad óptima en base de datos.
shardingsharding database shard highscalabitity altaescalabilidad sysadmin documentacion
-
sharding
-
-
06 Aug 07
-
05 Aug 07
-
04 Aug 07
-
ken ."Solving" performance with horizontal and vertical scaling (bigger, more boxes) has limits - Dathan Pattishall on Sharding, linear costs, avoiding bottleneck of a single master writer and replication, using federated space, grouped/partitioned data
architecture computer database design flickr growth strategy
Would you like to comment?
Join Diigo for a free account, or sign in if you are already a member.