The Pathologies of Big Data - ACM Queue

This link has been bookmarked by 37 people . It was first bookmarked on 15 Sep 2009, by mailforlen yahoo.

26 Jul 16

Pepa Štefan
databaze kde-acmqueue big-data performance hdd memory
26 Sep 15

Caleb Hyde
08 Mar 15

jayraymie
big data enjeux
19 Feb 15

carlospadoa
19 Nov 14

timharding007
queueing database bigdata performance nosql
08 Jan 14

solardynamo
05 Nov 13

José Devezas
big-data pathologies
16 Oct 13

Brett Burky
28 Aug 13

Dmitry Serebrennikov
bigdata
- in modern systems, as demonstrated in the figure, random access to memory is typically slower than sequential access to disk.
- SSD improves on this ratio by less than one order of magnitude. In a very real sense, all of the modern forms of storage improve only in degree, not in their essential nature, upon that most venerable and sequential of storage media: the tape.
- the highest-speed local network technologies have now surpassed most locally attached disk systems with respect to bandwidth, and network latency is naturally much lower than disk latency.
  
  As a result, the performance cost of storing and retrieving data on other nodes in a network is comparable to (and in the case of random access, potentially far less than) the cost of using disk.
1 more annotation...
21 Jan 13

cchaulk
linkeddata BigData
10 Mar 12

Caliel Costa
bigdata Academico
- 00 gigabytes is enough to store at least the basic demographic information—age, sex, income, ethnicity, language, religion, housing status, and location, packed in a 128-bit record—for every living human being on the planet.
- The Web log records millions of visits a day to a handful of pages; the cellphone database stores time and location every 15 seconds for each of a few million phones;
- What makes most big data big is repeated observations over time and/or space.
- he fact that most large datasets have inherent temporal or spatial dimensions, or both, is crucial to understanding one important way that big data can cause performance problems, especially when databases are involved.
- Distributed Computing as a Strategy for Big Data
- he beauty of today’s mainstream computer hardware, though, is that it’s cheap and almost infinitely replicable. Today it is much more cost-effective to purchase eight off-the-shelf, “commodity” servers with eight processing cores and 128 GB of RAM each than it is to acquire a single system with 64 processors and a terabyte of RAM.
4 more annotations...
28 Sep 11

robrab
bigdata nosql
18 May 11

Nick Gall
The Pathologies of Big Data (2009) http://bit.ly/jUXcKD /via @srisatish
I will take a stab at a meta-definition: big data should be defined at any point in time as “data whose size forces us to look beyond the tried-and-true methods that are prevalent at that time.” In the early 1980s, it was a dataset that was so large that

via_delicious_20101217 bigdata definition pinboardimport20141106
02 Mar 11

Tony Pearson
big-data
18 Feb 11

Neil Saunders
What is “big data” anyway? Gigabytes? Terabytes? Petabytes?

bigdata opinion computing hardware performance
28 Oct 10

koranteng
tradeoffs of big data in a human scale world

data scalability technology systems database performance scaling software modeling constraints architecture design strategy BestPractices
31 May 10

Capt. Ben Smith
datastorage
15 Jan 10

Antti Poikola
A database on the order of 100 GB would not be considered trivially small even today, although hard drives capable of storing 10 times as much can be had for less than $100 at any computer store.

bigdata history dataopas
02 Oct 09

Graham Perrin
2009
- The Pathologies of Big Data
- by Adam Jacobs | July 6, 2009
- Scale up your datasets enough and all your apps will come undone
- What are the typical problems and where do the bottlenecks generally surface?
- Dealing with Big Data
- Hard Limits
- Distributed Computing as a Strategy for Big Data
- A Meta-definition
- References
- Originally published in Queue vol. 7, no. 6
8 more annotations...
06 Aug 09

nealrichter
bigdata
07 Jul 09

frankpatz
... success at the leading edge will be achieved by those developers who can look past the standard, off-the-shelf techniques and understand the true nature of the hardware resources and the full panoply of algorithms that are available to them.

databases