Skip to main contentdfsdf

alex band's List: Cloud Database

    • Cassandra has several optimizations to make writes cheaper. When a write operation occurs, it doesn't immediately cause a write to the disk. Instead the record is updated in memory and the write operation is added to the commit log. Periodically the list of pending writes is processed and write operations are flushed to disk. As part of the flushing process the set of pending writes is analyzed and redundant writes eliminated. Additionally, the writes are sorted so that the disk is written to sequentially thus significantly improving seek time on the hard drive and reducing the impact of random writes to the system. How important is improving seek time when accessing data on a hard drive? It can make the difference between taking hours versus days to flush a hundred gigabytes of writes to a disk. Disk is the new tape.
    • The Cassandra data model is fairly straightforward. The entire system is a giant table with lots of rows. Each row is identified by a unique key. Each row has a column family, which can be thought of as the schema for the row. A column family can contain thousands of columns which are a tuple of {name, value, timestamp} and/or super columns which are a tuple of {name, column+} where column+ means one or more columns. This is very similar to the data model behind Google's BigTable.
    • Facebook Cassandra - alex band on 2008-07-30
    • Disk is the new tape - alex band on 2008-07-30
    • The services layer uses Jini for service registration and discovery, but SCA and OSGi integrations are being considered.
    • bigdata is a 100% Java project providing scale-out (distributed) indices, map/reduce style computing, a sparse row store (ala Hadoop’s HBase, Google’s bigtable, or CouchDB) a distributed file system (ala Hadoop’s HDFS or Google’s GFS), a high performance RDF database, and a flexible object generic object model (GOM) database.

    1 more annotation...

    • bigtable semantic and DFS - alex band on 2008-07-30
    • The components of bigdata - alex band on 2008-07-30
    • Now we are working on the 10gen database, named Mongo
    • Scalability: object databases are easier to scale than relational databases; sharding is easier. In a relational database, distributed joins are a complex problem that must be solved if one desires true plug-and-play scalability without limits

    1 more annotation...

    • Mongo,one of the prototype of cloud database - alex band on 2008-07-29
    • Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other  features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a  table-oriented view engine with JavaScript acting as the default view  definition language.

        

      CouchDB is written in Erlang, but can be easily accessed from any  environment that provides means to make HTTP requests. There are a multitude of  third-party client libraries that make this even easier for a variety of  programming languages and environments.

      • What CouchDB is

          
           
        • A document database server, accessible via a RESTful JSON API.
        •  
        • Ad-hoc and schema-free with a flat address space.
        •  
        • Distributed, featuring robust, incremental replication with bi-directional  conflict detection and management.
        •  
        • Query-able and index-able, featuring a table oriented reporting engine that  uses Javascript as a query language.
        •  
          

        What it is Not

          
           
        • A relational database.
        •  
        • A replacement for relational databases.
        •  
        • An object-oriented database. Or more specifically, meant to function as a  seamless persistence layer for an OO programming language.
    • Nimbus’ state-of-the-art Breeze unified iSCSI SAN and NAS storage systems, featuring the HALO storage operating system and 10 Gigabit Ethernet technology, provide a scalable, easy-to-use storage infrastructure for midsize enterprises focused on storage consolidation, server virtualization, and digital content management. With over 10,000 installations, Nimbus’ MySAN software is the world’s most popular open iSCSI target for Microsoft Windows servers.
    • Nimbus IP storage - alex band on 2008-07-30
    • Data Model

        

      HBase uses a data model very similar to that of Bigtable. Users store data rows in labelled tables. A data row has a sortable key and an arbitrary number of columns. The table is stored sparsely, so that rows in the same table can have crazily-varying columns, if the user likes.

       

      A column name has the form "<family>:<label>" where <family> and <label> can be any string you like. A single table enforces its set of <family>s (called "column families"). You can only adjust this set of families by performing administrative operations on the table. However, you can use new <label> strings at any write without preannouncing it. HBase stores column families physically close on disk. So the items in a given column family should have roughly the same write/read behavior.

       

      Writes are row-locked only. You cannot lock multiple rows at once. All row-writes are atomic by default.

       

      All updates to the database have an associated timestamp. The HBase will store a configurable number of versions of a given cell. Clients can get data by asking for the "most recent value as of a certain time". Or, clients can fetch all available versions at once.

    • Conceptual View

        

      Conceptually a table may be thought of a collection of rows that are located by a row key (and optional timestamp) and where any column may not have a value for a particular row key (sparse). The following example is a slightly modified form of the one on page 2 of the [WWW] Bigtable Paper.

    1 more annotation...

    • Just as Google's [WWW] Bigtable leverages the distributed data storage provided by the [WWW] Google File System, HBase provides Bigtable-like capabilities on top of Hadoop Core. Data is organized into tables, rows and columns. An Iterator-like interface is available for scanning through a row range (and of course there is the ability to retrieve a column value for a specific key). Any particular column may have multiple versions for the same row key.
    • HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper,  Bigtable: A Distributed Storeage System for Structured Data by Chang et al.  Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides  Bigtable-like capabilities on top of Hadoop.
    • Bigtable and Hbase - alex band on 2008-07-30
    • Dynamo and similar Amazon technologies are used to power parts of our Amazon Web Services, such as S3.
    • This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience.  To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.

        

    6 more annotations...

1 - 11 of 11
20 items/page
List Comments (0)