Skip to main contentdfsdf

Rhm2ktmi's List: Big Data

    • Oracle recently packed four big data-related products into a single announcement, headlined by the launch of Oracle Big Data Discovery (BDD), a new Hadoop data discovery tool. For the other three offerings – Oracle GoldenGate for Big Data, Oracle Big Data SQL and Oracle NoSQL Database 3.2.5 – Oracle hit the refresh button
    • Oracle BDD is a new data-discovery, preparation and analysis tool that runs as a native application on the Oracle Big Data Appliance (BDA). Cloudera's Distribution, including Apache Hadoop, comes packaged with BDA, so that is currently the distribution compatible with Oracle BDD. However, Oracle is developing a Hortonworks-compatible version. But Oracle BDD is not tied to BDA, and can also be installed on a stand-alone Cloudera cluster.
    • DataStax, the company that delivers Apache Cassandra™ to the enterprise, today announced the acquisition of Aurelius LLC, the innovators behind the open source graph database, Titan.
    • As the leading experts in graph database technology, the Aurelius team will join DataStax to build DataStax Enterprise (DSE) Graph, adding graph database capabilities into DSE alongside Apache Cassandra, DSE Search and Analytics. The addition of graph technology to DSE will empower enterprises with true ‘multi-model’ capabilities that deliver new levels of power and flexibility to transactional applications.
    • Druid, an open source database designed for real-time analysis, is moving to the Apache 2 software license in order to hopefully spur more use of and innovation around the project. It was open sourced in late 2012 under the GPL license, which is generally considered more restrictive than the Apache license in terms of how software can be reused.
    • Information technology -- Security techniques -- Code of practice for protection of personally identifiable information (PII) in public clouds acting as PII processors
    • In this blog, we will focus on one of those data processing engines—Apache Storm—and its relationship with Apache Kafka. I will describe how Storm and Kafka form a multi-stage event processing pipeline, discuss some use cases, and explain Storm topologies.

       

    • Hortonworks has launched an initiative to improve data governance in Apache Hadoop by providing a single data-governance foundation for the Hadoop stack. Data governance was not initially a major concern for Hadoop, but it has become increasingly relevant as large enterprises want to use Hadoop in more mission-critical production deployments and deploy multipurpose Hadoop clusters (so-called 'data lakes').
    • The result will be a new Apache incubator project (or projects; the fine details are still being worked on) to create new knowledge-store, policy-engine and audit-store functionality that will plug into Apache Falcon and Apache Ranger (the evolution of the security policy engine that Hortonworks acquired with XA Secure).
  • Jan 13, 15

    "Data protection law – the bundle of statutory duties on those who handle personal data about individuals and the corresponding rights for the individuals concerned – sits plumb in the centre of data law, an increasingly broad and complex amalgam of contract law, intellectual property and regulation.

    An important area of looming challenge for data protection lawyers at the moment is Big Data, the aggregation and analysis of datasets of great volume, variety and velocity for the purpose of competitive advantage1, where the business world is just at the start of a period of rapid adoption.

    "

    • Data protection law – the bundle of statutory duties on those who handle personal data about individuals and the corresponding rights for the individuals concerned – sits plumb in the centre of data law, an increasingly broad and complex amalgam of contract law, intellectual property and regulation.

        

      An important area of looming challenge for data protection lawyers at the moment is Big Data, the aggregation and analysis of datasets of great volume, variety and velocity for the purpose of competitive advantage1, where the business world is just at the start of a period of rapid adoption.

        

  • Jan 11, 15

    With YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it for batch, interactive and rea...

    • With YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it for batch, interactive and real-time streaming use cases. As more data flows into and through a Hadoop cluster to feed these engines, Apache Falcon is a crucial framework for simplifying data management and pipeline processing.
      • Among these many bug fixes, improvements and new features, four stand out as particularly important:

         
           
        • Authorization with ACLs for entities
        •  
        • Enhancements to lineage metadata
        •  
        • Cloud archival
        •  
        • Falcon recipes
        •  
         

    • The StreamFlow™ software project is designed to make working with Apache Storm, a free and open source distributed real-time computation system, easier and more productive. A Storm application ingests significant amounts of data through the use of topologies, or set of rules that govern how a network is organized. These topologies categorize the data streams into understandable pipelines. 

        

    • One key feature of Kafka is its functional simplicity. While there is a lot of sophisticated engineering under the covers, Kafka’s general functionality is relatively straightforward. Part of this simplicity comes from its independence from any other applications (excepting Apache ZooKeeper). As a consequence however, the responsibility is on the developer to write code to either produce or consume messages from Kafka. While there are a number of Kafka clients that support this process, for the most part custom coding is required.
    • Cloudera engineers and other open source community members have recently committed code for Kafka-Flume integration, informally called “Flafka,” to the Flume project. Flume is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of data from many different sources to a centralized data store. Flume provides a tested, production-hardened framework for implementing ingest and real-time processing pipelines. Using the new Flafka source and sink, now available in CDH 5.2, Flume can both read and write messages with Kafka.
    • DeepDive is a new type of system that enables developers to analyze data on a deeper level than ever before. DeepDive is a trained system: it uses machine learning techniques to leverage on domain-specific knowledge and incorporates user feedback to improve the quality of its analysis.

        

    • Glassbeam says that with the latest version of its SCALAR data processing engine, it is prepared for the IoT, which will require real-time data analytics able to handle up to billions of sensor readings. It already had a fast analytics platform, but integration with Apache Spark enables real-time analytics, as well as predictive analytics and machine learning.
    • The company started with Glassbeam Analytics for standard and custom analytics on machine-generated data, and then introduced Glassbeam Explorer for search and exploratory analysis in 2013.

    6 more annotations...

  • Cask 3

    Nov 16, 14

    • Deliver the Cask Data Application Platform (CDAP), an open source application development platform for the Hadoop ecosystem that provides developers with data and application virtualization to accelerate application development, address a broader range of real-time and batch use cases, and deploy applications into production while satisfying enterprise requirements.
    • Data Virtualization

       

      Logical representations of data.

    1 more annotation...

    • It has now gone one step further, because the core technology has been proposed as an Apache Software Foundation project.
    • GridGain announced the release of its in-memory data processing technology using the Apache License in March.

    3 more annotations...

    • Paxata has released the Fall 2014 version of its cloud service, which is designed to provide a single metadata layer as well as process and execution models for data preparation tasks. The latest releases marks one of the first major makeovers of the startup's multi-tenant service for data integration, quality, enrichment, governance and collaboration.
    • Paxata has a fresh cut of its data preparation service on the market following the general availability of Fall 2014 in October. The latest version includes numerous enhancements to bolster performance, flexibility and connectivity. It also contains improvements to the front-end application used by business analysts to wrangle data in a self-service fashion.

    9 more annotations...

    • Netflix has long been a proponent of the microservices model. This model offers higher-availability, resiliency to failure and loose coupling. The downside to such an architecture is the potential for a latent user experience.
    • Most of these microservices use some kind of stateful system to store and serve data. A few milliseconds here and there can add up quickly and result in a multi-second response time.
1 - 20 of 1058 Next › Last »
20 items/page
List Comments (0)