bigdata | PlusOne

Apache Big Data Seville 2016 – Unified Benchmarking of Big Data Platforms – Axel-Cyrille Ngonga Ngomo

January 12, 2017
rbowen

Unified Benchmarking of Big Data Platforms – Axel-Cyrille Ngonga Ngomo

Which Big Data Platform shown I use for my problem? This question remains one of the most important question for practitioners. In this talk, we will present the universal benchmarking platform for Big Data HOBBIT (htpp://project-hobbit.eu). The platform providies a unified approach for benchmarking Big Data frameworks. Mimicking algorithms generated from real data ensure that the dataset used for benchmarking resemble real data but are open for all to use, therewith circumventing the issues that come about when using company-bound data. The core of the platform implements industry-relevant KPI gathered from more than 70 Big-Datad-driven organizations. The results are generated using machine-readable formats so as to ensure that they can be analyzed and use for improving toold and frameworks. In the talk, I will present the architecture of the framework and some preliminary results.

More information about this talk

Apache Big Data Seville 2016 – Building and Running a Solr-as-a-Service for IBM Watson – Shai Erera

January 12, 2017
rbowen

Building and Running a Solr-as-a-Service for IBM Watson – Shai Erera

Running a managed Solr service brings fun challenges with it, to both the users and the service itself. Users typically do not have access to all components of the Solr system (e.g. the ZK ensemble, the actual nodes that Solr runs on etc.). On the other hand the service must ensure high-availability at all times, and handle what is often user-driven tasks such as version upgrades, taking nodes offline for maintenance and more.

In this talk I will describe how we tackle these challenges to build a managed Solr service on the cloud, which currently hosts few thousands of Solr clusters. I will focus on the infrastructure that we chose to run the Solr clusters on, as well how we ensure high-availability, cluster balancing and version upgrades.

More information about this talk

Apache Big Data Seville 2016 – Create a Hadoop Cluster and Migrate 39PB Data Plus 150000 Jobs/Day – Stuart Pook

January 12, 2017
rbowen

Create a Hadoop Cluster and Migrate 39PB Data Plus 150000 Jobs/Day – Stuart Pook

Criteo had an Hadoop cluster with 39 PB raw stockage, 13404 CPUs, 105 TB RAM, 40 TB data imported per day and >100000 jobs per day. This cluster was critical in both stockage and compute but without backups. This talk describes: 0/ the different options considered when deciding how to protect our data and compute capacity 1/ the criteria established for the 800 new computers and comparison tests between suppliers’ hardware 2/ the non-blocking network infrastructure with 10 Gb/s endpoints scalable to 5000 machines 3/ the installation and configuration, using Chef, of a cluster on new hardware 4/ the problems encountered in moving our jobs and data from the old CDH4 cluster to the new CDH5 cluster 600 km distant 5/ running and feeding with data the two clusters in parallel 6/ fail over plans 7/ operational issues 8/ the performance of the 16800 core, 200 TB RAM and 60 PB disk CDH5 cluster.

More information about this talk

Apache Big Data Seville 2016 – Smart Manufacturing with Apache Spark Streaming and Deep Learning – Prajod Vettiyattil

January 12, 2017
rbowen

Smart Manufacturing with Apache Spark Streaming and Deep Learning – Prajod Vettiyattil

Even after a century of the Industrial Revolution, manufacturing processes even within assembly lines, involve manual steps requiring costly human intervention. Eg:Product quality inspection. With the advent of machine learning and big data tools, it has become possible to automate many of these manual processes. What is more, such solutions can surpass human capability for manual quality inspection. In this session we will look at a few examples of how products on assembly lines can be monitored for quality, using image processing techniques combined with machine learning. The solution to be presented, is built using a combination of machine learning and deep learning techniques running on Apache Spark Streaming.

The presentation will also explain the steps involved in creating such a solution: mapping a business need to a ML based technical solution

More information about this talk

Apache Big Data Seville 2016 – Meerkat: Anomaly Detection as a Service – Julien Herzen

January 5, 2017
rbowen

Meerkat: Anomaly Detection as a Service – Julien Herzen

Julien will present Meerkat, a system built at Swisscom to do real-time anomaly detection on time series. Meerkat uses a combination of machine learning and big data technologies in order to trigger alerts in case of problems in Swisscom network.

Meerkat monitors arbitrary time series and trains statistical models that can be used to spot anomalies from both batch (historical) and streaming (live) data. It is composed of a Python modules for anomaly detection and data ingestion from Druid, as well as Scala modules using Apache Spark for ingesting from Apache Kafka and Apache Hadoop’s HDFS.

Meerkat is currently successfully used at Swisscom to trigger alerts in case of problems with VoIP calls, which represent more than 3 millions phone calls per day.

This is joint work with Khue Vu, who worked on Meerkat for his MSc thesis at EPFL, and the network intelligence team of Swisscom Innovation.

More information about this talk

Apache Big Data Seville 2016 – Classifying Unstructured Text – Deterministic and Machine Learning Approaches – Christian Winkler & Stephanie Fischer

January 5, 2017
rbowen

Classifying Unstructured Text – Deterministic and Machine Learning Approaches – Christian Winkler & Stephanie Fischer

Text is one of the most used forms of communication and ubiquitous in the Internet. Social networks like Facebook and Twitter mainly contain unstructured text; the same is true for content-driven websites.

For humans it is easy to grasp the meaning of text – much more difficult for computers. Used correctly, computers can help humans tremendously in structuring and classifying huge amounts of text. This “symbiosis” can help humans work more efficiently, reduce repetitve work and use the uncovered structure.

Our talk starts with visualizations giving us ideas how to automatically classify texts. Then we will demonstrate that manual intervention is sometimes necessary and how this can be used as a basis for machine learning. This helps significantly in classifying more complicated cases.

As software tools we use R, Apache Solr, D3.js, and several NLP and ML tools from the ASF.

More information about this talk

Apache Big Data Seville 2016 – Power Pig with Spark – Liyun Zhang

January 5, 2017
rbowen

Power Pig with Spark – Liyun Zhang

Apache Pig is a popular scripting platform for processing and analyzing large data sets in the Hadoop ecosystem. With its open architecture and backend neutrality, Pig scripts can currently run on MapReduce and Tez. Apache Spark is an open-source data analytics cluster computing framework that has gained significant momentum recently. Besides offering performance advantages, Spark is also a more natural fit for the query plan produced by Pig. Pig on Spark enables improved ETL performance while also supporting users intending to standardize to Spark as the execution engine.

More information about this talk

Apache Big Data Seville 2016 – Apache Ignite – Path to Converged Data Platform – Dmitriy Setrakyan

January 5, 2017
rbowen

Apache Ignite – Path to Converged Data Platform – Dmitriy Setrakyan

Apache Ignite is one of the fastest growing apache projects. The presentation will take the audience on a roadmap discovery of Ignite moving to a converged storage model, supporting both, analytical and transactional data sets. We will go over the differences between Fast Data and Big Data and cover the projects supporting both technologies. We will discuss the reasons, real-life use cases and technology approaches for merging Fast Data and Big Data in order to deliver a consistent & universal data processing platform regardless of where data resides relative to HDD, flash or DRAM.

More information about this talk

Apache Big Data Seville 2016 – Interactive Analytics at Scale in Apache Hive Using Druid – Jesús Camacho Rodríguez

January 5, 2017
rbowen

Interactive Analytics at Scale in Apache Hive Using Druid – Jesús Camacho Rodríguez

Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications. However, it does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries.

More information about this talk

Apache Big Data Seville 2016 – Hadoop, Hive, Spark and Object Stores – Steve Loughran

January 5, 2017
rbowen

Hadoop, Hive, Spark and Object Stores – Steve Loughran

Cloud deployments of Apache Hadoop are becoming more commonplace. Yet Hadoop and it’s applications don’t integrate that well äóîsomething which starts right down at the file IO operations.

This talk looks at how to make use of cloud object stores in Hadoop applications, including Hive and Spark. It will go from the foundational “what’s an object store?” to the practical “what should I avoid” and the timely “what’s new in Hadoop?” äóî the latter covering the improved S3 support in Hadoop 2.8+.

I’ll explore the details of benchmarking and improving object store IO in Hive and Spark, showing what developers can do in order to gain performance improvements in their own code äóîand equally, what they must avoid.

Finally, I’ll look at ongoing work, especially “S3Guard” and what its fast and consistent file metadata operations promise.

More information about this talk