September | 2019 | PlusOne

(Apache) Drill-ing into Collections of HDF5 Files Hyo-Kyung Joe Lee

September 12, 2019
timothyarthur

This talk is about building bridges between two ecosystems, Apache and HDF5.nHDF5 is a widely used storage container for complex data from embedded devicesnto supercomputers, and is developed and maintained as FOSS by The HDF Group.nWhile it is easy to imagine the potential benefits of making HDF5 containersnaccessible from the various Apache frameworks, there are several technicalnchallenges to overcome, and what makes a ‘good’ integration is by no meansnobvious. It requires the input and collaboration of experts from both sides. The core of this presentation will be an overview of a joint project between The HDF Group and Apache Drill contributor Charles Givre. We will show how tonuse Apache Drill to explore collections of HDF5 containers and discuss thenunderlying design decisions and a few technicalities. We will use thisnopportunity to give the Apache community an update on the support forncollections of HDF5 files on HDFS, in cloud storage (such as S3), our Sparkndata source and our new ‘HDF5 in the cloud’ platform, HDF KITA. The intended audience are ‘bridge-builders’ and curious members of the Apachencommunity, and anyone interested in ‘HDF5 demystified.’

Real time performance diagnosis in distributed databases Partha Kanuparthy Amazon Alolita Sharma Amazon

September 12, 2019
timothyarthur

Good and predictable performance is a common expectation across customers of Elasticsearch, an unstructured, distributed database. Elasticsearch, however, is not built with performance diagnosis as a design tenet and this makes it hard for customers to troubleshoot performance problems or provision a cluster. Attend this talk to learn about Performance Analyzer, a feature we built on Open Distro for Elasticsearch, to enable real time performance diagnosis on your Elasticsearch clusters. We will cover the abstractions for observability and root cause analysis, and walk through the design choices we made when building Performance Analyzer on Elasticsearch and the JVM. We will also show a live demo of how Performance Analyzer can help you troubleshoot an issue on your Elasticsearch cluster.

What’s Surprising about Apache Drill and Why That’s a Challenge Ellen Friedman

September 12, 2019
timothyarthur

‘Apache Drill has some very surprising characteristics and, more importantly, it enables Drill users to do some surprising things. It’s no longer surprising to be able to do standard SQL in a highly distributed and large scale system – there is an entire class of modern tools that do this including Apache Hive, Presto or Spark SQL. But Drill has other capabilities that are surprising and make it stand apart from its class. For one thing, Drill provides an extraordinary degree of flexibility in several ways, including: Support for a wide variety of file formats including semi-structured and nested data (such as Parquet, JSON, Avro) and non-file data sources – not just data access but ability to fully use these data sources with high performance Schema discovery – a capability that opens up data exploration in unexpected ways for Drill users and allows progressive data modelling Easy extensibility with high performance – you don’t have to trade one for the other These are valuable if surprising capabilities. Why, then, is that a challenge? Because people don’t expect them, they also may not come looking for a tool that can do these things. The challenge comes in how to make potential users aware of the opportunities that Drill offers. This talk will explore some of Drill’s surprising capabilities, how it’s able to do these things and what impact that has for Drill users. In addition, we will open a discussion about how best to inform and engage a broader user community. This latter issue is not only important for Drill but for other Apache projects as well.’

Managing Trillions of Rows with Aplomb (well, actually with Drill) Ted Dunning

September 12, 2019
timothyarthur

‘Ingesting lots of data isn’t very hard any more. Ingesting it on a critical schedule, within strict time bounds while minimizing the risk of bogus data showing up is much harder. In practice, grownup data ingestion and access requires the following capabilities * Incoming data can be fully ingested into our working dataset but hidden from users until all quality checks are completedn * Individual batches of data can be released atomicallyn * Any indexing updates should also appear appear atomicallyn * Expiring data should disappear atomically either according to ingest batch or precise time bounds Apache Drill provides several capabilities that make it much easier to meet these goals. You can handle large volumes of data while allowing in-situ quality controls and while controlling the visibility of unverified data. I will describe a worked example that shows how Drill helps make this happen. (with aplomb)n’

Consistent Cassandra schema changes in Elassandra Vincent Royer

September 12, 2019
timothyarthur

As described in CASSANDRA-10699 (Make schema alterations strongly consistent), concurrent schema changes can still lead to schema disagreement in Cassandra 3.0. In order to properly support Elasticsearch dynamic mapping in Elassandra, we will see how multiple schema changes are validated on a working copy of the Cassandra schema, and applied in an atomic update to all nodes if a light weight transaction succeed, thus avoiding concurrent schema changes issue. I will also explain how we have taken advantage of Cassandra table extensions to store the Elasticsearch mapping directly into the CQL schema with several benefits.

Day to day with Cassandra: The weirdest and complex situations we found! Carlos Rolo

September 12, 2019
timothyarthur

Every Cassandra operator has been hit with a couple of weird/complex cases that don’t fit the normal expected failure situation. It can be a problem in hardware, software, networking, operator mistake, or a mix of it all. In this talk we will go through a compilation of such cases that we faced. How do they appear, how did we debug them and how did we fix them. We expect this to be a walk through weird, fun cases and sharing knowledge on the situations and on the fixing of such problems

Declarative Benchmarking of Cassandra and Its Data Models Monal Daxini

September 12, 2019
timothyarthur

You have made changes to Cassandra code base. How do you benchmark these changes for scalability and correctness, including different data models (schema), easily? You have created a Cassandra schema for your service. How do you ensure this is scalable? How can you emulate application specific CQL queries, with specified distribution, to validate scale of your schema and associated data scalability without having to code your whole application? I am the author of the NDBench CQL Plugin tool, which was built at Netflix to address these needs and more, Declaratively. One of the Cassandra committers has called this tool ‘Game Changing’. This plugin is currently (June 2019) being prepped to be open sourced ahead of this talk. This talk presents: 1. An example of how we used declarative benchmarking to achieve 1 Million requests per second for a user specified data model and query distribution backing a critical service at Netflix. 2. For users: n a. How to certify scalability of new or existing data models on new version of Cassandra, for confident upgrades?n b. How to certify scalability of new complex data models? 3. For committers: n a. Define various profiles to emulate real world use cases to build confidence in changes and certifying new releases. n b. Compare scalability of the same data model across different Cassandra version. 4. The philosophy of the tool, it’s internal architecture, and future enhancements.

Supporting Cassandra In-House – Our Story !! Laxmikant Upadhyay Anuj Wadhera

September 12, 2019
timothyarthur

In this presentation, senior Cassandra Architects at Ericsson will share their vast experience in supporting around 100 Cassandra deployments in production. They will share the key challenges and best practices with respect to Cassandra operations, maintenance and support. There is plenty to learn when the team talks about many problems which they faced in production and how they fixed each one of them successfully by providing interesting solutions. Audience: All Cassandra users especially Cassandra operators and administrators.

Using the TLP toolchain as a crystal ball for your cluster Anthony Grasso

September 12, 2019
timothyarthur

Cassandra cluster management is hard. Understanding how your Cassandra data model will hold up in production over a period of time can be tricky. If that is not tough enough, understanding how a change to a Cassandra setting will affect your cluster can be be difficult. Knowing how your data model or setting change will perform under a production data load can prevent performance degradation or worse nodes going down. In the last year, The Last Pickle has invested the time to develop the tooling necessary to create test clusters in AWS, as well as a scalable stress tool which can run pre-configured workloads. These tools are designed to take the guess work out of data modelling and configuration changes in Cassandra. In this talk we will introduce our toolchain, look at what each tool does, how they work, and where you would use them. In addition, we will show how the toolchain can be used to quickly test a data model or feature in Cassandra.

Apache Camel K: a cloud-native integration platform Nicola Ferraro Andrea Tarocchi

September 12, 2019
timothyarthur

In this session we are going to introduce the latest innovation from the Apache Camel community: Camel K, a lightweight integration platform, born on Kubernetes, with serverless superpowers.nCamel K enables developers that want to integrate systems to directly write Camel DSL code in the cloud, with a great developer experience and really fast turnaround times.nYou’ll see Camel K in action with a live coding demo that will explore the main features that it provides.nYou’ll also learn how Camel K works under the hood and will have a glimpse on the future roadmap.

PlusOne

Telling the ASF's Stories

Monthly Archives: September 2019

(Apache) Drill-ing into Collections of HDF5 Files Hyo-Kyung Joe Lee

Real time performance diagnosis in distributed databases Partha Kanuparthy Amazon Alolita Sharma Amazon

What’s Surprising about Apache Drill and Why That’s a Challenge Ellen Friedman

Managing Trillions of Rows with Aplomb (well, actually with Drill) Ted Dunning

Consistent Cassandra schema changes in Elassandra Vincent Royer

Day to day with Cassandra: The weirdest and complex situations we found! Carlos Rolo

Declarative Benchmarking of Cassandra and Its Data Models Monal Daxini

Supporting Cassandra In-House – Our Story !! Laxmikant Upadhyay Anuj Wadhera

Using the TLP toolchain as a crystal ball for your cluster Anthony Grasso

Apache Camel K: a cloud-native integration platform Nicola Ferraro Andrea Tarocchi