PlusOne

Telling the ASF's Stories

From Postgres to an In-Memory Grid with Apache Ignite Vitor Wakim

September 12, 2019
timothyarthur

This talk describes how a food tech company with 50k merchants was at a critical situation by delivering events to devices by using events polling on a Postgres database, and how it could solve this problem by moving the events repository to Apache Ignite using distributed, in-memory SQL. The change also allowed the company to be on the right track to support more than 500k simultaneous merchants.

Building S3 over Ozone : Making a Cloud Native File System Bharat Viswanadham Anu Engineer

September 12, 2019
timothyarthur

The AWS S3 protocol is the defacto interface for modern object stores. Ozone supports S3 protocol as the first-class notion in Ozone. For all practical purposes, a user of S3 can start using Ozone without any change to code or tools. S3 protocol support offered by Ozone is strongly consistent, so users don’t need to run sidekick tools like S3Guard when running big data applications like Apache Spark, Apache YARN or Apache Hive. This talk is a deep dive into how Ozone supports S3, and how easy it is to use the cloud tools against Ozone. We will demo S3 over Ozone with different off-the-shelf tools like Goofys, Kubernetes CSI, AWS CLI and also demonstrate how to use applications like Apache Spark, Apache Hive and YARN. We will also discuss some of the challenges faced during building S3 compatible rest server. For example, how we did the mapping of S3 semantics to ozone semantics, how we solved the S3 security issues and mapped them to ozone world. With S3 support Ozone has become a generic object store: it can be used with thousands of s3 compatible tools while it will also provide Hadoop compatible file system access.

Protect your Private Data in your Hadoop Clusters with ORC Column Encryption Owen O’Malley

September 12, 2019
timothyarthur

Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads that provides optimized streaming reads, but with integrated support for finding required rows quickly. In this talk, we will outline the progress made in Apache community for adding fine-grained column level encryption natively into ORC format that will also provide capabilities to mask or redact data on write while protecting sensitive column metadata such as statistics to avoid information leakage. The column encryption capabilities will be fully compatible with Hadoop Key Management Server (KMS) and use the KMS to manage master keys providing the additional flexibility to use and manage keys per column centrally. An end to end scenario that demonstrates how this capability can be leveraged will be also demonstrated.

Schema-Managed Big Data Column Access Control and Use Cases Mohammad Islam Xinli Shang Pavi Subenderan

September 12, 2019
timothyarthur

Motivation: Access control via encryption improves security coverage compared with traditional enforcement in the access path because encryption can prevent invalid accesses from any angle. Finer-grained access control at the column-level is needed because in a typical big dataset, only a few columns are sensitive and need to be protected, and different columns could have different sensitivities and a different set of eligible readers. Design: With encryption features in columnar file format like Apache Parquet, column access control via encryption becomes possible. But to adopt these features into existing analytic frameworks that might use Apache Hive, Apache Spark, Apache Preto etc, is a challenge because significant changes are needed to those query engines to control the encryption. n nTo avoid massive changes in existing frameworks, the Apache Parquet community designed a schema controlled column encryption mechanism. The schema of data tables can be leveraged by a system architect in order to define the sensitivity of a column, that will be propagated through the stack and will eventually trigger the encryption on that column in the Apache Parquet writing layer. This solution is applicable in many analytic frameworks via transparent plug-in invocation that avoids massive changes in the frameworks, making Apache Parquet encryption easy to adopt. This mechanism is possible to be extended to support Apache ORC too. n nUse Cases: 1. HDFS ingesting pipelines with Spark and Hudi encrypt sensitive columns automatically with schema-controlled crypto settings. n nIn this use case, we will talk about how to use schema to control column encryption in Apache Parquet, and what is the flow that the pipeline can automatically encrypt the columns once schema sets the sensitivity. n n2. Column access control scalability and performance analysis for analytic frameworks with Apache Hive and Presto on Apache Parquet.n nLike Apache Spark, Apache Hive and Presto are two popular query engines. We will show our analysis on scalability and performance of column access control in an analytic framework. n nSummary: In this talk, we will present the motivation and design of schema-controlled column encryption at file format level, and how it can ease the adoption of encryption features to current popular query engines like Apache Hive, Apache Spark and Presto. Use cases will be presented to show how to use schema controlled column encryption in analytic pipeline and frameworks.

Apache Hadoop 3.x State of The Union and Upgrade Guidance Anu Engineer Suma Shivaprasad

September 12, 2019
timothyarthur

‘Apache Hadoop YARN is the modern Distributed Operating System for big data applications. It morphed the Hadoop compute layer to be a common resource-management platform that can host a wide variety of applications. Many organizations leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues etc. The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters. In this talk, we’ll start with the current status of Apache Hadoop 3.x – how it is used today in deployments large and small. We’ll then move on to the exciting present & future of Hadoop 3.x – features that are further strengthening Hadoop as the primary resource-management platform as well as the storage system for enterprise data-centers. We’ll discuss the current status as well as the future promise of features and initiatives for both YARN and HDFS of Hadoop 3.x: For YARN 3.x, we have powerful container placement, global scheduling, support for machine learning (Spark) and deep learning (TensorFlow) workloads through GPU & FPGA scheduling and isolation support, extreme scale with YARN federation, containerized apps on YARN, support for long-running services (alongside applications) natively without any changes, seamless application/services upgrades, powerful scheduling features like application priorities, intra-queue preemption across applications and operational enhancements including insights through Timeline Service V2, a new web UI, better queue management, etc.nAlso, HDFS 3.0 announced GA for erasure coding which doubles the storage efficiency of data and thus reduces the cost of storage for enterprise use cases. HDFS added support for multiple standby NameNodes for better availability. For better reliability of metadata and easier operations, Journal nodes have been enhanced to sync the edit log segments to protect against rolling failures.nDisk balancing within a DataNode was another important feature added to ensure disks are evenly utilized in a DataNode, which also ensures better aggregate throughput, and prevents from lopsided utilization if new disks are added or replaced in a DataNode. HDFS team is currently driving the Ozone initiative which lays the foundation of the next generation of storage architecture for HDFS where data blocks are organized in Storage Containers for higher scale and handling of small objects in HDFS. Ozone project also includes an object store implementation to support new use cases. At last, since more and more users are planning to upgrade from 2.x to 3.x to get all the benefits mentioned above, we will also briefly talk about upgrade guidance from Hadoop 2.x to 3.×.n’

Testing Contributions at Scale Allen Wittenauer

September 12, 2019
timothyarthur

Time may be one of the most valuable resources in a project. Automating code reviews to allow for other tasks is a crucial goal for many communities. This talk will cover one way many Apache projects have significantly increased code quality and contribution feedback while simultaneously doing more with less.

Unleashing Apache Kafka and TensorFlow in Hybrid Cloud Architectures Kai Waehner

September 12, 2019
timothyarthur

How can you leverage the flexibility and extreme scale in the public cloud combined with your Apache Kafka ecosystem to build scalable, mission-critical machine learning infrastructures, which span multiple public clouds or bridge your on-premise data centre to cloud? nThis talk will discuss and demo how you can leverage machine learning technologies such as TensorFlow with your Kafka deployments in public cloud to build a scalable, mission-critical machine learning infrastructure for data ingestion and processing, and model training, deployment and monitoring.nThe discussed architecture includes capabilities like scalable data preprocessing for training and predictions, combination of different Deep Learning frameworks, data replication between data centres, intelligent real time microservices running on Kubernetes, and local deployment of analytic models for offline predictions.

Jena, Kafka, Mina: Configuration Management Using Apache Components Claude Warren

September 12, 2019
timothyarthur

How we utilized Jena (an RDF data store), Mina (the SSH implementation) and Kafka to implement a system to monitor and verify configurations of devices in a protected network. This talk discusses how we implemented an application framework that gathers information about devices (Servers, IoT, applications, etc) in secured networks and passes that data to a processing environment where analysis on the configurations are performed to determine if any configuration has become non-compliant. We will touch on the use of RDF over Kafka and RDF datastores to create schema less data repositories that support reasoning and advanced machine learning strategies.

Peeking Behind the Curtain: The Anatomy of a Real Major Incident George Miranda

September 12, 2019
timothyarthur

Failures are inevitable. But when they occur, our goal should be to resolve them as quickly and efficiently as possible. PagerDuty has developed an open-source Incident Response framework based on the Incident Command System (ICS). That product-independent process has helped many organizations set up Incident Response processes that resolve technical issues as quickly and effectively as possible. There’s a lot of documentation you can follow, but how do incidents actually play out in real-time when they happen? In this talk, you get to peek behind the curtain to see what happens inside the walls of the company that’s known for waking you up to tell you there’s a technical incident you need to deal with. We will walk through a cascading failure that happened when unexpected errors were found happening in one of our Kafka clusters. I share all the gritty details of this actual incident, as they occurred, and use that as a way to demonstrate the structure of how we apply ICS to resolving technical problems. Attendees of this talk will walk away with an understanding of how to effectively manage complex technical incidents across multiple teams, the additional roles that are necessary to support technical responders, how to structure a blameless post-mortem, and how to start developing a similar process in their own organizations. This talk delves into technical concepts due to the nature of the problem, but it is mostly focused on the mechanics of managing any technical incident.

How Netflix manages petabyte scale Apache Cassandra in the cloud Joey Lynch Vinay Chella

September 12, 2019
timothyarthur

At Netflix, we manage petabytes of data in Apache Cassandra which must be reliably accessible to users in mere milliseconds. To achieve this, we have built sophisticated control planes that turn our persistence layer based on Apache Cassandra into a truly self-driving system. We will start with the user interface that Netflix developers use to interact with their Cassandra databases and dive deep into the automation that powers it all. From cluster creation, through scaling up, to cluster death, complex automation drives large fleets of virtual machines hosted on the AWS cloud. First, we will cover the basics of how Netflix deploys Apache Cassandra. In particular, this begins with how we mold Apache Cassandra to the Netflix philosophy of immutable infrastructure, including managing software and hardware upgrades in the face of ever-failing hardware. Then we will explore the concrete techniques needed for such a massive deployment, specifically pull-based control planes and auto-healing strategies. Next, we will cover how Netflix has automated complex but critical Apache Cassandra maintenance tasks such as continuous snapshot backups and always-on anti-entropy repair for keeping our datasets safe and consistent. Both of these systems have gone through multiple architectural evolutions, and we have learned many lessons along the way. Lastly, we will share some of the ways this has gone wrong, and what you can do to avoid them. We will cover a few case studies of major Cassandra outages at Netflix, their root cause, and what we learned from those incidents. At the end of this talk, we hope that participants leave with concrete understanding of the challenges in running massive scale Apache Cassandra as well as solid advice and techniques for building their own self-driving data persistence layer.

Powered by WordPress.com.