November | 2016

Apache Big Data Seville 2016 – Deep Neural Network Regression at Scale in Spark MLlib – Jeremy Nixon

November 29, 2016
The Apache Software Foundation

Deep Neural Network Regression at Scale in Spark MLlib – Jeremy Nixon

Deep Neural Network Regression at scale in Spark MLlib – Jeremy Nixon will focus on the engineering and applications of a new algorithm in MLlib. The presentation will focus on the methods the algorithm uses to automatically generate features to capture nonlinear structure in data, as well as the process by which it’s trained. Major aspects of that are the compositional transformations over the data, advantages of the various activation functions, the final linear layer, the cost function and training via backpropagation. Applications will look into how to use neural network regression to model data in computer vision, finance, and the environment. Details around optimal preprocessing, the type of structure that can be found, and managing its ability to generalize will inform developers looking to apply nonlinear modeling tools to problems that they face.

More information about this talk

Apache Big Data Seville 2016 – Open Source Operations: Building on Apache Spark with InsightEdge, TensorFlow, Apache Zeppelin, and/or Apache – Samuel Cozannet

November 29, 2016
The Apache Software Foundation

Open Source Operations: Building on Apache Spark with InsightEdge, TensorFlow, Apache Zeppelin, and/or Apache – Samuel Cozannet

As software becomes more free and open it also is becoming more complex and expensive to operate. How can we as an Open Source community distill best practices, and recommended operations to model complex interconnected services so users can focus on their ideas? How can we as developers deliver recommended best practices in our applications and when connected to other applications so users are free to contribute and use the project on their choice of substrate (laptop, cloud, or bare metal [x86, ARM, ppc64el, s390x]).

In this talk we explore how Juju can provide an Open Source method to model a multi-node Apache Spark cluster across a diverse set of substrates, and start adding other services to build additional solutions. This talk will include a demo, and users should be able to take all software shown to try for themselves in a free and Open Source manner.

More information about this talk

Apache Big Data Seville 2016 – Apache S2Graph (incubating) as a User Event Hub – Hyunsung Jo, Daewon Jeong & Hwansung Yu

November 29, 2016
The Apache Software Foundation

Apache S2Graph (incubating) as a User Event Hub – Hyunsung Jo, Daewon Jeong & Hwansung Yu

S2Graph is a graph database designed to handle transactional graph processing at scale.

Its API allows you to store, manage and query relational information using edge and vertex representations in a fully asynchronous and non-blocking manner.

However, at Kakao Corp., where the project was originally started, we believe that it could be so much more.

There have been efforts to utilize S2Graph as the centerpiece of Kakaoäó»s event delivery system taking advantage of its strengths such as

– flexibility of seamless bulk loading, AB testing, and stored procedure features,

– multitenancy that allows interoperability among different services within the company,

– and most of all, the ability to run various operations ranging from basic CRUD to multi-step graph traversal queries in realtime with large volumes.

More information about this talk

Apache Big Data Seville 2016 – Graph Processing with Apache Tinkerpop on Apache S2Graph – Doyung Yoon

November 29, 2016
The Apache Software Foundation

Graph Processing with Apache Tinkerpop on Apache S2Graph – Doyung Yoon

Since the last conference, Apache S2Graph community has been working on the integration with Apache Tinkerpop. Tinkerpop users are now able to use S2Graph as graph database without changing their Thinkerpop code, and also execute OLAP graph queries over their data in HDFS. We will share our experiences to integrate Thinkerpop as a graph database API, and comment on our current limitations and future plans. We will also present the benchmark results showing the comparison between S2Graph and existing graph databases such as Neo4j, Titan, and OrientDB. We focus our benchmarks on the “neighbors of neighbors” queries and the basic CRUD operations. Similar to Titan, S2Graph supports multiple storage backends, such as HBase, Cassandra, Mysql, Postgresql, and RocksDB, and the S2Graph’s performance for each backend will be presented.

More information about this talk

Apache Big Data Seville 2016 – How Big Data/IoT Leverage the Power of OpenSource to Solve Healthcare Use Cases – Manidipa Mitra

November 29, 2016
The Apache Software Foundation

How Big Data/IoT Leverage the Power of OpenSource to Solve Healthcare Use Cases – Manidipa Mitra

This session will talk about how a Digital Health Care Mgmt platform can be built (using different open source technologies like Kafka,Spark Streaming,HBase,Hive,pySpark,Mirth)to collect patient data,clinical data(HL7 data),claims data,real-time wearables data and create a 360 view/insights for a patient’s health risk and conditions. Also it will talk about how to built a generic platform(by scraping blogs,message board, articles, using an open source called Scrapy.Ingesting fb,twitter data,store,analyse,index,built social-sentiments,create word cloud,segment messages using open source like spark,HBase,Hive,python,Solr)to find out a Key Opinion Leader for a particular disease discussion in social media and how to provide insights/social-sentiments and search capabilities on different medicines used for particular disease/treatment to get feedback on medicines or for research purpose..

More information about this talk

Apache Big Data Seville 2016 – Fighting Identity Theft: Big Data Analytics to the Rescue – Seshika Fernando

November 29, 2016
The Apache Software Foundation

Fighting Identity Theft: Big Data Analytics to the Rescue – Seshika Fernando

Identity Theft is no longer just a consumeräó»s problem. Attackers are now targeting Enterprises for bigger financial gains and greater damage not just to the organizationäó»s infrastructure but more importantly to its corporate image.

While Enterprise Identity Theft Analytics Tools do exist, most organizations find it economically prohibitive to invest in expensive proprietary software. In this session, Seshika will show how a comprehensive Identity Theft Analytics Solution can be built using Open Source Technologies. She will demonstrate how Big Data Analytics can be used to safeguard any Enterprise by covering the 4 Aäó»s of Identity Analytics

More information about this talk

Apache Big Data Seville 2016 – Uber – Your Realtime Data Pipeline is Arriving Now!

November 29, 2016
The Apache Software Foundation

Uber – Your Realtime Data Pipeline is Arriving Now! – Ankur Bansal

Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.

Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.

More information about this talk

Apache Big Data Seville 2016 – ETL Pipelines with OODT, Solr and Stuff – Tom Barber

November 28, 2016
The Apache Software Foundation

ETL Pipelines with OODT, Solr and Stuff – Tom Barber

Discover a number of Apache projects you may not have heard of and how they can help you process both Clinical and non Clinical data. Apache OODT developed by NASA allows users to ingest and store files and metadata along with process workflows. OODT along with CTakes allows us to extract clinical information from files and then process them and allow end users access to the extracted data.

We can then take these sources and manipulate them further creating a highly flexible ETL pipeline offering reliability and scalability. Backed by Apache SOLR users can then interrogate the data via web interfaces and instigate further post processing and investigation.

Of course you may not have a clinical use case, but the platforms can be repurposed and will allow you to go away and build your own, scalable data pipeline for processing and integstion.

More information about this talk

Apache Big Data Seville 2016 – Large Scale SolrCloud Cluster Management via APIs – Anshum Gupta

November 28, 2016
The Apache Software Foundation

Large Scale SolrCloud Cluster Management via APIs – Anshum Gupta

Apache Solr is widely used by organizations to power their search platforms and often support multiple users. A lot of cluster management APIs were introduced over the last few releases, allowing the users to to manage operations ranging from replica placement to forcing leader elections via API calls. At the end of this talk, intermediate Solr users would understand what’s available, and when can they avoid direct interference with the system, leading to more stable clusters and lower chances of nodes going down. The attendees would also be much better equipped to build their own SolrCloud cluster management tools. I would also talk about when not to use these APIs and what’s planned in the near future to handle specific operational use cases.

More information about this talk

Apache Big Data Seville, 2016 – Fast & Scalable Email System with Apache Solr – Strategies, Tradeoffs and Optimizations – Arnon Yogev

November 28, 2016
The Apache Software Foundation

Fast & Scalable Email System with Apache Solr – Strategies, Tradeoffs and Optimizations – Arnon Yogev

Email interaction has its unique characteristics and is different than traditional web search (for example in that users search their own private mailboxes and are often interested in recent emails rather than the archive).

Taking advantage of these characteristics, we were able to optimize our infrastructure in terms of indexing strategy and query optimization and achieve a significant gain in scalability and performance.

Arnon will present the various tradeoffs that were explored, including multi-tiered indexes, sorted indexes, query optimizations and more.

Arnon will then present the benchmark results that stress the importance of correctly designing a Solr infrastructure and tailoring it to oneäó»s specific use case.

More information about this talk

PlusOne

Telling the ASF's Stories

Monthly Archives: November 2016

Apache Big Data Seville 2016 – Deep Neural Network Regression at Scale in Spark MLlib – Jeremy Nixon

Apache Big Data Seville 2016 – Open Source Operations: Building on Apache Spark with InsightEdge, TensorFlow, Apache Zeppelin, and/or Apache – Samuel Cozannet

Apache Big Data Seville 2016 – Apache S2Graph (incubating) as a User Event Hub – Hyunsung Jo, Daewon Jeong & Hwansung Yu

Apache Big Data Seville 2016 – Graph Processing with Apache Tinkerpop on Apache S2Graph – Doyung Yoon

Apache Big Data Seville 2016 – How Big Data/IoT Leverage the Power of OpenSource to Solve Healthcare Use Cases – Manidipa Mitra

Apache Big Data Seville 2016 – Fighting Identity Theft: Big Data Analytics to the Rescue – Seshika Fernando

Apache Big Data Seville 2016 – Uber – Your Realtime Data Pipeline is Arriving Now!

Apache Big Data Seville 2016 – ETL Pipelines with OODT, Solr and Stuff – Tom Barber

Apache Big Data Seville 2016 – Large Scale SolrCloud Cluster Management via APIs – Anshum Gupta

Apache Big Data Seville, 2016 – Fast & Scalable Email System with Apache Solr – Strategies, Tradeoffs and Optimizations – Arnon Yogev