Blog

Data Engineering Outline

wordcloud2The ADOS project monitored 30,000 sensors on nuclear reactor, measuring temperatures, pressures and mass flows at discrete points throughout the cores and associated equipment ( boilers, heat exchangers, condensers etc)

 

 

Challenges of Big Data Systems

When dealing with huge volumes of data that are derived from multiple independent sources. It is a significant undertaking to connect, link, match, clean and transform data across systems. However, it is necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control. Data governance can help you determine how disparate data relates to common definitions and how to systematically integrate structured and unstructured data assets to produce high-quality information that is useful, appropriate and up-to-date.

Big data technologies not only provide the infrastructure to collect large amounts of data, they provide the analytics to understand it and take advantage of its value. The goal of all organizations with access to large data collections should be to harness the most relevant data and use it for optimized decision making.
Some examples:

  • Send tailored recommendations to mobile devices at just the right time, while customers are in the right location to take advantage of offers.
  • Recalculate and revalue entire risk portfolios and provide supplementary analysis providing strategies to reduce risk and in addition mitigate risk impact.
  • Mine customer data for insights that drive new strategies for customer acquisition, retention, campaign optimization and next best offers.
  • Generate special offers at the point of sale based on the customer’s current and past purchases, ensuring a higher customer retention rate.
  • Analyze data from social media to detect new market trends, changing customer perceptions and predict changes in demand.
  • Use pattern matching, fuzzy logic and deep layer data mining of the Internet click-stream to detect fraudulent behavior.
  • Identify and log root causes of failures, issues and defects by investigating user sessions, network logs and machine sensors.

 

 

Apache Spark speeds up big data processing by a factor of 10 to 100

Apache Spark speeds up big data processing by a factor of 10 to 100 and simplifies app development to such a degree that developers call it a “game changer.”

Apache Spark has been called a game changer and perhaps the most significant open source project of the next decade, and it’s been taking the big data world by storm since it was open sourced in 2010. Apache Spark is an open source data processing engine built for speed, ease of use and sophisticated analytics. Spark is designed to perform both batch processing and new workloads like streaming, interactive queries, and machine learning. “Spark is undoubtedly a force to be reckoned with in the big data ecosystem,” said Beth Smith, general manager of the Analytics Platform for IBM Analytics. IBM has invested heavily in Spark. Meanwhile, in a talk at the Spark Summit East 2015, Matthew Glickman, a managing director at Goldman Sachs, said he realized Spark was something special when he attended last year’s Strata + Hadoop World conference in New York.

He said he went back to Goldman and “posted on our social media that I’d seen the future and it was Apache Spark. What did I see that was so game-changing? It was sort of to the same extent [as] when you first held an iPhone or when you first see a Tesla. It was completely game-changing.”

Matei Zaharia, co-founder and CTO of Databricks and the creator of Spark, told eWEEK Spark started out in 2009 as a research project at the University of California Berkeley, where he was working with early users of MapReduce and Hadoop, including Facebook and Yahoo. He said he found some common problems among those users, chief among them being that they all wanted to run more complex algorithms that couldn’t be done with just one MapReduce step. “MapReduce is a simple way to scan through data and aggregate information in parallel and not every algorithm can be done with it,” Zaharia said. “So we wanted to create a more general programming model for people to write cluster applications that would be fast and efficient at these more complex types of algorithms.” Zaharia noted that the researchers he worked with also said MapReduce was not only slow for what they wanted to do, but they also found the process for writing applications “clumsy.” So he set out to deliver something better.

 

 

Read more at eweek here

 

 

 

Microsoft has unveiled its plans for integrating big data package

Microsoft announced the Cortana Analytics Suite. It takes the company’s machine learning, big data and analytics products and packages them together in one huge, monolithic suite.

Microsoft has put together the suite with the hope of providing a one-stop, big data and analytics solution for enterprise customers.

“Our goal was to bring integration of these pieces so customers have a comprehensive platform to build intelligent solutions,” Joseph Sirosh, corporate vice president at Microsoft, who is in charge of Azure ML told TechCrunch

As for Cortana, which is the Microsoft voice-driven personal assistant tool in Windows 10, it’s a small part of the solution, but Sirosh says Microsoft named the suite after it because it symbolizes the contextualized intelligence that the company hopes to deliver across the entire suite.

It includes pieces like Azure ML, the company’s cloud machine learning product, PowerBI, its data visualization tool and Azure Data Catalog, a service announced just last week designed for sharing and surfacing data stores inside a company, among others. It hopes to take advantage of range of technologies such as face and speech recognition to generate a series of solutions like recommendation engines and churn forecasting.

ms cortana

It’s All About Integration

Microsoft expects that by providing an integrated solution, third parties and systems integrators will build packaged solutions based on the suite, and that customers will be attracted by a product with pieces designed to play nicely together. It is building in integration, thereby reducing the complexity of making these types of tools work together — at least that’s the theory.

“Where the suite provides value is the great interoperability, finished solutions, recipes and cookbooks,” Sirosh explained.

As an example, Microsoft talked about a coordinated medical care project at Dartmouth-Hitchcock Medical Center. The program, called ImagineCare, is built on top of the Cortana Analytics Suite and the Microsoft Dynamics CRM tool.

Tendron Systems  technical director Alan Brown stated that time would tell if customers adopted this product. It may be late to the party, but it has a good specification.

Read more at:    http://techcrunch.com/2015/07/13/microsoft-unifies-big-data-and-analytics-in-newly-launched-suite/#.a5lxqkn:Cg6d

Next post is on Apache Spark, which is making significant inroards into the big data arena.

James Goode, Tendron Systems

Tendron Systems
Tendron Systems Ltd, Regent Street, London, W1B

Why Cloudera is saying ‘Goodbye, MapReduce’ and ‘Hello, Spark’

Cloudera, a company that helped popularize Hadoop as a platform for analyzing huge amounts of data when it was founded in 2008, is overhauling its core technology. The One Platform Initiative the company announced Wednesday lays out Cloudera’s plan to officially replace MapReduce with Apache Spark as the default processing engine for Hadoop.

datacenter-blinking-lights-lg

Cloudera chief technologist Eli Collins said the company is “at best” halfway through the process from a technology standpoint and should be done in about a year. When complete, Spark should have similar levels of security, manageability, and scalability as MapReduce, and should be equally integrated with the rest of the technologies that comprise the ever-expanding Hadoop platform.

Collins said Spark’s existing weaknesses are “OK for early adopters, but really not acceptable to our customer base” as a whole. Cloudera says it has more than 100 customers running Spark in production—including Equifax, Experian, and CSC—but realizes that broader adoption and an improved Spark experience are a chicken-or-egg type of problem.

The history of the move to Spark is in some ways as old Hadoop itself. Google GOOG 2.26% created MapReduce in the early 2000s as a faster, easier implementation of existing parallel processing approaches, and the creators of Hadoop developed an open source version of Google’s work. However, while MapReduce proved revolutionary for early big data workloads (nearly every major web company is a heavy Hadoop user), its limitations became more clear as Hadoop and big data became mainstream technology movements.

Large enterprises, technology startups and other potential Hadoop users saw the potential in storing lots of data using the Hadoop file system and in analyzing that data, but they wanted something faster and more flexible than MapReduce. It was designed for indexing the web at places like Google and Yahoo YHOO 2.30% , a batch-processing job where latency was measured in hours rather than milliseconds. MapReduce is also notoriously difficult to program, a problem that helped exacerbate the “big data skills gap” to which analyst firms and consultants have been pointing for years.

When Spark was created a few years ago at the University of California, Berkeley, it was the solution Hadoop vendors, Hadoop users, and venture capitalists alike needed to resolve their MapReduce woes. Spark is significantly faster and easier to program than MapReduce, meaning it can handle a much broader array of jobs. In fact, the project includes libraries for real-time data analysis, interactive SQL analysis, and machine learning, in addition to its core MapReduce-style engine.

When Tendron Systems evaluated Apache Spark, they were impressed with the performance improvements achieved, especially for scientific parallel algorithms for numerical analytics. Added to the Spark Streaming modules this is an impressive addition to Apache big data technology.

And better yet, Spark is designed to integrate with Hadoop’s native file system. This means Hadoop users don’t have to move their terabytes or even petabytes of data elsewhere in order to take advantage of Spark. By 2013, major VC firms had began putting millions of dollars into Databricks, a startup founded by the creators of Spark, and major Hadoop vendors Cloudera, MapR, and Hortonworks HDP 3.69% were beginning to integrate Spark into their Hadoop distributions.

read more:

http://fortune.com/2015/09/09/cloudera-spark-mapreduce/

Apache Spark speeds up big data processing

Apache Spark speeds up big data processing by a factor of 10 to 100 and simplifies app development to such a degree that developers call it a “game changer.”

Apache Spark has been called a game changer and perhaps the most significant open source project of the next decade, and it’s been taking the big data world by storm since it was open sourced in 2010. Apache Spark is an open source data processing engine built for speed, ease of use and sophisticated analytics. Spark is designed to perform both batch processing and new workloads like streaming, interactive queries, and machine learning. “Spark is undoubtedly a force to be reckoned with in the big data ecosystem,” said Beth Smith, general manager of the Analytics Platform for IBM Analytics. IBM has invested heavily in Spark. Meanwhile, in a talk at the Spark Summit East 2015, Matthew Glickman, a managing director at Goldman Sachs, said he realized Spark was something special when he attended last year’s Strata + Hadoop World conference in New York.

read more

http://www.eweek.com/enterprise-apps/how-apache-spark-is-transforming-big-data-processing-development.html

 

Alan

Tendron Systems Ltd

Facebook – Big Data London Group

Apache Spark – Executive Summary

Using World Bank Data in R with Shiny Dashboards

Introduction to Spark SQL

 

 

Installing Apache Spark on Ubuntu

Apache Spark

google:    apache spark 1.4 install ubuntu 14.04 linux

http://www.philchen.com/2015/02/16/how-to-install-apache-spark-and-cassandra-stack-on-ubuntu

http://stackoverflow.com/questions/30814484/spark-1-4-with-zeppelin-installation

http://blog.prabeeshk.com/blog/2014/10/31/install-apache-spark-on-ubuntu-14-dot-04/

https://spark.apache.org/docs/latest/

http://www.ibm.com/developerworks/library/os-spark/os-spark-pdf.pdf

worked good 1st time,   but 1.6 secs to compute pi, and not exactly accurate either;

IBM Backs Apache Spark For Big Data Applications

Technology giant IBM has thrown its full weight behind Spark, Apache’s open-source cluster computing framework.

Spark will form the basis of all of Big Blue’s analytics and commerce platforms and its Watson Health Cloud. The framework will also be sold as a service on its Bluemix cloud.

IBM will commit more than 3,500 of its researchers and developers to Spark-related projects and promised a Spark Technology Center in San Francisco, California where data science and developers can work with IBM designers and architects.

Spark began life in as a project at UC Berkeley in California, quickly delivering in-memory performance as much as 100 times that of the MapReduce framework that originally underpinned Apache Hadoop. Hadoop has moved on since then, to adopt other — faster and more flexible — ways of working. Spark has also progressed, promoting increasingly capable disk-based performance to complement its in-memory strengths, and establishing itself as a strong contender for use particularly in machine learning tasks. Spark moved to the Apache Software Foundation in 2013, becoming a top level project in 2014. In 2013, members of the original Berkeley team established the company now known as Databricks to construct a business around Apache Spark. The company launched with almost $14 million dollars from Andreessen Horowitz and others, and secured a further $33 million a year ago. Nevertheless, Spark is not without competitors of its own. Flink  also a top-level project of the Apache Software Foundation, has just  begun to attract many of the same admiring comments directed Spark’s way 12-18 months ago. Despite sound technical credentials, ongoing development, big investments, and today’s high-profile endorsement from IBM, it would be  premature to crown Spark as the winner just yet.

Written in Java, Scala and Python, Spark is an in-memory system for processing large data sets. It consists of scheduling and dispatching, SQL-style programming language, a machine-learning framework and distributed graphics processing framework.

Several key technology companies are likely to invest in their spark infrastructure as a direct result of IBM’s initiative, including Databricks, Tendron Systems, and major consultancies.

Spark can scale to more than 8,000 production nodes and, while it works with Hadoop and MapReduce, is claimed to also be substantially faster on certain workloads.

read more:  http://www.forbes.com/sites/paulmiller/2015/06/15/ibm-backs-apache-spark-for-big-data-analytics/