Big Data, BI, and SQL Server at SQLSaturday Silicon Valley

Are you a business analytics or SQL Server professional in the Bay Area? If you’re trying to keep up with the cutting edge developments in the data world, or just hone your skills, join us at SQLSaturday Silicon Valley on March 28, 2015.

It’s a free, all-day event at the Microsoft Technology Center in Mountain View featuring 30+ in-depth sessions with top SQL Server experts.  You’ll learn valuable skills and best practices for analyzing, visualizing, and reporting on data, as well as unlocking the power and promise of Big Data.

Top experts will cover Power BI, DBA, data visualization, Document DB, Excel, new developments with Flash Storage, and  Power Pivot. They’ll share how to get the most from your business data and make better data-driven decisions.

Here are just a few of the sessions you can attend:

  • Intro to Time Series Forecasting – Peter MyersSQLSaturday Silicon Valley
  • A Practical Guide To Using Charts & Graphs – Dan Bulos
  • Storage For the DBA – Denny Cherry
  • Automating Power BI Creations – Angel Abundez
  • Becoming a Top DBA – Automation in SQL Server – Joseph D’Antoni
  • Roles on a Big Data Team – Andrew Eichenbaum
  • Intro to Azure DocumentDB – Ike Ellis
  • Automating Your Database Deployments – Grant Fritchey
  • Flash & SQL Server – Re-Thinking Best Practices – Jimmy May
  • Common SQL Server Mistakes and How to Avoid Them – Tim Radney

You can see all the sessions here.  Come join us on the 28th – and tell your SQL Server and BI colleagues about this incredible day of learning and networking.

Big Data: How To Do It Right

If you’re beginning your first foray into analyzing your organization’s Big Data, you need to spend some time thinking about the big picture. The payoffs of effectively analyzing your data can be enormous, but you need to plan carefully in order to achieve the optimal outcome.

Here’s a checklist of questions to consider:

  • What data do you have, and what can you obtain?Big Data DesignMind
  • What problems do you have that might be solvable, given sufficient understanding of your data in a perfect-world scenario (without yet trying to determine what is technically possible with today’s tools)?
  • Do you have the resources and executive buy-in to pursue high ROI opportunities uncovered by your initiative?
  • Have you evaluated the various Hadoop vendors such as Cloudera, MapR, Qubole, or Hortonworks, to see what each has that sets them apart from their competitors?
  • Do you have the internal resources to oversee your big data project and continue to reap the benefits going forward? Read our whitepaper on Building Your Big Data Team to see what kind of manpower you’ll need.

Big Data is a popular buzzword these days, but it is really important, as companies who know how to extract valuable information have a huge competitive advantage. However without proper planning, a Big Data project can become a money pit. You should definitely do your homework first!

Hiring a Data Scientist: Interviewing Basics

Interviewing Basics DesignMind With a set of candidate resumes in hand, you now have the task of the interview…

Let me start with a personal view that the hiring system is completely screwed up. A candidate is judged fit or unfit for a position with only about one day of interaction. These interactions are split up amongst a handful of people, so each person has about an hour to say if they want to spend more waking time with this person then they do their own family.

First, you must agree on the interviewing basics.  Here’s the basic advice for all tech hiring that I gave my team here at DesignMind in San Francisco:

1. You should have a mid-size group of people interviewing the candidate. Five to eight is a good range of formal interviewers. There can be more if the candidate goes out to lunch with a group, or if you do some pair interviewing. But after 6-8 interview sessions, almost any candidate will burn out.

2. A range of people need to interview the candidate. Having people who do similar work is definitely necessary, but people outside of the core group should also be on he interviewing team. Knowing if a candidate can talk to people of different backgrounds is a requirement if this person will be on cross-functional teams. Also, knowing how a candidate will interact with a perceived “subordinate” is great insight on how that person works inside of an organization.

3. Interview feedback should happen within an interview team pow-wow within a day of the interview. First round of responses should be “yes”, “no”, or “maybe”, where:

  •   No means no
  •   Maybe means no
  •   Yes means maybe

There can be mitigating circumstances where a maybe can be turned into a yes. But if not, you need to pass on the candidate.

4. If the hiring team gives the candidate a yes, it is time to check references. If there are any dubious responses, you need to dig into them and find the reason. Often, finding someone in your extended network who has worked with this person is a great way to get an honest, unbiased answer.

5. Last, and most important, the candidate is interviewing you and your company during the interview process. Don’t forget to sell your company and its people during the interview.

Andrew Eichenbaum is VP, Data Science Solutions at DesignMind. He specializes in data mining, data modeling, and artificial intelligence. Andrew heads DesignMind’s Data Science division.

Hiring a Data Science Team: Hiring Your First Data Scientist

In the first blog post Hiring a Data Science Team: Types of Data Scientists, we discussed types of Data Scientists. Now you’re at the point where you’ve decided that you need your first data scientist, and need to decide who to hire.

Most groups who are hiring their first Data Scientist are in one of two situations:Data Scientist

1. You have data, and we have questions, but we do not know how to answer the questions given our data.
2. You need help understanding our business in a data centric format.

If you’re in group one, I suggest looking for an Algorithms Expert or Statistician. They’re very good at finding known or novel ways of answering the questions you have at your fingertips.

However, if you’re in group two, a bit more thought needs to go into your selection. Let’s start with three simple yes/no questions:

1. Do you have a good handle on our data flows, storage, and reporting?
2. Do you fully understand the data you’re currently acquiring and have stored in your databases?
3. Do you have a set of well defined questions you want answered?

If you answered “no” to #1, you need a Data Wrangler or Data Miner/Algorithms expert with significant data management experience. This answer overrides both other answers, because if you don’t have a good handle on the raw data, you won’t have a handle on anything downstream.

If you answered yes to #1 and no to #2, you need a Data Miner. The first job of your new Data Scientist will be to come in and validate all of your raw data and base assumptions.

If you answered yes to #1 and #2, but no to #3, you need an experienced Algorithms Expert or Data Miner. Your Data Scientist is there to help you define your data driven path. Their first job is to understand the current status of your analytics systems and make suggestions on where and how to make improvements to the current systems.

Finally, if you answered no to #1 and yes to #2, you are lying to yourself. When you don’t have a good understanding of what data is coming into your system and how it gets there, you can never be sure of the quality of your results.

In the next installment, we’ll talk about the interview process for hiring a Data Scientist.

Andrew Eichenbaum is VP, Data Science Solutions at DesignMind. He specializes in data mining, data modeling, and artificial intelligence. Andrew heads DesignMind’s Data Science division.

Hiring a Data Science Team: Types of Data Scientists

Hiring a new member to your team is always a daunting task. Now combine that with looking to fill “The Sexiest Job of the Century”, Data Scientist, and you have quite the conundrum on your hands:

  • Who and what is a Data Scientist?Types of Data Scientists
  • How do you hire one?
  • Where do you find them?
  • How do you vet them?

Over the next few weeks we’ll discuss these topics in a series on the DesignMind blog. To start, let’s discuss the four major categories of Data Scientists:

Algorithms Expert These are the people who will ask you what questions you want answered. They then try to answer these questions by matching the form and format of your available data to a set of Machine Learning or Optimization techniques. Algorithms Experts usually come from a Computer Science, Electrical Engineering, or Mathematics background.

Data Miner Data Miners are “why” scientists who ask why you’re asking the questions you are asking. They then try to find patterns in the data and build individual or derived Performance Metrics that will help focus the business in their direction and outcomes. Data Miners usually come from a science-based background like Physics, Biology, or Chemistry.

Data Wrangler Just like a cowboy, a data wrangler will manage your data flows and makes sure data is internally consistent. They look at your raw data and say, what do you need from this and what are you missing, then architect and build the systems to accomplish this. Data Wranglers come from a diverse set of backgrounds.

Statistician This is a mathematician who looks for patterns in your raw data. They are the classic actuary, where given a set of possible outcomes, they try and look for patterns in your data stream that will try to predict any of the outcomes. Statisticians usually come from Applied Math or Statistics background.

It’s interesting to note that most Data Scientists are a blend of more than one category, and that’s a good thing as Data Scientists are required to fill multiple rolls.

In the next installment, we’ll talk about hiring your first Data Scientist and how they fit into your team.

For further reading, I suggest:

Andrew Eichenbaum is Principal Data Science Consultant at DesignMind. He specializes in data mining, data modeling, and artificial intelligence. Andrew heads DesignMind’s Data Science division.

Hadoop Speeds Data Delivery at Bloomberg

Many organizations are adopting Hadoop as their data platform because of two fundamental issues:

  1. They have a lot of data that they need to store, analyze and make sense of, on the order of 10s of terabytes or greater.
  2. The cost of doing the above in Hadoop is significantly less expensive than the alternatives.

But those organizations are finding there are other good reasons for using Hadoop and other NoSQL data stores (HBase, Cassandra). Hadoop has rapidly become the dominant distributed data platform, much as Linux quickly dominated the Unix operating system market. With that platform come a rich ecosystem of applications for building data products, whether it’s for the growing SQL on Hadoop movement or real-time data access with HBase.

At the latest Hadoop SF meetup at Bloomberg’s office, two presenters discussed how Bloomberg was taking advantage of this converged platform to power their data products. Bloomberg is the leading provider of securities data to financial companies, but they describe their data problem as “medium data” – they don’t have as much data to deal with, but they do have strong requirements around how quickly they need to deliver it to their users. They have thousands of developers working on all aspects of these data products, but especially custom low-latency data delivery systems.

When Bloomberg explored the use of HBase as the backend of their portfolio pricing lookup tool, they had quite a challenge – support an average query lookup of 10M+ cells in around 60 ms. Initial efforts to use HBase were promising, but not quite fast enough. Through several iterations of optimization, including parallel client queries, scheduling garbage collection, and even enhancing high availability further to minimize the impact of a failed server (HBASE-10070), they were able to hit their targets, and allow the move from custom data server products to HBase.

With the move to Hadoop, Bloomberg’s also needed better cluster management capabilities. Open-source tools are already dominant in this space, and while Bloomberg leverages a combination of Apache Bigtop for Hadoop, Chef for configuration management, and Zabbix for monitoring, many other good tools exist (I’m most fond of Ansible, Monit and proprietary Cloudera Manager personally). Combining the abilities of the Hadoop platform for developing and running large-scale data products with more efficient provisioning and operational models gives Bloomberg exactly what they need. It’s a model that’s going to play out repeatedly in the coming years at many organizations as Hadoop proves its capabilities as a modern data platform.

Mark Kidwell is Principal Big Data Consultant at DesignMind. He specializes in Hadoop, Data Warehousing, and Technical and Project Leadership. 

 

Online Learning: Next-Gen Education

Technologists, venture capitalists, educators, policy makers, investors, and edtech entrepreneurs gathered in San Francisco on June 24th to discuss the productive use of technology to transform education – from pre-K to life-long learning.  Sponsored by SVForum, law firm Orrick, Herrington, and Sutcliffe, and Microsoft, the Next-Gen Education conference brought together some of the most progressive education companies in the world.Online Learning

Discussions about the K-12 world included experts from Kidaptive, KIPP, Rethink Education, Clever, and EdSurge.  Higher education experts were from NovoEd, Pathbrite, InsideTrack, Learn Capital, Minerva Project, and Udemy.

Using leading edge Big Data and Business Intelligence technologies, these organizations are able to bring online learning to students around the world. Just one example of these groundbreaking companies is NovoEd. Founded in 2012 by Stanford University professor Amin Saberi, NovoEd creates online courses that foster more social interactions between students and teachers. NovoEd’s list of partners now includes Stanford, Princeton, University of Michigan, University of Virginia Darden School of Business, Wharton, and the Carnegie Foundation, among others.

Worldwide, what are the major markets for online classes? Students in the United States by a landslide, followed by online learners in India, the United Kingdom, Canada, and China.

 

Joy Mundy of Kimball Group on Dimensional Modeling

Joy Mundy of the renowned IT consulting firm Kimball Group focuses on Data Warehouse and Business Intelligence solutions.

She spoke at SQLSaturday Silicon Valley on designing dimensional models. Joy also emphasized the importance of consulting with business users during the design process. “Get as close to the business users as possible and make them part of the design team,” she advised. Before joining Kimball Group, Joy worked on Microsoft’s BI Best Practices Team.