Why Microsoft SSAS Should Be Part Of Your Big Data Game Plan

While many organizations are using Hadoop for big data warehousing, it’s also worth considering a reliable alternative: Microsoft SSAS (SQL Server Analysis Services).

Here in Silicon Valley, you can’t have a cup of coffee without overhearing a conversation about big data. “Build-Measure-Learn”, the mantra of today’s corporate world, results in every organization within a company needing a system where data from disparate sources can be easily analyzed and reported on. Enter “data warehousing”.

SQL Server Analysis Services

As familiar a term as “big data” is, there are many misconceptions regarding traditional data warehousing tools such as SQL Server Analysis Services (Microsoft SSAS) and more recent evolutions like Hadoop. In this blog post, we deal with one of these misconceptions – SSAS and its applicability while dealing with larger datasets.

Crocodile, crocodile – which color do you want?
Consider this scenario: Company A has generated reams of data regarding their operations, and needs to generate useful insights to streamline their processes. So which data warehousing tool should they choose? The decision to use one type over the other depends on the type of data, its size and the reporting needs. And no, the answer is not Hadoop every single time. Here’s a look at the benefits of an alternative solution, SQL Server Analysis Services.

SQL Server Analysis Services (SSAS)
SQL Server Analysis Services (SSAS) has existed for a long time, and has proven time and again that it can handle really complex data. One can implement complex business logic into the design of the datamart that uses SSAS. There are many reasons to consider it as your go-to analytical tool:

Multidimensional data models: SSAS works well with multidimensional data models. You can design as many dimensions and measures as you would like to have in your datamart while maintaining a good design that focuses on relevant metrics.

Dimension Hierarchies: SSAS lets you build hierarchies, which let you slice and dice and drill through your dimension data for all reporting needs.

Speed: SSAS cubes have the ability to precompute and physically store the aggregations used to summarize the data. When a query is run against the cube, it does not compute the calculations at run time, but shares precomputed aggregations, which allows results to be shared immediately. A  query that was once run against the cube will remain in the cache until the cache is updated next time. You also have control of when to run the cache update.

Built-in time calculations: Analysis services lets you perform many time-based calculations such as rolling averages, and year-to-date metrics.

Historical Data Analysis: The Microsoft SSAS cube allows for historical data storage and comparison through its MDX and DMX calculations.

Integrates well with many reporting tools: SSAS cube solutions can be integrated with many front end reporting tools, and no customized software builds are necessary in order to run reports. It works well with Excel, Tableau, and SQL Server Reporting Services, to name a few. The user does not need extensive training to start using the tool for reporting.

Data security: Microsoft SSAS features some of the most complex data security at a cell level. You can set security for various dimensions, as well as dimension attributes for users with different security levels.

Capacity to handle HUGE data volumes: Despite common misconceptions, it is indeed possible to build multi-terabyte sized cube solutions. The key to building a successful big cube is to keep all the design best practices in mind, and follow best practices for data processing and query performance tuning. There are several techniques one can apply to get the best out of their SSAS cube with multi-terabyte sized data. Microsoft has released several white papers for this purpose. With careful design and adding performance-tuning methods, you can get the most out of the SSAS cube for very large datasets.

In summary, for data sets ranging from a few gigabytes to tens of terabytes, Microsoft’s SSAS could be your tool of choice. Decisions regarding which data warehousing tool needs to be used should be made after detailed analysis of the type of data being reported on and the reporting needs.

“If you do not know how to ask the right question, you discover nothing”.
– W. Edwards Deming

Pravina Parthasarthy is a Senior Business Intelligence Consultant in DesignMind’s Business Intelligence group.   She specializes in data warehouse design and implementation.

Power BI: 5 Things You Need to Know Now

Power BI has been out for a while. It’s a reporting solution that Microsoft has developed, enabling analysts, BI developers, and power users to create ad hoc analysis in the most popular BI tool in the world: Excel. However, recently, Microsoft is putting serious focus on Power BI in terms of making it more accessible, easier to use, and visually more powerful than any analytics tool they’ve ever built beforPower BI e.

So what is Power BI? What do you need to buy and how can you make it work for you? Here are five things you need to know so you can embark down the path of data discovery.  This latest version of Power BI is an incredibly useful toolset that helps you ingest data, prepare it, analyze it, and present it effectively to your team or customers.

1. Power BI is available for free with Office 2013 Pro Plus or a business license for Office 365.
To have availability to all the Power BI tools in Excel, including Power View, you will need at least Office Pro Plus version. Excel 2010 users can download Power Pivot for free, but will not have access to the rest of the Power BI tools. It’s also worth mentioning that if your company has a business license for Office 365, you not only have the right version of Excel 2013 to run all the Power BI tools, but you get to install Office 2013 Pro Plus on up to five machines. Christmas may have come early for some of you just after I said that.

2. Don’t have Office? You can still get Power BI Public Preview for free.
The latest Power BI service works without Office, Excel, or Office 365. It includes a free Power BI Dashboard that allows you to import data from databases, web apps, or flat files and then create visually compelling dashboards full of interactivity. You can download it here free.

3. Power BI for Office 365 can refresh data from Azure or your on-premise servers.
Power BI can refresh data from Azure without any additional setup. For customers with on-premise SQL Server, Oracle, or other database vendors, you can set up a Data Management Gateway service on your server to automate data refreshes from your OLTP systems. Whether you use Power Pivot or Power Query to develop your Power BI creation, you can have a Hybrid environment that’s secure, backed by encryption, and easy to use.

4. You can create impressive dashboards and visualizations with Power BI Designer.
The days of setting up drivers and writing custom code to ingest data from popular SaaS applications is past.  The new Power BI Designer can bring in data from SalesForce, Google Analytics, Marketo, etc. and visualize that data instantly.  It has easy connectors to pull data, do mashups, and create impressive dashboards in minutes.

5. Don’t have Office 365 or the right version of SharePoint? No problem. We have Power Update.
Now with Power Update, you don’t need to worry about refreshing your workbooks. This wonderfully simple utility can help refresh your Power BI workbooks without a Power BI tenant in the cloud, or even worrying about data in the cloud. You can refresh your workbooks right on a file server for your viewers to see.

Angel Abundez is VP, Business Intelligence at DesignMind. He specializes in Microsoft SQL Server BI tools, SharePoint and ASP.NET.  Angel heads DesignMind’s Business Intelligence group.

Big Data: How To Do It Right

If you’re beginning your first foray into analyzing your organization’s Big Data, you need to spend some time thinking about the big picture. The payoffs of effectively analyzing your data can be enormous, but you need to plan carefully in order to achieve the optimal outcome.

Here’s a checklist of questions to consider:

  • What data do you have, and what can you obtain?Big Data DesignMind San Francisco
  • What problems do you have that might be solvable, given sufficient understanding of your data in a perfect-world scenario (without yet trying to determine what is technically possible with today’s tools)?
  • Do you have the resources and executive buy-in to pursue high ROI opportunities uncovered by your initiative?
  • Have you evaluated the various Hadoop vendors such as Cloudera, MapR, Qubole, or Hortonworks, to see what each has that sets them apart from their competitors?  You can learn more on our Partner page.
  • Do you have the internal resources to oversee your big data project and continue to reap the benefits going forward? Read our whitepaper on Building Your Big Data Team to see what kind of manpower you’ll need.

Big Data is a popular buzzword these days, but it is really important, as companies who know how to extract valuable information have a huge competitive advantage. However without proper planning, a Big Data project can become a money pit. You should definitely do your homework first!

Hiring a Data Scientist: Interviewing Basics

Interviewing Basics DesignMind With a set of candidate resumes in hand, you now have the task of the interview…

Let me start with a personal view that the hiring system is completely screwed up. A candidate is judged fit or unfit for a position with only about one day of interaction. These interactions are split up amongst a handful of people, so each person has about an hour to say if they want to spend more waking time with this person then they do their own family.

First, you must agree on the interviewing basics.  Here’s the basic advice for all tech hiring that I gave my team here at DesignMind in San Francisco:

1. You should have a mid-size group of people interviewing the candidate. Five to eight is a good range of formal interviewers. There can be more if the candidate goes out to lunch with a group, or if you do some pair interviewing. But after 6-8 interview sessions, almost any candidate will burn out.

2. A range of people need to interview the candidate. Having people who do similar work is definitely necessary, but people outside of the core group should also be on he interviewing team. Knowing if a candidate can talk to people of different backgrounds is a requirement if this person will be on cross-functional teams. Also, knowing how a candidate will interact with a perceived “subordinate” is great insight on how that person works inside of an organization.

3. Interview feedback should happen within an interview team pow-wow within a day of the interview. First round of responses should be “yes”, “no”, or “maybe”, where:

  •   No means no
  •   Maybe means no
  •   Yes means maybe

There can be mitigating circumstances where a maybe can be turned into a yes. But if not, you need to pass on the candidate.

4. If the hiring team gives the candidate a yes, it is time to check references. If there are any dubious responses, you need to dig into them and find the reason. Often, finding someone in your extended network who has worked with this person is a great way to get an honest, unbiased answer.

5. Last, and most important, the candidate is interviewing you and your company during the interview process. Don’t forget to sell your company and its people during the interview.

Andrew Eichenbaum is VP, Data Science Solutions at DesignMind. He specializes in data mining, data modeling, and artificial intelligence. Andrew heads DesignMind’s Data Science division.

Hiring a Data Science Team: Hiring Your First Data Scientist

In the first blog post Hiring a Data Science Team: Types of Data Scientists, we discussed types of Data Scientists. Now you’re at the point where you’ve decided that you need your first data scientist, and need to decide who to hire.

Most groups who are hiring their first Data Scientist are in one of two situations:Data Scientist

1. You have data, and we have questions, but we do not know how to answer the questions given our data.
2. You need help understanding our business in a data centric format.

If you’re in group one, I suggest looking for an Algorithms Expert or Statistician. They’re very good at finding known or novel ways of answering the questions you have at your fingertips.

However, if you’re in group two, a bit more thought needs to go into your selection. Let’s start with three simple yes/no questions:

1. Do you have a good handle on our data flows, storage, and reporting?
2. Do you fully understand the data you’re currently acquiring and have stored in your databases?
3. Do you have a set of well defined questions you want answered?

If you answered “no” to #1, you need a Data Wrangler or Data Miner/Algorithms expert with significant data management experience. This answer overrides both other answers, because if you don’t have a good handle on the raw data, you won’t have a handle on anything downstream.

If you answered yes to #1 and no to #2, you need a Data Miner. The first job of your new Data Scientist will be to come in and validate all of your raw data and base assumptions.

If you answered yes to #1 and #2, but no to #3, you need an experienced Algorithms Expert or Data Miner. Your Data Scientist is there to help you define your data driven path. Their first job is to understand the current status of your analytics systems and make suggestions on where and how to make improvements to the current systems.

Finally, if you answered no to #1 and yes to #2, you are lying to yourself. When you don’t have a good understanding of what data is coming into your system and how it gets there, you can never be sure of the quality of your results.

In the next installment, we’ll talk about the interview process for hiring a Data Scientist.

Andrew Eichenbaum is VP, Data Science Solutions at DesignMind. He specializes in data mining, data modeling, and artificial intelligence. Andrew heads DesignMind’s Data Science division.

Hiring a Data Science Team: Types of Data Scientists

Hiring a new member to your team is always a daunting task. Now combine that with looking to fill “The Sexiest Job of the Century”, Data Scientist, and you have quite the conundrum on your hands:

  • Who and what is a Data Scientist?Types of Data Scientists
  • How do you hire one?
  • Where do you find them?
  • How do you vet them?

Over the next few weeks we’ll discuss these topics in a series on the DesignMind blog. To start, let’s discuss the four major categories of Data Scientists:

Algorithms Expert These are the people who will ask you what questions you want answered. They then try to answer these questions by matching the form and format of your available data to a set of Machine Learning or Optimization techniques. Algorithms Experts usually come from a Computer Science, Electrical Engineering, or Mathematics background.

Data Miner Data Miners are “why” scientists who ask why you’re asking the questions you are asking. They then try to find patterns in the data and build individual or derived Performance Metrics that will help focus the business in their direction and outcomes. Data Miners usually come from a science-based background like Physics, Biology, or Chemistry.

Data Wrangler Just like a cowboy, a data wrangler will manage your data flows and makes sure data is internally consistent. They look at your raw data and say, what do you need from this and what are you missing, then architect and build the systems to accomplish this. Data Wranglers come from a diverse set of backgrounds.

Statistician This is a mathematician who looks for patterns in your raw data. They are the classic actuary, where given a set of possible outcomes, they try and look for patterns in your data stream that will try to predict any of the outcomes. Statisticians usually come from Applied Math or Statistics background.

It’s interesting to note that most Data Scientists are a blend of more than one category, and that’s a good thing as Data Scientists are required to fill multiple roles.

In the next installment, we’ll talk about hiring your first Data Scientist and how they fit into your team.

For further reading, I suggest:

Andrew Eichenbaum is Principal Data Science Consultant at DesignMind. He specializes in data mining, data modeling, and artificial intelligence. Andrew heads DesignMind’s Data Science division.

Hadoop Speeds Data Delivery at Bloomberg

Many organizations are adopting Hadoop as their data platform because of two fundamental issues:

  1. They have a lot of data that they need to store, analyze and make sense of, on the order of 10s of terabytes or greater.
  2. The cost of doing the above in Hadoop is significantly less expensive than the alternatives.

But those organizations are finding there are other good reasons for using Hadoop and other NoSQL data stores (HBase, Cassandra). Hadoop has rapidly become the dominant distributed data platform, much as Linux quickly dominated the Unix operating system market. With that platform come a rich ecosystem of applications for building data products, whether it’s for the growing SQL on Hadoop movement or real-time data access with HBase.

At the latest Hadoop SF meetup at Bloomberg’s office, two presenters discussed how Bloomberg was taking advantage of this converged platform to power their data products. Bloomberg is the leading provider of securities data to financial companies, but they describe their data problem as “medium data” – they don’t have as much data to deal with, but they do have strong requirements around how quickly they need to deliver it to their users. They have thousands of developers working on all aspects of these data products, but especially custom low-latency data delivery systems.

When Bloomberg explored the use of HBase as the backend of their portfolio pricing lookup tool, they had quite a challenge – support an average query lookup of 10M+ cells in around 60 ms. Initial efforts to use HBase were promising, but not quite fast enough. Through several iterations of optimization, including parallel client queries, scheduling garbage collection, and even enhancing high availability further to minimize the impact of a failed server (HBASE-10070), they were able to hit their targets, and allow the move from custom data server products to HBase.

With the move to Hadoop, Bloomberg’s also needed better cluster management capabilities. Open-source tools are already dominant in this space, and while Bloomberg leverages a combination of Apache Bigtop for Hadoop, Chef for configuration management, and Zabbix for monitoring, many other good tools exist (I’m most fond of Ansible, Monit and proprietary Cloudera Manager personally). Combining the abilities of the Hadoop platform for developing and running large-scale data products with more efficient provisioning and operational models gives Bloomberg exactly what they need. It’s a model that’s going to play out repeatedly in the coming years at many organizations as Hadoop proves its capabilities as a modern data platform.

Mark Kidwell is Principal Big Data Consultant at DesignMind. He specializes in Hadoop, Data Warehousing, and Technical and Project Leadership. 

 

Joy Mundy of Kimball Group on Dimensional Modeling

Joy Mundy Kimball Group

Joy Mundy of the renowned IT consulting firm Kimball Group focuses on Data Warehouse and Business Intelligence solutions.

She spoke at SQLSaturday Silicon Valley on designing dimensional models. Joy also emphasized the importance of consulting with business users during the design process. “Get as close to the business users as possible and make them part of the design team,” she advised. Before joining Kimball Group, Joy worked on Microsoft’s BI Best Practices Team.

“The biggest problem I think is that we tend not to talk enough to the business users about what their requirements are and instead build our designs from a technology perspective rather than from a business perspective. It’s a very old problem, there is nothing new here.

The problem exists mainly because we, the people who are in charge of building these systems are technical people because it’s, there is a lot of technology involved and a lot of moving parts that are difficult to put together, but the underlying problem is a business problem and it’s out of the normal comfort zone and skill set of the technological people who are building the solutions.”