Colin White of BI Research and Harriet Fryman of IBM help separate the reality from the hype by taking a look at use cases and the benefits customers are gaining from big data.
Python is an increasingly popular object-oriented, interpreted and interactive programming language used for heavy-duty data analysis. Python is designed for ease-of-use, speed, readability and tailored for data-intensive applications. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming styles. It features a fully dynamic type system and automatic memory management, similar to that of Scheme, Ruby, Perl and Tcl.
You can create customized data tools using Python that can handle large data sets efficiently - it lets you work more quickly and integrate your systems more effectively. You can get more done in less time using Python for manipulating, processing, cleaning, and crunching data.
Python allows an organization to build a framework that makes it easy to collect data from a myriad of data sources and model them. So instead of spending time writing database connector code, you can use a simple configuration and quickly get off the ground. As a result of this easy familiarity, Python allows an organization to move code from development to production more quickly considering the same code created as a prototype can easily be moved into production.
If you like R language, Python libraries such as SciPy, iPython and Pandas provide much of the mathematical functionality typically found in R. While R offers more packages and visualization capabilities at this time, Python is catching up.
Simply, Python is easy to learn, platform neutral and cheap. Python is a tool to build other tools with, including data analysis tools. It was actually conceived in a huge orgy of different programming paradigms, styles and languages. Python runs on Windows, Linux/Unix, Mac OS X, and has been ported to the Java and .NET virtual machines.
Python is free to use, even for commercial products, because of its OSI-approved open source license. See: http://www.python.org/psf/license/
Pandas is a Python package for doing data transformation and statistical analysis. Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools. See: http://pandas.pydata.org/
While R is the most widely-used open source environment for statistical modeling and graphics, Pandas adopts some of the best concepts of R, like the foundational data.frame. Pandas has been described as "R data.frame on steroids". Pandas seeks to remedy some frustrations common to R users:
1. R has simple data alignment and indexing functionality, leaving much work to the user. Pandas makes it easy and intuitive to work with messy, irregularly indexed data - like time series data. Pandas also provides rich tools, like hierarchical indexing, not found in R;
2. R is not well-suited to general purpose programming and system development. Pandas enables you to do large-scale data processing seamlessly when developing your production applications;
3. Hybrid systems connecting R to a low-productivity systems language like Java, C++, or C# suffer from significantly reduced agility and maintainability, and you’re still stuck developing the system components in a low-productivity language;
4. The "copyleft" GPL license of R can create concerns for commercial software vendors who want to distribute R with their software under another license. Python and Pandas use more permissive licenses.
Top Python Advantages
- Instant feedback from the interactive interpreter.
- Non-intrusive: You think about the problem, not the tool you are working with. After you learn Python, it gets out of the way.
- Libraries: Whatever you want to do, somebody has written code to help you get there.
- Community: The community is a great source of examples and ideas.
- The philosophy of one-best-way means that Python programmers all tend to do things in sort of the same way. This is a big advantage because it makes it easy to read other people's code - a great way to learn.
Top Python Disadvantages
- No single source of truth / best-practices: It can be hard to learn what is the best library for a particular job. The large number of packages relevant to a particular task can make it difficult to find the one best suited to your exact needs.
- Documentation is substandard: The Python official documentation is seldom the best way to learn a new library. The informal Python community provides the most useful examples. Yet sorting out the wheat from the chaff can be hit-or-miss.
- Concurrency: Python was designed without concurrency in mind and it shows.
Selecting the right Business Intelligence (BI) and Analytics Platform for a unique organization is challenging. Each organization - both in the public and private sectors - has a legacy IT ecosystem as well as unique competencies, skills, knowledge / business processes and people. As a result, the selection process should factor the above and include all stakeholders, including data scientists.
Gartner recently released the "2013 Business Intelligence and Analytics Platforms Magic Quadrant" as shown above. I suggest using Gartner's Magic Quadrant as one source of guidance - albeit with a heavy grain of salt. There may be conflicts of interest with Gartner's ratings. It is wise to give more weight to organization requirements and best platform fit than to vendor ratings. In addition, there is no perfect BI / Analytical Platform - each have strengths and weaknesses.
It is advisable to include data scientists in the selection process. They will provide invaluable information about the strengths and weaknesses of various BI / Analytical Platforms. Moreover, they will use different analytical tools for their job and know best how those tools integrate with different BI / Analytical Platforms.
Many organizations are electing to architect and build their own BI / Analytical Platforms using a mixture of open source and proprietary technologies. This has the advantage of greater technology agility and flexibility with the potential for lower costs. The disadvantage is greater expertise and time required to design, build and maintain the platform. Many value IT agility and flexibility for the future over other considerations. Others assign greater weight to perceived security of the black box of a major vendor.
I strongly suggest using a third-party professional - usually an independent professional IT services firm or systems integrator - to provide counsel in the selection process. They can provide an independent opinion - free of conflicts of interest - on which vendor(s) best meet needs of the organization. They can also help design and implement an overall Information Management Plan.
It is critical to have an organizational Information Management Plan to get optimal use of the BI / Analytical Platform. This may entail re-engineering knowledge / business processes and painful change management. Yet the short term pain may be worth the long term reward.