Businesses should not deploy open source software for data mining just because it is generally cheaper, an open source consultant has advised.
“Don’t focus solely on cost savings,” said Jos van Dongen, an associate and principal at business intelligence (BI) consultancy DeltIQ Group at the Predictive Analytics World conference in London yesterday. “It [open source] could turn out more expensive because it could require specialised people and more work. Other benefits could be more important.”
To this end, van Dongen, who is also an independent consultant in open source BI software, said that companies should evaluate open source software as they would any software, open source or proprietary. “It doesn’t matter if the software is free if it takes longer to build, manage and deploy solutions to end users, or if it is unstable, or missing key features. Don’t select just because it is open source,” he said.
Van Dongen compared the different benefits of WEKA KnowledgeFlow, an open source tool, and proprietary software SPSS Modeler from IBM, to illustrate his point.
While WEKA is free, extendable and embeddable and covers more than 95 percent of data mining, van Dongen recognised that SPSS had certain advantages over it. “[SPSS] isn’t a cheap solution but it is scriptable and it is very powerful. The types of analyses covered by SPSS are much broader than what you can do with WEKA.
“WEKA is old-school datamining, [for instance] you can’t do text analytics in WEKA. Whereas with SPSS, you can run only part of a model, run different branches or easily compare different models in the same working environment. SPSS is a much more mature interface to work with. So if you want more intuition and power, skip WEKA, go for SPSS.”
Despite this, van Dongen believes that if a business does not have any existing tools for data mining, they should make open source the default option.
For organisations in this situation, he recommended open source data mining system RapidMiner, which provides capabilities such as data integration, data analysis and reporting. RapidMiner was rated this year as the most popular data mining tool in the KDnuggets Data Mining and Analytics Software Poll.
However, van Dongen advised against businesses taking a ‘rip and replace’ approach to its implementation, and suggested instead that businesses plan to augment their existing software with open source.
“Look at gaps in the BI portfolio and data warehouse stack, and use open source to supplement your systems. Try to work in conjunction with existing solutions,” he said.
In addition, van Dongen said that most organisations are adopting open source in an ad hoc fashion, on a project-by-project basis. He therefore recommended firms consider developing open source policies in order to standardise the process.