From Data Mining to Statistical Data Mining: Emerging Opportunities for Statisticians and Businesses

Dr. Granville is Chief Scientist at a publicly traded company, and the founder of Analyticbridge, the leading social network for analytic professionals, with more than 35,000 members. He has created several patents related to web traffic quality scoring, and he is an invited speaker at leading international data mining conferences. Vincent has consulted with Visa, eBay, Wells Fargo, Microsoft, CNET, LowerMyBills, InfoSpace and a number of startups on projects such as fraud detection, user experience, core KPIs, metric selection, change point detection, multivariate testing, competitive intelligence, keyword bidding optimization, taxonomy creation, scoring technology and web crawling.

Web and business analytics are two areas that are increasingly becoming popular.  While these two areas have benefited from significant computer science advances, such as cloud computing, programmable API’s, SaaS and modern programming languages (Python) and architectures (Map/Reduce), the true revolution has yet to come.

We will reach limits in terms of hardware and architecture scalability. Also, cloud can only be implemented for problems that can easily be partitioned, such as search (web crawling). Soon, a new type of statisticians will be critical to further optimize “big data” business applications. They might be called data mining statisticians, statistical engineers, business analytics statisticians, data or modeling scientists, and essentially, they will have a strong background in

  • Design of experiments: multivariate testing is critical in web analytics
  • Fast, efficient, unsupervised clustering and algorithmic to solve taxonomy and text clustering problems involving billions of search queries
  • Advanced scoring technology for fraud detection and credit or transaction scoring, or to assess whether a click or Internet traffic conversion is real or Botnet generated – the models could involve sophisticated versions of constrained or penalized logistic regression and unusual, robust decision trees such as hidden decision trees, and in addition, provide confidence intervals for individual scores
  • Robust cross-validation, model selection and fitting without over-fitting, as opposed to traditional back-testing
  • Integration of time series cross correlations with time lags, spatial data and events categorization and weighting, e.g. to better predict stock prices
  • Monte-Carlo, bootstrap and data driven, model-free, robust statistical techniques used in high dimensional spaces
  • Fuzzy merging to integrate corporate data with data gathered on social networks, and other external data sources
  • Six sigma concepts, Pareto analyses to accelerate software development  lifecycle
  • Models that detect causes rather than correlations
  • Statistical metrics to measure lift, yield and other critical KPI’s (key performance indicators)
  • Visualization skills – even putting data summaries not just in charts, but in videos

An example of a web analytics application that will benefit from statistical technology is estimating the value (CPC, or cost-per-click) and volume of a search keyword depending on market, position and match type– a critical problem for Google and Bing advertisers, as well as publishers. Currently, if you use the Google API to get CPC estimates, more than 50% of the time, Google will return no value. This is a classical example of a problem that was addressed by smart engineers and computer scientists, but truly lack a statistical component – even as simple as naïve Bayes – to provide a CPC estimate for any keyword, even those that are brand new.  Statisticians with experience in imputation methods should easily solve this problem, and help their company sell CPC and volume estimates (with confidence intervals – something Google does not offer) for all keywords.

Another example is spam detection in social networks. The most profitable networks will be those where content, be it messages posted by users or commercial ads, will be highly relevant to users – without invading privacy. For those familiar with Facebook, you know how much progress still need to be done:  improvements will rely on better statistical models. Spam detection is still largely addressed using naïve Bayes techniques, which are notoriously flawed due to their inability to take into account rule interactions. It is like running a regression model where all independent variables are highly… dependent on each other.

Finally, in the context of online advertising ROI optimization, one big challenge is assigning attribution. If you buy a product 2 months after having seen a TV ad twice, one month after checking organic search results on Google for the product in question, one week after clicking on a Google paid ad and 3 days after clicking on a Bing paid ad, how do you determine the cause of your purchase? It could be 25% due to the TV ad, 20% due to the Bing ad etc. This is a rather complicated advertising mix optimization problem, and being able to accurately track users over several months helps solve the statistical challenge. Yet, with more user tracking regulations preventing usage of IP addresses in databases for targeting purposes, the problem will become more complicated, and more advanced statistics will be required. Companies working with the best statisticians will be able to provide great targeting and high ROI without “stalking” users in corporate databases and data warehouses.

Related articles