Jan 31, 2017: Maurice van Keulen: Managing uncertainty in data: The key to effective management of data quality problems

January 31, 2017Managing uncertainty in data: The key to effective management of data quality problems
Room: HB 2FMaurice van Keulen

Business analytics and data science are significantly impaired by a wide variety of 'data handling' issues, especially when data from different sources are combined and when unstructured data is involved. The root cause of many such problems centers around data semantics and data quality. We have developed a generic method which is based on modeling such problems as uncertainty *in* the data, i.e., as probabilistic data. Probabilistic databases can store and query this data scaling not only with a growing volume of data, but also with a growing volume of uncertainty. In this way, it allows one to, e.g., postpone the resolution of certain data problems by storing the unresolved data as is, and assess later what their influence is on analytical results, etc. We furthermore develop technology for data cleansing, web harvesting, and natural language processing which uses this method to deal with ambiguity of natural language and many other problems encountered when using unstructured data. In this talk, besides explaining probabilistic database technology, also significant attention is paid to its application in NLP because probabilistic data inspired a fundamentally different formalization of aspects of meaning in natural language sentences, that enables other approaches to dealing with its ambiguity. We call it "Sherlock holmes-style" because of the principle of "when you have eliminated the impossible, whatever remains, however improbable, must be the truth".