Data science researcher builds tools to speed up data preparation
Data preparation is widely regarded as the most time-consuming part of data science and of a data scientists time. 間眅埶AV computing science professor Jiannan Wangs mission is to speed up data science by greatly reducing the time spent on data preparation. To do this, he develops innovative technologies and open-source tools for data scientists to use.
Data preparation refers to the process of collecting, exploring, cleaning, transforming and integrating data into a form for downstream analysis and modeling. By 2025, it is estimated that the market for data preparation will be more than .
Data preparation is not a single problem, says Wang.
It consists of many challenging problems such as discovering, understanding, cleaning and integrating the data.
These problems are more easily solved by crowdsourcing and using human intelligence than by being fully automated. For example, entity resolution is the task of disambiguating records that refer to real-world entities. It is central to data cleaning and integration, but algorithmic solutions are far from perfect. Wang built , the first crowdsourced entity resolution system able to outperform the best human-only and machine-only systems. To reduce the human cost, he also developed the first quality-aware task assignment system for various data preparation tasks.
Wangs project was proposed to scale the expensive data cleaning process. The main idea of this project is to have a human clean a small sample of the data, and then use these results for the machine to learn the cleaning process and lessen the impact of unclean data on query results. This system has been incorporated in the , one of the worlds most popular big data stacks at that time.
His mission to speed up data science, however, can perhaps best be seen in his string similarity join work. String similarity join is defined as finding all pairs of similar strings whose similarity values are above a user-specified threshold. Wangs proposed algorithms made several major breakthroughs and ran 10 to 100 times faster than all other algorithms at the String Similarity Join/Search Competition hosted by in 2013, reducing the algorithm run time from hours to minutes.
In recognition of these research breakthroughs, Wang recently received a from CS-Can|Info-Can. This is, To foster excellence in Computer Science research and higher education in Canada, drive innovation and benefit society. This award comes after Wang also received the in 2018.
This award recognizes my past achievements, but in my opinion, research is a marathon, says Wang, who also serves as the program director for the Professional Masters Program in Computer Science at 間眅埶AV.
His long-term research goal can be found in his new project , an all-in-one data preparation system that provides the easiest way for data scientists to prepare data in Python. After beginning on this project in May 2019, DataPrep has already been downloaded over and has received positive feedback on forums such as
Through his research, Wang hopes to build a community that is equal, diverse and inclusive while saving data scientists time during the crucial stage of data preparation.
This award gives me more motivation to do excellent research and to focus on the impact of this research, says Wang.
There are millions of data scientists in the world that spend a lot of time on data preparation, so solving this problem could have a huge impact on society.