1. 特邀大会报告题目：Mosaics in Big Data：Stratosphere, Apache Flink, and Beyond
特邀大会报告摘要: The global database research community has greatly impacted the functionality and performance of data storage and processing systems along the dimensions that define “big data”, i.e., volume, velocity, variety, and veracity. Locally, over the past five years, we have also been working on varying fronts. Among our contributions are: (1) establishing a vision for a database-inspired big data analytics system, which unifies the best of database and distributed systems technologies, and augments it with concepts drawn from compilers (e.g., iterations) and data stream processing, as well as (2) forming a community of researchers and institutions to create the Stratosphere platform to realize our vision. One major result from these activities was Apache Flink, an open-source big data analytics platform and its thriving global community of developers and production users. Although much progress has been made, when looking at the overall big data stack, a major challenge for database research community still remains. That is, how to maintain the ease-of-use despite the increasing heterogeneity and complexity of data analytics, involving specialized engines for various aspects of an end-to-end data analytics pipeline, including, among others, graph-based, linear algebra-based, and relational-based algorithms, and the underlying, increasingly heterogeneous hardware and computing infrastructure. At TU Berlin, DFKI, and the Berlin Big Data Center (BBDC), we aim to advance research in this field via the Mosaics project. Our goal is to remedy some of the heterogeneity challenges that hamper developer productivity and limit the use of data science technologies to just the privileged few, who are coveted experts.
Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA) Group at the Technische Universität Berlin (TU Berlin) and an Adjunct Full Professor at the University of Toronto. At the German Research Center for Artificial Intelligence (DFKI), he is both a Chief Scientist and Head of the Intelligent Analytics for Massive Data Research Group. In addition, he is Director of the Berlin Big Data Center (BBDC). Earlier in his career, he was a Research Staff Member and Project Leader at the IBM Almaden Research Center in San Jose, California, USA and a Research Group Leader at FORWISS, the Bavarian Research Center for Knowledge-based Systems located in Munich, Germany. Dr. Markl has published numerous research papers on indexing, query optimization, lightweight information integration, and scalable data processing. He holds 20 patents, has transferred technology into several commercial products, and advises several companies and startups. He has been both the Speaker and Principal Investigator for the Stratosphere Project, which resulted in a Humboldt Innovation Award as well as Apache Flink, the open-source big data analytics system. He serves as the President-Elect of the VLDB Endowment and was elected as one of Germany's leading Digital Minds (Digitale Köpfe) by the German Informatics (GI) Society. Most recently, Volker and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on “Implicit Parallelism Through Deep Language Embedding.” Volker Markl and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on implicit parallelism through deep language embedding.
2. 特邀大会报告题目：Medical Treatment Support by Data Engineering Technologies
特邀大会报告摘要: Our daily lives have been greatly impacted by information technology which has become one of the most important infrastructure. As an example, the information technology has introduced significant changes in the medical field, such as medical image recognition, medical sensor data processing, computational drug design, and electronic medical record (EMR) systems. Focusing on the EMR systems, the data engineering technologies has high potentials for supporting them. The EMR systems do not only reduce the cost of managing medical treatment histories, but also can improve medical processes by the secondary use of these records. To expedite the secondary use, Japanese government has started a project to collect the EMR from a large number of hospitals in Japan. The clinical pathway service is a good instance of the secondary use of the EMR. Medical workers including doctors, nurses, and technicians generally use clinical pathways as their guidelines for typical sequences of medical treatments. The clinical pathways have been traditionally created by the medical workers themselves based on their experiences with great effort. The candidates of the clinical pathways can be extracted by applying the sequential pattern mining techniques to medical orders in the EMR. It is helpful for the medical workers to verify the correctness of existing clinical pathways or modify them by comparing the extracted frequent sequential patterns. To provide proper patterns as the useful information to the medical workers, there are a number of technical issues to be considered. At first, consideration on time intervals between the medical treatments is essential. Moreover, there are a number of branches in the frequent sequential patterns extracted from the EMR. Visualization of these branches is important to choose appropriate patterns. The issues of cost, safety, and reasoning related to these branches should also be considered.
Haruo Yokota received his B.E., M.E., and Dr.Eng. degrees from Tokyo Institute of Technology in 1980, 1982, and 1991, respectively. He joined Fujitsu Ltd. in 1982, and was a researcher at ICOT for the 5th Generation Computer Project from 1982 to 1986, and at Fujitsu Laboratories Ltd. from 1986 to 1992. From 1992 to 1998, he was an Associate Professor at Japan Advanced Institute of Science and Technology (JAIST). He moved to Tokyo Institute of Technology 1998, and has been a Full Professor at the Department of Computer Science since 2001. He is currently the Dean of School of Computing in Tokyo Institute of Technology. His research interests include the general research areas of data engineering, information storage systems, and dependable computing. He was a vice president of DBSJ, a chair of ACM SIGMOD Japan Chapter, a trustee board member of IPSJ, the Editor-in-Chief of Journal of Information Processing, and an associate editor of the VLDB Journal. He is currently a board member of DBSJ, a fellow of IEICE and IPSJ, a senior member of IEEE, and a member of IFIP-WG10.4, JSAI, ACM, and ACM-SIGMOD.
3. 特邀大会报告题目：Data Analytics as a Service for Data Scientists
特邀大会报告摘要: Data scientists and domain experts often face challenges when dealing with large amounts of data, especially due to the scale and limited IT knowledge and infrastructure maintenance skills. In this talk, I will present several software solutions we are developing to support data analytics as a service to these users. These solutions include Apache AsterixDB as an open source parallel database, Cloudberry as a middleware system to support data visualization, and Texera as a system to enable browser-based text analytics using declarative workflows. These solutions can be integrated to support data ingestion, storage, indexing, querying, visualization, and analytics.. As an example, we will report experiences of using these solutions to support management of large-scale social media data (e.g., billions of tweets in terabytes) as a service to researchers of various disciplines such as social science and public health from several schools and universities.
Chen Li is a professor in the Department of Computer Science at UC Irvine. He received his Ph.D. degree in Computer Science from Stanford University, and his M.S. and B.S. in Computer Science from Tsinghua University, China, respectively. His research interests are in the field of data management, including data-intensive computing, query processing and optimization, visualization, and text analytics. His current focus is building open source systems for data management and analytics. He was a recipient of an NSF CAREER Award, several test-of-time publication awards, and many grants and industry gifts. He was once a part-time Visiting Research Scientist at Google. He founded a company to commercialize university research.
4.特邀大会报告题目： Building Scalable Machine Learning Solutions for Data Curation
特邀大会报告摘要:Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. In this talk I discuss why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions. The talk focuses on two main problems: (1) entity consolidation, which is arguably the most difficult data curation challenge because it is notoriously complex and hard to scale; and (2) using probabilistic inference to suggest data repair for identified errors and anomalies using our new system called HoloCLean. Both problems have been challenging researchers and practitioners for decades due to the fundamentally combinatorial explosion in the space of solutions and the lack of ground truth. There’s a large body of work on this problem by both academia and industry. Techniques have included human curation, rules-based systems, and automatic discovery of clusters using predefined thresholds on record similarity Unfortunately, none of these techniques alone has been able to provide sufficient accuracy and scalability. The talk aims at providing deeper insight into the entity consolidation and data repair problems and discusses how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution.
Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his main research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a co-founder of Tamr, a startup focusing on large-scale data integration and cleaning. He is a recipient of the Ontario Early Researcher Award (2009), a Cheriton Faculty Fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees, elected SIGMOD vice chair, and an associate editor of the ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.