"Knowledge on the Web: Robust and Scalable Harvesting of Entity-Relationship Facts"

Gerhard Weikum (Max-Planck Institute for Informatics)



The proliferation of knowledge-sharing communities like Wikipedia and the advances in automatic information extraction from semistructured and textual Web data have enabled the construction of very large knowledge bases. These knowledge collections contain facts about many millions of entities and relationships between them, and can be conveniently represented in the RDF data model. Prominent examples are DBpedia, YAGO, Freebase, Trueknowledge, and others.

These structured knowledge collections can be viewed as ``Semantic Wikipedia Databases'', and they can answer many advanced questions by SPARQL-like query languages and appropriate ranking models. In addition, the knowledge bases can boost the semantic capabilities and precision of entity-oriented Web search, and they are enablers for value-added knowledge services and applications in enterprises and online communities.

The talk discusses recent advances in the large-scale harvesting of entity-relationship facts from Web sources, and it points out the next frontiers in building comprehensive knowledge bases and enabling semantic search services. In particular, it discusses the benefits and problems in extending the prior work along the following dimensions: temporal knowledge to capture the time-context and evolution of facts, multilingual knowledge to interconnect the plurality of languages and cultures, and multimodal knowledge to include also photo and video footage of entities. All these dimensions pose grand challenges for robustness and scalability of knowledge harvesting.


Gerhard Weikum is a Scientific Director at the Max-Planck Institute for Informatics, where he is leading the research group on databases and information systems. Earlier he held positions at Saarland University in Germany, ETH Zurich in Switzerland, MCC in Austin, and he was a visiting senior researcher at Microsoft Research in Redmond. His recent working areas include peer-to-peer information systems, the integration of database-systems and information-retrieval methods, and information extraction for building and maintaining large-scale knowledge bases. Weikum has co-authored more than 150 publications, including a comprehensive textbook on transactional concurrency control and recovery. He received several best paper awards including the VLDB 2002 ten-year award, and he is an ACM Fellow. He has served on the editorial boards of various journals and book series, including ACM TODS, the Springer LNCS series, and the new CACM, and as program committee chair for international conferences like ICDE 2000, ACM SIGMOD 2004, and CIDR 2007. He is currently the president of the VLDB Endowment.


"Cloud Data Management @ Yahoo!"

Raghu Ramakrishnan (Yahoo! Research)



In this talk, I will present an overview of cloud computing at Yahoo!, in particular, the data management aspects. I will discuss two major systems in use at Yahoo!--the Hadoop map-reduce system and the PNUTS/Sherpa storage system, in the broader context of offline and online data management in a cloud setting.

Hadoop is a well known open source implementation of a distributed file system with a map-reduce interface. Yahoo! has been a major contributor to this open source effort, and Hadoop is widely used internally. Given that the map-reduce paradigm is widely known, I will cover it briefly and focus on describing how Hadoop is used at Yahoo!. I will also discuss our approach to open source software, with Hadoop as an example.

Yahoo! has also developed a data serving storage system called Sherpa (sometimes referred to as PNUTS) to support data-backed web applications. These applications have stringent availability, performance and partition tolerance requirements that are difficult, sometimes even impossible, to meet using conventional database management systems. On the other hand, they typically are able to trade off consistency to achieve their goals. This has led to the development of specialized key-value stores, which are now used widely in virtually every large-scale web service.

Since most web services also require capabilities such as indexing, we are witnessing an evolution of data serving stores as systems builders seek to balance these trade-offs. In addition to presenting PNUTS/Sherpa, I will survey some of the solutions that have been developed, including Amazon's S3 and SimpleDB, Microsoft's Azure, Google's Megastore, the open source systems Cassandra and HBase, and Yahoo!'s PNUTS, and discuss the challenges in building such systems as "cloud services", providing elastic data serving capacity to developers, along with appropriately balanced consistency, availability, performance and partition tolerance.


Raghu Ramakrishnan is Chief Scientist for Audience & Cloud Computing, and a Fellow at Yahoo!, where he heads the Community Systems group. He has been Professor of Computer Sciences at the University of Wisconsin-Madison, and was founder and CTO of QUIQ, a company that pioneered question-answering communities, powering Ask Jeeves' AnswerPoint as well as customer-support for companies such as Compaq.

His research is in the area of database systems, with a focus on data mining, online communities, and web-scale data management. He has developed scalable algorithms for clustering, decision-tree construction, and itemset counting, and was among the first to investigate mining of continuously evolving, stream data.

His work on query optimization and deductive databases has found its way into commercial database systems, and his work on extending SQL to deal with queries over sequences has influenced the design of window functions in SQL:1999.

His paper on the Birch clustering algorithm received the SIGMOD 10-Year Test-of-Time award, and he has written the widely-used text "Database Management Systems" (WCB / McGraw-Hill, with J. Gehrke), now in its third edition.

He is Chair of ACM SIGMOD, on the Board of Directors of ACM SIGKDD and the Board of Trustees of the VLDB Endowment, and has served as editor-in-chief of the Journal of Data Mining and Knowledge Discovery, associate editor of ACM Transactions on Database Systems, and the Database area editor of the Journal of Logic Programming.

Raghu is a Fellow of the Association for Computing Machinery (ACM), and has received several awards, including a Distinguished Alumnus Award from IIT Madras, a Packard Foundation Fellowship, an NSF Presidential Young Investigator Award, and an ACM SIGMOD Contributions Award.