Apache Innovation Bolsters IBM’s ‘Smartest Machine on Earth’ in First-ever Man vs. Machine Competition on Jeopardy! Quiz Show

Apache UIMA and Apache Hadoop Advance Data Intelligence and Semantics Capabilities of Watson Supercomputer

FOREST HILL, Md. Feb. 14, 2011

Processing 80 trillion operations (teraflops) per second, Watson will access 200 million pages of content against 6 million logic rules to "understand" the nuances, meanings, and patterns in spoken human language, and compete in the trivia game show Jeopardy!. Contestants are presented with clues in the form of answers, and must phrase their responses as questions within a 5-second timeframe.

Hundreds of Apache UIMA Annotators and thousands of algorithms help Watson –which runs disconnected from the Internet– access vast databases to simultaneously comprehend clues and formulate answers. Watson then analyzes 500 gigabytes of preprocessed information to match potential meanings for the question and a potential answer to the question. Helping Watson do this is:

  • Apache UIMA: standards-based frameworks, infrastructure and components that facilitate the analysis and annotation of an array of unstructured content (such as text, audio and video). Watson uses Apache UIMA for real-time content analytics and natural language processing, to comprehend clues, find possible answers, gather supporting evidence, score each answer, compute its confidence in each answer, and improve contextual understanding (machine learning) – all under 3 seconds.  
  • Apache Hadoop: software framework that enables data-intensive distributed applications to work with thousands of nodes and petabytes of data. A foundation of Cloud computing, Apache Hadoop enables Watson to access, sort, and process data in a massively parallel system (90+ server cluster/2,880 processor cores/16 terabytes of RAM/4 terabytes of disk storage).

The Watson system uses UIMA as its principal infrastructure for component interoperability and makes extensive use of the UIMA-AS scale-out capabilities that can exploit modern, highly parallel hardware architectures. UIMA manages all work flow and communication between processes, which are spread across the cluster. Apache Hadoop manages the task of preprocessing Watson’s enormous information sources by deploying UIMA pipelines as Hadoop mappers, running UIMA analytics.

