Hypothesis Generation

The overarching goal in our hypothesis generation research is to develop artificial intelligent or decision support systems that can take a piece of data (e.g., symptom) and use that datum as the basis of inferring its underlying cause. We view this process as iterative, wherein the agent (i.e., person or machine) first generates a set of possible hypotheses then successively whittles down this set to a single hypothesized cause. We assume that the hypotheses themselves inform the agent about what new data would be most diagnostic for distinguishing amongst the set of hypotheses under consideration. Thus, we postulate a set of information search algorithms that operate over the set of generated hypotheses.

Our current model of hypothesis generation is called HyGene. HyGene is best described as a Bayesian inference engine that operates over a semantic space. In this way, HyGene can capitalize on relative frequency information (i.e., generate hypotheses with the highest frequency of occurence) while exploiting information inherent in semantics.

The fact that HyGene operates on a semantic space allows the model deal with the so-called reference class problem, wherein the probability of any particular event is dependent on the reference class it is ascribed to. Given that most objects can belong to multiple reference classes, it is difficult, if not impossible, to compute the probability of an event. For example, what does it mean to estimate the likelihood that a 67 year old white male named Tom has prostate cancer? Is this probability defined across the entire population? Across people similar in age to Tom? Across white males? White males in the US? White males in the US with no other health problems? etc. Obviously, the perceived probability that Tom has cancer depends on how we define Tom's reference class.

John Venn proposed that reference classes correspond to 'natural kinds' or sets. We borrow from Venn, but further expliciate the definition of 'natural kinds' in terms of semantic similarity. The use of semantic similarity in our model allows us to isolate relevant 'clusters' of hypotheses in semantic memory over which to compute probability.