Towards a research corpus

In my thesis proposal,  I outline an approach to the federation of structured conversations. On the surface, federation means combining representations of topics that are about the same topic. The term, from topic maps, is merging.

The trivial example is seen in these two conversation assertions in answer to the same question:

  • co2 causes climate change
  • climate change is caused by co2

On inspection, humans recognize those two assertions as saying the same thing. Not so for most computer programs; my task is to write a program that notices the sameness of the two assertions. One approach is to transform the assertions into some sort of canonical form and compare those. Many tricks (a term exploited by the climategate crowd) are available. One is to notice that causes and is caused by relate to the same notion of causality, a root relation. A transform based on that results in these two triples:

  • {co2, cause, climate change}
  • {carbon dioxide, cause, climate change}

The next trick is to notice that co2 and carbon dioxide are both names for the same topic. We thus reduce both assertions to one triple; both say the same thing. We can merge the two statements into one.

To do that on a large scale, we need a corpus of conversations for training and testing.  Our mission was thus one of harvesting numerous such conversations from the web. We could use search engines and find various blog entries, Wikipedia entries, op eds, and so forth; we will eventually do lots of that. But, good fortune bestowed the gift of 126 climate change arguments into our laptop and the corpus described in the last post appeared. To get that corpus into shape requires further processing.

Further processing happens in the form of an online web service, AlchemyAPI, one among several we are testing. One signs up for an account, downloads some software utilities, writes a program to use those utilities and begins to harvest each of the pages linked in our 126-argument issue map from our last post. Those utilities harvest the page and return several XML files. One returns clean text ready for further processing. One returns named entities discovered in the text, and others return key terms and concepts. We are well on our way to a corpus sufficient to conduct this research.