Automating the Identification of Web APIs

Although this is somewhat old news, I still would like to drop a few lines about this work for it still can yield further fruits. We have been working for some time now on better supporting the use of Web APIs. So far, we have analysed the current state of affairs and have provided a set of technologies including conceptual models and tools for supporting the life-cycle of Web APIs and applications based on those. We worked in the past on a tool and conceptual model for creating semantic annotations for Web APIs with SWEET, advanced discovery support for APIs annotated in this manner through iServe, as well as for supporting the invocation of any of those APIs through a single generic invocation engine called OmniVoke.

Although the solutions we devised do provide a considerable improvement in the level of automation that one can benefit from while building applications based on Web APIs, it is all predicated on the existence of these semantic annotations and there are not many available. Indeed, better user assistance but also the availability of good incentives play an increasingly important role and we are devoting efforts to these aspects. I’d like, however, to talk in this post about another path we have been working on lately which takes as starting point Web APIs as they currently are. We have presented this work both at the AAAI Symposium – Intelligent Web Services Meet Social Computing and at ISWC.

As a first step in this endeavour we have revisited Web APIs discovery. Nowadays, the solutions in this area are rather limited. The best option is Programmable Web which has become the de-facto registry for Web APIs. Programmable Web’s data depends on the input provided by users which the editorial team eventually clean up and enrich through a titanic manual effort. So far, so good, but how far can this scale? I believe that it cannot scale very far and in fact we already see that even this is not enough as search results are not always accurate. ProgrammableWeb is a great resource but the team behind it are very much aware of this potential limitation and they are trying to improve this situation.

Rather than relying on manual input, as part of our research on iServe, we have been working on automating the identification of Web APIs for their eventual automated processing. Our approach is based on approximating the problem of locating Web APIs as one that focusses on identifying Web pages that provide technical documentation about Web APIs. In fact, in order to be usable, every Web API out there provides technical documentation that developers can read and interpret while creating a client or mashup. In order to solve this problem we have exploited machine learning algorithms to automatically classify a given Web page as either a normal page or one that provides technical documentation. Our experiments are quite promising. Notably, analysing the Web pages text we have achieved an accuracy slightly over 80% using Support Vector Machines. We have further improved these results up to 82% using our own extension of Latent Dirichlet Allocation.

In the context of the COMPOSE project we have teamed up with Barcelona’s Supercomputing Centre for running these algorithms at scale over a Web crawl–CommonCrawl. We should have the first results real soon… Let’s see what we get, I’ll certainly report that back.