Petr Knoth
Dr. Petr Knoth
Research Fellow
Knowledge Media Institute
The Open University
Walton Hall
Milton Keynes, MK7 6AA
United Kingdom

Direct: ++44(0)1908 654548
email:petr.knoth (** at **)
web: my KMI page

Research interests

I lead an R&D team working in the domains of text-mining, digital libraries and open access/science. I am the founder, product and team leader for CORE, which is a service that aggregates millions of open access articles from around the world and makes them available for people to search and machines to text-mine. I have been involved as a principle investigator in a number of European Commission funded projects on Open Science and Text Mining. Previously, I worked as a Senior Data Scientist at Mendeley on information extraction and content recommendation for research. I have a deep interest in the use of AI to improve research workflows. I have co-founded which aim to go beyond bibliometrics and altmetrics to produce new research evaluation methods that make use of the publication full-texts in research assessment.

  • Natural Language Processing/Text and data mining
  • Open Access, Open Science, Scholarly communication
  • Information Retrieval, Information Extraction, Recommendation systems
  • Scientometrics


Pontika, N., Knoth, P., Cancellieri, M. and Pearce, S. (2016) Developing Infrastructure to Support Closer Collaboration of Aggregators with Open Repositories, LIBER Quarterly, 25, 4

Herrmannova, D. and Knoth, P. (2016) Simple Yet Effective Methods for Large-Scale Scholarly Publication Ranking: KMi and Mendeley (team BletchleyPark) at WSDM Cup 2016, Workshop: WSDM Cup 2016 - Entity Ranking Challenge Workshop at International Conference on Web Search and Data Mining (WSDM)s, San Francisco, CA, USA

Pontika, N., Cancellieri, M., Knoth, P. and Pearce, S. (2015) Fostering Open Science to Research using a Taxonomy and eLearning Portal, i-Know - 15th International Conference on Knowledge Technologies and Data Driven Businesss. 21 - 22 October. Graz, Austria.

Herrmannova, D. and Knoth, P. (2015) Semantometrics: Fulltext-based Measures for Analysing Research Collaboration, Poster at ISSI 2015s, Istanbul, Turkey

Herrmannova, D. and Knoth, P. (2015) Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysing Patterns of Research Collaboration, D-Lib Magazines, 21, 11/12, Corporation for National Research Initiatives

Knoth, P. and Herrmannova, D. (2014) Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing a Research Publication's Contribution, D-Lib Magazines, 20, 11/12, Corporation for National Research Initiatives

Kats, P., Knoth, P., Mamakis, G., Mielnicki, M., Muhr, M. and Werla, M. (2014) Design of Europeana Cloud Technical Infrastructure, Poster at Digital Libraries (DL 2014)s, London, United Kingdom

Knoth, P., Anastasiou, L. and Pearce, S. (2014) My repository is being aggregated: a blessing or a curse?, Open Repositories 2014 (OR 2014)s, Helsinki, Finland

Knoth, P. (2013) CORE: Aggregation Use Cases for Open Access, Demo at Joint Conference on Digital Libraries (JCDL 2013), Indianapolis, Indiana, United States 

Knoth, P. (2013) From Open Access Metadata to Open Access Content: Two Principles for Increased Visibility of Open Access Content, Open Repositories 2013 (OR 2013), Charlottetown, Prince Edward Island, Canada 

Knoth, P. and Herrmannova, D. (2013) Simple Yet Effective Methods for Cross-Lingual Link Discovery (CLLD) - KMI @ NTCIR-10 CrossLink-2, NTCIR-10 Evaluation of Information Access Technologies, Tokyo, Japan 

Knoth, P. and Zdrahal, Z. (2012) CORE: Three Access Levels to Underpin Open Access, D-Lib Magazine, 18, 11/12, Corporation for National Research Initiatives 

Knoth, P., Zdrahal, Z. and Juffinger, A. (2012) Special Issue on Mining Scientific Publications, D-Lib Magazine, 18, 7/8, Corporation for National Research Initiatives  

Herrmannova, D. and Knoth, P. (2012) Visual Search for Supporting Content Exploration in Large Document Collections, D-Lib Magazine, 18, 7/8, Corporation for National Research Initiatives  

Knoth, P., Zilka, L. and Zdrahal, Z. (2011) KMI, The Open University at NTCIR-9 CrossLink: Cross-Lingual Link Discovery in Wikipedia Using Explicit Semantic Analysis, The 9th NTCIR Workshop Meeting Evaluation of Information Access Technologies: Information Retrieval, Question Answering, and Cross-Lingual Information Access, Tokyo, Japan

Knoth, P., Zilka, L. and Zdrahal, Z. (2011) Using Explicit Semantic Analysis for Cross-Lingual Link Discovery, 5th International Workshop on Cross Lingual Information Access: Computational Linguistics and the Information Need of Multilingual Societies (CLIA) at The 5th International Joint Conference on Natural Language Processing (IJC-NLP 2011), Chiang Mai, Thailand

Maleshkova, M., Zilka, L., Knoth, P. and Pedrinaci, C. Cross-Lingual Web API Classification and Annotation, 2nd Workshop on the Multilingual Semantic Web at The 10th International Semantic Web Conference, Bonn, Germany.

Knoth, P. and Zdrahal, Z. (2011) Mining Cross-document Relationships from Text, The First International Conference on Advances in Information Mining and Management (IMMM 2011), Barcelona, Spain

Knoth, P., Robotka, V. and Zdrahal, Z. (2011) Connecting Repositories in the Open Access Domain using Text Mining and Semantic Data, International Conference on Theory and Practice of Digital Libraries 2011 (TPDL 2011), Berlin, Germany (Best Poster/Demo Award)

Knoth, P. and Zdrahal, Z. (2011) CORE: Connecting Repositories in the Open Access Domain, CERN workshop on Innovations in Scholarly Communication (OAI7), Geneva, Switzerland

Knoth, P., Novotny, J., and Zdrahal, Z. (2010) Automatic generation of inter-passage links based on semantic similarity, In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Beijing, China

Fernandez, M., Sabou, M., Knoth, P., Motta, E. (2010) Predicting the quality of semantic relations by applying Machine Learning classifiers, In Proceedings of the 17th International Conference on Knowledge Engineering and Knowledge Management, Poster session. Lisbon, Portugal (Best Poster Award)

Knoth, P., Collins, T., Sklavounou, E., and Zdrahal, Z. (2010) Facilitating cross-language retrieval and machine translation by multilingual domain ontologies, In Workshop on Supporting eLearning with Language Resources and Semantic Data at LREC 2010, Valletta, Malta

Knoth, P., Collins, T., Sklavounou, E., and Zdrahal, Z. (2010) EUROGENE: Multilingual Retrieval and Machine Translation applied to Human Genetics, In 32nd European Conference on IR Research (ECIR 2010), Demo session, Milton Keynes, United Kingdom

Knoth, P., Sova, J., and Zdrahal, Z. (2010) Eurogene - The First Pan-European Learning Service in the Field of Genetics, Znalosti (Knowledge) 2010, Jindrichuv Hradec, Czech Republic

Knoth, P. (2009) Semantic Annotation of Multilingual Learning Objects Based on a Domain Ontology, In Doctoral consortium at EC-TEL 2009, Nice, France

Zdrahal, Z., Knoth, P., Collins, T., and Mulholland, P. (2009) Reasoning across Multilingual Learning Resources in Human Genetics, In International Conference on Interactive Computer Aided Learning (ICL 2009), Villach, Austria

Knoth, P., Schmidt, M., Smrz, P., and Zdrahal, Z. (2009) Towards a Framework for Comparing Automatic Term Recognition Methods, In Znalosti (Knowledge) 2009, Brno, Czech Republic

Schmidt, M., Knoth, P., and Smrz, P. (2009) Information Extraction in the KiWi Project, In Znalosti (Knowledge) 2009, Brno, Czech Republic

Opsomer, R., Knoth, P., Polen, F., Trapman, J., and Wiering, M. (2008) Categorizing Children: Automated Text Classification of CHILDES files, BNAIC 2008, Enchede, The Netherlands

Knoth, P. (2008) Extraction of Semantic Relations from Texts, In Student EEICT 2008, Brno, Czech Republic


Text and Data Mining (TDM) of scholarly literature has the potential to revolutionise the way we do research. It can improve the ways in which we discover, access, read, disseminate and evaluate research. However, current TDM applications are hindered by a number of barriers to machine access to scientific literature as well as the lack of scalable standardised interfaces for text and data mining of research papers. The OpenMinTeD project aims at providing an open and sustainable TDM infrastructure in order to make primary content accessible through standardised interfaces, to process, analyse and annotate scientific text by well-documented services and workflows that better facilitate identifying and extracting entities, patterns and relationships.

FOSTER is an international European Commission funded project of 13 partners. The project aims to raise the awareness, practical knowledge and skills of the different European Research Area stakeholders related to open access and open research data issues and promote the open access culture. The Open University is responsible in this project for the design, implementation and delivery of an eLearning platform – the FOSTER portal – populated with open access training content.

Europeana Cloud aims to establish a cloud-based system for Europeana and its aggregators (including the CORE system I designed at the OU). Europeana Cloud will provide new content, new metadata, a new linked storage system, new tools and services for researchers and a new platform - Europeana Research.

The goal of DiggiCORE is to analyse a vast set of research publications from the Open Access domain using natural language processing and social network analysis methods to identify patterns in the behaviour of research communities, to recognise trends in research disciplines, to learn new insights about the citation behaviours of researchers.

The ServiceCORE project aims to develop a new nation-wide aggregation service that will improve the discovery of research publications stored across British Open Access repositories. The ServiceCORE project will extend the solution provided by the CORE system, developed in the first stage of the Resource Discovery programme.

CORE - The COnnecting REpositories (CORE) project aims to facilitate the access and navigation across scientific papers, stored in British Open Access repositories, using Natural Language Processing and Linked Data.

RETAIN The goal of the RETAIN project is to extend the existing Business Intelligence (BI) functionality that is currently in use at the Open University. The focus will be on using BI to improve student retention. Several initiatives have been instigated with a view to finding ways to improve retention figures, to identify why the problem exists and the different approaches for dealing with these issues.

DECIPHER DECIPHER is a three year is a European Commission supported project which aims to support the discovery and exploration of cultural heritage through story and narrative. To do this we are developing new solutions to the whole range of narrative construction, knowledge visualisation and display problems for museums. The outcome will change the way people access digital heritage by combining rich visualisations, event-based meta-data and causal reasoning models.

The TECH-IT-EASY project develops an information system, based on analytical and knowledge-based tools, able to support electromechanical European SMEs in structuring and systematising the internal product innovation process based on the combined application of QFD (Quality Function Deployment) and technology potentials of TRIZ (Theory of Inventive Problem Solving).

EuroGene is a European Commission supported e-ContentPlus project concerned with providing high quality semantically enriched educational content in genetics. The primary role of KMI within EuroGene is to apply tools and methods for automatic content annotation, cross-language retrieval and the navigation through the available content.

Knowledge in a WiKi - The main objectives of KIWI are to investigate how knowledge management in highly dynamic environments can be supported using Semantic Wiki technologies, and how Semantic Wikis can be improved to satisfy the requirements of knowledge management. For this purpose, KIWI will implement an advanced knowledge management system based on the Semantic Wiki IkeWiki and extend it by improved, rule-based reasoning support, information extraction, personalisation, and advanced visualisations and editors; and verify the system on two use cases in the areas of project knowledge management and software knowledge management, with flexible workflow models and specific support for the respective application areas.


CORE tools - CORE is a system that allows accessing, navigating and downloading content stored across a number of Open Access repositories. CORE provides three tools, CORE Portal, CORE Mobile and CORE Plugin. To find out more, visit the core project website. CORE Portal can be accessed here. CORE Mobile is freely available from the Android Market here.

Eurogene portal - Eurogene is an e-learning system in the domain of genetics that provides free multimedia learning resources in nine languages for statistical, medical and molecular genetics and delivers them to students and professionals. The Eurogene content includes presentations, reviewed research articles, images, videos and learning packages submitted by world-leading geneticists.

Jajatr automatic term recognition framework


Information Extraction from Biomedical Texts - master thesis

Annotating Knowledge Resources to Support Learning - probation report