Anita Khadka PROFILE PHOTO

Anita Khadka is a PhD candidate in the Knowledge Media Institute in the Open University. Her research interest is focused on finding semantic relationships between research publications from a large corpus of digital libraries. Currently, she is working on recommender system domain specific to academic recommender systems.

Before starting her PhD, she worked as a software engineer for few years in the financial institution. And she has a Master degree in Intelligent systems and Robotics from the University of Essex.

Twitter

C2D Dataset

We have released the first version of a citation-context based dataset called C2D, created while doing an experiment in the work which will be published in RecSys 2018 as a short paper.

A dataset of 53 million unique records containing citation-information (explained in the section below) is constructed using 2 million full-text open-source research publications obtained from CORE.

We extracted citation information from each publication. Information such as cited document's title, author(s), published date and citation-context. We will describe the assumption of extracting citation-context in a bit more detail below:

First of all, we extracted the position of citation where it has mention including citation-context which is texts around the cited document. For our purpose, we created citation-context using three sentences; the sentence where the reference has been cited, the preceding, and the following sentence. Additionally, at the start or end of a paragraph, the preceding or following sentence is not extracted respectively.

Therefore, the attributes of our dataset contain:

Attributes:

  • ReferenceID - unique identifier of cited reference in a citing document
  • SourceID - unique identifier of a citing document.
  • ChapterNumber - Chapter number of the citing document where the 'ReferenceID' has mentioned.
  • ParagraphNumber - paragraph number of the citing document where the reference ReferenceID has mentioned.
  • SentenceNumber - sentence number of the citing document where the reference ReferencedID has mentioned.
  • Title - Title of the reference ReferenceID.
  • PublishedDate - Publication date when the reference ReferenceID
  • Authors - Author(s) of the reference ReferenceID
  • TextBeforeRefMention - Sentence just before the sentence where the reference ReferenceID has been cited.
  • TextWhereRefMention - Sentence where the reference ReferenceID has been cited.
  • TextAfterRefMention - Sentence just after the sentence where the reference ReferenceID has been cited.

Finally, the dataset can be downloaded from the link.

Note:

  • The actual size of the dataset is ~40gb however compressed size is ~6.7gb.
  • Requirements of different users may be different therefore we have released the raw version of the dataset. Please note, data cleansing (such as special character and stop-word removal) has not been performed.
News Image Title

Athena Swan Award Success!

The Knowledge Media Institute is delighted to be among the successful applicants announced by The Equality Challenge Unit for the November 2017 award round....Read more

WOSP2017 - Touchdown Toronto

Since 2012, members of KMi’s CORE team, headed by Petr Knoth, have orchestrated the WOSP (Workshop On mining Scientific Publications) held each year as a part of JCDL (Joint Conference on Digital Libraries....Read more

View all 1 publications

Want to know more about my research? Fill in the form below and I'll be in touch!

Knowledge Media Institute
The Open University
Milton Keynes
MK7 6AA
United Kingdom

+44 (0)1908 652790