We have released the first version of a citation-context based dataset called C2D, created while doing an experiment in the work which will be published in RecSys 2018 as a short paper.
A dataset of 53 million unique records containing citation-information (explained in the section below) is constructed using 2 million full-text open-source research publications obtained from CORE.
We extracted citation information from each publication. Information such as cited document's title, author(s), published date and citation-context. We will describe the assumption of extracting citation-context in a bit more detail below:
First of all, we extracted the position of citation where it has mention including citation-context which is texts around the cited document. For our purpose, we created citation-context using three sentences; the sentence where the reference has been cited, the preceding, and the following sentence. Additionally, at the start or end of a paragraph, the preceding or following sentence is not extracted respectively.
Therefore, the attributes of our dataset contain:
- ReferenceID - unique identifier of cited reference in a citing document
- SourceID - unique identifier of a citing document.
- ChapterNumber - Chapter number of the citing document where the 'ReferenceID' has mentioned.
- ParagraphNumber - paragraph number of the citing document where the reference ReferenceID has mentioned.
- SentenceNumber - sentence number of the citing document where the reference ReferencedID has mentioned.
- Title - Title of the reference ReferenceID.
- PublishedDate - Publication date when the reference ReferenceID
- Authors - Author(s) of the reference ReferenceID
- TextBeforeRefMention - Sentence just before the sentence where the reference ReferenceID has been cited.
- TextWhereRefMention - Sentence where the reference ReferenceID has been cited.
- TextAfterRefMention - Sentence just after the sentence where the reference ReferenceID has been cited.
Finally, the dataset can be downloaded from the link.
- The actual size of the dataset is ~40gb however compressed size is ~6.7gb.
- Requirements of different users may be different therefore we have released the raw version of the dataset. Please note, data cleansing (such as special character and stop-word removal) has not been performed.