Software/Data

Yahoo! JAPAN Research is making a portion of its software and data available for extensive use by researchers in a variety of fields, such as information science, social science and interdisciplinary areas, at university and public research institutions.

Software

NGT – Neighborhood Graph and Tree for Indexing

Description and Details
The software being provided performs high-speed searches for data near to the vectors indicated as queries from among large volumes of high dimensional vector data.

big3store - large-scale distributed RDF storage manager

Description and Details
This is a prototype system for storing and searching large-scale knowledge graph data efficiently. The scalability of storage system and query processing system towards Peta triples is currently possible by using large-scale distribution of data into shared-nothing clusters.

Data

Yahoo! Chiebukuro Data (Ver. 2)

Description

Yahoo! Chiebukuro is the largest community-driven question answering service in Japan. It connects users with questions to those users who may have the answer, enabling people to share information and knowledge with each other. The data being provided consists of resolved questions and answers extracted from the Chiebukuro database for the period as below.

Period: April 2004 – April 2009

Number of Questions: about 16 million

Number of Answers: about 50 million

Obtaining the Data
This data is available for download through the National Institute of Informatics (NII) (external site) homepage. Please refer to the NII’s Yahoo! Chiebukuro Data (Ver. 2) Usage Procedures page (external site) for details regarding applying for and using the data.

Yahoo! Search Query Data

Description

The data is composed of a set of related queries to the topic queries of the 12th NTCIR (NTCIR-12) tasks. By using three different techniques, related queries were extracted from search logs of Yahoo! Search for the period as below. The data does not contain any personal information such as operation history, personal identifiers and context.

Period: July 2009 – June 2013

Provision Method

This data is provided to NTCIR (NII Testbeds and Community for Information access Research) (external site) Evaluation of Information Access Technologies Workshop participants, and can be used for free by research groups taking part in the workshop.

For details, please check the NTCIR (external site) web page.

※ Applications to participate in the task that will use the data provided by Yahoo! JAPAN are no longer being accepted.

YJ Captions Dataset

Description

We have developed a Japanese version of the MS COCO caption dataset (external site), which we call YJ Captions 26k Dataset. It is created to facilitate the development of image captioning in Japanese language. Each Japanese caption describes the specified image provided in MS COCO dataset and each image has 5 captions.