Marios Papachristou Personal Homepage

Datasets

Here you will find a collection of datasets available on the public domain for various tasks. I strive to make the data I use for my research open. In case anything is not working please contact me at papachristoumarios@cs.cornell.edu.

Graph Datasets

Opinion Dynamics

Data in this category consist of unweighted directed networks where each node has a multi-dimenaional label (0-1 valued) regarding whether (or not) each node endorses a certain opinion. The file X.edges contains directed edges between the nodes of the network, and the file X.feat contains a dense feature matrix where the first entry corresponds to the corresponding node id.

  • pokec. Derived from soc-pokec. The data contains users of the pokec social network where users with private information have been filtered out. The attributes of each user are derived by looking at his/her corresponding profile interests (described in the original network).
  • github. Contains data gathered from GHTorrent with queries described in this gist where nodes are github users and attributes are programming languages that the user has programmed at as an owner of a project.

Call Graphs

These datasets contain call graphs derived using the cscout tool. Each file represents a directed call graph where each line corresponds to a directed edge between two entities (files, functions etc.).

Hypergraph Datasets

GHTorrent Datasets

In these datasets the nodes of a hypergraph represent users and hyperedges represent repositories, org members etc. that these users belong to. Each user comes with features (such as number of commits, number of followers etc.) used for experiments for this paper. We provide the SQL queries to create the datasets based on the GHTorrent MySQL schema. The post-processed datasets follow the convention of these hypergraph datasets.

Other resources