Collaboration gaps in citation-based networks: a community detection and hypergraph approach

Social Network Analysis · University of Pisa · 2024–2025

Research Question

Researchers at the same institution often work on closely related problems without knowing it. Can citation network analysis surface these hidden overlaps?

We build two graphs from University of Pisa publications (OpenAIRE, post-2020): G_Cit (direct citations only) and G_BC (expanded with bibliographic-coupling edges weighted by Jaccard similarity). Three community detection algorithms and a hypergraph representation reveal structure invisible to pairwise methods.

Key Results

Adding "shared-reference" links to the citation graph reveals 18 research groups that looked separate under citations alone but are actually working on related topics.
The hypergraph surfaces ~2 million paper pairs that share a common reference pool without ever citing each other — potential collaboration gaps.
Of the top-100 most likely gaps, 95 are genuine (5 duplicates removed): all 95 fall inside the same research cluster under three independent algorithms, against a random baseline near zero.
Only 7 pairs involve two recent papers (post-2020) — those are the ones where a missed collaboration could still happen today.

Open questions. The analysis is structural: shared references indicate thematic proximity, not a confirmed missed collaboration. Turning these candidates into actual recommendations would require resolving papers to individual researchers (e.g. via ORCID) and validating retrospectively whether flagged 2020 pairs eventually co-authored by 2025.

Data

Source: OpenAIRE Research Graph

Pipeline

The analysis runs in five stages. Each notebook folder maps directly to a section of the report.

Folder	Report section
`notebooks/A_graph_creation/`	§2 Data Collection and Graph Creation
`notebooks/B_graph_analysis/`	§3 Network Characterization
`notebooks/C_community_detection/`	§4 Community Detection
`notebooks/D_hypergraph/`	§5 Hypergraph Analysis
`notebooks/E_open_question/`	§6 Latent Intellectual Communities in a Research Institution

Project Structure

├── data/          # graphs, hypergraph, intermediate and raw OpenAIRE data
├── notebooks/
│   ├── A_graph_creation/
│   ├── B_graph_analysis/
│   ├── C_community_detection/
│   ├── D_hypergraph/
│   └── E_open_question/
├── tex/           # LaTeX report source
└── report.pdf

Data folder

The data folder is too large to be hosted on GitHub. Please repeat ad verify all the pipeline on your own.

Note: raw dumps can be downloaded from Zenodo (Dataset version: 10.6).

Reproducibility

Set up the environment with ./notebooks/requirements.txt.

Most time- and memory-intensive steps are cached in each notebook subfolder's output/ directory and load automatically. To force recomputation, delete the cached file or set force = True (community detection notebooks only).

Known Issues

`BrokenProcessPool` on Windows with Jupyter

Running any notebook that uses ProcessPoolExecutor raises:

BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Root cause. Windows uses spawn instead of fork for child processes. Inside Jupyter there is no if __name__ == "__main__": guard, so worker processes recursively re-execute notebook cells and crash immediately.

Workarounds:

Rely on the cache — pre-computed results load automatically. Recommended for Windows users.
Replace ProcessPoolExecutor with ThreadPoolExecutor — avoids the spawn issue, but the GIL limits CPU-bound parallelism for community detection.
Remove all multiprocessing (not recommended) — makes community detection prohibitively slow.

Authors

Francesco Secoli · Elisa Calabrese · Tommaso Agostini University of Pisa — Social Network Analysis, 2024–2025

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
notebooks		notebooks
tex		tex
.gitignore		.gitignore
README.md		README.md
SNA Final Project.pdf		SNA Final Project.pdf
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Collaboration gaps in citation-based networks: a community detection and hypergraph approach

Social Network Analysis · University of Pisa · 2024–2025

Research Question

Key Results

Data

Pipeline

Project Structure

Data folder

Reproducibility

Known Issues

`BrokenProcessPool` on Windows with Jupyter

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Collaboration gaps in citation-based networks: a community detection and hypergraph approach

Social Network Analysis · University of Pisa · 2024–2025

Research Question

Key Results

Data

Pipeline

Project Structure

Data folder

Reproducibility

Known Issues

BrokenProcessPool on Windows with Jupyter

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`BrokenProcessPool` on Windows with Jupyter

Packages