← Back to Blog

Building a Research Intelligence System

How I automated ArXiv monitoring into a four-layer pipeline that reads papers, detects research trends, and delivers daily briefings

Keeping up with the literature is one of the most time-consuming parts of research. Between ArXiv, conference proceedings, and citation trails, it is easy to spend hours every week just deciding what to read — before actually reading anything. I built a system to solve this problem for myself, and it turned out general enough to work for any research domain.

The Research Intelligence System is an open-source, four-layer pipeline that transforms ArXiv paper aggregation into structured research intelligence. Each layer builds on the previous one: fetching and scoring papers, deeply analyzing their content, detecting emerging research fronts through citation networks, and finally producing living literature reviews and daily email briefings. The whole thing runs automatically via GitHub Actions.

The system is designed for any research topic. If you work in AI for healthcare, climate modeling, or anything else, you configure two files — a domain definition and a researcher profile — and the pipeline adapts without any code changes.

Layer 0: Fetch & Score

The first layer is responsible for acquiring papers and deciding which ones matter. It uses two complementary search strategies: keyword-based search against the ArXiv API, and citation graph crawling via Semantic Scholar starting from a set of seed papers you define. This dual strategy catches both new work in your area and papers that cite work you already care about.

Each paper is then scored 0–10 by an LLM using your category description and researcher context as the prompt. Scores and PDFs are cached to avoid redundant API calls across runs. Code links are automatically discovered via Papers with Code and GitHub search. The output is a set of JSON files and a human-readable README that summarizes what was found.

Layer 1: Deep Analysis

For papers that pass the relevance threshold, Layer 1 runs a four-agent pipeline against the full PDF. The agents are specialized:

Reader — extracts the problem, methodology, experiments, and key results
Methods Extractor — tags the techniques used and traces methodological lineage ("builds on X", "extends Y")
Positioning Agent — scores the paper's relevance to your active projects specifically, and explains why
Synthesis Controller — merges the three outputs and writes to the database

All outputs are structured via Pydantic schemas using Gemini's structured output mode, so downstream layers get reliable JSON rather than free-form text. The cost is roughly $0.01 per paper using Gemini Flash.

Layer 2: Bibliometric Fronts

This is the most technically interesting layer. It builds a directed citation graph from Semantic Scholar, then constructs a co-citation network using Salton's cosine similarity between paper pairs. Louvain community detection finds clusters — these are the research fronts, groups of papers that the community collectively cites together.

Each detected front is summarized by an LLM into a trend description. The system also identifies bridge papers — papers that connect two or more otherwise separate fronts — which are often the most strategically interesting to read. Results are stored in the database and can be visualized as citation graphs.

Layer 3: Living Reviews & Email Briefings

The final layer maintains a continuously updated literature review for each research category. It operates on three cycles:

Daily — new papers are appended to the existing review structure
Weekly — sections are restructured, narrative is added, research fronts are highlighted
Monthly — a full rewrite with new organization reflecting how the field has evolved

On the email side, a unified daily briefing covers all categories, while weekly deep-dive emails go category by category. Citation graph visualizations are embedded directly in the emails. Everything is delivered via Gmail SMTP and committed back to the repository, giving you an audit trail of how your understanding of the field evolved.

Adapting to Any Domain

The system was designed from the start to be domain-agnostic. To use it for a new research area, you edit two configuration files:

research_domain.yaml — defines your categories, ArXiv filters, and keywords
researcher_profile.md — describes your active projects and research interests, used to personalize the relevance scoring in Layers 0 and 1

No code changes are needed. The same pipeline that monitors AI for optimization can be pointed at computer vision, drug discovery, or climate science by changing these two files.

Cost

Running the full pipeline on 100 papers costs roughly $1.35 using Gemini Flash: $0.05 for relevance scoring in Layer 0, $1.00 for the four-agent deep analysis in Layer 1, $0.20 for front summarization in Layer 2, and $0.10 for email generation in Layer 3. The caching strategy means subsequent runs on the same paper pool cost almost nothing. In practice, running the full automated workflow on a regular schedule comes to around $2 per week.

The code, configuration schema, and an example setup for AI & Optimization research are available on GitHub. The associated paper collection for OR and AI is also open:

Code on GitHub