Files
puaros/packages/guardian/docs/RESEARCH_PROJECT_STRUCTURE_DETECTION.md

33 KiB

Research: Project Structure Detection for Architecture Analysis

This document provides comprehensive research on approaches to detecting and validating project architecture structure. It covers existing tools, academic research, algorithms, and industry best practices that inform Guardian's architecture detection strategy.


Table of Contents

  1. Executive Summary
  2. Existing Tools Analysis
  3. Academic Approaches to Architecture Recovery
  4. Graph Analysis Algorithms
  5. Configuration Patterns and Best Practices
  6. Industry Consensus
  7. Recommendations for Guardian
  8. Additional Resources

1. Executive Summary

Key Finding

Industry consensus: Automatic architecture detection is unreliable. All major tools (ArchUnit, eslint-plugin-boundaries, Nx, dependency-cruiser, SonarQube) require explicit configuration from users rather than attempting automatic detection.

Why Automatic Detection Fails

  1. Too Many Variations: Project structures vary wildly across teams, frameworks, and domains
  2. False Positives: Algorithms may "detect" non-existent architectural patterns
  3. Performance: Graph analysis is slow for large codebases (>2000 files)
  4. Ambiguity: Same folder names can mean different things in different contexts
  5. Legacy Code: Poorly structured code produces meaningless analysis results
Priority Approach Description
P0 Pattern-based detection Glob/regex patterns for layer identification
P0 Configuration file .guardianrc.json for explicit rules
P1 Presets Pre-configured patterns for common architectures
P1 Generic mode Fallback with minimal checks
P2 Interactive setup CLI wizard for configuration generation
P2 Graph visualization Visual dependency analysis (informational only)
Auto-detection NOT recommended as primary strategy

2. Existing Tools Analysis

2.1 ArchUnit (Java)

Approach: Fully declarative - user defines all layers explicitly.

Official Website: https://www.archunit.org/

User Guide: https://www.archunit.org/userguide/html/000_Index.html

GitHub Repository: https://github.com/TNG/ArchUnit

Key Characteristics:

  • Does NOT detect architecture automatically
  • User explicitly defines layers via package patterns
  • Fluent API for rule definition
  • Supports Layered, Onion, and Hexagonal architectures out-of-box
  • Integrates with JUnit/TestNG test frameworks

Example Configuration:

layeredArchitecture()
    .layer("Controller").definedBy("..controller..")
    .layer("Service").definedBy("..service..")
    .layer("Persistence").definedBy("..persistence..")
    .whereLayer("Controller").mayNotBeAccessedByAnyLayer()
    .whereLayer("Service").mayOnlyBeAccessedByLayers("Controller")
    .whereLayer("Persistence").mayOnlyBeAccessedByLayers("Service")

References:


2.2 eslint-plugin-boundaries (TypeScript/JavaScript)

Approach: Pattern-based element definition with dependency rules.

NPM Package: https://www.npmjs.com/package/eslint-plugin-boundaries

GitHub Repository: https://github.com/javierbrea/eslint-plugin-boundaries

Key Characteristics:

  • Does NOT detect architecture automatically
  • Uses micromatch/glob patterns for element identification
  • Supports capture groups for dynamic element naming
  • TypeScript import type awareness (value vs type imports)
  • Works with monorepos

Example Configuration:

settings: {
    "boundaries/elements": [
        {
            type: "domain",
            pattern: "src/domain/*",
            mode: "folder",
            capture: ["elementName"]
        },
        {
            type: "application",
            pattern: "src/application/*",
            mode: "folder"
        },
        {
            type: "infrastructure",
            pattern: "src/infrastructure/*",
            mode: "folder"
        }
    ]
},
rules: {
    "boundaries/element-types": [2, {
        default: "disallow",
        rules: [
            { from: "infrastructure", allow: ["application", "domain"] },
            { from: "application", allow: ["domain"] },
            { from: "domain", disallow: ["*"] }
        ]
    }]
}

References:


2.3 SonarQube Architecture as Code

Approach: YAML/JSON configuration with automatic code structure analysis.

Official Documentation: https://docs.sonarsource.com/sonarqube-server/design-and-architecture/overview/

Configuration Guide: https://docs.sonarsource.com/sonarqube-server/design-and-architecture/configuring-the-architecture-analysis/

Key Characteristics:

  • Introduced in SonarQube 2025 Release 2
  • Automatic code structure analysis (basic)
  • YAML/JSON configuration for custom rules
  • Supports "Perspectives" (multiple views of architecture)
  • Hierarchical "Groups" for organization
  • Glob and regex pattern support
  • Works without configuration for basic checks (cycle detection)

Supported Languages:

  • Java (SonarQube Server)
  • Java, JavaScript, TypeScript (SonarQube Cloud)
  • Python, C# (coming soon)
  • C++ (under consideration)

Example Configuration:

# architecture.yaml
perspectives:
  - name: "Clean Architecture"
    groups:
      - name: "Domain"
        patterns:
          - "src/domain/**"
          - "src/core/**"
      - name: "Application"
        patterns:
          - "src/application/**"
          - "src/use-cases/**"
      - name: "Infrastructure"
        patterns:
          - "src/infrastructure/**"
          - "src/adapters/**"
    constraints:
      - from: "Domain"
        deny: ["Application", "Infrastructure"]
      - from: "Application"
        deny: ["Infrastructure"]

References:


2.4 Nx Enforce Module Boundaries

Approach: Tag-based system with ESLint integration.

Official Documentation: https://nx.dev/docs/features/enforce-module-boundaries

ESLint Rule Guide: https://nx.dev/docs/technologies/eslint/eslint-plugin/guides/enforce-module-boundaries

Key Characteristics:

  • Tag-based constraint system (scope, type)
  • Projects tagged in project.json or package.json
  • Supports regex patterns in tags
  • Two-dimensional constraints (scope + type)
  • External dependency blocking
  • Integration with Nx project graph

Example Configuration:

// project.json
{
    "name": "user-domain",
    "tags": ["scope:user", "type:domain"]
}

// ESLint config
{
    "@nx/enforce-module-boundaries": ["error", {
        "depConstraints": [
            {
                "sourceTag": "type:domain",
                "onlyDependOnLibsWithTags": ["type:domain"]
            },
            {
                "sourceTag": "type:application",
                "onlyDependOnLibsWithTags": ["type:domain", "type:application"]
            },
            {
                "sourceTag": "scope:user",
                "onlyDependOnLibsWithTags": ["scope:user", "scope:shared"]
            }
        ]
    }]
}

References:


2.5 dependency-cruiser

Approach: Rule-based validation with visualization capabilities.

NPM Package: https://www.npmjs.com/package/dependency-cruiser

GitHub Repository: https://github.com/sverweij/dependency-cruiser

Key Characteristics:

  • Regex patterns for from/to rules
  • Multiple output formats (SVG, DOT, Mermaid, JSON, HTML)
  • CI/CD integration support
  • TypeScript pre-compilation dependency support
  • Does NOT detect architecture automatically

Example Configuration:

// .dependency-cruiser.js
module.exports = {
    forbidden: [
        {
            name: "no-domain-to-infrastructure",
            severity: "error",
            from: { path: "^src/domain" },
            to: { path: "^src/infrastructure" }
        },
        {
            name: "no-circular",
            severity: "error",
            from: {},
            to: { circular: true }
        }
    ],
    options: {
        doNotFollow: { path: "node_modules" },
        tsPreCompilationDeps: true
    }
}

References:


2.6 ts-arch / ArchUnitTS (TypeScript)

Approach: ArchUnit-like fluent API for TypeScript.

ts-arch GitHub: https://github.com/ts-arch/ts-arch

ts-arch Documentation: https://ts-arch.github.io/ts-arch/

ArchUnitTS GitHub: https://github.com/LukasNiessen/ArchUnitTS

Key Characteristics:

  • Fluent API similar to ArchUnit
  • PlantUML diagram validation support
  • Jest/Vitest integration
  • Nx monorepo support
  • Does NOT detect architecture automatically

Example Usage:

import { filesOfProject } from "tsarch"

// Folder-based dependency check
const rule = filesOfProject()
    .inFolder("domain")
    .shouldNot()
    .dependOnFiles()
    .inFolder("infrastructure")

await expect(rule).toPassAsync()

// PlantUML diagram validation
const rule = await slicesOfProject()
    .definedBy("src/(**/)")
    .should()
    .adhereToDiagramInFile("architecture.puml")

References:


2.7 Madge

Approach: Visualization and circular dependency detection.

NPM Package: https://www.npmjs.com/package/madge

GitHub Repository: https://github.com/pahen/madge

Key Characteristics:

  • Dependency graph visualization
  • Circular dependency detection
  • Multiple layout algorithms (dot, neato, fdp, circo)
  • Simple CLI interface
  • Does NOT define or enforce layers

Usage:

# Find circular dependencies
npx madge --circular src/

# Generate dependency graph
npx madge src/ --image deps.svg

# TypeScript support
npx madge src/main.ts --ts-config tsconfig.json --image ./deps.png

References:

Alternative: Skott


3. Academic Approaches to Architecture Recovery

3.1 Software Architecture Recovery Overview

Wikipedia Definition: https://en.wikipedia.org/wiki/Software_architecture_recovery

Software architecture recovery is a set of methods for extracting architectural information from lower-level representations of a software system, such as source code. The abstraction process frequently involves clustering source code entities (files, classes, functions) into subsystems according to application-dependent or independent criteria.

Motivation:

  • Legacy systems often lack architectural documentation
  • Existing documentation is frequently out of sync with implementation
  • Understanding architecture is essential for maintenance and evolution

3.2 Machine Learning Approaches

Research Paper: "Automatic software architecture recovery: A machine learning approach"

Source: ResearchGate - https://www.researchgate.net/publication/261309157_Automatic_software_architecture_recovery_A_machine_learning_approach

Key Points:

  • Current architecture recovery techniques require heavy human intervention or fail to recover quality components
  • Machine learning techniques use multiple feature types:
    • Structural features (dependencies, coupling)
    • Runtime behavioral features
    • Domain/textual features
    • Contextual features (code authorship, line co-change)
  • Automatically recovering functional architecture facilitates developer understanding

Limitation: Requires training data and may not generalize across project types.


3.3 Genetic Algorithms for Architecture Recovery

Research Paper: "Parallelization of genetic algorithms for software architecture recovery"

Source: Springer - https://link.springer.com/content/pdf/10.1007/s10515-024-00479-0.pdf

Key Points:

  • Software Architecture Recovery (SAR) techniques analyze dependencies between modules
  • Automatically cluster modules to achieve high modularity
  • Many approaches employ Genetic Algorithms (GAs)
  • Major drawback: lack of scalability
  • Solution: parallel execution of GA subroutines

Finding: Finding optimal software clustering is an NP-complete problem.


3.4 Clustering Algorithms Comparison

Research Paper: "A comparative analysis of software architecture recovery techniques"

Source: IEEE Xplore - https://ieeexplore.ieee.org/document/6693106/

Algorithms Compared:

Algorithm Description Strengths Weaknesses
ACDC Comprehension-Driven Clustering Finds natural subsystems Requires parameter tuning
LIMBO Information-Theoretic Clustering Scalable May miss domain patterns
WCA Weighted Combined Algorithm Balances multiple factors Complex configuration
K-means Baseline clustering Simple, fast Poor for code structure

Key Finding: Even the best techniques have surprisingly low accuracy when compared against verified ground truths.


3.5 ACDC Algorithm (Algorithm for Comprehension-Driven Clustering)

Original Paper: "ACDC: An Algorithm for Comprehension-Driven Clustering"

Source: ResearchGate - https://www.researchgate.net/publication/221200422_ACDC_An_Algorithm_for_Comprehension-Driven_Clustering

York University Wiki: https://wiki.eecs.yorku.ca/project/cluster/protected:acdc

Algorithm Steps:

  1. Build dependency graph
  2. Find "dominator" nodes (subsystem patterns)
  3. Group nodes with common dominators
  4. Apply orphan adoption for ungrouped nodes
  5. Iteratively improve clusters

Advantages:

  • Considers human comprehension patterns
  • Finds natural subsystems
  • Works without prior knowledge

Disadvantages:

  • Requires parameter tuning
  • Does not guarantee optimality
  • May not work well on poorly structured code

3.6 LLM-Based Architecture Recovery (Recent Research)

Research Paper: "Automated Software Architecture Design Recovery from Source Code Using LLMs"

Source: Springer - https://link.springer.com/chapter/10.1007/978-3-032-02138-0_5

Key Findings:

  • LLMs show promise for automating software architecture recovery
  • Effective at identifying:
    • Architectural styles
    • Structural elements
    • Basic design patterns
  • Struggle with:
    • Complex abstractions
    • Class relationships
    • Fine-grained design patterns

Conclusion: "LLMs can support SAR activities, particularly in identifying structural and stylistic elements, but they struggle with complex abstractions"

Additional Reference: arXiv paper on design principles - https://arxiv.org/html/2508.11717


4. Graph Analysis Algorithms

4.1 Louvain Algorithm for Community Detection

Wikipedia: https://en.wikipedia.org/wiki/Louvain_method

Original Paper: "Fast unfolding of communities in large networks" (2008)

Algorithm Description:

  1. Initialize each node as its own community
  2. For each node, try moving to neighboring communities
  3. Select move with maximum modularity gain
  4. Merge communities into "super-nodes"
  5. Repeat from step 2

Modularity Formula:

Q = (1/2m) * Σ[Aij - (ki*kj)/(2m)] * δ(ci, cj)

Where:
- Aij = edge weight between i and j
- ki, kj = node degrees
- m = sum of all weights
- δ = 1 if ci = cj (same cluster)

Characteristics:

Parameter Value
Time Complexity O(n log n)
Modularity Range -1 to 1
Good Result Q > 0.3
Resolution Limit Yes (may hide small communities)

Implementations:

Application to Code Analysis:

Dependency Graph:
User.ts → Email.ts, UserId.ts
Order.ts → OrderId.ts, Money.ts
UserController.ts → User.ts, CreateUser.ts

Louvain detects communities:
Community 1: [User.ts, Email.ts, UserId.ts]        // User aggregate
Community 2: [Order.ts, OrderId.ts, Money.ts]      // Order aggregate
Community 3: [UserController.ts, CreateUser.ts]   // User feature

4.2 Modularity as Quality Metric

Wikipedia: https://en.wikipedia.org/wiki/Modularity_(networks)

Definition: Modularity measures the strength of division of a network into modules (groups, clusters, communities). Networks with high modularity have dense connections within modules but sparse connections between modules.

Interpretation:

Modularity Value Interpretation
Q < 0 Non-modular (worse than random)
0 < Q < 0.3 Weak community structure
0.3 < Q < 0.5 Moderate community structure
Q > 0.5 Strong community structure
Q → 1 Perfect modularity

Research Reference: "Fast Algorithm for Modularity-Based Graph Clustering" - https://cdn.aaai.org/ojs/8455/8455-13-11983-1-2-20201228.pdf


4.3 Graph-Based Software Modularization

Research Paper: "A graph-based clustering algorithm for software systems modularization"

Source: ScienceDirect - https://www.sciencedirect.com/science/article/abs/pii/S0950584920302147

Key Points:

  • Clustering algorithms partition source code into manageable modules
  • Resulting decomposition is called software system structure
  • Due to NP-hardness, evolutionary approaches are commonly used
  • Objectives:
    • Minimize inter-cluster connections
    • Maximize intra-cluster connections
    • Maximize overall clustering quality

4.4 Topological Sorting for Layer Detection

Algorithm Description:

Layers can be inferred from dependency graph topology:

  • Layer 0 (Domain): Nodes with no outgoing dependencies to other layers
  • Layer 1 (Application): Nodes depending only on Layer 0
  • Layer 2+ (Infrastructure): Nodes depending on lower layers

Pseudocode:

function detectLayers(graph):
    layers = Map()
    visited = Set()

    function dfs(node):
        if layers.has(node): return layers.get(node)
        if visited.has(node): return 0  // Cycle detected

        visited.add(node)
        deps = graph.getDependencies(node)

        if deps.isEmpty():
            layers.set(node, 0)  // Leaf node = Domain
            return 0

        maxDepth = max(deps.map(dfs))
        layers.set(node, maxDepth + 1)
        return maxDepth + 1

    graph.nodes.forEach(dfs)
    return layers

Limitation: Assumes acyclic graph; circular dependencies break this approach.


4.5 Graph Metrics for Code Quality Assessment

Useful Metrics:

Metric Description Good Value
Modularity Clustering quality > 0.3
Density Edge/node ratio Low for good separation
Clustering Coefficient Local clustering Domain-dependent
Cyclic Rate % of circular deps < 0.1 (10%)
Average Path Length Mean dependency distance Lower = more coupled

Code Quality Interpretation:

if cyclicRate > 0.5:
    return "SPAGHETTI"  // Cannot determine architecture
if modularity < 0.2:
    return "MONOLITH"   // No clear separation
if modularity > 0.5:
    return "WELL_STRUCTURED"  // Can determine layers
return "MODERATE"

5. Configuration Patterns and Best Practices

5.1 Pattern Hierarchy

Level 1: Minimal Configuration

{
    "architecture": "clean-architecture"
}

Level 2: Custom Paths

{
    "architecture": "clean-architecture",
    "layers": {
        "domain": ["src/core", "src/domain"],
        "application": ["src/app", "src/use-cases"],
        "infrastructure": ["src/infra", "src/adapters"]
    }
}

Level 3: Full Control

{
    "layers": [
        {
            "name": "domain",
            "patterns": ["src/domain/**", "**/*.entity.ts"],
            "allowDependOn": []
        },
        {
            "name": "application",
            "patterns": ["src/application/**", "**/*.use-case.ts"],
            "allowDependOn": ["domain"]
        },
        {
            "name": "infrastructure",
            "patterns": ["src/infrastructure/**", "**/*.controller.ts"],
            "allowDependOn": ["domain", "application"]
        }
    ]
}

5.2 Architecture Drift Detection in CI/CD

Best Practices from Industry:

Source: Firefly Academy - https://www.firefly.ai/academy/implementing-continuous-drift-detection-in-ci-cd-pipelines-with-github-actions-workflow

Source: Brainboard Blog - https://blog.brainboard.co/drift-detection-best-practices/

Key Recommendations:

  1. Integrate into Pipeline: Validate architecture on every code update
  2. Continuous Monitoring: Run automated scans daily minimum, hourly for active projects
  3. Enforce IaC-Only Changes: All changes through automated workflows
  4. Automated Reconciliation: Regular drift detection and correction
  5. Proper Alerting: Slack for minor drift, PagerDuty for critical
  6. Least Privilege: Limit who can bypass architecture checks
  7. Emergency Process: Document process for urgent manual changes
  8. Environment Refresh: Reset after each pipeline run

Example GitHub Actions Integration:

name: Architecture Check

on: [push, pull_request]

jobs:
    architecture:
        runs-on: ubuntu-latest
        steps:
            - uses: actions/checkout@v4

            - name: Check Architecture
              run: npx guardian check --strict

            - name: Generate Report
              if: failure()
              run: npx guardian report --format html

            - name: Upload Report
              if: failure()
              uses: actions/upload-artifact@v3
              with:
                  name: architecture-report
                  path: architecture-report.html

5.3 Presets for Common Architectures

Clean Architecture Preset:

{
    "preset": "clean-architecture",
    "layers": {
        "domain": {
            "patterns": ["**/domain/**", "**/entities/**", "**/core/**"],
            "allowDependOn": []
        },
        "application": {
            "patterns": ["**/application/**", "**/use-cases/**", "**/services/**"],
            "allowDependOn": ["domain"]
        },
        "infrastructure": {
            "patterns": ["**/infrastructure/**", "**/adapters/**", "**/api/**"],
            "allowDependOn": ["domain", "application"]
        }
    }
}

Hexagonal Architecture Preset:

{
    "preset": "hexagonal",
    "layers": {
        "core": {
            "patterns": ["**/core/**", "**/domain/**"],
            "allowDependOn": []
        },
        "ports": {
            "patterns": ["**/ports/**"],
            "allowDependOn": ["core"]
        },
        "adapters": {
            "patterns": ["**/adapters/**", "**/infrastructure/**"],
            "allowDependOn": ["core", "ports"]
        }
    }
}

NestJS Preset:

{
    "preset": "nestjs",
    "layers": {
        "domain": {
            "patterns": ["**/*.entity.ts", "**/entities/**"],
            "allowDependOn": []
        },
        "application": {
            "patterns": ["**/*.service.ts", "**/*.use-case.ts"],
            "allowDependOn": ["domain"]
        },
        "infrastructure": {
            "patterns": ["**/*.controller.ts", "**/*.module.ts", "**/*.resolver.ts"],
            "allowDependOn": ["domain", "application"]
        }
    }
}

6. Industry Consensus

6.1 Why Major Tools Don't Auto-Detect

Tool Auto-Detection Reasoning
ArchUnit No "User knows their architecture best"
eslint-plugin-boundaries No "Too many structure variations"
Nx No "Tag-based approach is more flexible"
dependency-cruiser No "Regex patterns cover all cases"
SonarQube ⚠️ Partial "Basic analysis + config for accuracy"

6.2 Common Themes Across Tools

  1. Explicit Configuration: All tools require user-defined rules
  2. Pattern Matching: Glob/regex patterns are universal
  3. Layered Rules: Allow/deny dependencies between layers
  4. CI/CD Integration: All support pipeline integration
  5. Visualization: Optional but valuable for understanding

6.3 Graph Analysis Position

Graph analysis is used for:

  • Circular dependency detection
  • Visualization
  • Metrics calculation
  • Suggestion generation

Graph analysis is NOT used for:

  • Primary layer detection
  • Automatic architecture classification
  • Rule enforcement

7. Recommendations for Guardian

┌─────────────────────────────────────────────────────────────┐
│                    Configuration Layer                       │
├─────────────────────────────────────────────────────────────┤
│  .guardianrc.json │ package.json │ CLI args │ Interactive   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Strategy Resolver                         │
├─────────────────────────────────────────────────────────────┤
│  1. Explicit Config (if .guardianrc.json exists)            │
│  2. Preset Detection (if preset specified)                  │
│  3. Smart Defaults (standard patterns)                      │
│  4. Generic Mode (fallback - minimal checks)                │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Analysis Engine                           │
├─────────────────────────────────────────────────────────────┤
│  Pattern Matcher │ Layer Detector │ Dependency Analyzer     │
└─────────────────────────────────────────────────────────────┘

7.2 Implementation Priorities

Phase 1: Configuration File Support

  • Add .guardianrc.json parser
  • Support custom layer patterns
  • Support custom DDD folder names
  • Validate configuration on load

Phase 2: Presets System

  • Clean Architecture preset
  • Hexagonal Architecture preset
  • NestJS preset
  • Feature-based preset

Phase 3: Smart Defaults

  • Try standard folder names first
  • Fall back to file naming patterns
  • Support common conventions

Phase 4: Interactive Setup

  • guardian init command
  • Project structure scanning
  • Configuration file generation
  • Preset recommendations

Phase 5: Generic Mode

  • Minimal checks without layer knowledge
  • Hardcode detection
  • Secret detection
  • Circular dependency detection
  • Basic naming conventions

7.3 Graph Analysis - Optional Feature Only

Graph analysis should be:

  • Optional: Not required for basic functionality
  • Informational: For visualization and metrics
  • Suggestive: Can propose configuration, not enforce it

CLI Commands:

guardian analyze --graph --output deps.svg    # Visualization
guardian metrics                               # Quality metrics
guardian suggest                               # Configuration suggestions

8. Additional Resources

Official Documentation

Academic Papers

Tutorials and Guides

  • Clean Architecture by Robert C. Martin (2017) - ISBN: 978-0134494166
  • Domain-Driven Design by Eric Evans (2003) - ISBN: 978-0321125217
  • Implementing Domain-Driven Design by Vaughn Vernon (2013) - ISBN: 978-0321834577

Conclusion

The research conclusively shows that automatic architecture detection is unreliable and not used by major industry tools. The recommended approach for Guardian is:

  1. Configuration-first: Support explicit layer definitions via .guardianrc.json
  2. Pattern-based: Use glob/regex patterns for flexible matching
  3. Presets: Provide pre-configured patterns for common architectures
  4. Smart defaults: Try standard conventions when no config exists
  5. Generic fallback: Provide useful checks even without architecture knowledge
  6. Graph analysis as optional: Use for visualization and suggestions only

This approach aligns with industry best practices from ArchUnit, eslint-plugin-boundaries, SonarQube, Nx, and dependency-cruiser.


Document Version: 1.0 Last Updated: 2025-11-27 Author: Guardian Research Team Questions or contributions?