DocumentationSoftware AuditFile Discovery & Indexing

File Discovery & Indexing

How CodeDD discovers, categorizes, and prepares files for analysis

File Discovery & Indexing

Overview

Once your repository is securely cloned, CodeDD performs comprehensive file discovery and indexing. This stage maps your entire codebase, calculates metrics, and prepares files for AI-powered analysis—all while maintaining security through encryption.

Discovery Process

Intelligent File Scanning

CodeDD recursively scans your repository to identify all relevant files:

What Gets Scanned:

  • All source code files (100+ language support)
  • Configuration files (Docker, Kubernetes, CI/CD)
  • Documentation files (README, API docs)
  • Infrastructure-as-Code (Terraform, CloudFormation)
  • Database schemas and migrations
  • Security configurations

What Gets Excluded:

  • Binary files and compiled artifacts
  • Version control directories (.git)
  • Dependencies (node_modules, vendor, venv)
  • Build outputs (dist, build, target)
  • Symlinked directories (to avoid duplicates)

Performance Optimization

Parallel Processing:

  • Multi-threaded directory traversal (up to 14 concurrent workers)
  • Adaptive batch sizing based on directory depth
  • Queue-based architecture for efficient processing
  • Typical performance: 5,000+ directories/second

Scalability:

  • Handles repositories of any size (tested up to 500k+ files)
  • Memory-efficient streaming for large file sets
  • Progress tracking for large repositories

File Categorization

Automatic Type Detection

Every file is automatically categorized:

Source Code:

  • Application code (Python, JavaScript, Java, Go, etc.)
  • Test files
  • Language-specific scripts

Configuration:

  • Application configuration (JSON, YAML, TOML)
  • Environment files
  • Build configurations

Infrastructure:

  • Container definitions (Dockerfile, docker-compose)
  • Orchestration manifests (Kubernetes YAML)
  • Infrastructure-as-Code templates

Security:

  • Authentication configurations
  • Secrets management files
  • Security policy definitions

Documentation:

  • README files
  • API documentation
  • Architecture diagrams (as code)

Extension Mapping

CodeDD maintains a comprehensive extension database:

  • 100+ programming languages
  • Framework-specific file patterns
  • Custom configuration formats

Metrics Calculation

Lines of Code Analysis

For each file, CodeDD calculates:

Code Lines:

  • Non-empty lines of actual code
  • Excluding comments and whitespace
  • Language-aware parsing

Documentation Lines:

  • Comments and docstrings
  • README and documentation files
  • Inline documentation

Complexity Indicators:

  • File size and structure
  • Nesting depth
  • Import/dependency patterns

Tools Used

Primary: LineCounter

  • Fast, accurate line counting
  • Language-agnostic
  • Handles edge cases (mixed content, embedded code)

Fallback: Python Parser

  • Used when LineCounter unavailable
  • Basic comment detection
  • Ensures analysis continues even with tool failures

Git History Analysis

Commit Metadata

For each file, CodeDD extracts:

Temporal Data:

  • Last modified date
  • Commit frequency
  • Age of code

Developer Activity:

  • File ownership
  • Number of contributors
  • Commit patterns

Change Velocity:

  • Recent modifications
  • Code churn indicators
  • Stability metrics

Batch Processing

  • Efficient batch Git operations (500 files per query)
  • Minimizes repository access overhead
  • Handles large histories gracefully

Encryption at Rest

Immediate Encryption

Critical Security Feature:

As soon as files are analyzed, they are encrypted in place:

  1. File Read: Content read into memory for analysis
  2. Metrics Calculated: LOC, complexity, type detection
  3. Immediate Encryption: Original file overwritten with encrypted version
  4. Key Management: Unique encryption key per audit
  5. Memory Clearing: Original content purged from memory

Encryption Standard:

  • AES-256-GCM encryption
  • Unique initialization vector per file
  • Authenticated encryption (prevents tampering)

Storage Architecture

Encrypted File System:

/cache/audit-uuid/
  ├── repo-name/
  │   ├── file1.py (encrypted)
  │   ├── file2.js (encrypted)
  │   └── folder/
  │       └── file3.java (encrypted)

Metadata Storage:

  • File paths and metrics in graph database (TypeDB)
  • No file content in database
  • Database queries by metadata only

Folder Structure Mapping

Hierarchical Organization

CodeDD builds a complete folder hierarchy:

Folder Metrics:

  • Aggregated lines of code per folder
  • File count and distribution
  • Language breakdown per folder

Relationship Mapping:

  • Parent-child folder relations
  • File-to-folder associations
  • Root directory linkage

Domain Identification

Folders are later grouped into logical domains:

  • Frontend (UI components)
  • Backend (APIs, business logic)
  • Database (schemas, migrations)
  • Infrastructure (DevOps, configuration)
  • Tests (test suites)

Database Schema

TypeDB Graph Structure

CodeDD uses a graph database to represent relationships:

Entities:

- Audit (root entity)
- Root Directory (repository clone)
- Folders (hierarchy)
- Files (individual source files)

Relationships:

- Audit → Root Directory (audit_directory)
- Root Directory → Folders (directory_content)
- Folders → Sub-Folders (directory_content)
- Folders → Files (directory_content)

Query Performance

Optimizations:

  • Batched writes (25 operations per sub-transaction)
  • Concurrent transaction processing
  • Thread pooling (up to 16 workers)
  • Retry logic with exponential backoff

Typical Performance:

  • 50-100 files/second database writes
  • Parallel folder and file relation creation
  • Progress tracking via status updates

Status Reporting

Real-Time Progress

During file import, you see live updates:

"Repository imported | 1,247/10,523 | 
Processing data: 1,247/10,523 files (11.9%) at 87.2 files/s"

Information Provided:

  • Files processed / Total files
  • Percentage complete
  • Processing speed (files per second)
  • Estimated time remaining

Error Handling

Resilient Processing

File-Level Errors:

  • Individual file failures don't stop the audit
  • Detailed error logging for troubleshooting
  • Fallback mechanisms for metrics calculation

Common Issues:

  • Permission Errors: Skipped, logged, audit continues
  • Encoding Issues: Fallback to binary mode
  • Large Files: Streaming processing for files >100MB

Audit Trail

All operations logged:

  • Files successfully processed
  • Files skipped (with reason)
  • Errors encountered (without exposing content)
  • Performance metrics

Symlink Handling

Duplicate Prevention

Challenge: Symlinks can create duplicate entries

Solution:

  • Symlinks are detected and resolved to real paths
  • Only real files are counted and analyzed
  • Prevents inflated LOC metrics
  • Avoids redundant AI analysis

Resource Management

Memory Efficiency

Streaming Architecture:

  • Files processed in batches
  • Results written to database incrementally
  • Memory cleared after each batch
  • No full repository in memory

Concurrency Control

Thread Pools:

  • File Processing: 4-16 threads (CPU-bound)
  • Database Writes: 2-4 threads (I/O-bound)
  • Prevents over-subscription
  • Adaptive to system resources

What Happens Next

After file discovery and indexing:

  1. Selection: Files marked for detailed AI analysis
  2. Priority: Critical files (security, config) prioritized
  3. Batching: Files grouped for efficient processing
  4. Handoff: Encrypted files ready for AI audit stage

Key Takeaways

For Investors:

  • Complete repository visibility—no files missed
  • Accurate LOC metrics for valuation models
  • Historical commit data for risk assessment
  • All data encrypted throughout process

For CTOs:

  • Language-agnostic analysis
  • Handles monorepos and complex structures
  • Git history integrated into analysis
  • No manual file selection required

Next Steps