File Discovery & Indexing
How CodeDD discovers, categorizes, and prepares files for analysis
File Discovery & Indexing
Overview
Once your repository is securely cloned, CodeDD performs comprehensive file discovery and indexing. This stage maps your entire codebase, calculates metrics, and prepares files for AI-powered analysis—all while maintaining security through encryption.
Discovery Process
Intelligent File Scanning
CodeDD recursively scans your repository to identify all relevant files:
What Gets Scanned:
- All source code files (100+ language support)
- Configuration files (Docker, Kubernetes, CI/CD)
- Documentation files (README, API docs)
- Infrastructure-as-Code (Terraform, CloudFormation)
- Database schemas and migrations
- Security configurations
What Gets Excluded:
- Binary files and compiled artifacts
- Version control directories (
.git) - Dependencies (
node_modules,vendor,venv) - Build outputs (
dist,build,target) - Symlinked directories (to avoid duplicates)
Performance Optimization
Parallel Processing:
- Multi-threaded directory traversal (up to 14 concurrent workers)
- Adaptive batch sizing based on directory depth
- Queue-based architecture for efficient processing
- Typical performance: 5,000+ directories/second
Scalability:
- Handles repositories of any size (tested up to 500k+ files)
- Memory-efficient streaming for large file sets
- Progress tracking for large repositories
File Categorization
Automatic Type Detection
Every file is automatically categorized:
Source Code:
- Application code (Python, JavaScript, Java, Go, etc.)
- Test files
- Language-specific scripts
Configuration:
- Application configuration (JSON, YAML, TOML)
- Environment files
- Build configurations
Infrastructure:
- Container definitions (Dockerfile, docker-compose)
- Orchestration manifests (Kubernetes YAML)
- Infrastructure-as-Code templates
Security:
- Authentication configurations
- Secrets management files
- Security policy definitions
Documentation:
- README files
- API documentation
- Architecture diagrams (as code)
Extension Mapping
CodeDD maintains a comprehensive extension database:
- 100+ programming languages
- Framework-specific file patterns
- Custom configuration formats
Metrics Calculation
Lines of Code Analysis
For each file, CodeDD calculates:
Code Lines:
- Non-empty lines of actual code
- Excluding comments and whitespace
- Language-aware parsing
Documentation Lines:
- Comments and docstrings
- README and documentation files
- Inline documentation
Complexity Indicators:
- File size and structure
- Nesting depth
- Import/dependency patterns
Tools Used
Primary: LineCounter
- Fast, accurate line counting
- Language-agnostic
- Handles edge cases (mixed content, embedded code)
Fallback: Python Parser
- Used when LineCounter unavailable
- Basic comment detection
- Ensures analysis continues even with tool failures
Git History Analysis
Commit Metadata
For each file, CodeDD extracts:
Temporal Data:
- Last modified date
- Commit frequency
- Age of code
Developer Activity:
- File ownership
- Number of contributors
- Commit patterns
Change Velocity:
- Recent modifications
- Code churn indicators
- Stability metrics
Batch Processing
- Efficient batch Git operations (500 files per query)
- Minimizes repository access overhead
- Handles large histories gracefully
Encryption at Rest
Immediate Encryption
Critical Security Feature:
As soon as files are analyzed, they are encrypted in place:
- File Read: Content read into memory for analysis
- Metrics Calculated: LOC, complexity, type detection
- Immediate Encryption: Original file overwritten with encrypted version
- Key Management: Unique encryption key per audit
- Memory Clearing: Original content purged from memory
Encryption Standard:
- AES-256-GCM encryption
- Unique initialization vector per file
- Authenticated encryption (prevents tampering)
Storage Architecture
Encrypted File System:
/cache/audit-uuid/
├── repo-name/
│ ├── file1.py (encrypted)
│ ├── file2.js (encrypted)
│ └── folder/
│ └── file3.java (encrypted)
Metadata Storage:
- File paths and metrics in graph database (TypeDB)
- No file content in database
- Database queries by metadata only
Folder Structure Mapping
Hierarchical Organization
CodeDD builds a complete folder hierarchy:
Folder Metrics:
- Aggregated lines of code per folder
- File count and distribution
- Language breakdown per folder
Relationship Mapping:
- Parent-child folder relations
- File-to-folder associations
- Root directory linkage
Domain Identification
Folders are later grouped into logical domains:
- Frontend (UI components)
- Backend (APIs, business logic)
- Database (schemas, migrations)
- Infrastructure (DevOps, configuration)
- Tests (test suites)
Database Schema
TypeDB Graph Structure
CodeDD uses a graph database to represent relationships:
Entities:
- Audit (root entity)
- Root Directory (repository clone)
- Folders (hierarchy)
- Files (individual source files)
Relationships:
- Audit → Root Directory (audit_directory)
- Root Directory → Folders (directory_content)
- Folders → Sub-Folders (directory_content)
- Folders → Files (directory_content)
Query Performance
Optimizations:
- Batched writes (25 operations per sub-transaction)
- Concurrent transaction processing
- Thread pooling (up to 16 workers)
- Retry logic with exponential backoff
Typical Performance:
- 50-100 files/second database writes
- Parallel folder and file relation creation
- Progress tracking via status updates
Status Reporting
Real-Time Progress
During file import, you see live updates:
"Repository imported | 1,247/10,523 |
Processing data: 1,247/10,523 files (11.9%) at 87.2 files/s"
Information Provided:
- Files processed / Total files
- Percentage complete
- Processing speed (files per second)
- Estimated time remaining
Error Handling
Resilient Processing
File-Level Errors:
- Individual file failures don't stop the audit
- Detailed error logging for troubleshooting
- Fallback mechanisms for metrics calculation
Common Issues:
- Permission Errors: Skipped, logged, audit continues
- Encoding Issues: Fallback to binary mode
- Large Files: Streaming processing for files >100MB
Audit Trail
All operations logged:
- Files successfully processed
- Files skipped (with reason)
- Errors encountered (without exposing content)
- Performance metrics
Symlink Handling
Duplicate Prevention
Challenge: Symlinks can create duplicate entries
Solution:
- Symlinks are detected and resolved to real paths
- Only real files are counted and analyzed
- Prevents inflated LOC metrics
- Avoids redundant AI analysis
Resource Management
Memory Efficiency
Streaming Architecture:
- Files processed in batches
- Results written to database incrementally
- Memory cleared after each batch
- No full repository in memory
Concurrency Control
Thread Pools:
- File Processing: 4-16 threads (CPU-bound)
- Database Writes: 2-4 threads (I/O-bound)
- Prevents over-subscription
- Adaptive to system resources
What Happens Next
After file discovery and indexing:
- Selection: Files marked for detailed AI analysis
- Priority: Critical files (security, config) prioritized
- Batching: Files grouped for efficient processing
- Handoff: Encrypted files ready for AI audit stage
Key Takeaways
For Investors:
- Complete repository visibility—no files missed
- Accurate LOC metrics for valuation models
- Historical commit data for risk assessment
- All data encrypted throughout process
For CTOs:
- Language-agnostic analysis
- Handles monorepos and complex structures
- Git history integrated into analysis
- No manual file selection required
Next Steps
- Understand AI-Powered File Analysis
- Review Data Encryption
- Learn about Data Deletion

