Machine Learning Overview

DefenSys uses machine learning for anomaly-based threat detection. The ML pipeline complements rule-based detection by identifying behavioral anomalies that may indicate novel or zero-day attacks.

How It Works

When a packet is captured, the system extracts 20 numerical features from it (and from the flow it belongs to). These features are passed to an anomaly detection model. The model outputs an anomaly score and confidence. If the score indicates an anomaly and confidence is high, the result is combined with rule-based detection to decide whether to raise an alert.

Pipeline (Code)

The CombinedDetectionEngine orchestrates the flow. For each packet:

// backend/services/combinedDetectionEngine.js
async analyzePacket(packet) {
  // 1. Extract 20-D feature vector
  const features = this.featureExtractor.extractFeatures(packet);

  // 2. Run rule-based detection (port scan, SYN flood, etc.)
  const ruleResult = this.runRuleBasedDetection(packet);

  // 3. Run ML detection
  const mlResult = await this.runMLDetection(features, packet);

  // 4. Combine results and decide if alert
  const combinedResult = this.combineResults(ruleResult, mlResult, packet);

  if (combinedResult.alert) {
    this.storeAlert(combinedResult, packet);
    this.emit("alert", combinedResult);
  }
  return combinedResult;
}

Architecture

Feature Extractor – Builds a 20-dimensional feature vector from each packet/flow. See Feature Extraction.
ML Inference Service – Runs predictions using either a Python subprocess (sklearn/PyOD) or a JavaScript fallback. See Inference Pipeline.
Combined Detection Engine – Merges rule-based and ML results into a single decision.

Model Types

Isolation Forest – Primary model. Treats anomalies as points that are easy to isolate. Negative scores indicate more anomalous behavior.
JavaScript Fallback – Used when Python is not available. Uses heuristic rules on scaled features.

Key Concepts

Anomaly Score – Lower (more negative) = more anomalous. Isolation Forest uses negative scores; threshold is typically around -0.1 to -0.5.
Confidence – How certain the model is. Higher confidence is required for alerts (e.g. 0.9).
Flow – A bidirectional connection (src:port ↔ dst:port, protocol). Features are computed per flow.

ML Alert Decision (Code)

An ML result becomes an alert only when all three conditions hold:

// backend/services/combinedDetectionEngine.js - runMLDetection()
const isAlert =
  prediction.is_anomaly &&
  prediction.anomaly_score < this.thresholds.mlAnomalyScore &&  // e.g. -0.5
  prediction.confidence > this.thresholds.mlConfidence;        // e.g. 0.9

return {
  alert: isAlert,
  anomalyScore: prediction.anomaly_score,
  confidence: prediction.confidence,
  method: "ml",
  modelVersion: prediction.model_version,
  features: features.slice(0, 5),
};

Learn More

Feature Extraction – All 20 features explained
Inference Pipeline – Python vs JavaScript, latency, batching
Training – How to train and retrain models
Configuration – Thresholds and settings

Feature Extraction

The feature extractor converts raw packet and flow data into a 20-dimensional vector used by the ML model. Features are computed in real time as packets arrive.

Flow Tracking

A flow is a bidirectional connection identified by source IP:port, destination IP:port, and protocol. The extractor maintains a flow cache and updates it with each packet. Flows older than 5 minutes are removed.

Flow Key (Code)

Flows are bidirectional: A→B and B→A are the same flow. The key uses the lexicographically smaller combination:

// backend/services/featureExtractor.js
getFlowKey(packet) {
  const srcIP = packet.srcIP || "unknown";
  const dstIP = packet.dstIP || "unknown";
  const srcPort = packet.srcPort || 0;
  const dstPort = packet.dstPort || 0;
  const protocol = packet.protocol || "unknown";

  const key1 = `${srcIP}:${srcPort}-${dstIP}:${dstPort}-${protocol}`;
  const key2 = `${dstIP}:${dstPort}-${srcIP}:${srcPort}-${protocol}`;

  return key1 < key2 ? key1 : key2;
}

Feature Vector Construction (Code)

The extractor builds an object of features, then converts to a fixed-order array for the model:

// backend/services/featureExtractor.js - extractFeatures()
const featureVector = [
  features.packetSize,
  features.flowDuration,
  features.packetsPerFlow,
  features.bytesPerFlow,
  features.srcPort,
  features.dstPort,
  features.protocol,
  features.bytesPerSecond,
  features.packetsPerSecond,
  features.srcIPActivity,
  features.dstIPActivity,
  features.interPacketInterval,
  features.timeOfDay,
  features.dayOfWeek,
  features.protocolDiversity,
  features.flagDiversity,
  features.unusualPortCombination ? 1 : 0,
  features.highVolumeFlow ? 1 : 0,
  features.srcIPRisk,
  features.dstIPRisk,
];

The 20 Features

Basic Packet & Flow (1–4)

packet_size – Size of the current packet in bytes
flow_duration – Time since the flow started (seconds)
packets_per_flow – Number of packets in this flow
bytes_per_flow – Total bytes in this flow

Network (5–7)

src_port – Normalized source port (well-known, registered, or dynamic ranges)
dst_port – Normalized destination port
protocol – Encoded protocol (TCP=1, UDP=2, ICMP=3, HTTP=4, HTTPS=5, DNS=6, etc.)

// backend/services/featureExtractor.js
normalizePort(port) {
  if (!port || port === 0) return 0;
  if (port <= 1023) return port;           // Well-known
  if (port <= 49151) return 1024 + (port % 1000);  // Registered
  return 2024 + (port % 1000);             // Dynamic
}

encodeProtocol(protocol) {
  const protocolMap = {
    TCP: 1, UDP: 2, ICMP: 3, HTTP: 4, HTTPS: 5,
    DNS: 6, FTP: 7, SSH: 8, SMTP: 9, unknown: 0,
  };
  return protocolMap[protocol] || 0;
}

Flow Rate (8–9)

bytes_per_second – Flow throughput (bytes/sec)
packets_per_second – Packet rate (packets/sec)

IP Behavior (10–11)

src_ip_activity – Activity rate of source IP (packets/min, normalized 0–100)
dst_ip_activity – Activity rate of destination IP

Temporal (12–14)

inter_packet_interval – Time since last packet in flow (seconds)
time_of_day – Hour (0–23)
day_of_week – Day (0–6, Sunday=0)

Diversity (15–16)

protocol_diversity – Number of distinct protocols in the flow
flag_diversity – Number of distinct TCP flags seen

Anomaly Indicators (17–18)

unusual_port_combination – 0 or 1. Set if src/dst ports match known suspicious pairs (e.g. SSH-SSH, RPC-NetBIOS)
high_volume_flow – 0 or 1. Set if throughput exceeds 1 MB/s

Implementation (Code)

// backend/services/featureExtractor.js
isUnusualPortCombination(srcPort, dstPort) {
  const unusualCombinations = [
    [22, 22], [80, 443], [53, 53], [135, 139], [445, 445]
  ];
  return unusualCombinations.some(
    ([src, dst]) =>
      (srcPort === src && dstPort === dst) ||
      (srcPort === dst && dstPort === src)
  );
}

isHighVolumeFlow(byteCount, duration) {
  if (duration <= 0) return false;
  const bytesPerSecond = byteCount / duration;
  return bytesPerSecond > 1000000; // 1MB/s
}

Risk (19–20)

src_ip_risk – Risk score for source IP (e.g. private vs public)
dst_ip_risk – Risk score for destination IP

Training Data Export

The extractor stores recent feature vectors (up to 10,000) for training. You can export them to CSV via the Settings or API for use with the training script.

Inference Pipeline

The ML inference service runs predictions on feature vectors in real time. It supports a Python subprocess for full sklearn/PyOD models and a JavaScript fallback when Python is unavailable.

Python Subprocess (Preferred)

When Python is installed and ml/predict.py exists, the service spawns a long-running Python process. Feature vectors are sent via stdin as JSON; predictions are returned on stdout.

Python Communication (Code)

// backend/services/mlInferenceService.js - predictWithPython()
const predictionId = `pred_${++this.predictionIdCounter}`;
this.predictionQueue.set(predictionId, { resolve, reject, timestamp });

const input = JSON.stringify({
  prediction_id: predictionId,
  features: featureVector,
});
this.pythonProcess.stdin.write(input + "\n");

// 5 second timeout
setTimeout(() => {
  if (this.predictionQueue.has(predictionId)) {
    this.predictionQueue.delete(predictionId);
    reject(new Error("Python prediction timeout"));
  }
}, 5000);

Uses sklearn Isolation Forest or PyOD models if available
Supports .pkl (joblib) and JSON model formats
Each prediction gets a unique ID for async request/response matching
5-second timeout per prediction

JavaScript Fallback

If Python is not found or fails to start, the service loads JSON models from ml/models/ and runs a mock predictor in Node.js.

Mock Model Logic (Code)

The JS fallback scales features, then checks heuristic indicators. More suspicious indicators → lower (more anomalous) score:

// backend/services/mlInferenceService.js - callMockModel()
const scaledFeatures = featureVector.map((value, index) => {
  const mean = this.mockScaler.means[index] || 0;
  const scale = this.mockScaler.scales[index] || 1;
  return (value - mean) / scale;
});

const suspiciousIndicators = [
  scaledFeatures[16] > 2.0,  // unusual_port_combination
  scaledFeatures[17] > 2.0,  // high_volume_flow
  scaledFeatures[9] > 5,     // src_ip_activity
  scaledFeatures[10] > 5,   // dst_ip_activity
  scaledFeatures[0] > 5,    // packet_size
  scaledFeatures[1] > 5,    // flow_duration
];
const suspiciousCount = suspiciousIndicators.filter(Boolean).length;

// More suspicious → lower anomaly_score (more anomalous)
if (suspiciousCount >= 5) { anomalyScore = -0.8; confidence = 0.95; }
else if (suspiciousCount >= 4) { anomalyScore = -0.6; confidence = 0.9; }
else if (suspiciousCount >= 3) { anomalyScore = -0.3; confidence = 0.8; }
else { anomalyScore = 0.1; confidence = 0.9; }

const isAnomaly = anomalyScore < thresholds.anomalyScore &&
                  confidence > thresholds.confidence;

Prediction Flow

Packet arrives → Feature extractor builds 20-D vector
Vector is validated (length, no NaN/Inf)
Sent to ML service (Python or JS)
Model returns: is_anomaly, anomaly_score,confidence, model_version
Combined engine checks thresholds and merges with rule-based result

Thresholds

An ML result becomes an alert when:

is_anomaly is true
anomaly_score < mlAnomalyScore (e.g. -0.5)
confidence > mlConfidence (e.g. 0.9)

See Configuration for how to adjust these.

Statistics

The service tracks:

Total predictions
Anomalies detected
Average latency (ms)
Last prediction time

These are exposed via the API and the Dashboard ML status panel.

Disabling Python ML

Set USE_PYTHON_ML=false in your environment to force the JavaScript fallback (e.g. when Python is not installed).

ML Model Training

DefenSys includes a Python training script that builds anomaly detection models from collected traffic data. Training produces Isolation Forest (and optionally PyOD) models plus a scaler and metadata.

Prerequisites

Python 3 with scikit-learn, numpy, pandas
Optional: PyOD for LOF, OCSVM, PyOD IForest
Training data in CSV format (see below)

Collecting Training Data

Run DefenSys with packet capture enabled. The feature extractor stores feature vectors in memory. Export them to CSV via:

Settings → Detection → Export Training Data
Or the exportTrainingData IPC/API method

The CSV is written to ml/datasets/training_data.csv with columns matching the 20 feature names plus src_ip, dst_ip, timestamp.

Training Pipeline

Load data – Read CSV, validate columns
Preprocess – Handle missing/infinite values, scale with StandardScaler
Train Isolation Forest – contamination=0.1, n_estimators=200
Train PyOD models (optional) – LOF, OCSVM, PyOD IForest
Evaluate – Anomaly rate, score ranges
Save – model_latest.pkl (or .json), scaler, metadata

Isolation Forest Training (Code)

# ml/train_model.py
model = IsolationForest(
    contamination=0.1,   # Expected proportion of outliers
    n_estimators=200,
    max_samples='auto',
    max_features=1.0,
    random_state=42,
    n_jobs=-1
)
model.fit(X)

predictions = model.predict(X)
anomaly_scores = model.decision_function(X)
# score < 0 → anomaly (Isolation Forest convention)

Running the Trainer

cd ml
python train_model.py

By default it looks for datasets/training_data.csv and writes to models/. You can override paths in the script.

Output Files

model_latest.json or model_latest.pkl – Model
scaler_*.json or scaler_*.pkl – Scaler
metadata_*.json – Training date, feature names, metrics

Retraining

The inference service can report when the model is older than 30 days via needsRetraining(). Collect new data periodically and retrain to adapt to changing traffic patterns.

ML Configuration

ML behavior is controlled by environment variables, model metadata, and detection engine thresholds.

Environment Variables

USE_PYTHON_ML – Set to false to disable Python and use the JavaScript fallback only

Model Thresholds (in model_latest.json)

anomalyScore – Alert when score is below this (e.g. -0.1). More negative = more anomalous.
confidence – Alert only when confidence is above this (e.g. 0.7).

Detection Engine Thresholds

The combined detection engine uses its own thresholds (in code):

mlAnomalyScore – e.g. -0.5 (stricter than model default)
mlConfidence – e.g. 0.9 (higher confidence required)

These are tuned to reduce false positives. Adjust in combinedDetectionEngine.js if needed.

Settings UI

In the Desktop App, Settings → Detection lets you:

Enable/disable ML anomaly detection
Choose Python ML vs JavaScript (when both available)
Export training data
View ML status (initialized, Python active, stats)

Model Paths

The service looks for models in ml/models/ relative to the app path, with fallbacks for development and packaged builds.