Machine Learning Overview
DefenSys uses machine learning for anomaly-based threat detection. The ML pipeline complements rule-based detection by identifying behavioral anomalies that may indicate novel or zero-day attacks.
How It Works
When a packet is captured, the system extracts 20 numerical features from it (and from the flow it belongs to). These features are passed to an anomaly detection model. The model outputs an anomaly score and confidence. If the score indicates an anomaly and confidence is high, the result is combined with rule-based detection to decide whether to raise an alert.
Pipeline (Code)
The CombinedDetectionEngine orchestrates the flow. For each packet:
// backend/services/combinedDetectionEngine.js
async analyzePacket(packet) {
// 1. Extract 20-D feature vector
const features = this.featureExtractor.extractFeatures(packet);
// 2. Run rule-based detection (port scan, SYN flood, etc.)
const ruleResult = this.runRuleBasedDetection(packet);
// 3. Run ML detection
const mlResult = await this.runMLDetection(features, packet);
// 4. Combine results and decide if alert
const combinedResult = this.combineResults(ruleResult, mlResult, packet);
if (combinedResult.alert) {
this.storeAlert(combinedResult, packet);
this.emit("alert", combinedResult);
}
return combinedResult;
}Architecture
- Feature Extractor – Builds a 20-dimensional feature vector from each packet/flow. See Feature Extraction.
- ML Inference Service – Runs predictions using either a Python subprocess (sklearn/PyOD) or a JavaScript fallback. See Inference Pipeline.
- Combined Detection Engine – Merges rule-based and ML results into a single decision.
Model Types
- Isolation Forest – Primary model. Treats anomalies as points that are easy to isolate. Negative scores indicate more anomalous behavior.
- JavaScript Fallback – Used when Python is not available. Uses heuristic rules on scaled features.
Key Concepts
- Anomaly Score – Lower (more negative) = more anomalous. Isolation Forest uses negative scores; threshold is typically around -0.1 to -0.5.
- Confidence – How certain the model is. Higher confidence is required for alerts (e.g. 0.9).
- Flow – A bidirectional connection (src:port ↔ dst:port, protocol). Features are computed per flow.
ML Alert Decision (Code)
An ML result becomes an alert only when all three conditions hold:
// backend/services/combinedDetectionEngine.js - runMLDetection()
const isAlert =
prediction.is_anomaly &&
prediction.anomaly_score < this.thresholds.mlAnomalyScore && // e.g. -0.5
prediction.confidence > this.thresholds.mlConfidence; // e.g. 0.9
return {
alert: isAlert,
anomalyScore: prediction.anomaly_score,
confidence: prediction.confidence,
method: "ml",
modelVersion: prediction.model_version,
features: features.slice(0, 5),
};Learn More
- Feature Extraction – All 20 features explained
- Inference Pipeline – Python vs JavaScript, latency, batching
- Training – How to train and retrain models
- Configuration – Thresholds and settings
Feature Extraction
The feature extractor converts raw packet and flow data into a 20-dimensional vector used by the ML model. Features are computed in real time as packets arrive.
Flow Tracking
A flow is a bidirectional connection identified by source IP:port, destination IP:port, and protocol. The extractor maintains a flow cache and updates it with each packet. Flows older than 5 minutes are removed.
Flow Key (Code)
Flows are bidirectional: A→B and B→A are the same flow. The key uses the lexicographically smaller combination:
// backend/services/featureExtractor.js
getFlowKey(packet) {
const srcIP = packet.srcIP || "unknown";
const dstIP = packet.dstIP || "unknown";
const srcPort = packet.srcPort || 0;
const dstPort = packet.dstPort || 0;
const protocol = packet.protocol || "unknown";
const key1 = `${srcIP}:${srcPort}-${dstIP}:${dstPort}-${protocol}`;
const key2 = `${dstIP}:${dstPort}-${srcIP}:${srcPort}-${protocol}`;
return key1 < key2 ? key1 : key2;
}Feature Vector Construction (Code)
The extractor builds an object of features, then converts to a fixed-order array for the model:
// backend/services/featureExtractor.js - extractFeatures() const featureVector = [ features.packetSize, features.flowDuration, features.packetsPerFlow, features.bytesPerFlow, features.srcPort, features.dstPort, features.protocol, features.bytesPerSecond, features.packetsPerSecond, features.srcIPActivity, features.dstIPActivity, features.interPacketInterval, features.timeOfDay, features.dayOfWeek, features.protocolDiversity, features.flagDiversity, features.unusualPortCombination ? 1 : 0, features.highVolumeFlow ? 1 : 0, features.srcIPRisk, features.dstIPRisk, ];
The 20 Features
Basic Packet & Flow (1–4)
- packet_size – Size of the current packet in bytes
- flow_duration – Time since the flow started (seconds)
- packets_per_flow – Number of packets in this flow
- bytes_per_flow – Total bytes in this flow
Network (5–7)
- src_port – Normalized source port (well-known, registered, or dynamic ranges)
- dst_port – Normalized destination port
- protocol – Encoded protocol (TCP=1, UDP=2, ICMP=3, HTTP=4, HTTPS=5, DNS=6, etc.)
// backend/services/featureExtractor.js
normalizePort(port) {
if (!port || port === 0) return 0;
if (port <= 1023) return port; // Well-known
if (port <= 49151) return 1024 + (port % 1000); // Registered
return 2024 + (port % 1000); // Dynamic
}
encodeProtocol(protocol) {
const protocolMap = {
TCP: 1, UDP: 2, ICMP: 3, HTTP: 4, HTTPS: 5,
DNS: 6, FTP: 7, SSH: 8, SMTP: 9, unknown: 0,
};
return protocolMap[protocol] || 0;
}Flow Rate (8–9)
- bytes_per_second – Flow throughput (bytes/sec)
- packets_per_second – Packet rate (packets/sec)
IP Behavior (10–11)
- src_ip_activity – Activity rate of source IP (packets/min, normalized 0–100)
- dst_ip_activity – Activity rate of destination IP
Temporal (12–14)
- inter_packet_interval – Time since last packet in flow (seconds)
- time_of_day – Hour (0–23)
- day_of_week – Day (0–6, Sunday=0)
Diversity (15–16)
- protocol_diversity – Number of distinct protocols in the flow
- flag_diversity – Number of distinct TCP flags seen
Anomaly Indicators (17–18)
- unusual_port_combination – 0 or 1. Set if src/dst ports match known suspicious pairs (e.g. SSH-SSH, RPC-NetBIOS)
- high_volume_flow – 0 or 1. Set if throughput exceeds 1 MB/s
Implementation (Code)
// backend/services/featureExtractor.js
isUnusualPortCombination(srcPort, dstPort) {
const unusualCombinations = [
[22, 22], [80, 443], [53, 53], [135, 139], [445, 445]
];
return unusualCombinations.some(
([src, dst]) =>
(srcPort === src && dstPort === dst) ||
(srcPort === dst && dstPort === src)
);
}
isHighVolumeFlow(byteCount, duration) {
if (duration <= 0) return false;
const bytesPerSecond = byteCount / duration;
return bytesPerSecond > 1000000; // 1MB/s
}Risk (19–20)
- src_ip_risk – Risk score for source IP (e.g. private vs public)
- dst_ip_risk – Risk score for destination IP
Training Data Export
The extractor stores recent feature vectors (up to 10,000) for training. You can export them to CSV via the Settings or API for use with the training script.
Inference Pipeline
The ML inference service runs predictions on feature vectors in real time. It supports a Python subprocess for full sklearn/PyOD models and a JavaScript fallback when Python is unavailable.
Python Subprocess (Preferred)
When Python is installed and ml/predict.py exists, the service spawns a long-running Python process. Feature vectors are sent via stdin as JSON; predictions are returned on stdout.
Python Communication (Code)
// backend/services/mlInferenceService.js - predictWithPython()
const predictionId = `pred_${++this.predictionIdCounter}`;
this.predictionQueue.set(predictionId, { resolve, reject, timestamp });
const input = JSON.stringify({
prediction_id: predictionId,
features: featureVector,
});
this.pythonProcess.stdin.write(input + "\n");
// 5 second timeout
setTimeout(() => {
if (this.predictionQueue.has(predictionId)) {
this.predictionQueue.delete(predictionId);
reject(new Error("Python prediction timeout"));
}
}, 5000);- Uses sklearn Isolation Forest or PyOD models if available
- Supports .pkl (joblib) and JSON model formats
- Each prediction gets a unique ID for async request/response matching
- 5-second timeout per prediction
JavaScript Fallback
If Python is not found or fails to start, the service loads JSON models from ml/models/ and runs a mock predictor in Node.js.
Mock Model Logic (Code)
The JS fallback scales features, then checks heuristic indicators. More suspicious indicators → lower (more anomalous) score:
// backend/services/mlInferenceService.js - callMockModel()
const scaledFeatures = featureVector.map((value, index) => {
const mean = this.mockScaler.means[index] || 0;
const scale = this.mockScaler.scales[index] || 1;
return (value - mean) / scale;
});
const suspiciousIndicators = [
scaledFeatures[16] > 2.0, // unusual_port_combination
scaledFeatures[17] > 2.0, // high_volume_flow
scaledFeatures[9] > 5, // src_ip_activity
scaledFeatures[10] > 5, // dst_ip_activity
scaledFeatures[0] > 5, // packet_size
scaledFeatures[1] > 5, // flow_duration
];
const suspiciousCount = suspiciousIndicators.filter(Boolean).length;
// More suspicious → lower anomaly_score (more anomalous)
if (suspiciousCount >= 5) { anomalyScore = -0.8; confidence = 0.95; }
else if (suspiciousCount >= 4) { anomalyScore = -0.6; confidence = 0.9; }
else if (suspiciousCount >= 3) { anomalyScore = -0.3; confidence = 0.8; }
else { anomalyScore = 0.1; confidence = 0.9; }
const isAnomaly = anomalyScore < thresholds.anomalyScore &&
confidence > thresholds.confidence;Prediction Flow
- Packet arrives → Feature extractor builds 20-D vector
- Vector is validated (length, no NaN/Inf)
- Sent to ML service (Python or JS)
- Model returns:
is_anomaly,anomaly_score,confidence,model_version - Combined engine checks thresholds and merges with rule-based result
Thresholds
An ML result becomes an alert when:
is_anomalyis trueanomaly_score< mlAnomalyScore (e.g. -0.5)confidence> mlConfidence (e.g. 0.9)
See Configuration for how to adjust these.
Statistics
The service tracks:
- Total predictions
- Anomalies detected
- Average latency (ms)
- Last prediction time
These are exposed via the API and the Dashboard ML status panel.
Disabling Python ML
Set USE_PYTHON_ML=false in your environment to force the JavaScript fallback (e.g. when Python is not installed).
ML Model Training
DefenSys includes a Python training script that builds anomaly detection models from collected traffic data. Training produces Isolation Forest (and optionally PyOD) models plus a scaler and metadata.
Prerequisites
- Python 3 with scikit-learn, numpy, pandas
- Optional: PyOD for LOF, OCSVM, PyOD IForest
- Training data in CSV format (see below)
Collecting Training Data
Run DefenSys with packet capture enabled. The feature extractor stores feature vectors in memory. Export them to CSV via:
- Settings → Detection → Export Training Data
- Or the
exportTrainingDataIPC/API method
The CSV is written to ml/datasets/training_data.csv with columns matching the 20 feature names plus src_ip, dst_ip, timestamp.
Training Pipeline
- Load data – Read CSV, validate columns
- Preprocess – Handle missing/infinite values, scale with StandardScaler
- Train Isolation Forest – contamination=0.1, n_estimators=200
- Train PyOD models (optional) – LOF, OCSVM, PyOD IForest
- Evaluate – Anomaly rate, score ranges
- Save – model_latest.pkl (or .json), scaler, metadata
Isolation Forest Training (Code)
# ml/train_model.py
model = IsolationForest(
contamination=0.1, # Expected proportion of outliers
n_estimators=200,
max_samples='auto',
max_features=1.0,
random_state=42,
n_jobs=-1
)
model.fit(X)
predictions = model.predict(X)
anomaly_scores = model.decision_function(X)
# score < 0 → anomaly (Isolation Forest convention)Running the Trainer
cd ml python train_model.py
By default it looks for datasets/training_data.csv and writes to models/. You can override paths in the script.
Output Files
model_latest.jsonormodel_latest.pkl– Modelscaler_*.jsonorscaler_*.pkl– Scalermetadata_*.json– Training date, feature names, metrics
Retraining
The inference service can report when the model is older than 30 days via needsRetraining(). Collect new data periodically and retrain to adapt to changing traffic patterns.
ML Configuration
ML behavior is controlled by environment variables, model metadata, and detection engine thresholds.
Environment Variables
- USE_PYTHON_ML – Set to
falseto disable Python and use the JavaScript fallback only
Model Thresholds (in model_latest.json)
- anomalyScore – Alert when score is below this (e.g. -0.1). More negative = more anomalous.
- confidence – Alert only when confidence is above this (e.g. 0.7).
Detection Engine Thresholds
The combined detection engine uses its own thresholds (in code):
- mlAnomalyScore – e.g. -0.5 (stricter than model default)
- mlConfidence – e.g. 0.9 (higher confidence required)
These are tuned to reduce false positives. Adjust in combinedDetectionEngine.js if needed.
Settings UI
In the Desktop App, Settings → Detection lets you:
- Enable/disable ML anomaly detection
- Choose Python ML vs JavaScript (when both available)
- Export training data
- View ML status (initialized, Python active, stats)
Model Paths
The service looks for models in ml/models/ relative to the app path, with fallbacks for development and packaged builds.