This document explains how TrustVector evaluations are conducted, scored, and maintained.
TrustVector evaluates AI systems across 5 dimensions using evidence-based scoring. Every claim is backed by verifiable sources with confidence levels.
What we measure:
- Task accuracy on standard benchmarks
- Output consistency across runs
- Latency (p50, p95, p99)
- Uptime and availability
- Context window size
- Rate limits and throughput
Why it matters: Performance determines if an AI system can meet user needs reliably. A high-performing model that's frequently down or inconsistent is not trustworthy.
Scoring criteria:
- Benchmark results normalized to 0-100 scale
- Latency measured in real-world conditions
- Uptime from official status pages
- Consistency from repeated testing
What we measure:
- Prompt injection resistance (OWASP LLM01)
- Jailbreak defenses
- Data leakage prevention
- Output safety filtering
- API security features
Why it matters: Security breaches can expose sensitive data, produce harmful outputs, or enable misuse. Security is critical for production deployments.
Scoring criteria:
- Testing against known attack vectors
- Review of security documentation
- Historical incident analysis
- API security best practices compliance
What we measure:
- Data residency options
- Data retention policies
- Training data opt-out availability
- PII handling capabilities
- Compliance certifications (SOC 2, HIPAA, GDPR, ISO 27001)
- Zero data retention options
Why it matters: Privacy violations can violate laws (GDPR, HIPAA), break user trust, and expose organizations to liability.
Scoring criteria:
- Policy review (terms of service, privacy policy)
- Certification verification
- Data flow analysis
- Opt-out mechanism testing
What we measure:
- Explainability features
- Hallucination rates
- Bias and fairness testing
- Uncertainty quantification
- Model card quality
- Training data transparency
- Built-in guardrails
Why it matters: Users need to understand how AI systems work, when they're uncertain, and where biases exist to make informed decisions.
Scoring criteria:
- Factual accuracy testing
- Bias benchmarks (BBQ, etc.)
- Documentation completeness review
- Explanation quality assessment
What we measure:
- API design quality
- SDK availability and quality
- Versioning policy clarity
- Breaking change management
- Monitoring/observability tools
- Support quality and SLAs
- Ecosystem maturity
- License terms
Why it matters: Operational issues cause integration failures, unexpected costs, and vendor lock-in. Excellent operations enable smooth production deployments.
Scoring criteria:
- API design principles compliance
- SDK feature completeness
- Documentation quality
- Community health metrics
- Support responsiveness
Each dimension receives an overall score (0-100) calculated as:
overall_score = weighted_average(criteria_scores)
Weights are assigned based on criterion importance within that dimension.
Individual criteria are scored based on:
- Quantitative metrics: Benchmark results, measured latency
- Qualitative assessment: Documentation quality, policy comprehensiveness
- Binary checks: Certification present (yes/no)
Normalization:
- Benchmarks: Map performance to 0-100 (relative to state-of-the-art)
- Latency: Lower is better, normalize to 0-100
- Binary: Yes = 100, No = 0, Partial = 50
overall_trust_score = (
performance_reliability +
security +
privacy_compliance +
trust_transparency +
operational_excellence
) / 5
This is a simple average by default. Users can customize weights via the custom score calculator.
- Peer-reviewed papers (highest confidence)
- Official benchmarks (LMSYS, SWE-bench, etc.)
- Official documentation (API docs, model cards)
- Third-party testing (security firms, independent researchers)
- Status pages (uptime data)
- Community testing (GitHub, blogs)
- Vendor claims (lowest confidence, requires verification)
Every scored criterion requires:
- At least 1 evidence source
- URL to the source
- Date when evidence was published (YYYY-MM-DD)
- Value that the evidence shows
- Methodology explaining how score was derived
- Last verified date
Preferred:
- 2+ evidence sources (cross-verification)
- Evidence from last 6 months
- Mix of quantitative and qualitative sources
- Multiple independent sources (2+)
- Evidence published within last 6 months
- Direct testing or benchmark results
- Authoritative sources (official, peer-reviewed)
- Single authoritative source
- Evidence 6-12 months old
- Indirect measurement (e.g., inferred from similar models)
- Reputable third-party testing
- Vendor claims only (not independently verified)
- Evidence older than 12 months
- Anecdotal reports
- Significant gaps in data
- Extrapolated from limited information
Important: Low confidence doesn't mean wrong, it means "needs more verification."
If no evidence exists for a criterion:
- Don't guess - mark as low confidence
- Explain in notes why evidence is unavailable
- Seek vendor input or community testing
- Consider not scoring (use
valuefield for qualitative data)
If sources disagree:
- Document all sources in evidence array
- Note the conflict in criterion notes
- Use most recent and authoritative source for score
- Mark as medium confidence at best
Some criteria are qualitative (e.g., "API design quality"):
- Use structured rubrics (see Rubrics below)
- Explain reasoning in methodology
- Compare to peers for relative scoring
- Mark confidence appropriately (usually medium)
- 90-100: RESTful, OpenAPI spec, streaming, webhooks, comprehensive
- 75-89: Well-designed, consistent, good features, minor gaps
- 60-74: Functional, some inconsistencies, missing features
- 40-59: Poor design, frequent breaking changes, limited features
- 0-39: Inconsistent, unreliable, poorly documented
- 90-100: Comprehensive, clear examples, API reference, guides, up-to-date
- 75-89: Good coverage, some gaps, mostly clear
- 60-74: Basic docs, outdated sections, limited examples
- 40-59: Minimal documentation, confusing, many gaps
- 0-39: Severely lacking or non-existent
Based on:
- Number of integrations with major frameworks
- Community library availability
- Tutorial and course availability
- Active community size
- GitHub stars/forks (for open-source)
Like CVSS (Common Vulnerability Scoring System), TrustVector allows custom weights:
{
performance_reliability: 0.20, // Must sum to 1.0
security: 0.30,
privacy_compliance: 0.25,
trust_transparency: 0.15,
operational_excellence: 0.10
}- Balanced: 20% each dimension
- Security First: 35% security, 30% privacy
- Performance Focused: 45% performance, 20% operations
- Enterprise: 25% security, 25% privacy, 20% operations
- Healthcare: 35% privacy, 30% security (HIPAA focus)
- Financial: 30% security, 25% privacy, 20% operations
- Startup: 35% performance, 30% operations (cost-focused)
Different use cases prioritize different dimensions:
| Use Case | Critical Dimensions | Min Scores |
|---|---|---|
| Healthcare | Privacy, Security | 85, 80 |
| Finance | Security, Compliance | 85, 85 |
| Code Generation | Performance, Trust | 75, 70 |
| Research | Trust, Transparency | 85, 80 |
| Customer Support | Performance, Security | 70, 75 |
- Major version release: New model version
- Significant benchmark update: New SWE-bench, GPQA results
- Policy changes: Privacy policy, terms of service changes
- Security incidents: Disclosed vulnerabilities
- Every 6 months: Routine refresh
- Contributor opens issue or PR
- Update relevant criteria
- Add new evidence
- Update
last_evaluatedandlast_verifieddates - Note what changed in PR description
- Recency bias: Newer models may have more recent benchmarks
- Benchmark gaming: Vendors may optimize specifically for tested benchmarks
- Closed-source opacity: Limited transparency for proprietary models
- Resource constraints: Can't test everything ourselves
- Temporal decay: Evaluations become less accurate over time
- Multiple evidence sources required
- Confidence levels acknowledge uncertainty
- Community review process
- Regular re-evaluation (6 months)
- Transparent methodology
- Cost-benefit analysis (though we note pricing)
- Subjective quality (writing style preference)
- Specific use case performance (beyond 10 common ones)
- Future capabilities (roadmap promises)
Every evaluation must pass:
- JSON schema validation
- Evidence URL accessibility
- Score consistency (overall ≈ average of criteria)
- Required fields present
- Evidence freshness warnings (if > 1 year old)
- 2 maintainer approvals required
- Community can comment on PRs
- Conflicts of interest must be disclosed
- Bi-annual review of all evaluations
- Issue tracking for needed updates
- Community can flag outdated evaluations
| Feature | TrustVector | LMSYS Arena | RiskRubric | Simple Benchmarks |
|---|---|---|---|---|
| Multi-dimensional | ✅ 5 dimensions | ❌ Single | ❌ Single metric | |
| Security evaluation | ✅ Comprehensive | ❌ No | ❌ No | |
| Privacy evaluation | ✅ Detailed | ❌ No | ❌ No | |
| Evidence-based | ✅ Every score | ✅ Yes | ✅ Yes | |
| Confidence levels | ✅ High/Med/Low | ❌ No | ❌ No | ❌ No |
| Custom weighting | ✅ CVSS-style | ❌ No | ❌ No | ❌ No |
| Open-source | ✅ MIT | ✅ Open | ❌ No | |
| Use case specific | ✅ 10+ | ❌ No | ❌ No |
Complementary, not competitive: TrustVector incorporates data from LMSYS, benchmarks, etc., providing a holistic view.
- v1.0 (2025-11-07): Initial methodology
- Framework covers models, MCPs, agents
- 5 dimensions, evidence-based, confidence levels
- Custom score calculator
This methodology will evolve based on community feedback.
- Suggest improvements: GitHub Discussions
- Report issues: GitHub Issues
- Ask questions: Comment on relevant PRs
Methodology inspired by:
- CVSS (vulnerability scoring)
- OWASP LLM Top 10 (security framework)
- Model Card framework (documentation standards)
- ISO/IEC 25010 (software quality model)