Project

Candidate Ranking system

AI-powered candidate ranking system that combines semantic embeddings and structured signals to surface the best-fit candidates from a 100K pool - without keyword matching.

Problem

Traditional recruiting tools rank by keyword overlap, rewarding profile stuffing over genuine fit.
This system reads career narratives for meaning instead of matching tags.
It weighs behavioral signals like availability and engagement, and penalizes implausible profiles.
Built for the Redrob Data & AI Challenge, it ranks a 100K pool against a job description in under 5 minutes on CPU — no GPU, no external APIs.

Approach

Two-phase architecture: precompute once, rank many

Phase 1 does the heavy, query-independent work: structured scores plus the BGE embedding of every candidate.
Phase 1 writes six parallel artifacts to artifacts/: candidate_ids.npy, semantic_scores.npy, three Track-1 score arrays, and track1_details.pkl.
Phase 2 (phase2/rank.py) loads those artifacts and ranks with no model load, no network, and no GPU.
The expensive embedding pass is paid once per population; re-ranking is pure NumPy.
The demo (demo_pipeline.py) collapses both phases into one in-memory pass since uploaded pools have no precomputed artifacts.

Candidate text representation (Track 2)

build_candidate_text serializes each candidate as: current role → career history → summary → skills → education.
Career history is recency-weighted via CAREER_TOKEN_BUDGET: 125 tokens for the most recent job, then 94, 62, and title-and-company-only for the 4th+ role.
Skills trail the narrative as a subordinate, noisy signal, and beginner proficiency is dropped (_PROFICIENCY_RANK keeps only intermediate/advanced/expert).
Certifications are excluded as low-signal.
Both candidate and JD text get BGE instruction prefixes for asymmetric retrieval.

Embedding model

The model is BAAI/bge-base-en-v1.5, a bi-encoder.
Each candidate and the JD (build_jd_text) are encoded independently.
Semantic score is cosine similarity of L2-normalized vectors, a dot product once normalized (cosine_similarity_matrix).
The bi-encoder lets candidate vectors be precomputed and reused across queries, matching the two-phase split.

Structured signals as multipliers (Track 1)

All three scores are in [0, 1] and read fields only through the field_map.py accessor layer.
Hard filter (track1_hard_filter.py) is the product location × yoe × work_mode × consulting_penalty × tenure.
Hard filter scores Pune/Noida 1.0, other tier-1 cities 0.9, the JD's 5–9 YOE band 1.0 (0.85 outer band), all-consulting careers (TCS/Infosys/Wipro/etc.) 0.3, and sub-12-month average tenure 0.50.
Availability (track1_availability.py) is a weighted average: open_to_work (0.30), last_active (0.25), response_time (0.20), notice_period (0.15), recruiter-response (0.05), applications-30d (0.05).
Availability excludes interview_completion_rate as a weak discriminator.
Credibility (track1_credibility.py) weights profile completeness (0.25), log-scaled endorsements (0.20), education tier (0.20), GitHub (0.20), LinkedIn (0.10), verified contact (0.05).
A -1 GitHub sentinel maps to 0.0, distinguishing "no account" from "low activity."

Final score and ranking

Phase 2 min-max normalizes semantic scores across the population (normalize_semantic).
final_score = semantic_norm × hard_filter × availability × credibility.
The multiplicative form lets any single disqualifying signal (e.g. consulting-only career → 0.3) collapse the whole score.
rank_top_n stable-sorts via np.lexsort on (final, semantic_norm, hard_filter) descending and slices the top 100.

Reasoning generation

build_reasoning emits a per-candidate header with role, YOE, and location.
It adds a semantic-match phrase tuned to the normalized score and real top skills only.
_career_clause fingerprints the previous row so consecutive current-role snippets never duplicate.
A Concerns section is driven by Track-1 sub-signals: long notice period, no GitHub, consulting background, short tenure.
_jd_hits surfaces JD-relevant areas only from skills/text the candidate actually has.

Honeypot avoidance

There is no separate honeypot detector; resistance is emergent.
Keyword-stuffed profiles gain little because skills are subordinate and the bi-encoder scores narrative meaning.
Implausible combinations (consulting-only history, sub-12-month tenure, zero applications) hit the multiplicative penalties and collapse the final score.

Architecture

Screenshots

Limitations

The demo caps at MAX_CANDIDATES = 100 and runs in memory, skipping the 100K path, artifact caching, and streaming top-N loader.
Demo semantic normalization is relative to the uploaded set, so scores are not comparable across runs.
Uploads with missing fields fall back to neutral defaults (avg_response_time → 280.0, notice_period → 90, github → -1), which can flatter or penalize partial profiles.
field_map.py hardcodes EDA-derived ranges (GITHUB_SCORE_MAX = 96.9, LAST_ACTIVE 23–263 days, RESPONSE_TIME 2.1–280h, ENDORSEMENTS_MAX = 242) that won't transfer to other distributions.
_JD_OFFICE_CITIES = {'pune', 'noida'} is hardcoded because the parser can't separate "office location" from "welcome to apply."
interview_completion_rate and standalone skill_assessment_scores are excluded for sparsity; current_company_size, India-based willing_to_relocate, and certifications carry little weight.
A missing/unparseable last_active_date scores 0.0 (fully stale, not neutral).
consulting_penalty falls back to current-company matching when career history is empty.
Consulting detection is substring-based, so a non-consulting firm whose name contains a flagged token can be mismatched.

Future Improvements

Add a cross-encoder second pass re-scoring only the top-K (~200), slotting between rank_top_n and reasoning.
Fine-tune bge-base-en-v1.5 (LoRA/QLoRA) on recruiter relevance labels for sharper JD-to-candidate discrimination.
Promote skill_assessment_scores to a weighted credibility component once coverage is higher.
Use company_size, current_industry, and the parsed seniority_level (currently unused in scoring) as additional fit signals.
Learn Track-1 weights and the combination from ground-truth labels (e.g. learning-to-rank) instead of hand-tuning.
Add explicit honeypot checks: YOE-vs-career-duration consistency, boilerplate-description detection beyond _career_clause, and outlier endorsement/GitHub combinations.