Talent Atlas Inc.
Case Study
Problem Statement
As Talent Atlas Inc. scaled rapidly across continents, its workforce tripled within two years - but its HR and Finance teams were still operating with disconnected spreadsheets.
This created major operational challenges: leadership could not quickly answer basic questions like "How much are we spending by department?" or "What is our average employee tenure?"
Employee and departmental data existed but was fragmented, unstandardized, and manually analyzed.
The lack of a unified data engine hindered strategic decisions in budgeting, talent planning, and performance management.
Solution
A Workforce Intelligence Engine was developed using PySpark to unify and analyze the company’s HR data.
Two core datasets - departments.csv and employees.csv - were ingested and transformed into a scalable analytics model capable of answering key business questions from HR, Finance, and Leadership.
Using PySpark’s distributed processing capabilities, the solution performed:
1. Data Exploration and Preparation
-
Imported and inspected the Employees and Departments datasets using PySpark DataFrames.
-
Validated schema and data types, identified nulls, and removed duplicate or invalid records.
-
Applied text cleaning operations such as trimming spaces and standardizing capitalization for consistent formatting.
-
Filtered irrelevant records to focus only on active employees and valid departments.
-
Established a clean, structured base for downstream transformations and analytics.
​
2. Aggregation and Descriptive Analytics
-
Used groupBy() and agg() functions to calculate total, average, and maximum salaries per department.
-
Determined overall employee count and salary distribution across departments and roles.
-
Highlighted departments with the highest and lowest salary budgets to reveal pay disparities.
-
Computed organization-wide salary averages to benchmark departmental performance.
-
Built foundational descriptive metrics for subsequent financial and HR insights.
​
3. Data Integration and Analysis
-
Performed a join between Employees and Departments datasets to create a unified analytical view.
-
Mapped each employee to their department, manager, and location for enriched context.
-
Derived insights such as departmental headcount, location-wise salary spend, and hiring distribution.
-
Enabled cross-functional analysis of workforce cost and structure using the integrated data model.
-
Provided leadership-ready views for strategic workforce planning.
​
4. Temporal and Text-Based Insights
-
Leveraged PySpark date functions to extract year, month, and day components from hire dates.
-
Calculated employee tenure by comparing hire dates to the current date.
-
Identified seasonal hiring patterns and workforce growth trends over time.
-
Used string functions to standardize department names and format employee identifiers.
-
Enhanced time-series and textual consistency for reporting and visualization.
​
5. Advanced Workforce Analytics with Window Functions
-
Applied window functions to rank employees within departments by salary and experience.
-
Calculated running totals and cumulative averages to monitor departmental spending over time.
-
Used partition-based analysis to isolate performance trends by role or function.
-
Identified top earners and high-performing departments using rank and dense-rank logic.
-
Delivered advanced, dynamic insights that would be difficult to compute with traditional aggregations alone.
Business Impact
-
Unified Workforce View: Delivered a centralized, consistent view of all employee and departmental data across global operations.
-
Informed Financial Planning: Provided clear visibility into salary budgets and departmental spending, improving budget accuracy.
-
Faster HR Decision-Making: Enabled HR teams to track hiring trends, tenure, and workforce distribution without manual data pulls.
-
Data-Driven Leadership: Empowered executives with on-demand answers to critical workforce questions in seconds instead of days.
-
Scalable Analytics Framework: Established a PySpark-based foundation that can scale to handle larger datasets and additional HR metrics as the organization grows.