Data platform development, productionizing personalized recommendations
eBay Classifieds Group
Data Engineering and MLOps in the Benelux team (marktplaats.nl, 2dehands.be).Tech: Python, Scala, Spark, Hadoop, SQL, Hive, Cassandra, Airflow, MLflow, Nomad, Docker, Linux, CI/CD (Jenkins), Google Cloud Platform
Detecting and localizing infected tulips
H2L Robotics (part time @ 0.2 FTE)
We build a robot that drives through tulips fields autonomously, while detecting and localizing sick tulips and applying treatments.
I work on the neural network that detects the tulips using cameras. That means building tools around data processing (e.g. for annotations), writing Keras/TensorFlow code, performing a lot of experiments to figure out what works and what doesn’t.Tech: Python, Keras, TensorFlow, TensorRT, CNNs / ConvNets, Object Detection, Keypoint Detection, MLflow, Linux, Docker, Amazon Web Services
Financial asset management data hub
Helped a financial asset management firm transition to a cloud native data hub on AWS.Tech: Python, Airflow, Spark, Linux, Docker, Amazon Web Services
This was an initiative to detect events during the aircraft handling process with Deep Learning-based Computer Vision. I built the initial prototype and evangelized it internally. When this turned into an actual project, my main responsibility was detection with high accuracy: initially by implementing Object Detection models using TensorFlow Object Detection API, Kalman-filter based tracking and rule-based event detection. Later on, this transitioned into end-to-end learning on small video clips, using a custom Action Recognition approach based on a DeepMind paper.
But next to that, I also took on a significant chunk of engineering: ETL pipelines (from videos to de-duplicated jpegs with pre-processing such as region-of-interest masks), low-level TensorFlow dataset batching code, building airflow DAGs for hyperparameter searches and experimentation, annotation tooling and maintaining Linux systems (on-prem and Azure VMs).
In addition, I played an important role in getting the project off the ground: many meetings to even get the permission to use the data, convincing stakeholders, legal, discussions about anonymization, camera position meetings and site surveys, helping to build the team, defining roadmaps etc.Tech: Python (pandas, numpy, keras, matplotlib, seaborn, click, OpenCV), TensorFlow, TensorFlow Object Detection API, TensorBoard, CNNs / ConvNets, Object Detection, YOLO, SSD, Faster R-CNN, ResNets, video activity recognition, Inception-based architectures, multi-task learning, Locality Similarity Hashing, Kalman filter tracking, Airflow, MLflow, Spark, Databricks, PostgreSQL, Linux, Azure
Predicted Off-Block Time (Departure Delay Prediction)
Schiphol has experienced tremendous growth during the last couple of years, and infrastructure has struggled to keep up. The inevitable result is increased delays. I set out to develop a model to predict delay and - in the process - try to get a better understanding of the factors that drive delays.
The final model that maximized predictive accuracy was a boosted tree (XGBoost) model with a lot of feature engineering. The model improved existing estimates of departure time by 15% to 50%. I built an async flight API client, which refreshes on a timer and shows predictions in simple Flask UI. Later on, I built the first version of a low-latency streaming implementation using Spark Structured Streaming on Databricks.Tech: Python (pandas, numpy, scikit-learn, matplotlib, Flask), Jupyter, SQL, Hive, Spark, Spark Structured Streaming, Databricks, random forest regressor, boosted trees (XGBoost), LIME and SHAP
Wi-Fi-Sensor Based Location Analytics
I was part of a team that developed a system to measure the presence of Wi-Fi radios using custom Wi-Fi sensors. In essence, this enables insight into the approximate number of people in an area, how long they remain there, which route they take, how often they come back, etc. This system was deployed at various clients in retail, public transport, facility management and a football stadium.
I started out on the Data Engineering side and gradually transitioned into Data Science.
My engineering contributions:
- Designing, implementing and maintaining a lambda-architecture based big data platform.
- Developing streaming data processing code and real-time summary statistics in Apache Storm (Java).
- A framework to greatly simplify PySpark-based batch jobs. Also, scheduling and monitoring them.
- Privacy by design: developed the anonymization pipeline involving a Trusted Third Party, and developed an opt-out system.
- Built a Python/Flask/MongoDB/jQuery/Bootstrap based configuration management tool to simplify the administration of sensor locations, regions, maps, geometry, etc. This saved many hours of work.
- Supervising Junior Data Engineers.
- Monitoring with Prometheus.
- Developing a real-time Crowd Monitor for the KPMG Restaurant, with a short-term prediction (30 mins ahead). This was beneficial for internal marketing and helped our colleagues avoid crowds and queues.
- Analyzing data from 120 sensors in a large furniture store, working together with stakeholders to extract useful insights in shopper’s behavior.
- Pivoting the product into a version for workplace utilization and occupancy monitoring, in order to more efficiently allocate teams to areas, and possibly close down an entire section of the building (saving a lot of money on exploitation costs).
Fun side project:
- Prototyped an indoor navigation app for Android using iBeacons. Dijkstra-based routing, sensor fusion, proximity-triggered messaging managed by a Python backend.
Tram & Metro Vehicle Maintenance Analysis
Public transport provider (via KPMG)
Public transport providers maintain expensive assets and malfunctions on the track can be quite disruptive to their travelers and society as a whole. If maintenance can be done earlier (preventing breakdowns) or more efficiently, this can translate into many euros saved.
I investigated how patterns in vehicle (sensor) data relate to vehicle maintenance records. This uncovered interesting insights, such as specific areas of the rail that cause significantly more wheel damage.Tech: Python (pandas, numpy, scikit-learn, matplotlib), Jupyter notebook, Java, SQL, Hive, Hadoop, Spark, Hortonworks big data cluster, Linux, association rule mining, random forest classifier
Public Transport Traveler Clustering
Public transport provider (via KPMG)
Since switching to electronic payment cards for public transport, a lot of data has been collected on behavior of travelers. This raises the question: can this data be utilized to create better products, more in line with travelers’ wishes?
I investigated how (anonymized) travelers can be assigned into several clusters based on their behavior. I used Hive on a Hadoop cluster to calculate various normalized behaviour indicators, applied the K-means clustering algorithm, visualized the results with matplotlib and Gephi and facilitated the interpretation and validation of the results with business stakeholders.Tech: Python (pandas, numpy, scikit-learn, matplotlib), Jupyter notebook, SQL, Hive, Hadoop, Spark, Hortonworks big data cluster, Linux, Gephi, k-means clustering, dimensionality reduction (PCA)
Highway Vehicle Intensity Prediction
Ministry of Infrastructure and Environment (Rijkswaterstaat; via KPMG)
The Dutch road administration has many terabytes of data from measurements of vehicles on the highways, made using induction loops embedded in the road. This results in noisy measurements of the number of vehicles, their length and velocity. The goal of this project was to investigate the possibilities of applying big data techniques to induction loop sensor data.
I developed a predictive model for the intensity on the road at any given time, based on historic intensity and weather data. The model was able to predict the standard weekly pattern quite accurately, including holiday effects and rush hour traffic. Adding precipitation data reduced the error by 3%. Predicting traffic jams due to collisions and rare “black swan” events remains elusive, though. This was just a Proof-of-Concept, but the results were featured in a newspaper article in the NRC (Dutch).Tech: Python (pandas, numpy, matplotlib, scikit-learn), Jupyter notebook, random forest regressor, gradient descent, time series prediction, autoregressive feature extraction