Portfolio

Index

Forecasting for Megaprojects

CONTEXT. A megaproject is an extremely large-scale investment project. Megaprojects are large-scale, complex ventures, take many years to develop & build, involve multiple public & private stakeholders, are transformational, and impact millions of people. E.g., a new train station qualifies as a megaproject. Management typically wants to monitor the progress, forecast delays and costs of megaprojects to guide budgeting, and make the best adjustments.

CONTRIBUTIONS. As a Data Scientist, I completed the exploratory data analysis and resolved data quality issues in concert with stakeholders. I designed, implemented, and evaluated predictive models to forecast cost and time overruns & underruns. Further, I delivered a technical report and presentations to stakeholders. Technologies: Jupyter Lab, Scikit-learn, ML algorithms (regression, multi-class classification).

“This is excellent.”

Head of Data Science, Consulting agency

Demand forecasting for the textile industry

CONTEXT. In manufacturing, such as in the textile industry, it is of paramount importance to forecast demand to optimize production. Textile production is a complex process that depends on staffing, logistics, factory time for fabric production, pre-treatments, dyeing, printing and finishing treatments. The planning and execution might require months. Manufacturers want to be competitive by minimizing delivery times and need to be cost-effective by minimizing unsold stock. This means producing the right quantity at the right time.

CONTRIBUTIONS. As a Data Scientist, I performed an exploratory data analysis on historical data points. I highlighted data quality issues and opportunities to integrate additional data for increased forecasting accuracy. I trained, evaluated, and assessed the demand forecasting model, producing a final report. Technologies: AWS Forecast, Jupyter Lab, Pandas, Matplotlib.

“Thank you very much from the entire team for helping us.”

CEO, Consulting agency

Workshop: Big Data & Machine Learning for Mobility

CONTEXT. Mobility as a service (MaaS) is a range of digital solutions designed to make transportation more efficient and simple for passengers (trip planning, booking, ticketing, payment, and updates) and transport operators (fleet management, demand forecasting, predictive maintenance, optimized fixed and demand-responsive transit). As a first step toward a MaaS transformation of the Company (a solution provider for transport operators), I offered this workshop.

CONTRIBUTIONS. I trained 20 attendees (sales, management) in a three days remote workshop with engaging interactions introducing Big Data (motivations, architectures, integration strategies with legacy systems) and how this unfolds for the Company. We covered Machine Learning, challenges in productionalizing ML services, opportunities for mobility, with deep dives on specific use cases of interest for the Company. We concluded with best practices for data product management.

“Clear overview of ML methods with examples applied to our business.”

Pre-Sales Analyst

Financial Predictive Models & APIs for Tax Filing Services

CONTEXT. In certain financial services, including tax filing services, refunds are offered to customers based on their answers to questionnaires. Answers might be imprecise and inconsistent, affecting the refund calculations. Mitigation strategies include bounding uncertainty and providing estimates with quality guarantees; auto-correcting inconsistencies; requesting manual corrections.

CONTRIBUTIONS. As a Data Scientist and Machine Learning Engineer, I helped refine the business problem, source and consolidate the training data, build prediction models, run validation experiments with domain experts, and deploy the prediction APIs in production systems. Technologies: Google AI platform, Google App Engine, Flask, k8s, Snowflake, Jupyter Lab, Scikit-learn, ML algorithms (regression, binary classification, multi-class classification, quantile prediction).

“Would you like to join us full-time as Lead Data Scientist?”

Head of Data

Multi-Touch Marketing Attribution Modeling & Productionalization

CONTEXT. An attribution model is a set of rules that determines how to credit touchpoints for conversions in conversion paths. Touchpoints include: Clicking ads; visiting blog posts; using referral codes provided by influencers or affiliate partners; organic search. Models differ in how they weigh the importance of different touchpoints: first-click, last-click, multi-touch. Visitors’ behavior prior to conversion is of paramount importance for marketing departments to measure and optimize their operations and budget allocation.

CONTRIBUTIONS. As a Data Scientist, I implemented a multi-touch attribution model including Data cleansing and consolidation; tracking model (extracting traces of converted visitors); attribution model (attribution of visits, attribution of users, multi-attribution schema), marketing analytics (answering analytical questions). Technologies: Piwik/Matomo data model, MySQL, PostgreSQL, SQLite, customized visit-user matching with data provenance, and handling of missing attributions.

“Thanks, Michele – what are we going to do when you finish.”

Senior Performance Marketing Manager

Optimization in Public Transportation

CONTEXT. In urban public transport, bus routes and schedules are regularly updated to match service demand, fleet, and drivers’ availability. These adjustments are often the result of manual processes and tacit knowledge, which lead to suboptimal choices.

CONTRIBUTIONS. As a Data Scientist and Transport Engineer, I architected a comprehensive solution to estimate demand, run fleet simulations to evaluate what-if scenarios, and identify optimal service adjustments. Target business KPIs include network service costs and profits, drivers’ mileage, passenger waiting times, and CO2 emissions. The grant proposal based on the resulting report has been successful. Technologies: SUMO, Spark, origin-destination estimation, simulation-aided optimization.

“We are delighted to work with Michele, which is helping us with the preparation of grant proposals and prototyping.”

CEO

Visual Analysis of Business Processes

CONTEXT. Pharmaceutical companies must comply with strict requirements in terms of business processes concerning how confidential information is treated, how products are developed, and how medical trials are managed. These processes grow in complexity, are highly connected, and tend to be difficult to understand. The analysis of the recorded business process logs is a very useful resource that can reveal these complex dynamics, empowering companies with analytical services to drive automation and optimization.

CONTRIBUTIONS. As a Data Scientist, I built an application component enabling the visual exploration of business processes, highlighting the different aspects of interest. The result has been integrated into a comprehensive Business Intelligence dashboard. Technologies: Node.js, Vue, TypeScript, yFiles (diagramming).

“That’s it for me, well done you.”

Freelance Data Scientist with 25+ years of experience in consulting

Contact Tracing for COVID-19

CONTEXT. When systematically applied, contact tracing can break the transmission chains of infectious disease and is thus an essential public health tool for controlling infectious disease outbreaks. By combining mobile phone usage data with other information, public health organizations can monitor the situation at the country scale, identify potential hazards such as overcrowded areas, and track contacts of possibly infected persons.

CONTRIBUTIONS. As a Data Scientist and Research Engineer, I reviewed the methodology and algorithms of an existing system for contact tracing. The findings resulted in a series of recommendations that have been implemented, leading to higher performance and more accurate results. Technologies: ElasticSearch, Python, Pandas, data cleansing, trajectory mining algorithms.

“Thank You, Michele!”

CEO

Adaptive Traffic Signal Analysis & Optimization

CONTEXT. Urban traffic modeling and analysis is a key component of traffic management and control. Its purpose is to predict congestion states and propose improvements in the traffic network. Traffic signal control is one of these improvements, which aims to minimize the travel time of vehicles by coordinating their movements at the road intersections.

CONTRIBUTIONS. As a Data Scientist and Research Engineer, I built an urban traffic simulation model starting from noisy sensor measurements and applied reinforcement learning methods to identify the optimal policy for the traffic lights plan. Technologies: Flink, Kafka, SUMO microscopic agent-based traffic simulator, Java, Python, Jupyter Lab, NetworkX, probabilistic modeling methods, simulator-in-the-loop optimization strategies.

“We truly appreciate the effort that Michele put into our collaboration. He is a skilled data scientist that gets things done. We will definitely consider him for other projects in the future.”

Principal Research Engineer and Team Lead

City-Scale People’s Movement Analytics

CONTEXT. Mobile devices connect regularly to the cellular network to receive messages, initiate calls, and transfer data. Network logs provide a comprehensive picture of how the population moves in metropolitan areas, enabling use cases such as optimization of out-of-home advertising.

CONTRIBUTIONS. As Data Engineer and Data Scientist, I architected and implemented a GDPR-compliant analytics service delivering a population demographics API for urban areas by modeling people’s location and movements using the cellular network data, ingesting billions of records per day. Technologies: AWS EMR clusters, AWS S3, AWS Lambda, AWS CloudWatch, HDFS, Parquet, Hadoop, Spark, Scala, Python, Zeppelin notebooks, Docker.

“Michele is a very talented data scientist with excellent data engineering skills. He contributed several fundamental components of our location intelligence platform. I highly recommend Michele and would love to get the chance to work with him again.”

Matthew Lehar, Data Science Capability Manager for ITS Data Lab, Siemens Mobility

Location Intelligence for Retail Analytics

CONTEXT. WiFi radio signals can be used to locate the emitting device such as a smartphone in indoor environments. In retail analytics, this data helps define areas of high visibility, evaluate the effectiveness of specific promotions, and inform product placement for more effective selling.

CONTRIBUTIONS. As a Senior Data Engineer and then Lead Data Scientist, I redesigned the existing WiFi analytics pipeline as a series of Spark jobs resulting in simpler data/ML pipelines with a double-digit percentage accuracy increase. Technologies: PostgreSQL, Airflow, Celery, Cassandra, AWS S3, Python Docker, JupyterLab, Scikit-learn, Pandas, Fabric, Flask.

“Michele is a passionate, hardworking, and reflected data scientist. It was a blast to see how Michele picked up the task and was able to deliver excellent, reliable, and reproducible results. Michele is definitely an asset for every company! I am very happy that Michele succeeded in my position as Lead Data Scientist.”

Alexander Müller, Founder & CTO workist.com

Cloud Robotics & Drone Charging Stations

CONTEXT. Fully autonomous robots require remote management services to operate: fleet management, predictive maintenance, and teleoperation. Autonomous charging capabilities must be integrated to enable extended and unattended operativity.

CONTRIBUTIONS. As founder and CTO, I supervised the design, development, and commercialization of charge stations and protective hangars for commercial drones with complementary management and connectivity services. My contributions led to contracts with more than thirty international clients, including CIA, NASA, NIH, Google X, Parrot, and Stanford University. Technologies: ROS, Python, C++, Flask, Docker, Fabric, OpenCV, Jupyter Lab.

“Michele perfectly fits dynamic managing positions, planning the work, coaching and coordinating the team. I highly recommend him.”

Francesco Mosconi, CEO & Chief Data Scientist @ Catalit

Modeling & Querying Data with Uncertainty

CONTEXT. As a Ph.D. student, I co-authored ten papers in top-tier international conferences and journals, including SIGMOD, VLDB, EDBT, KAIS, and DKE. I worked on problems related to the management and analysis of streaming and uncertain data. Moreover, I established collaborations with prominent research groups in the area of data management, namely, with the data management and data analytics groups at IBM T.J. Watson Research Center (USA), where I spent six months (during two visits) as a visiting researcher, and the Qatar Computing Research Institute (Qatar), where I spent another three months.

CONTRIBUTIONS (Available on Google Scholar here)

  1. “Improving Classification Quality in Uncertain Graphs”. Michele Dallachiesa, Charu Aggarwal, Themis Palpanas. ACM Journal of Data and Information Quality (JDIQ), 2018.
  2. “Similarity Using Correlation-Aware Measures”. Katsiaryna Mirylenka, Michele Dallachiesa, Themis Palpanas. International Conference on Scientific and Statistical Database Management (SSDBM), Chicago, USA, 2017.
  3. “Correlation-Aware Distance Measures for Data Series”. Katsiaryna Mirylenka, Michele Dallachiesa, Themis Palpanas. International Conference on Extending Database Technology (EDBT), Italy, 2017.
  4. “Top-k Nearest Neighbor Search In Uncertain Data Series”. Michele Dallachiesa, Themis Palpanas, Ihab F. Ilyas. Proceedings of the VLDB Endowment (PVLDB) Journal 8(1), 2015.
  5. “Sliding windows over uncertain data streams”. Michele Dallachiesa, Gabriela Jacques-Silva, Bugra Gedik, Kun-Lung Wu, Themis Palpanas. Knowledge and Information Systems (KAIS) Journal 45, 2015.
  6. “Node classification in uncertain graphs”. Michele Dallachiesa, Charu Aggarwal, Themis Palpanas. International Conference on Scientific and Statistical Database Management (SSDBM), Aalborg, Denmark, 2014.
  7. “NADEEF: a commodity data cleaning system”. Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed Elmagarmid, Ihab Ilyas, Mourad Ouzzani, and Nan Tang. ACM SIGMOD International Conference on Management of Data (SIGMOD) – New York City, NY, USA, 2013.
  8. “Identifying Streaming Frequent Items In Ad-hoc Recent Time Windows”. Michele Dallachiesa and Themis Palpanas. Data & Knowledge Engineering (DKE) Journal – 2013.
  9. “Uncertain Time-Series Similarity: Return to the Basics”. Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, Themis Palpanas. Proceedings of the VLDB Endowment (PVLDB) Journal, 2012, Turkey.
  10. “Similarity Matching for Uncertain Time Series: Analytical and Experimental Comparison”. Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, Themis Palpanas. ACM SIGSPATIAL International Workshop on Querying and Mining Uncertain Spatio-Temporal Data (QUeST), Chicago, USA, 2011.

“Michele is able to grasp new ideas and concepts, and in general, learn fast. The leading abilities of Michele are evident in his work: he is capable of independent thinking, and of delivering novel and effective solutions to hard problems. He combines a very solid theoretical background with excellent practical skills.”

Prof. Themis Palpanas, Paris Descartes University

Visual Network Graph Exploration & Analytics

CONTEXT. Network visualization helps us visualize complex relationships between entities and discover interesting patterns in very large data sets, using the superpowers of our visual cortex. Networks emerge naturally from social and physical interactions.

CONTRIBUTIONS. As an M.Sc. student, I implemented a commercial visualization software to explore very large networks with millions of nodes and connections. The application has been downloaded by more than 600 users working on a myriad of use cases, from social network analysis to cybercrime analytics. The tool is also cited in several scientific publications and its unique features have inspired other visualization products. Technologies: Graph layout methods, OpenGL, Qt, C++, Python, social network analysis methods.

“Michele is literally a unicorn. He has managed to merge business intuition, rigorous scientific approach, and highly technical skills in a single person. Working with him? Illuminating experience.”

Carlo Nicolini, Ph.D., Computational Scientist at Prometeia

IT Security Research

CONTEXT. As a hobbyist and member of the Italian hacker community, I enjoyed IT security for nearly a decade (2000-2010).

CONTRIBUTIONS. I have primarily investigated methods to conceal unauthorized access to networks. My efforts resulted in a series of public tools and some articles published in underground magazines (BfI, newbies) that have been cited in several research publications. Technologies: C/C++, Bash scripting, Python, Linux kernel programming, networking protocols.

REFERENCES (Open-source software)

  • hopfake: monitor and fake network traceroute requests
  • libimbw: full-duplex covert channels proxied to TCP/IP
  • mikhail: connect hosts behind NATs
  • cookie-tools: steal web sessions from nearby Wi-Fi connected devices
  • ptrace-tools: inject commands on ptraced telnet/ssh sessions
  • rtpbreak: capture and reconstruct voice calls of IP-based phones
  • icopy: datalink packet bridge between network interfaces