- Forecasting for Megaprojects [Infrastructure, Financial services]
- Demand forecasting for the textile industry [Manufacturing, Financial services]
- Workshop: Big Data & Machine Learning for Mobility [Transport, Location intelligence]
- Financial Predictive Models & APIs for Tax Filing Services [Financial services]
- Multi-Touch Marketing Attribution Modeling & Productionalization [Online advertising]
- Optimization in Public Transportation [Transport, Location intelligence]
- Visual Analysis of Business Processes [Data Visualization]
- Contact Tracing for COVID-19 [Location intelligence]
- Adaptive Traffic Signal Analysis & Optimization [Transport, Location intelligence]
- City-Scale People’s Movement Analytics [Telecom, Location intelligence]
- Location Intelligence for Retail Analytics [Retail, Location intelligence]
- Cloud Robotics & Drone Charging Stations [Robotics]
- Modeling & Querying Data with Uncertainty [Research, Time series, Uncertain data]
- Visual Network Graph Exploration & Analytics [Data Visualization]
- IT Security Research [IT Security]
Forecasting for Megaprojects
CONTEXT. A megaproject is an extremely large-scale investment project. Megaprojects are large-scale, complex ventures, take many years to develop & build, involve multiple public & private stakeholders, are transformational, and impact millions of people. E.g., a new train station qualifies as a megaproject. Management typically wants to monitor the progress, forecast delays and costs of megaprojects to guide budgeting, and make the best adjustments.
CONTRIBUTIONS. As a Data Scientist, I completed the exploratory data analysis and resolved data quality issues in concert with stakeholders. I designed, implemented, and evaluated predictive models to forecast cost and time overruns & underruns. Further, I delivered a technical report and presentations to stakeholders. Technologies: Jupyter Lab, Scikit-learn, ML algorithms (regression, multi-class classification).
Demand forecasting for the textile industry
CONTEXT. In manufacturing, such as in the textile industry, it is of paramount importance to forecast demand to optimize production. Textile production is a complex process that depends on staffing, logistics, factory time for fabric production, pre-treatments, dyeing, printing and finishing treatments. The planning and execution might require months. Manufacturers want to be competitive by minimizing delivery times and need to be cost-effective by minimizing unsold stock. This means producing the right quantity at the right time.
CONTRIBUTIONS. As a Data Scientist, I performed an exploratory data analysis on historical data points. I highlighted data quality issues and opportunities to integrate additional data for increased forecasting accuracy. I trained, evaluated, and assessed the demand forecasting model, producing a final report. Technologies: AWS Forecast, Jupyter Lab, Pandas, Matplotlib.
Workshop: Big Data & Machine Learning for Mobility
CONTEXT. Mobility as a service (MaaS) is a range of digital solutions designed to make transportation more efficient and simple for passengers (trip planning, booking, ticketing, payment, and updates) and transport operators (fleet management, demand forecasting, predictive maintenance, optimized fixed and demand-responsive transit). As a first step toward a MaaS transformation of the Company (a solution provider for transport operators), I offered this workshop.
CONTRIBUTIONS. I trained 20 attendees (sales, management) in a three days remote workshop with engaging interactions introducing Big Data (motivations, architectures, integration strategies with legacy systems) and how this unfolds for the Company. We covered Machine Learning, challenges in productionalizing ML services, opportunities for mobility, with deep dives on specific use cases of interest for the Company. We concluded with best practices for data product management.
Financial Predictive Models & APIs for Tax Filing Services
CONTEXT. In certain financial services, including tax filing services, refunds are offered to customers based on their answers to questionnaires. Answers might be imprecise and inconsistent, affecting the refund calculations. Mitigation strategies include bounding uncertainty and providing estimates with quality guarantees; auto-correcting inconsistencies; requesting manual corrections.
CONTRIBUTIONS. As a Data Scientist and Machine Learning Engineer, I helped refine the business problem, source and consolidate the training data, build prediction models, run validation experiments with domain experts, and deploy the prediction APIs in production systems. Technologies: Google AI platform, Google App Engine, Flask, k8s, Snowflake, Jupyter Lab, Scikit-learn, ML algorithms (regression, binary classification, multi-class classification, quantile prediction).
Multi-Touch Marketing Attribution Modeling & Productionalization
CONTEXT. An attribution model is a set of rules that determines how to credit touchpoints for conversions in conversion paths. Touchpoints include: Clicking ads; visiting blog posts; using referral codes provided by influencers or affiliate partners; organic search. Models differ in how they weigh the importance of different touchpoints: first-click, last-click, multi-touch. Visitors’ behavior prior to conversion is of paramount importance for marketing departments to measure and optimize their operations and budget allocation.
CONTRIBUTIONS. As a Data Scientist, I implemented a multi-touch attribution model including Data cleansing and consolidation; tracking model (extracting traces of converted visitors); attribution model (attribution of visits, attribution of users, multi-attribution schema), marketing analytics (answering analytical questions). Technologies: Piwik/Matomo data model, MySQL, PostgreSQL, SQLite, customized visit-user matching with data provenance, and handling of missing attributions.
Optimization in Public Transportation
CONTEXT. In urban public transport, bus routes and schedules are regularly updated to match service demand, fleet, and drivers’ availability. These adjustments are often the result of manual processes and tacit knowledge, which lead to suboptimal choices.
CONTRIBUTIONS. As a Data Scientist and Transport Engineer, I architected a comprehensive solution to estimate demand, run fleet simulations to evaluate what-if scenarios, and identify optimal service adjustments. Target business KPIs include network service costs and profits, drivers’ mileage, passenger waiting times, and CO2 emissions. The grant proposal based on the resulting report has been successful. Technologies: SUMO, Spark, origin-destination estimation, simulation-aided optimization.
Visual Analysis of Business Processes
CONTEXT. Pharmaceutical companies must comply with strict requirements in terms of business processes concerning how confidential information is treated, how products are developed, and how medical trials are managed. These processes grow in complexity, are highly connected, and tend to be difficult to understand. The analysis of the recorded business process logs is a very useful resource that can reveal these complex dynamics, empowering companies with analytical services to drive automation and optimization.
CONTRIBUTIONS. As a Data Scientist, I built an application component enabling the visual exploration of business processes, highlighting the different aspects of interest. The result has been integrated into a comprehensive Business Intelligence dashboard. Technologies: Node.js, Vue, TypeScript, yFiles (diagramming).
Contact Tracing for COVID-19
CONTEXT. When systematically applied, contact tracing can break the transmission chains of infectious disease and is thus an essential public health tool for controlling infectious disease outbreaks. By combining mobile phone usage data with other information, public health organizations can monitor the situation at the country scale, identify potential hazards such as overcrowded areas, and track contacts of possibly infected persons.
CONTRIBUTIONS. As a Data Scientist and Research Engineer, I reviewed the methodology and algorithms of an existing system for contact tracing. The findings resulted in a series of recommendations that have been implemented, leading to higher performance and more accurate results. Technologies: ElasticSearch, Python, Pandas, data cleansing, trajectory mining algorithms.
Adaptive Traffic Signal Analysis & Optimization
CONTEXT. Urban traffic modeling and analysis is a key component of traffic management and control. Its purpose is to predict congestion states and propose improvements in the traffic network. Traffic signal control is one of these improvements, which aims to minimize the travel time of vehicles by coordinating their movements at the road intersections.
CONTRIBUTIONS. As a Data Scientist and Research Engineer, I built an urban traffic simulation model starting from noisy sensor measurements and applied reinforcement learning methods to identify the optimal policy for the traffic lights plan. Technologies: Flink, Kafka, SUMO microscopic agent-based traffic simulator, Java, Python, Jupyter Lab, NetworkX, probabilistic modeling methods, simulator-in-the-loop optimization strategies.
City-Scale People’s Movement Analytics
CONTEXT. Mobile devices connect regularly to the cellular network to receive messages, initiate calls, and transfer data. Network logs provide a comprehensive picture of how the population moves in metropolitan areas, enabling use cases such as optimization of out-of-home advertising.
CONTRIBUTIONS. As Data Engineer and Data Scientist, I architected and implemented a GDPR-compliant analytics service delivering a population demographics API for urban areas by modeling people’s location and movements using the cellular network data, ingesting billions of records per day. Technologies: AWS EMR clusters, AWS S3, AWS Lambda, AWS CloudWatch, HDFS, Parquet, Hadoop, Spark, Scala, Python, Zeppelin notebooks, Docker.
Location Intelligence for Retail Analytics
CONTEXT. WiFi radio signals can be used to locate the emitting device such as a smartphone in indoor environments. In retail analytics, this data helps define areas of high visibility, evaluate the effectiveness of specific promotions, and inform product placement for more effective selling.
CONTRIBUTIONS. As a Senior Data Engineer and then Lead Data Scientist, I redesigned the existing WiFi analytics pipeline as a series of Spark jobs resulting in simpler data/ML pipelines with a double-digit percentage accuracy increase. Technologies: PostgreSQL, Airflow, Celery, Cassandra, AWS S3, Python Docker, JupyterLab, Scikit-learn, Pandas, Fabric, Flask.
Cloud Robotics & Drone Charging Stations
CONTEXT. Fully autonomous robots require remote management services to operate: fleet management, predictive maintenance, and teleoperation. Autonomous charging capabilities must be integrated to enable extended and unattended operativity.
CONTRIBUTIONS. As founder and CTO, I supervised the design, development, and commercialization of charge stations and protective hangars for commercial drones with complementary management and connectivity services. My contributions led to contracts with more than thirty international clients, including CIA, NASA, NIH, Google X, Parrot, and Stanford University. Technologies: ROS, Python, C++, Flask, Docker, Fabric, OpenCV, Jupyter Lab.
Modeling & Querying Data with Uncertainty
CONTEXT. As a Ph.D. student, I co-authored ten papers in top-tier international conferences and journals, including SIGMOD, VLDB, EDBT, KAIS, and DKE. I worked on problems related to the management and analysis of streaming and uncertain data. Moreover, I established collaborations with prominent research groups in the area of data management, namely, with the data management and data analytics groups at IBM T.J. Watson Research Center (USA), where I spent six months (during two visits) as a visiting researcher, and the Qatar Computing Research Institute (Qatar), where I spent another three months.
CONTRIBUTIONS (Available on Google Scholar here)
- “Improving Classification Quality in Uncertain Graphs”. Michele Dallachiesa, Charu Aggarwal, Themis Palpanas. ACM Journal of Data and Information Quality (JDIQ), 2018.
- “Similarity Using Correlation-Aware Measures”. Katsiaryna Mirylenka, Michele Dallachiesa, Themis Palpanas. International Conference on Scientific and Statistical Database Management (SSDBM), Chicago, USA, 2017.
- “Correlation-Aware Distance Measures for Data Series”. Katsiaryna Mirylenka, Michele Dallachiesa, Themis Palpanas. International Conference on Extending Database Technology (EDBT), Italy, 2017.
- “Top-k Nearest Neighbor Search In Uncertain Data Series”. Michele Dallachiesa, Themis Palpanas, Ihab F. Ilyas. Proceedings of the VLDB Endowment (PVLDB) Journal 8(1), 2015.
- “Sliding windows over uncertain data streams”. Michele Dallachiesa, Gabriela Jacques-Silva, Bugra Gedik, Kun-Lung Wu, Themis Palpanas. Knowledge and Information Systems (KAIS) Journal 45, 2015.
- “Node classification in uncertain graphs”. Michele Dallachiesa, Charu Aggarwal, Themis Palpanas. International Conference on Scientific and Statistical Database Management (SSDBM), Aalborg, Denmark, 2014.
- “NADEEF: a commodity data cleaning system”. Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed Elmagarmid, Ihab Ilyas, Mourad Ouzzani, and Nan Tang. ACM SIGMOD International Conference on Management of Data (SIGMOD) – New York City, NY, USA, 2013.
- “Identifying Streaming Frequent Items In Ad-hoc Recent Time Windows”. Michele Dallachiesa and Themis Palpanas. Data & Knowledge Engineering (DKE) Journal – 2013.
- “Uncertain Time-Series Similarity: Return to the Basics”. Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, Themis Palpanas. Proceedings of the VLDB Endowment (PVLDB) Journal, 2012, Turkey.
- “Similarity Matching for Uncertain Time Series: Analytical and Experimental Comparison”. Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, Themis Palpanas. ACM SIGSPATIAL International Workshop on Querying and Mining Uncertain Spatio-Temporal Data (QUeST), Chicago, USA, 2011.
Visual Network Graph Exploration & Analytics
CONTEXT. Network visualization helps us visualize complex relationships between entities and discover interesting patterns in very large data sets, using the superpowers of our visual cortex. Networks emerge naturally from social and physical interactions.
CONTRIBUTIONS. As an M.Sc. student, I implemented a commercial visualization software to explore very large networks with millions of nodes and connections. The application has been downloaded by more than 600 users working on a myriad of use cases, from social network analysis to cybercrime analytics. The tool is also cited in several scientific publications and its unique features have inspired other visualization products. Technologies: Graph layout methods, OpenGL, Qt, C++, Python, social network analysis methods.
IT Security Research
CONTEXT. As a hobbyist and member of the Italian hacker community, I enjoyed IT security for nearly a decade (2000-2010).
CONTRIBUTIONS. I have primarily investigated methods to conceal unauthorized access to networks. My efforts resulted in a series of public tools and some articles published in underground magazines (BfI, newbies) that have been cited in several research publications. Technologies: C/C++, Bash scripting, Python, Linux kernel programming, networking protocols.
REFERENCES (Open-source software)
- hopfake: monitor and fake network traceroute requests
- libimbw: full-duplex covert channels proxied to TCP/IP
- mikhail: connect hosts behind NATs
- cookie-tools: steal web sessions from nearby Wi-Fi connected devices
- ptrace-tools: inject commands on ptraced telnet/ssh sessions
- rtpbreak: capture and reconstruct voice calls of IP-based phones
- icopy: datalink packet bridge between network interfaces