Data Scientist / AI Engineer

Email: boulaich.mohamed970@gmail.com
Alternative Email: mboulaich@insea.ac.ma
LinkedIn: Mohamed Boulaich
GitHub: BlcMed

Summary

I am a Data Science student passionate about leveraging data-driven approaches to solve real-world problems. My academic background has equipped me with skills in statistical analysis, machine learning, and data visualization. I’m eager to apply my knowledge to contribute meaningfully to the field of data science.

Internship

OCR-Based Data Extraction
Direction Générale des Impôts, Rabat, Morocco Jun. 2024 – Aug. 2024
- Developed textify-docs, a Python library for text extraction from diverse document formats, focusing on modularity and reusability.
- Built a pipeline using OpenCV for image preprocessing, followed by Pytesseract for efficient text extraction.
- Applied Table Transformer (TATR), based on DETR, to enhance table extraction accuracy.
- Utilized LLMs to identify and structure relevant information.

The main pipeline for extracting information from the “portail marocain des marchés publics”

The core algorithm behind the textify_docs library

Engineered RAG Chatbot
Maroc Telecom, Rabat, Morocco Jun. 2024 – Aug. 2024
- Leveraged LlamaIndex framework with vector indexing for efficient information retrieval.
- Integrated ChromaDB for optimized storage and loading of embeddings.
- Optimized chatbot responses through prompt engineering, hyperparameter tuning, and tools abstraction.
- Created a user-friendly interface using Streamlit to showcase the chatbot’s functionality.

Streamlit app featuring an interactive chatbot

Data Analyst Intern
Higher Planning Commission (HCP), Tangier, Morocco
Jun. 2023 – Jul. 2022
- Retrieved demographic data from HCP BDS (Base de Données Statistiques) for various regions and provinces spanning from 2015 to 2023
- Employed linear regression to forecast the population for the year 2024 based on the gathered demographic data
- Utilized Folium library to create interactive maps for visualizing population distributions across different regions

Projects

Accent Detection DL model – Pytorch, Pandas, Scikit-Learn, Librosa (Apr. 2024 - Present)
GitHub Repo

Reviewed literature on accent recognition methods and deep learning architectures.
Collected audio recordings from the Speech Accent Archive.
Preprocessed the dataset using silence trimming and noise reduction.
Implemented and fine-tuned an ANN to predict accents using MFCCs, achieving 84.15% accuracy, 0.79 F1 score, and 0.80 precision.

Artistic Neural Style Transfer – PyTorch, Jupyter notebooks, Streamlit (Oct. 2024 - Present)
GitHub Repo

Utilized Neural Style Transfer with VGG19 CNN for feature extraction and gram matrix computations.
Developed a Streamlit app allowing users to generate stylized images.
Fine-tuned hyperparameters to optimize style transfer quality.

neural_transfer Diagram illustrating the Neural Style Transfer process. The content image (bottom left) and style image (top left) are passed through a convolutional neural network. Content and style representations are extracted at different layers. These representations are then used to guide the transformation of a generated image (typically a white noise image), resulting in style reconstructions (top) and content reconstructions (bottom) at various levels of the network. (Gatys et al., 2015)

streamlit interface

Sentiment Analysis On Movie Reviews – Spacy, PySpark, Pandas, Scikit-Learn (May. 2024 - Jun. 2024)
GitHub Repo

Preprocessed text data using tokenization, lemmatization, and vectorization (BoW, Tf-IDF).
Trained classifiers (Logistic Regression, SVM, Naive Bayes), achieving 86% accuracy with SVM.
Used PySpark to process large-scale data, improving computational efficiency.

Financial Cointegration Analysis | Time Series Analysis, Python, Jupyter Notebooks
Dec. 2023 – Dec. 2023
GitHub Repo

Applied the Engle-Granger two-step method and cointegration concept to distinguish between spurious correlations and genuine long-term relationships.
Analyzed historical price data from Yahoo Finance using Python’s yfinance library.

examine_cointegrations Testing cointegration between two financial time series

Student Accommodation Clustering | Python, Scikit-Learn, Folium
Nov. 2022 – Dec. 2022
GitHub Repo

Developed a student accommodation clustering system to suggest optimal housing options based on proximity to preferred locations.
Conducted data collection and cleaning tasks, including API integration and handling missing values.
Implemented machine learning techniques, particularly the K-Means clustering algorithm from scikit-learn, to group accommodation options based on their similarity.
Performed exploratory data analysis (EDA) with visualizations and maps to gain insights into student preferences and housing patterns.

map_clustering Optimal housing based on students preferred locations.

Education

Engineering, Data Science
The National Institute of Statistics and Applied Economics, Rabat, Morocco
2022 - 2024
SPE, MP*
Higher School Preparatory Classes (CPGE) Moulay Idriss, Fes, Morocco
2021 - 2022

Skills

Statistics: Statistical Inference, Descriptive Statistics, Statistical learning, Machine Learning Methodology, Traditional Modeling, Generalized Linear Models, Time Series Analysis, Hidden Markov Model, Stochastic Processes, Queuing Theory
Libraries: Numpy, Pandas, Scikit-Learn, Pytorch, Transformers, SpaCy, Seaborn
DevOps Practices: Code Versioning, CI/CD (Github Actions), Automated testing (tox, pytest), Containerization (Docker), Code Quality Tools (linters), Agile/Scrum
Languages: English, French, Arabic
Data Tools: Apache Airflow, Apache Superset
Languages: English (fluent), French (fluent), Arabic (native)