Projects
RAG Microservice: Research-mate chatbot

Technologies: Python, FastAPI, PyTorch, Transformers, GCP, Pinecone, Prompt Engineering, Llamma 3.2, chat UI frontend
LLMs introduced AI models with good general reasoning and intelligence. They are trained on public data. So, they do not have knowledge about one's private data. What if we can teach the LLM about private data? Augmenting LLMs with private data introduces large amount of possibilities. For instance, this project enables LLMs to learn about our collection of research papers and then answer queries about that research paper. We can also use it to find the relevant research papers.
The project is a context-aware RAG-based chatbot leveraging Pinecone vector database, enabling semantic search across 2,700 research papers with 95% query relevance and reducing response time to seconds. It leverages Anthropic AIs latest research on RAG which is Contextual Retrieval. It is optimized for model performance with Binary Quantization, achieving 7x speedup in inference time and 85% reduction in memory. As the data size increases, the the embeddings would out-grow the memory size and disk IO for matching is orders of magnitude slower than in-memory operations. So, we decrease the storage amount of embeddings while keeping them relevant to make them store in memory for fast matching and low inference latency. The project plans to improve upon ElasticSearch's beta feature: Better Binary Quantization (BBQ) which provides better savings but at a more harsh memory-recall trade-off.

Technologies: Python, PyTorch, ONNX, TensorRT, FastAPI, AWS, MLflow, Evidently, Frontend
Essays represent a person's critical thinking. It provides a glimpse into a person's mind. This serves as an approach to cultivate thoughts and ideas, and also a way to evaluate a person. This method is used by academia and educational institutions. Essay evaluation has been performed manually by humans; which is subject to many factors and could lead to inconsistency.
So, the project explores ways to automate essay evaluation with Transformers like BERT, DeBERTa-v3 and Large Language Models (LLMs) like GPT-2. By automating this task, we can reduce many hours of human labour, mental fatigue, evaluation time and errors in evaluation. We leveraged multiple Parameter Efficient Fine Tuning (PEFT) techniques to fine-tune LLMs on small amount of data. We used many other Transfer Learning techniques to find a good set of parameters fine-tuned on our dataset. Along with these techniques, the project also used dynamic learning rate with cosine annealing and warm-up. The project uses Cohen's Kappa score as a metric to evaluate the model. With all these techniques, the model achieved an impressive Kappa score of 81.7.
Open Source Project: mAIgic

Technologies: Python, OpenAI Function Calling, CircleCI, Pytest, MyPy, Ruff, uv
LLMs introduced AI models with good general reasoning and intelligence. They are trained on public data. So, they do not have knowledge about one's private data. What if we can teach the LLM about private data? Augmenting LLMs with private data introduces large amount of possibilities. For instance, this project enables LLMs to learn about emails and extract meaningful tasks from them and add them to a Trello board. This also creates a private and query-able knowledge base. The project plans to extend functionalities by adding more tools and functionalities.
mAIgic is a smart AI assistant with generalized knowledge and reasoning capabilities. The project leverages OpenAI’s function calling, achieving 95% accuracy in task extraction and automated Trello board updates, reducing manual email processing time by 70%. This is a AI-software project, engineered with a production-grade Python API with 100% test coverage through CircleCI pipeline, implementing comprehensive static type checking with mypy, and leveraged SQLite-based conversation tracking system. The project uses a modern Rust-based python package manager called uv for efficient dependency and package management.
Open Source Project: Migrating NYU Vida Lab’s Wildlife Trafficking Prevention Project Pipeline onto Apache PySpark.

Technologies: Python, PySpark, AWS, MinIO, Databricks.Koalas, AWS
The global wildlife trafficking industry is valued between $7 billion and $23 billion annually. Wildlife crime is estimated to cause a loss of $1 trillion to the global economy annually, considering the environmental damage, loss of biodiversity, and impact on local economies. In order to prevent the selling of illegally trafficed animal parts and products, NYU's Vida Lab is working on a project to identify these animal products, selling on the internet. This can be used to take necessary actions to identify the seller of such products.
The project uses machine learning based and rules based crawling and extraction of the internet in-order to find relevant data. Then it uses a zero-shot, multi-modal, AI model to identify the products. It also uses AI to generated clean data. This whole data ETL pipeline was not scalable. So, we scaled the complete data pipeline by migrating it onto Apache Spark with PySpark. This made the pipeline more robust and faster. With our efforts, we achieved 160% speedup.
End to End web-app: Personal Health Assistant for Diabetics hosted with auto-deployment on Heroku.

Technologies: Python, TensorFlow, Flask, Heroku, Frontend, HTML, CSS, JavaScript, Bootstrap
As of 2021, approximately 537 million adults (20-79 years) worldwide have diabetes. This number is expected to rise to 643 million by 2030 and 783 million by 2045. Type 2 Diabetes accounts for about 90-95% of all diabetes cases. It is primarily related to lifestyle factors and genetic predisposition. The global economic cost of diabetes in 2021 was estimated at $966 billion, a 316% increase over the past 15 years. Diabetes is a leading cause of death, responsible for approximately 6.7 million deaths in 2021, equating to 1 death every 5 seconds.
Type-2 Diabetes is preventable with lifestyle changes. So, the project estimates a person's chances of developing type-2 diabetes on the basis of multiple factors. The project is an end-to-end webapp. The model was trained on Behavioral Survey dataset (BRFSS). Additionally, the webapp calculates a person's BMI, and informs them about their lifestyle. It also provides MD Physician recommended food and lifestyle tips. Alongwith this, it also provides a list of good quality medicines for diabetic people.
The project uses ensemble of multiple machine learning and deep learning models. And the congregation of the decision of multiple models leads to a final and conclusive estimate. For applications of medical domain, accuracy is not a good metric to test the model. The ideal metric is the combination of high precision, recall and specificity. And the model achieves 95.8% precision and recall, and 99.4% specificity.
AI based classification of New York City Open-Source Noise data into 10 categories.

Technologies: Python, PyTorch, librosa, Convolutional Neural Network (CNN)
New York City is a metropolitan city with numerous sources for noise pollution. These noise levels are harmful to hearing health of people. For instance, it has been estimated that 9 out of 10 adults in New York City (NYC) are exposed to excessive noise levels, i.e., beyond the limit of what the EPA considers to be harmful. When applied to U.S. cities of more than 4 million inhabitants, such estimates extend to over 72 million urban residents. So, it is important to identify noise sources to mitigate it. The noise data was sourced by NYU’s Music and Audio Research Lab; funded by NYU’s Centre for Urban Science.
The project leverages multiple Machine Learning and Deep Learning techniques to identify noise categories with 86% accuracy. The project uses Deep Convolutional Network to understand and classify audio data. By leveraging this project, one can identify the noise sources in the city. This project can be extended to real-time identification by placing noise sensing devices across the city and that data can be leveraged to re-position appropriate officials to minimize noise pollution.
Leveraging custom Deep Residual Convolutional Networks under 5M parameters to classify images into 10 categories.

Technologies: Python, PyTorch, Convolutional Neural Network (CNN), ResNet50, Fine-tuning
Conventional Deep Neural Networks have multiple layers. After a certain depth, the model starts suffering from issues like vanishing / exploding gradients, overfitting, etc. By leveraging Residual Connections / skip Connections, we can create deeper networks without overfitting. Additionally, the improved gradient flow and regularization also aids with the cause.
The project uses a Deep Residual Convolutional Neural Network for identiying 10 types of RGB images. The images were RGB and of low resolution of 32x32 pixels. And on top of it, we did not use any pre-trained weights like of the imagenet dataset. We employed multiple data augmentation and feature enhancing transformations on the image. The model has around 4.7M parameters. The model is good balance between deep and shallow networks. So, it performs better on new images while still remembering the training images. Dropouts and L2 regularization also helps with the overfitting issue. The model has an accuracy of 81%.
Training Generative Adversarial Network (GAN) to generate images of clothes.

Technologies: Python, PyTorch, Convolutional Generative Adversarial Networks
Generative Adversarial Networks is a result of two different models trying to outperform each other. GANs create images from random noise provided to them. The Generator tries to outperform the Discriminator and the Discriminator tries to outperform Generator. To train GANs, it is imperative that GANs reach a Saddle point. The Saddle point provides an equilibrium between Discriminator and Generator.
The project uses a deep convolutional generative adversarial network. It has 2 convolutional layers and uses activation functions like LeakyRelu and Tanh. By observing the training progress of the GAN, the losses of Generator and Discriminator converges smoothly. This indicates that the model has found the saddle point.