Cal Poly Personal Assistant

4 min readJun 12, 2021

Authors: Kai Chin Huang, Sarbo Roy, Ryan Ozawa, John Tran

Abstract

Virtual assistants are becoming larger and larger part of life. VA are programs that can help us find information fast and in a more natural way. These assistants have arguably come a long way since the first generation of commercialized VA’s but are now almost ubiquitous. These systems now give driving directions, call people with important information, find important destinations and items, and generally answer questions (Alexa, Siri, Google, Cortana, etc.) and handle tech support calls.

This is going to be an intelligent assistant that knows all about the Cal Poly Statistics and Computer Science departments. It is a chatbot that can answer questions entered in by text (which can be easily extended to voice). These are questions about classes, offerings, instructors, basically anything that can be answer from this website: https://schedules.calpoly.edu for Fall quarter 2021 (must be on VPN)

Software Architecture

From a high level, our Calpass chat bot consists of the following elements

A CLI interface that retrieves queries from the user and returns the appropriate response
A decision tree classifier that classifies the query into 5 classes: Professor, Course, Building, Other, End
A rule based extraction pipeline for querying the database with the relevant information from a classified query
A web scraping agent that stores data inside an in-memory database

Our strategy with this project was to create highly module components that could be optimized for their purpose independently of interactions with other components. As a result, division of labor and gluing pieces together was trivial.

Although our modular approach allowed efficient division of labor, there were still many tiers of group decision making that needed to be completed, mainly involving parts of the project that required manual labor such as

Parsing through the training data

Determining which classes the data can be clustered as
Aggregating the data from groups and deciding on a standardized format

Entity extraction from query

Determining which keywords would map to a specific database lookup

Deciding the architecture of the system

How to retrieve data, how to train model, what’s the work flow for prediction pipeline

Detailed Model Architecture:

Model Architecture:

Bagged Decision Tree Classifier
Ues DecisionTreeClassifier — max depth 22
Total of 25 trees

Query Preprocessing

TfIdf Vectorizer, ngram range (1,3)
NLTK English Stemmer
NLTK Word Tokenizer
Remove Words that are not POS type: Noun, Verb, Wh Determiner (who, what when, where, why)
Remove Words that are less than 2 characters
Remove NLTK Stop Words

Signals:

When the “ — show_signals” flag is enabled, any or all of the following signals may precede the chatbot response

Types: describes type of response, Professor/Course/Building: normal response; Unknown: Chatbot could not determine the type of question; End: End of the conversation
Error: Error message
Target: Intended subject (professor/course/building) of the response

Entity Extraction Pipeline:

Pipeline runs after query is classified by decision tree model
Query text is normalized, all lowercase, lemmatization, removing punctuation, removing stop words
Respective query class logic is applied, Logic is a mapping from keywords present in queries to relevant data inside database (pandas dataframe)

Packages and Tools:

The entire project is written in Python 3

Web Scraping Agent:

Requests package to get HTML data from Cal Poly Schedule Website,
BeautifulSoup4 package to extract relevant information from Cal Poly schedule HTML data
Pandas package to save extracted data as a dataframe, serves as a pseudo database

Command Line Interface

Click package for building command line wrappers and functions

Entity Extraction Pipeline

Difflib package for gestalt pattern matching similarity algorithm

Machine Learning Query Classifier

Scikit Learn: Tfidf Vectorizer, Train test split, Metrics, KNNeighbor Classifier, RandomForest, GaussianProcessClassifier
NLTK: English Stemmer, POS Tagging
Matplotlib: Plotting metrics for model

Testing and Results:

Web Scraping Agent

Manually glancing at webpages and checking if data is retrieved properly inside pandas dataframe