Cal Poly Personal Assistant
Authors: Kai Chin Huang, Sarbo Roy, Ryan Ozawa, John Tran
Abstract
Virtual assistants are becoming larger and larger part of life. VA are programs that can help us find information fast and in a more natural way. These assistants have arguably come a long way since the first generation of commercialized VA’s but are now almost ubiquitous. These systems now give driving directions, call people with important information, find important destinations and items, and generally answer questions (Alexa, Siri, Google, Cortana, etc.) and handle tech support calls.
This is going to be an intelligent assistant that knows all about the Cal Poly Statistics and Computer Science departments. It is a chatbot that can answer questions entered in by text (which can be easily extended to voice). These are questions about classes, offerings, instructors, basically anything that can be answer from this website: https://schedules.calpoly.edu for Fall quarter 2021 (must be on VPN)
Software Architecture
From a high level, our Calpass chat bot consists of the following elements
- A CLI interface that retrieves queries from the user and returns the appropriate response
- A decision tree classifier that classifies the query into 5 classes: Professor, Course, Building, Other, End
- A rule based extraction pipeline for querying the database with the relevant information from a classified query
- A web scraping agent that stores data inside an in-memory database
Our strategy with this project was to create highly module components that could be optimized for their purpose independently of interactions with other components. As a result, division of labor and gluing pieces together was trivial.
Although our modular approach allowed efficient division of labor, there were still many tiers of group decision making that needed to be completed, mainly involving parts of the project that required manual labor such as
Parsing through the training data
- Determining which classes the data can be clustered as
- Aggregating the data from groups and deciding on a standardized format
Entity extraction from query
- Determining which keywords would map to a specific database lookup
Deciding the architecture of the system
- How to retrieve data, how to train model, what’s the work flow for prediction pipeline
Detailed Model Architecture:
Model Architecture:
- Bagged Decision Tree Classifier
- Ues DecisionTreeClassifier — max depth 22
- Total of 25 trees
Query Preprocessing
- TfIdf Vectorizer, ngram range (1,3)
- NLTK English Stemmer
- NLTK Word Tokenizer
- Remove Words that are not POS type: Noun, Verb, Wh Determiner (who, what when, where, why)
- Remove Words that are less than 2 characters
- Remove NLTK Stop Words
Signals:
When the “ — show_signals” flag is enabled, any or all of the following signals may precede the chatbot response
- Types: describes type of response, Professor/Course/Building: normal response; Unknown: Chatbot could not determine the type of question; End: End of the conversation
- Error: Error message
- Target: Intended subject (professor/course/building) of the response
Entity Extraction Pipeline:
- Pipeline runs after query is classified by decision tree model
- Query text is normalized, all lowercase, lemmatization, removing punctuation, removing stop words
- Respective query class logic is applied, Logic is a mapping from keywords present in queries to relevant data inside database (pandas dataframe)
Packages and Tools:
The entire project is written in Python 3
Web Scraping Agent:
- Requests package to get HTML data from Cal Poly Schedule Website,
- BeautifulSoup4 package to extract relevant information from Cal Poly schedule HTML data
- Pandas package to save extracted data as a dataframe, serves as a pseudo database
Command Line Interface
- Click package for building command line wrappers and functions
Entity Extraction Pipeline
- Difflib package for gestalt pattern matching similarity algorithm
Machine Learning Query Classifier
- Scikit Learn: Tfidf Vectorizer, Train test split, Metrics, KNNeighbor Classifier, RandomForest, GaussianProcessClassifier
- NLTK: English Stemmer, POS Tagging
- Matplotlib: Plotting metrics for model
Testing and Results:
Web Scraping Agent
- Manually glancing at webpages and checking if data is retrieved properly inside pandas dataframe
Command Line Interface
- Manually testing hand selected queries and checking whether output is correct
Machine Learning Query Classifier
- Created a train, test set from our normalized dataset
- Best Results from Testing Different Models
Entity Extraction Pipeline
- Separated dataset into buckets by query class
- For each class, ran appropriate query extraction logic and calculated the percentage of queries that returned a response
- 75% of Professor queries answered
- 68% of Course queries answered
- 64% of Building queries answered
Problems/Weaknesses:
- User sessions not implemented
- Accuracy on classifying query into Professor/Building/Course only 48%
- Logic for entity extraction for respective class hand engineered and is not scalable with the number of classes
- Can not give responses for “non-standard” queries: Ex: “Who teaches PL this quarter?”
- Currently accuracy in responses is measured by what percentage of queries are answered and does not measure whether the response returned is correct, Ideally we would randomly sample a subset of queries in the dataset and create train/test sets to validate accuracy instead of running on the entire dataset and measuring whether the question is answered
Thanks for reading, feel free to check out the source code !