Cal Poly Personal Assistant

Kai-Chin Huang
4 min readJun 12, 2021

--

Authors: Kai Chin Huang, Sarbo Roy, Ryan Ozawa, John Tran

Photo by Alex Knight on Unsplash

Abstract

Virtual assistants are becoming larger and larger part of life. VA are programs that can help us find information fast and in a more natural way. These assistants have arguably come a long way since the first generation of commercialized VA’s but are now almost ubiquitous. These systems now give driving directions, call people with important information, find important destinations and items, and generally answer questions (Alexa, Siri, Google, Cortana, etc.) and handle tech support calls.

This is going to be an intelligent assistant that knows all about the Cal Poly Statistics and Computer Science departments. It is a chatbot that can answer questions entered in by text (which can be easily extended to voice). These are questions about classes, offerings, instructors, basically anything that can be answer from this website: https://schedules.calpoly.edu for Fall quarter 2021 (must be on VPN)

Software Architecture

From a high level, our Calpass chat bot consists of the following elements

  • A CLI interface that retrieves queries from the user and returns the appropriate response
  • A decision tree classifier that classifies the query into 5 classes: Professor, Course, Building, Other, End
  • A rule based extraction pipeline for querying the database with the relevant information from a classified query
  • A web scraping agent that stores data inside an in-memory database
High Level Project Architecture:
Detailed Architecture Components:

Our strategy with this project was to create highly module components that could be optimized for their purpose independently of interactions with other components. As a result, division of labor and gluing pieces together was trivial.

Although our modular approach allowed efficient division of labor, there were still many tiers of group decision making that needed to be completed, mainly involving parts of the project that required manual labor such as

Parsing through the training data

  • Determining which classes the data can be clustered as
  • Aggregating the data from groups and deciding on a standardized format

Entity extraction from query

  • Determining which keywords would map to a specific database lookup

Deciding the architecture of the system

  • How to retrieve data, how to train model, what’s the work flow for prediction pipeline

Detailed Model Architecture:

Model Architecture:

  • Bagged Decision Tree Classifier
  • Ues DecisionTreeClassifier — max depth 22
  • Total of 25 trees

Query Preprocessing

  • TfIdf Vectorizer, ngram range (1,3)
  • NLTK English Stemmer
  • NLTK Word Tokenizer
  • Remove Words that are not POS type: Noun, Verb, Wh Determiner (who, what when, where, why)
  • Remove Words that are less than 2 characters
  • Remove NLTK Stop Words

Signals:

When the “ — show_signals” flag is enabled, any or all of the following signals may precede the chatbot response

  • Types: describes type of response, Professor/Course/Building: normal response; Unknown: Chatbot could not determine the type of question; End: End of the conversation
  • Error: Error message
  • Target: Intended subject (professor/course/building) of the response

Entity Extraction Pipeline:

  • Pipeline runs after query is classified by decision tree model
  • Query text is normalized, all lowercase, lemmatization, removing punctuation, removing stop words
  • Respective query class logic is applied, Logic is a mapping from keywords present in queries to relevant data inside database (pandas dataframe)

Packages and Tools:

The entire project is written in Python 3

Web Scraping Agent:

  • Requests package to get HTML data from Cal Poly Schedule Website,
  • BeautifulSoup4 package to extract relevant information from Cal Poly schedule HTML data
  • Pandas package to save extracted data as a dataframe, serves as a pseudo database

Command Line Interface

  • Click package for building command line wrappers and functions

Entity Extraction Pipeline

  • Difflib package for gestalt pattern matching similarity algorithm

Machine Learning Query Classifier

  • Scikit Learn: Tfidf Vectorizer, Train test split, Metrics, KNNeighbor Classifier, RandomForest, GaussianProcessClassifier
  • NLTK: English Stemmer, POS Tagging
  • Matplotlib: Plotting metrics for model

Testing and Results:

Web Scraping Agent

  • Manually glancing at webpages and checking if data is retrieved properly inside pandas dataframe

Command Line Interface

  • Manually testing hand selected queries and checking whether output is correct

Machine Learning Query Classifier

  • Created a train, test set from our normalized dataset
  • Best Results from Testing Different Models

Entity Extraction Pipeline

  • Separated dataset into buckets by query class
  • For each class, ran appropriate query extraction logic and calculated the percentage of queries that returned a response
  • 75% of Professor queries answered
  • 68% of Course queries answered
  • 64% of Building queries answered

Problems/Weaknesses:

  • User sessions not implemented
  • Accuracy on classifying query into Professor/Building/Course only 48%
  • Logic for entity extraction for respective class hand engineered and is not scalable with the number of classes
  • Can not give responses for “non-standard” queries: Ex: “Who teaches PL this quarter?”
  • Currently accuracy in responses is measured by what percentage of queries are answered and does not measure whether the response returned is correct, Ideally we would randomly sample a subset of queries in the dataset and create train/test sets to validate accuracy instead of running on the entire dataset and measuring whether the question is answered

Thanks for reading, feel free to check out the source code !

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Kai-Chin Huang
Kai-Chin Huang

No responses yet

Write a response