Finding suitable library space — A Data Science Project

As part of my final thesis at UQ, I was tasked with completing a thesis project based on a research topic. Of the many that were available, one stood out: “Designing Citizen Centric Smart Cities”.

The Problem

Right now, cities are undergoing radical transformation as IoT and other available technologies become ubiquitous. A key consideration, however, was ensuring that these new “smart spaces” are developed with the needs of the citizen at the forefront. My work focused on how the university campus could benefit from this consideration

Starting point — engaging citizens

In order to understand what citizens needed from smart campus I conducted interviews and elicited their thoughts about topics such as privacy, smart cities, smart campuses and what needs they had on them?

One problem that users faced was knowing whether a space on campus was suitable for study. Would they . One interviewer noted:

“How do I know if the library is busy or quiet before I walk there?”

The data

As part of my research, I was given access to 20gb .json data file containing details about Wi-Fi connections on campus. Using this data would allow me to understand how movement occurs around a campus. In particular, this would also help me understand when people typically go to the library.

My first challenge was to “wrangle” this into a suitable file format. Having this data in .csv format would make it easier to visualise and explore.

The key features I needed to extract were the time of day, the longitude and latitude as well as other features such as building and user type (student, staff, etc).

Visualising movement on campus

After wrangling the data, I now had a .csv file that was ready to explore. My first port of call was to throw this file into Tableau. I like using Tableau to get a quick visual overview of the data. The key question I wanted to address was how the number of Wi-Fi connections changes across the day.

Across all the days, and across all the buildings, there was a clear trend. Wi-Fi connections would increase and peak at midday before decreasing towards the end of the day. This is inline to anecdotally what we might expect. Morning classes aren’t popular with students!

Building a Machine Learning Model

Using this data now, the next stage would be to implement a machine learning model in order to classify whether a library would be busy or quiet at a specific time.

From my observations in Tableau, this problem looked suitable for applying linear regression. This process fitting a polynomial function that can capture the underlying relationship between the time of day and the number of Wi-Fi connections.

To build this machine learning model, I used thePython libraries Scipy and Sympy to build linear regression equations of different polynomial degrees (1, 2, 3 and 4).

For each day and library and each polynomial regression, I calculated the r² value. This value represents how well the line fits the data. For example, an r² value of 1.0 would mean the regression equation fits the data perfectly.

Model Selection

The larger goal of a machine learning model is to capture the relationship of the model to the data, but not to overfit it. This concept is best illustrated below:

Based on this principle, I selected the polynomial regression equation of degree 2 as it had the best compromise between overfitting the data (a very high r² value) and underfitting the data (very low r² value).

Querying the model

The next step would be to query the model and inform the user if the library is getting busier or quieter.

Essentially, we need to know the rate of change at any given time. Calculating the derivative would give us this result. If the derivative was close to zero, we could determine that the library was near the peak of its occupancy. If the derivative is negative, then it’s getting quieter.

Deploying the model

In order for the model to be accessible, it needed to be deployed on a server. My first choice was to use the Flask framework because it is easy to set up and running straight away.

The code for this can be seen below.

Building the application

Now that the model was deployed, I needed to have some way for users of a smart campus to be able to interact with it. From interviews, users had universal access to smartphones so building an app that users could interact with seemed the most obvious choice.

To develop the app, I’ve chosen to use Flutter as it allows you to build for both iOS and Android.

The full repo for this can be found here:


There were a number of ways that this model could have been improved.

Using historically data to feed a machine learning model whether a library is busy or quiet has a number of challenges. For one, it doesn’t account for random events that may disrupt the number of people who attend campus. For example, during the Covid-19 pandemic, the campus has been much quieter. This model doesn’t account for that.

Another challenge is that Wi-Fi connection data isn’t a perfect 1-to-1 measurement of occupancy. For example, students in a library may have multiple devices. For example, students in the Engineering Library may use multiple devices to study including tablet, phone or laptop. These 3 devices would register 3 different Wi-Fi connections for a single person.

In order to improve this model, some form of streaming would be the ideal way of determining if a library space is full or not.

I’m a Software Engineering grad interested in Data Engineering opportunities.