British Airways Data Science Virtual Experience
I recently completed the British Airways Data Science Virtual Experience, where I had the opportunity to explore the world of data science. My responsibilities included web scraping customer reviews and building predictive models to understand customer buying behaviour. Throughout this experience, I learned how to gather and analyse data effectively, extracting valuable insights that can drive business decisions. I also developed my skills in Python, utilising libraries for data manipulation and machine learning. If you're interested in knowing how I accomplished these tasks and the insights I gained along the way, read on!

Web Scraping
First things first, what is web scraping? Web scraping is a powerful technique that allows us to extract data from websites automatically. It's a way to gather large datasets that would otherwise be time-consuming to compile manually. In this project, my goal was to scrape customer reviews to gain insights into how passengers perceive British Airways. It is important to understand these sentiments as they can help the airline improve its services and enhance customer satisfaction.
​
​
Setting Up the Scraping Process
​
To kick off the project, I utilised Python along with several key libraries: Requests for fetching web pages, BeautifulSoup for reading and sorting through web page content, and pandas for data manipulation. Here’s a brief overview of how I structured my web scraping process:
​
1. Importing Libraries: I began by importing the necessary libraries to set up my environment for web scraping and data analysis.
2. Defining the Target URL: I specified the URL for British Airways reviews on Skytrax. This would be my starting point for gathering data.
3. Creating Functions: I developed two primary functions:
-
get_soup(): This function sends a request to the specified URL and returns a BeautifulSoup object, which allows me to navigate and extract HTML elements easily.
-
get_reviews(): This function extracts relevant information from each review, including the title, rating, and body text.
​
​
Cleaning and Preparing Data
​
Once I had scraped the reviews, it was time to clean and prepare the data for analysis. The raw dataset was unstructured and required several pre-processing steps:
-
Normalisation: I converted ratings into numerical values and standardised text by converting everything to lowercase (just to keep things consistent).
-
Tokenisation: I broke down the text into individual words or tokens for easier analysis.
-
Stopword Removal: Common words that have no significant meaning (like "and" or "the") were removed to focus on more impactful terms.
-
Lemmatization: Words were reduced to their base forms (e.g., "flying" became "fly"), which helped in consolidating similar terms.
​
​
Extracting Insights Through Analysis
​
With my cleaned dataset ready, it was time to see what those reviews actually meant. Here are some techniques I employed:
​
1. Topic Modelling: I used a technique called Latent Dirichlet Allocation (LDA) to group similar themes in the reviews. This helped me understand what aspects of British Airways stood out—be it their service quality, food offerings, or booking experience.
​
2. Sentiment Analysis: I then applied NLTK's VADER sentiment analyser, which can tell if a word has a positive, negative or neutral connotation. This measure provided insights into overall customer satisfaction levels.
​
3. Word Cloud Visualisation: To visually represent frequently mentioned terms in the reviews, I created a word cloud. This figure highlighted what customers were talking about most often—offering an intuitive way to grasp key sentiments at a glance.
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
Predicting customer buying behaviour
After gathering valuable insights from British Airways customer reviews through web scraping, I was ready to tackle the next challenge: predicting customer buying behaviour. For my predictive modelling task, I chose to use a Random Forest Classifier. You might wonder what that is. It is an ensemble learning method that combines multiple decision trees to make predictions. Instead of relying on just one decision tree, which might make mistakes, a Random Forest combines the predictions from many different trees to get a more accurate result.
​
1. Preparing the Data: I split my dataset into features (the information we use for predictions) and labels (the outcomes we want to predict). In this case, I wanted to predict whether a booking would be completed or not.
2. One-Hot Encoding: To handle categorical variables (like flight routes), I used one-hot encoding. This transforms categories into binary values, making them easier for the model to process.
3. Training and Validation Split: I divided my data into training and validation sets. The training set helps the model learn, while the validation set tests how well it performs on unseen data.
4. Fitting the Model: With everything set up, I trained my Random Forest model using the training data. It learned from past customer behaviours, identifying patterns that could indicate whether someone would complete a booking.
​
Evaluating the Model​
After training, it was time to see how well our model performed:
1. Making Predictions: I used the validation set to see if our model could accurately predict customer bookings.
​
2. Calculating Accuracy: The model achieved an impressive accuracy of about 85%. That means it correctly predicted bookings 85 times out of 100!
3. Assessing Feature Importance: One of the coolest parts was discovering which factors were most influential in predicting bookings. For example:
-
Purchase Lead Time topped the list, showing that how far in advance someone books matters significantly.
-
Other important factors included flight route, flight hour, and length of stay.
I created a bar chart to visualise these important features, making it easy to see what influences customer decisions at a glance.
​
​
​
​
​
​
​
​
​
​
​
​
​
Conclusion
As I wrap up this project, I'm truly amazed by the capabilities of Python. This project, from web scraping to predictive modelling, has opened my eyes to the power and versatility of this programming language. Starting with web scraping, I was blown away by how Python, with libraries like BeautifulSoup and requests, could effortlessly extract vast amounts of customer review data from the Skytrax website. It felt like having a superpower – being able to gather and process information at a scale that would be impossible manually. Then, transitioning to machine learning, Python continued to impress. In fact, I built a predictive model that could forecast customer booking behaviour with remarkable accuracy. The ease with which I could preprocess data, engineer features, train models, and evaluate results was impressive. It's exciting to think about how businesses can leverage these tools to make data-driven decisions and improve customer experiences.
​
​
​​​​​​​​​​​​​​​​​​You can find my Python code here.
​
​

