data science capstone project report

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

Notifications You must be signed in to change notification settings

EveThan/IBM-Applied-Data-Science-Capstone-Project

Folders and files.

Name		Name
5 Commits

Repository files navigation

Ibm applied data science capstone project.

The PowerPoint slides for this project can be found at Capstone_Presentation.pptx or Capstone_Presentation.pdf .

Executive summary

In this capstone project, we will predict if the SpaceX Falcon 9 first stage will land successfully using several machine learning classification algorithms. The main steps in this project include:

Data collection, wrangling, and formatting
Exploratory data analysis
Interactive data visualization
Machine learning prediction

Our graphs show that some features of the rocket launches have a correlation with the outcome of the launches, i.e., success or failure. It is also concluded that decision tree may be the best machine learning algorithm to predict if the Falcon 9 first stage will land successfully.

Introduction

In this capstone, we will predict if the Falcon 9 first stage will land successfully. SpaceX advertises Falcon 9 rocket launches on its website with a cost of 62 million dollars; other providers cost upward of 165 million dollars each, much of the savings is because SpaceX can reuse the first stage. Therefore if we can determine if the first stage will land, we can determine the cost of a launch. This information can be used if an alternate company wants to bid against SpaceX for a rocket launch.

Most unsuccessful landings are planned. Sometimes, SpaceX will perform a controlled landing in the ocean. The main question that we are trying to answer is, for a given set of features about a Falcon 9 rocket launch which include its payload mass, orbit type, launch site, and so on, will the first stage of the rocket land successfully?

Methodology

The overall methodology includes:

Data collection, wrangling, and formatting, using:
Web scraping
Exploratory data analysis (EDA), using:
Pandas and NumPy
Data visualization, using:
Matplotlib and Seaborn
Machine learning prediction, using
Logistic regression
Support vector machine (SVM)
Decision tree
K-nearest neighbors (KNN)

Data collection using SpaceX API

1_Data Collection API.ipynb

Libraries or modules used: requests, pandas, numpy, datetime

The API used is here .
The API provides data about many types of rocket launches done by SpaceX, the data is therefore filtered to include only Falcon 9 launches.
The API is accessed using requests.get().
The json result is converted to a dataframe using the json_normalize() function from pandas.
Every missing value in the data is replaced the mean the column that the missing value belongs to.
We end up with 90 rows or instances and 17 columns or features.

Data Collection with Web Scraping

2_Data Collection with Web Scraping.ipynb

Libraries or modules used: sys, requests, BeautifulSoup from bs4, re, unicodedata, pandas

The data is scraped from List of Falcon 9 and Falcon Heavy launches .
The website contains only the data about Falcon 9 launches.
First, the Falcon9 Launch Wiki page is requested from the url and a BeautifulSoup object is created from response of requests.get().
Next, all column/variable names are extracted from the HTML table header by using the find_all() function from BeautifulSoup.
A dataframe is then created with the extracted column names and entries filled with launch records extracted from table rows.
We end up with 121 rows or instances and 11 columns or features.

EDA with Pandas and Numpy

3_EDA.ipynb

Libraries or modules used: pandas, numpy

Functions from the Pandas and NumPy libraries such as value_counts() are used to derive basic information about the data collected, which includes:

The number of launches on each launch site
The number of occurrence of each orbit
The number and occurrence of each mission outcome

EDA with SQL

4_EDA with SQL.ipynb

Framework used: IBM DB2

Libraries or modules used: ibm_db

The data is queried using SQL to answer several questions about the data such as:

The names of the unique launch sites in the space mission
The total payload mass carried by boosters launched by NASA (CRS)
The average payload mass carried by booster version F9 v1.1

The SQL statements or functions used include SELECT, DISTINCT, AS, FROM, WHERE, LIMIT, LIKE, SUM(), AVG(), MIN(), BETWEEN, COUNT(), and YEAR().

Data Visualization using Matplotlib and Seaborn

5_EDA Visualization.ipynb

Libraries or modules used: pandas, numpy, matplotlib.pyplot, seaborn

Functions from the Matplotlib and Seaborn libraries are used to visualize the data through scatterplots, bar charts, and line charts. The plots and charts are used to understand more about the relationships between several features, such as:

The relationship between flight number and launch site
The relationship between payload mass and launch site
The relationship between success rate and orbit type

Examples of functions from seaborn that are used here are scatterplot(), barplot(), catplot(), and lineplot().

Data Visualization using Folium

6_Interactive Visual Analytics with Folium lab.ipynb

Libraries or modules used: folium, wget, pandas, math

Functions from the Folium libraries are used to visualize the data through interactive maps. The Folium library is used to:

Mark all launch sites on a map
Mark the succeeded launches and failed launches for each site on the map
Mark the distances between a launch site to its proximities such as the nearest city, railway, or highway

These are done using functions from folium such as add_child() and folium plugins which include MarkerCluster, MousePosition, and DivIcon.

Data Visualization using Dash

7_spacex_dash_app.py

Libraries or modules used: pandas, dash, dash_html_components, dash_core_components, Input and Output from dash.dependencies, plotly.express

Functions from Dash are used to generate an interactive site where we can toggle the input using a dropdown menu and a range slider. Using a pie chart and a scatterplot, the interactive site shows:

The total success launches from each launch site
The correlation between payload mass and mission outcome (success or failure) for each launch site

The application is launched on a terminal on the IBM Skills Network website.

Machine Learning Prediction

8_Machine Learning Prediction.ipynb

Libraries or modules used: pandas, numpy, matplotlib.pyplot, seaborn, sklearn

Functions from the Scikit-learn library are used to create our machine learning models. The machine learning prediction phase include the following steps:

Standardizing the data using the preprocessing.StandardScaler() function from sklearn
Splitting the data into training and test data using the train_test_split function from sklearn.model_selection
Creating machine learning models, which include:
Logistic regression using LogisticRegression from sklearn.linear_model
Support vector machine (SVM) using SVC from sklearn.svm
Decision tree using DecisionTreeClassifier from sklearn.tree
K nearest neighbors (KNN) using KNeighborsClassifier from sklearn.neighbors
Fit the models on the training set
Find the best combination of hyperparameters for each model using GridSearchCV from sklearn.model_selection
Evaluate the models based on their accuracy scores and confusion matrix using the score() function and confusion_matrix from sklearn.metrics

Putting the results of all 4 models side by side, we can see that they all share the same accuracy score and confusion matrix when tested on the test set. Therefore, their GridSearchCV best scores are used to rank them instead. Based on the GridSearchCV best scores, the models are ranked in the following order with the first being the best and the last one being the worst:

Decision tree (GridSearchCV best score: 0.8892857142857142)
K nearest neighbors, KNN (GridSearchCV best score: 0.8482142857142858)
Support vector machine, SVM (GridSearchCV best score: 0.8482142857142856)
Logistic regression (GridSearchCV best score: 0.8464285714285713)

From the data visualization section, we can see that some features may have correlation with the mission outcome in several ways. For example, with heavy payloads the successful landing or positive landing rate are more for orbit types Polar, LEO and ISS. However, for GTO, we cannot distinguish this well as both positive landing rate and negative landing(unsuccessful mission) are both there here.

Therefore, each feature may have a certain impact on the final mission outcome. The exact ways of how each of these features impact the mission outcome are difficult to decipher. However, we can use some machine learning algorithms to learn the pattern of the past data and predict whether a mission will be successful or not based on the given features.

In this project, we try to predict if the first stage of a given Falcon 9 launch will land in order to determine the cost of a launch. Each feature of a Falcon 9 launch, such as its payload mass or orbit type, may affect the mission outcome in a certain way.

Several machine learning algorithms are employed to learn the patterns of past Falcon 9 launch data to produce predictive models that can be used to predict the outcome of a Falcon 9 launch. The predictive model produced by decision tree algorithm performed the best among the 4 machine learning algorithms employed.

~ Project created in January 2022 ~

Jupyter Notebook 99.5%
Python 0.5%

Capstone Projects

M.S. in Data Science students are required to complete a capstone project. Capstone projects challenge students to acquire and analyze data to solve real-world problems. Project teams consist of two to four students and a faculty advisor. Teams select their capstone project at the beginning of the year and work on the project over the course of two semesters.

Most projects are sponsored by an organization—academic, commercial, non-profit, and government—seeking valuable recommendations to address strategic and operational issues. Depending on the needs of the sponsor, teams may develop web-based applications that can support ongoing decision-making. The capstone project concludes with a paper and presentation.

Key takeaways:

Synthesizing the concepts you have learned throughout the program in various courses (this requires that the question posed by the project be complex enough to require the application of appropriate analytical approaches learned in the program and that the available data be of sufficient size to qualify as ‘big’)
Experience working with ‘raw’ data exposing you to the data pipeline process you are likely to encounter in the ‘real world’
Demonstrating oral and written communication skills through a formal paper and presentation of project outcomes
Acquisition of team building skills on a long-term, complex, data science project
Addressing an actual client’s need by building a data product that can be shared with the client

Capstone projects have been sponsored by a variety of organizations and industries, including: Capital One, City of Charlottesville, Deloitte Consulting LLP, Metropolitan Museum of Art, MITRE Corporation, a multinational banking firm, The Public Library of Science, S&P Global Market Intelligence, UVA Brain Institute, UVA Center for Diabetes Technology, UVA Health System, U.S. Army Research Laboratory, Virginia Department of Health, Virginia Department of Motor Vehicles, Virginia Office of the Governor, Wikipedia, and more.

Sponsor a Capstone Project

View previous examples of capstone projects and check out answers to frequently asked questions.

What does the process look like?

The School of Data Science periodically puts out a Call for Proposals . Prospective project sponsors submit official proposals, vetted by the Associate Director for Research Development, Capstone Director, and faculty.
Sponsors present their projects to students at “Pitch Day” near the start of the Fall term, where students have the opportunity to ask questions.
Students individually rank their top project choices. An algorithm sorts students into capstone groups of approximately 3 to 4 students per group.
Adjustments are made by hand as necessary to finalize groups.
Each group is assigned a faculty mentor, who will meet groups each week in a seminar-style format.

What is the seminar approach to mentoring capstones?

We utilize a seminar approach to managing capstones to provide faculty mentorship and streamlined logistics. This approach involves one mentor supervising three to four loosely related projects and meeting with these groups on a regular basis. Project teams often encounter similar roadblocks and issues so meeting together to share information and report on progress toward key milestones is highly beneficial.

Do all capstone projects have corporate sponsors?

Not necessarily. Generally, each group works with a sponsor from outside the School of Data Science. Some sponsors are corporations, some are from nonprofit and governmental organizations, and some are from in other departments at UVA.

One of the challenges we continue to encounter when curating capstone projects with external sponsors is appropriately scoping and defining a question that is of sufficient depth for our students, obtaining data of sufficient size, obtaining access to the data in sufficient time for adequate analysis to be performed and navigating a myriad of legal issues (including conflicts of interest). While we continue to strive to use sponsored projects and work to solve these issues, we also look for ways to leverage openly available data to solve interesting societal problems which allow students to apply the skills learned throughout the program. While not all capstones have sponsors, all capstones have clients. That is, the work is being done for someone who cares and has investment in the outcome.

Why do we have to work in groups?

Because data science is a team sport!

All capstone projects are completed by group work. While this requires additional coordination , this collaborative component of the program reflects the way companies expect their employees to work. Building this skill is one of our core learning objectives for the program.

I didn’t get my first choice of capstone project from the algorithm matching. What can I do?

Remember that the point of the capstone projects isn’t the subject matter; it’s the data science. Professional data scientists may find themselves in positions in which they work on topics assigned to them, but they use methods they enjoy and still learn much through the process. That said, there are many ways to tackle a subject, and we are more than happy to work with you to find an approach to the work that most aligns with your interests.

Your ability to influence which project you work on is in the ranking process after “pitch day” and in encouraging your company or department to submit a proposal during the Call for Proposal process. At a minimum it takes several months to work with a sponsor to adequately scope a project, confirm access to the data and put the appropriate legal agreements into place. Before you ever see a project presented on pitch day, a lot of work has taken place to get it to that point!

Can I work on a project for my current employer?

Each spring, we put forward a public call for capstone projects. You are encouraged to share this call widely with your community, including your employer, non-profit organizations, or any entity that might have a big data problem that we can help solve. As a reminder, capstone projects are group projects so the project would require sufficient student interest after ‘pitch day’. In addition, you (the student) cannot serve as the project sponsor (someone else within your employer organization must serve in that capacity).

If my project doesn’t have a corporate sponsor, am I losing out on a career opportunity?

The capstone project will provide you with the opportunity to do relevant, high-quality work which can be included on a resume and discussed during job interviews. The project paper and your code on Github will provide more career opportunities than the sponsor of the project. Although it does happen from time to time, it is rare that capstones lead to a direct job offer with the capstone sponsor's company. Capstone projects are just one networking opportunity available to you in the program.

Capstone Project Reflections From Alumni

Theo Braimoh, MSDS Online Graduate and Admissions Student Ambassador

"For my Capstone project, I used Python to train machine learning models for visual analysis – also known as computer vision. Computer vision helped my Capstone team analyze the ergonomic posture of workers at risk of developing musculoskeletal injuries. We automated the process, and hope our work further protects the health and safety of people working in the United States.” — Theophilus Braimoh, MSDS Online Program 2023, Admissions Student Ambassador

Haley Egan, MSDS Online 2023 and Admissions Student Ambassador

“My Capstone experience with the ALMA Observatory and NRAO was a pivotal chapter in my UVA Master’s in Data Science journey. It fostered profound growth in my data science expertise and instilled a confidence that I'm ready to make meaningful contributions in the professional realm.” — Haley Egan, MSDS Online Program 2023, Admissions Student Ambassador

“Our Capstone projects gave us the opportunity to gain new domain knowledge and answer big data questions beyond the classroom setting.” — Mina Kim, MSDS Residential Program 2023, Ph.D. in Psychology Candidate

Capstone Project Reflections From Sponsors

“For us, the level of expertise, and special expertise, of the capstone students gives us ‘extra legs’ and an extra push to move a project forward. The team was asked to provide a replicable prototype air quality sensor that connected to the Cville Things Network, a free and community supported IoT network in Charlottesville. Their final product was a fantastic example that included clear circuit diagrams for replication by citizen scientists.” — Lucas Ames, Founder, Smart Cville

“Working with students on an exploratory project allowed us to focus on the data part of the problem rather than the business part, while testing with little risk. If our hypothesis falls flat, we gain valuable information; if it is validated or exceeded, we gain valuable information and are a few steps closer to a new product offering than when we started.” — Ellen Loeshelle, Senior Director of Product Management, Clarabridge

Student Capstone Project Looks To Improve Electrolarynx Speech-to-Text

MSDS Capstone Projects Give Students Exposure to Industry While in Academia

Master's Students' Capstone Presentations

Get the latest news.

Subscribe to receive updates from the School of Data Science.

Prospective Student
School of Data Science Alumnus
UVA Affiliate
Industry Member

Data Science Capstone Project: Milestone Report

Alexey serdyuk, table of content, prerequisites, obtaining the data, splitting the data, first glance on the data and general plan, cleaning up and preprocessing the corpus, analyzing words (1-grams), analyzing bigrams.

Pruning bigrams
3-grams to 6-grams
Conclusions and next steps

This is a milestone report for Week 2 of the capstone project for the cycle of courses Data Science Specialization offered on Coursera by Johns Hopkins University .

The purpose of the capstone project is to build a Natural Language Processing (NLP) application, that, given a chunk of text, predicts the next most probable word. The application may be used, for example, in mobile devided to provide suggestions as the user tips in some text.

In this report we will provide initial analysis of the data, as well as discuss approach to building the application.

An important question is which library to use for processing and analyzing the corpora, as R provides several alternatives. Initially we attempted to use the library tm , but quickly found that the library is very memory-hungry, and an attempt to build bi- or trigrams for a large corpus are not practical. After some googling we decided to use the library quanteda instead.

We start by loading required libraries.

To speed up processing of large data sets, we will apply parallel version of lapply function from the library parallel . To use all the available resources, we detect a number of CPU cores and configure the library to use them all.

Here and at some times later we use caching to speed up rendering of this document. Results of long-running operations are stored, and used again during the next run. If you wish to re-run all operations, just remove the cache directory.

We download the data from the URL provided in the course description, and unzip it.

The downloaded zip file contains corpora in several languages: English, German, Russian and Finnish. In our project we will use only English corpora.

Corpora in each language, including English, contains 3 files with content obtained from different sources: news, blogs and twitter.

As the first step, we will split each relevant file on 3 parts:

Training set (60%) will be used to build and train the algorithm.
Testing set (20%) will be used to test the algorithm during it’s development. This set may be used more than once.
Validation set (20%) will be used for a final validation and estimation of out-of-sample performance. This set will be used only once.

We define a function which splits the specified file on parts described above:

To make results reproduceable, we set the seed of the random number generator.

Finally, we split each of the data files.

As a sanity check, we count a number of lines in each source file, as well in the partial files produced by the split.


	Rows	%	Rows	%	Rows	%
Training	539572	59.99991	606145	59.99998	1416088	59.99997
Testing	179858	20.00004	202048	19.99996	472030	20.00002
Validation	179858	20.00004	202049	20.00006	472030	20.00002
Total	899288	100.00000	1010242	100.00000	2360148	100.00000
Control (expected to be 0)	0	NA	0	NA	0	NA

As the table shows, we have splitted the data on sub-sets as intended.

In the section above we have already counted a number of lines. Let us load training data sets and take a look on the first 3 lines of each data set.

we could see that the data contains not only words, but also numbers and punctuation. The punctuation may be non-ASCII (Unicode), as the first example in the blogs sample shows (it contains a character “…”, which is different from 3 ASCII point characters “. . .”). Some lines may contain multiple sentences, and probably we have to take this into account.

Here is our plan:

Split text on sentences.
Clean up the corpus: remove non-language parts such as e-mail addresses and URLs, etc.
Preprocess the corpus: remove punctuation and numbers, change all words to lower-case.
Analyze distribution of words to decide if we should base our prediction on the full dictionary, or just on some sub-set of it.
Analyze n-grams for small n.

We decided to split text on sentences and do not attempt to predict words across sentence border. We still may use information about sentences to improve prediction of the first word, because the frequency of the first word in a sentence may be very different from an average frequency.

Libraries contains some functions for cleaning up and pre-processing, but for some steps we have to write functions ourselves.

Now we pre-process the data.

In this section we will study distribution of words in corpora, ignoring for the moment interaction between words (n-grams).

We define two helper functions. The first one creates a Document Feature Matrix (DFM) for n-grams in documents, and aggregates it over all documents to a Feature Vector. The second helper function enriches the Feature Vector with additional values useful for our analysis, such as cumulated coverage of text.

Now we may calculate frequency of words in each source, as well as in all sources together (aggregated).

The following chart displays 20 most-frequent words in each source, as well as in the aggregated corpora.

As we see from the chart, top 20 most-frequent words differs between sources. For example, the most frequent word in news is “said”, but this word is not included in the top-20 list for blogs and Twitter at all. At the same time, some words are shared between lists: the word “can” is 2nd most-frequent in blogs, 3rd-most frequest in Twitter, and 5th in and news.

Our next step is to analyze the intersection, that is to find how many words are common to all sources, and how many are unique to a particular source. Not only just a number of words is important, but also a source coverage, that is what percentage of the whole text of a particular source is covered by a particular subset of all words.

The following Venn diagram shows a number of unique words (stems) used in each source, as well as a percentage of the aggregated corpora covered by those words.

As we may see, 46686 words are shared by all 3 corpora, but those words cover 97.46% of the aggregated corpora. On the other hand, there are 83185 words unique to blogs, but these words appear very infrequently, covering just 0.43% of the aggregated corpora.

The Venn diagram indicates that we may get a high coverage of all corpora by choosing common words. Coverage by words specific to a particular corpus is negligible.

The next step in our analysis is to find out how many common words we should choose to achieve a decent coverage of the text. From the Venn diagram we already know that by choosing 46686 words we will cover 97.46% of the aggregated corpora, but maybe we may reduce a number of words without significantly reducing the coverage.

The following chart shows a number of unique words in each source which cover particular percentage of the text. For example, 1000 most-frequent words cover 68.09% of the Twitter corpus. An interesting observation is that Twitter requires less words to cover particular percentage of the text, whereas news requires more words.

Corpora Coverage	Blogs	News	Twitter	Aggregated
75%	2,004	2,171	1,539	2,136
90%	6,395	6,718	5,325	6,941
95%	13,369	13,689	11,922	15,002
99%	63,110	53,294	71,575	88,267
99.9%	149,650	126,585	161,873	302,693

The table shows that in order to cover 95% of blogs, we require 13,369 words. The same coverage of news require 13,689 words, and the coverage of twitter 11,922 words. To cover 95% of the aggregated corpora, we require 15,002 unique words. We may use this fact later to reduce a number of n-grams required for predictions.

In this section we will study distribution bigrams, that is combinations of two words.

Using previously defined functions, we may calculate frequency of bigrams in each source, as well as in all sources together (aggregated).

The following chart displays 20 most-frequent bigrams in each source, as well as in the aggregated corpora.

We immediately see a difference with lists of top 20 words: there were much more common words between sources, as there are common bigrams. There are still some common bigrams, but the intersection is smaller.

Similar to how we proceed with words, now we will analyze intersections, that is we will find how many bigrams are common to all sources, and how many are unique to a particular source. We also calculate a percentage of the whole source covered by a particular subset of all bigrams.

The following Venn diagram shows a number of unique bigrams used in each source, as well as a percentage of the aggregated corpora covered by those bigrams.

IMAGES

My Data Science Capstone Project
Capstone Project Report
Data Science Capstone
Data Science Capstone Project PPT.pdf
Data Science Capstone Final Project Help
List of 55 Best IT Capstone Projects on Business Administration and Management

COMMENTS

Final Capstone Project for IBM Data Science Professional ...
This repo contains files/projects required to complete the final IBM Data Science Professional certification Capstone project provided by Coursera. - https://www.coursera.org/learn/applied-data-science-capstone?specialization=ibm-data-science. The repo also contains the final presentation summarizing the findings.
evgenyzorin/IBM-Applied-Data-Science-Capstone - GitHub
This Capstone is the 10th (final) course in IBM Data Science Professional Certificate specialization, and it actually summarizes in the form of project all materials that have been learned during this specialization.
GitHub - EveThan/IBM-Applied-Data-Science-Capstone-Project
In this capstone project, we will predict if the SpaceX Falcon 9 first stage will land successfully using several machine learning classification algorithms. The main steps in this project include: Data collection, wrangling, and formatting; Exploratory data analysis; Interactive data visualization; Machine learning prediction
Google Data Analytics Capstone Project Report | Kaggle
Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources.
Capstone Projects — School of Data Science
M.S. in Data Science students are required to complete a capstone project. Capstone projects challenge students to acquire and analyze data to solve real-world problems. Project teams consist of two to four students and a faculty advisor.
A friendly walk-through of a Data Science Capstone Project
Client: Bank. Objective: In our case, the objective is to gain insight into why customers are leaving and build a model to predict which customers are at risk of leaving the bank. Obtaining The...
Applied Data Science Capstone | Coursera
Starts Aug 11. Financial aid available. 165,904 already enrolled. Included with. • Learn more. About. Outcomes. Modules. Recommendations. Testimonials. Reviews. What you'll learn. Demonstrate proficiency in data science and machine learning techniques using a real-world data set and prepare a report for stakeholders.
Data Science Capstone | Coursera
The capstone project class will allow students to create a usable/public data product that can be used to show your skills to potential employers. Projects will be drawn from real-world problems and will be conducted with industry, government, and academic partners.
Coursera Data Science Capstone - Milestone Report
The Coursera Data Science Capstone - Milestone Report (aka, “the report”) is intended to give an introductory look at analyzing the SwiftKey data set and figuring out: What the data consists of, and. Identifying the standard tools and models used for this type of data.
Data Science Capstone Project: Milestone Report - GitHub Pages
This is a milestone report for Week 2 of the capstone project for the cycle of courses Data Science Specialization offered on Coursera by Johns Hopkins University. The purpose of the capstone project is to build a Natural Language Processing (NLP) application, that, given a chunk of text, predicts the next most probable word.

Navigation Menu