To request solutions to the exercises within the Case Studies, please complete this form and indicate which case(s) and their number you would like to request in the space provided below. Solutions are provided to qualified instructors only and all requests including academic standing will be verified before solutions are sent.
Medical Malpractice
Explore claim payment amounts for medical malpractice lawsuits and identify factors that appear to influence the amount of the payment using descriptive statistics and data visualizations.
Key words: Summary statistics, frequency distribution, histogram, box plot, bar chart, Pareto plot, and pie chart
Download the case study (PDF)
Download the data set
Baggage Complaints
Analyze and compare baggage complaints for three different airlines using descriptive statistics and time series plots. Explore differences between the airlines, whether complaints are getting better or worse over time, and if there are other factors, such as destinations, seasonal effects or the volume of travelers that might affect baggage performance.
Key words: Time series plots, summary statistics
Defect Sampling
Explore the effectiveness of different sampling plans in detecting changes in the occurrence of manufacturing defects.
Key words: Tabulation, histogram, summary statistics, and time series plots
Film on the Rocks
Use survey results from a summer movie series to answer questions regarding customer satisfaction, demographic profiles of patrons, and the use of media outlets in advertising.
Key words: Bar charts, frequency distribution, summary statistics, mosaic plot, contingency table, (cross-tabulations), and chi-squared test
Improving Patient Satisfaction
Analyze patient complaint data at a medical clinic to identify the issues resulting in customer dissatisfaction and determine potential causes of decreased patient volume.
Key words: Frequency distribution, summary statistics, Pareto plot, tabulation, scatterplot, run chart, correlation
Download the data set 1
Download the data set 2
Price Quotes
Evaluate the price quoting process of two different sales associate to determine if there is inconsistency between them to decide if a new more consistent pricing process should be developed.
Key words: Histograms, summary statistics, confidence interval for the mean, one sample t-Test
Treatment Facility
Determine what effect a reengineering effort had on the incidence of behavioral problems and turnover at a treatment facility for teenagers.
Key words: Summary statistics, time series plots, normal quantile plots, two sample t-Test, unequal variance test, Welch's test
Use data from a survey of students to perform exploratory data analysis and to evaluate the performance of different approaches to a statistical analysis.
Use the DASL Fish Prices data to investigate whether there is evidence that overfishing occurred from 1970 to 1980.
Key words: Histograms, normal quantile plots, log transformations, inverse transformation, paired t-test, Wilcoxon signed rank test
Subliminal Messages
Determine whether subliminal messages were effective in increasing math test scores, and if so, by how much.
Key words: Histograms, summary statistics, box plots, t-Test and pooled t-Test, normal quantile plot, Wilcoxon Rank Sums test, Cohen's d
Priority Assessment
Determine whether a software development project prioritization system was effective in speeding the time to completion for high priority jobs.
Key words: Summary statistics, histograms, normal quantile plot, ANOVA, pairwise comparison, unequal variance test, and Welch's test
Determine if a backgammon program has been upgraded by comparing the performance of a player against the computer across different time periods.
Key words: Histograms, confidence intervals, stacking data, one-way ANOVA, unequal variances test, one-sample t-Test, ANOVA table and calculations, F Distribution, F ratios
Per Capita Income
Use data from the World Factbook to explore wealth disparities between different regions of the world and identify those with the highest and lowest wealth.
Using outcomes for 10,000 flips of a coin, use descriptive statistics, confidence intervals and hypothesis tests to determine whether the coin is fair.
Key words: Bar charts, confidence intervals for proportions, hypothesis testing for proportions, likelihood ratio, simulating random data, scatterplot, fitting a regression line
Lister and Germ Theory
Use results from a 1860’s sterilization study to determine if there is evidence that the sterilization process reduces deaths when amputations are performed.
Key words: Mosaic plots, contingency tables, Pearson and likelihood ratio tests, Fisher's exact test, two-sample proportions test, one- and two-sided tests, confidence interval for the difference, relative risk
Salk Vaccine
Using data from a 1950’s study, determine whether the polio vaccine was effective in a cohort study, and, if it was, quantify the degree of effectiveness.
Key words: Bar charts, two-sample proportions test, relative risk, two-sided Pearson and likelihood ratio tests, Fisher's exact test, and the Gamma measure of association
Smoking and Lung Cancer
Use the results of a retrospective study to determine if there is a positive association between smoking and lung cancer, and estimate the risk of lung cancer for smokers relative to non-smokers.
Key words: Mosaic plots, two-by-two contingency tables, odds ratios and confidence intervals, conditional probability, hypothesis tests for proportions (likelihood ratio, Pearson's, Fisher's Exact, two sample tests for proportions)
Mendel's Laws of Inheritance
Use the data sets provided to explore Mendel’s Laws of Inheritance for dominant and recessive traits.
Key words: Bar charts, frequency distributions, goodness-of-fit tests, mosaic plot, hypothesis tests for proportions
Contributions
Predict year-end contributions in an employee fund-raising drive.
Key words: Summary statistics, time series plots, simple linear regression, predicted values, prediction intervals
Direct Mail
Evaluate different regression models to determine if sales at small retail shop are influence by direct mail campaign and using the resulting models to predict sales based upon the amount of marketing.
Key words: Time series plots, simple linear regression, lagged variables, predicted values, prediction intervals
Cost Leadership
Assess the effectiveness of a cost leadership strategy in increasing market share, and assess the potential for additional gains in market share under the current strategy.
Archosaur: The Relationship Between Body Size and Brain Size
Analyze data on the brain and body weight of different dinosaur species to determine if a proposed statistical model performs well at describing the relationship and use the model to predict brain weight based on body weight.
Key words: Histogram and summary statistics, fitting a regression line, log transformations, residual plots, interpreting regression output and parameter estimates, inverse transformations
Cell Phone Service
Determine whether wind speed and barometric pressure are related to phone call performance (percentage of dropped or failed calls) and use the resulting model to predict the percentage of bad calls based upon the weather conditions.
After determining which factors relate to the selling prices of homes located in and around a ski resort, develop a model to predict housing prices.
Key words: Scatterplot matrix, correlations, multiple regression, stepwise regression, multicollinearity, model building, model diagnostics
Bank Revenues
A bank wants to understand how customer banking habits contribute to revenues and profitability. Build a model that allows the bank to predict profitability for a given customer. The resulting model will be used to forecast bank revenues and guide the bank in future marketing campaigns.
Determine whether certain conditions make it more likely that a customer order will be won or lost.
Key words: Bar charts, frequency distribution, mosaic plots, contingency table, chi-squared test, logistic regression, predicted values, confusion matrix
Titanic Passengers
Use the passenger data related to the sinking of the RMS Titanic ship to explore some questions of interest about survival rates for the Titanic. For example, were there some key characteristics of the survivors? Were some passenger groups more likely to survive than others? Can we accurately predict survival?
A bank would like to understand the demographics and other characteristics associated with whether a customer accepts a credit card offer. Build a Classification model that will provide insight into why some bank customers accept credit card offers.
The scenario relates to the handling of customer queries via an IT call center. The call center performance is well below best in class. Identify potential process changes to allow the call center to achieve best in class performance.
Key words: Interactive data visualization, graphs, distribution, tabulate, recursive partitioning, process capability, control chart, multiple regression, prediction profiler
Customer Churn
Analyze the factors related to customer churn of a mobile phone service provider. The company would like to build a model to predict which customers are most likely to move their service to a competitor. This knowledge will be used to identify customers for targeted interventions, with the ultimate goal of reducing churn.
Build a variety of prediction models (multiple regression, partition tree, and a neural network) to determine the one that performs the best at predicting house prices based upon various characteristics of the house and its location.
Key words: Stepwise regression, regression trees, neural networks, model validation, model comparison
Durability of Mobile Phone Screen - Part 1
Evaluate the durability of mobile phone screens in a drop test. Determine if a desired level of durability is achieved for each of two types of screens and compare performance.
Key words: Confidence Intervals, Hypothesis Tests for One and Two Population Proportions, Chi-square, Relative Risk
Durability of Mobile Phone Screen - Part 2
Evaluate the durability of mobile phone screens in a drop test at various drop heights. Determine if a desired level of durability is achieved for each of three types of screens and compare performance.
Key words: Contingency analysis, comparing proportions via difference, relative risk and odds ratio
Durability of Mobile Phone Screen - Part 3
Evaluate the durability of mobile phone screens in a drop test across various heights by building individual simple logistic regression models. Use the models to estimate the probability of a screen being damaged across any drop height.
Key words: Single variable logistic regression, inverse prediction
Durability of Mobile Phone Screen - Part 4
Evaluate the durability of mobile phone screens in a drop test across various heights by building a single multiple logistic regression model. Use the model to estimate the probability of a screen being damaged across any drop height.
Key words: Multivariate logistic regression, inverse prediction, odds ratio
Online Mortgage Application
Evaluate the potential improvement to the UI design of an online mortgage application process by examining the usability rating from a sample of 50 customers and comparing their performance using the new design vs. a large collection of historic data on customer’s performance with the current design.
Key words: Distribution, normality, normal quantile plot, Shapiro Wilk and Anderson Darling tests, t-Test
Performance of Food Manufacturing Process - Part 1
Evaluate the performance to specifications of a food manufacturing process using graphical analyses and numerical summarizations of the data.
Key words: Distribution, summary statistics, time series plots
Performance of Food Manufacturing Process - Part 2
Evaluate the performance to specifications of a food manufacturing process using confidence intervals and hypothesis testing.
Key words: Distribution, normality, normal quantile plot, Shapiro Wilk and Anderson Darling tests, test of mean and test of standard deviation
Detergent Cleaning Effectiveness
Analyze the results of an experiment to determine if there is statistical evidence demonstrating an improvement in a new laundry detergent formulation. Explore and describe the affect that multiple factors have on a response, as well as identify conditions with the most and least impact.
Key words: Analysis of variance (ANOVA), t-Test, pairwise comparison, model diagnostics, model performance
Manufacturing Systems Variation
Study the use of Nested Variability chart to understand and analyze the different components of variances. Also explore the ways to minimize the variability by applying various rules of operation related to variance.
Key words: Variability gauge, nested design, component analysis of variance
Text Exploration of Patents
This study requires the use of unstructured data analysis to understand and analyze the text related to patents filed by different companies.
Key words: Word cloud, data visualization, term selection
US Stock Indices
Understand the basic concepts related to time series data analysis and explore the ways to practically understand the risks and rate of return related to the financial indices data.
Study the application of regression and concepts related to choice modeling (also called conjoint analysis) to understand and analyze the importance of the product attributes and their levels influencing the preferences.
Key words: Part Worth, regression, prediction profiler
Pricing Spectacles
Design and analyze discrete choice experiments (also called conjoint analysis) to discover which product or service attributes are preferred by potential customers.
Key words: Discrete choice design, regression, utility and probability profiler, willingness to pay
Modeling Gold Prices
Learn univariate time series modeling using US Gold Prices. Build AR, MA, ARMA and ARMA models to analyze the characteristics of the time series data and forecast.
Key words: Stationarity, AR, MA, ARMA, ARIMA, model comparison and diagnostics
Explore statistical evidence demonstrating an association between Saguro size and the amount of flowers it produces.
Apply time series forecasting and Generalized linear mixed model (GLMM) to evaluate butterfly populations being impacted by climate and land-use changes.
Key words: Time series forecasting, Generalized linear mixed model
Exploratory Factor Analysis of Trust in Online Sellers
Apply exploratory factor analysis to uncover latent factor structure in an online shopping questionnaire.
Key words: Exploratory Factor Analysis (EFA), Bartlett’s Test, KMO Test
Modeling Online Shopping Perceptions
Apply measurement and structural models to survey responses from online shoppers to build and evaluate competing models.
Key words : Confirmatory Factor Analysis (CFA), Structural Equation Modeling (SEM), Measurement and Structural Regression Models, Model Comparison
Functional Data Analysis for HPLC Optimization
Apply functional data analysis and functional design of experiments (FDOE) for the optimization of an analytical method to allow for the accurate quantification of two biological components.
Key words: Functional Data Analysis, Functional PCA, Functional DOE
Nonlinear Regression Modeling for Cell Growth Optimization
Apply nonlinear models to understand the impact of factors on a cell growth.
Key words: Nonlinear Modeling, Logistic 3P, Curve DOE
Quantifying Sentiment in Economic Reports
Apply Sentiment analysis to quantify the emotion in unstructured text.
Key words: Word Cloud, Sentiment Analysis
Monitoring Fish Abundance in the Mesoamerican Reef
Apply exploratory data analysis in the context of wildlife monitoring and nature conservation
Key words: Summary statistics, Crosstabulation, Data visualization
A Dataset Exploration Case Study with Know Your Data
August 9, 2021
Posted by Mark Díaz and Emily Denton, Research Scientists, Google Research, Ethical AI Team
Data underlies much of machine learning (ML) research and development, helping to structure what a machine learning algorithm learns and how models are evaluated and benchmarked. However, data collection and labeling can be complicated by unconscious biases, data access limitations and privacy concerns, among other challenges. As a result, machine learning datasets can reflect unfair social biases along dimensions of race, gender, age, and more.
Methods of examining datasets that can surface information about how different social groups are represented within are a key component of ensuring development of ML models and datasets is aligned with our AI Principles . Such methods can inform the responsible use of ML datasets and point toward potential mitigations of unfair outcomes. For example, prior research has demonstrated that some object recognition datasets are biased toward images sourced from North America and Western Europe, prompting Google’s Crowdsource effort to balance out image representations in other parts of the world.
Today, we demonstrate some of the functionality of a dataset exploration tool, Know Your Data (KYD), recently introduced at Google I/O, using the COCO Captions dataset as a case study. Using this tool, we find a range of gender and age biases in COCO Captions — biases that can be traced to both dataset collection and annotation practices. KYD is a dataset analysis tool that complements the growing suite of responsible AI tools being developed across Google and the broader research community. Currently, KYD only supports analysis of a small set of image datasets, but we’re working hard to make the tool accessible beyond this set.
Introducing Know Your Data
Know Your Data helps ML research, product and compliance teams understand datasets, with the goal of improving data quality, and thus helping to mitigate fairness and bias issues. KYD offers a range of features that allow users to explore and examine machine learning datasets — users can filter, group, and study correlations based on annotations already present in a given dataset. KYD also presents automatically computed labels from Google’s Cloud Vision API , providing users with a simple way to explore their data based on signals that weren’t originally present in the dataset.
A KYD Case Study
As a case study, we explore some of these features using the COCO Captions dataset , an image dataset that contains five human-generated captions for each of over 300k images. Given the rich annotations provided by free-form text, we focus our analysis on signals already present within the dataset.
Exploring Gender Bias
Previous research has demonstrated undesirable gender biases within computer vision datasets, including pornographic imagery of women and image label correlations that align with harmful gender stereotypes. We use KYD to explore gender biases within COCO Captions by examining gendered correlations within the image captions. We find a gender bias in the depiction of different activities across the images in the dataset, as well as biases relating to how people of different genders are described by annotators.
The first part of our analysis aimed to surface gender biases with respect to different activities depicted in the dataset. We examined images captioned with words describing different activities and analyzed their relation to gendered caption words, such as “man” or “woman”. Building upon recent work that leverages the PMI metric to measure associations learned by a model , the KYD relations tab makes it easy to examine associations between different signals in a dataset. This tab visualizes the extent to which two signals in the dataset co-occur more (or less) than would be expected by chance. Each cell indicates either a positive (blue color) or negative (orange color) correlation between two specific signal values along with the strength of that correlation.
KYD also allows users to filter rows of a relations table based on substring matching. Using this functionality, we initially probed for caption words containing “-ing”, as a simple way to filter by verbs. We immediately saw strong gendered correlations :
to analyze the relationship between any word and gendered words. Each cell shows if the two respective words co-occur in the same caption more (up arrow) or less often (down arrow) than pure chance.
Digging further into these correlations, we found that several activities stereotypically associated with women, such as “shopping” and “cooking”, co-occur with images captioned with “women” or “woman” at a higher rate than with images captioned with “men” or “man”. In contrast captions describing many physically intensive activities, such as “skateboarding”, “surfing”, and “snowboarding”, co-occur with images captioned with “man” or “men” at higher rates.
While individual image captions may not use stereotypical or derogatory language, such as with the example below, if certain gender groups are over (or under) represented within a particular activity across the whole dataset, models developed from the dataset risk learning stereotypical associations. KYD makes it easy to surface, quantify, and make plans to mitigate this risk.
An image with one of the captions: “Two women cooking in a beige and white kitchen.” Image licensed under CC-BY 2.0.
In addition to examining biases with respect to the social groups depicted with different activities, we also explored biases in how annotators described the appearance of people they perceived as male or female. Inspired by media scholars who have examined the “male gaze” embedded in other forms of visual media, we examined the frequency with which individuals perceived as women in COCO are described using adjectives that position them as an object of desire. KYD allowed us to easily examine co-occurrences between words associated with binary gender (e.g. "female/girl/woman" vs. "male/man/boy") and words associated with evaluating physical attractiveness. Importantly, these are captions written by human annotators, who are making subjective assessments about the gender of people in the image and choosing a descriptor for attractiveness. We see that the words "attractive", "beautiful", "pretty", and "sexy" are overrepresented in describing people perceived as women as compared to those perceived as men, confirming what prior work has said about how gender is viewed in visual media.
A screenshot showing the relationship between words that describe attractiveness and gendered words. For example, “attractive” and “male/man/boy” co-occur 12 times, but we expect ~60 times by chance (the ratio is 0.2x). On the other hand, “attractive” and “female/woman/girl” co-occur 2.62 times more than chance.
KYD also allows us to manually inspect images for each relation by clicking on the relation in question. For example, we can see images whose captions include female terms (e.g. “woman”) and the word “beautiful”.
Exploring Age Bias
Adults older than 65 have been shown to be underrepresented in datasets relative to their presence in the general population — a first step toward improving age representation is to allow developers to assess it in their datasets. By looking at caption words describing different activities and analyzing their relation to caption words describing age, KYD helped us to assess the range of example captions depicting older adults. Having example captions of adults in a range of environments and activities is important for a variety of tasks, such as image captioning or pedestrian detection.
The first trend that KYD made clear is how rarely annotators described people as older adults in captions detailing different activities. The relations tab also shows a trend wherein “elderly”, “old”, and “older” tend not to occur with verbs that describe a variety of physical activities that might be important for a system to be able to detect. Important to note is that, relative to “young”, “old” is more often used to describe things other than people, such as belongings or clothing, so these relations are also capturing some uses that don’t describe people.
The relationship between words associated with from a screenshot of KYD.
The underrepresentation of captions containing the references to older adults that we examined here could be rooted in a relative lack of images depicting older adults as well as in a tendency for annotators to omit older age-related terms when describing people in images. While manual inspection of the intersection of “old” and “running” shows a negative relation, we notice that it shows no older people and a number of locomotives . KYD makes it easy to quantitatively and qualitatively inspect relations to identify dataset strengths and areas for improvement.
Understanding the contents of ML datasets is a critical first step to developing suitable strategies to mitigate the downstream impact of unfair dataset bias. The above analysis points towards several potential mitigations. For example, correlations between certain activities and social groups, which can lead trained models to reproduce social stereotypes, can be potentially mitigated by “dataset balancing” — increasing the representation of under-represented group/activity combinations. However, mitigations focused exclusively on dataset balancing are not sufficient, as our analysis of how different genders are described by annotators demonstrated. We found annotators’ subjective judgements of people portrayed in images were reflected within the final dataset, suggesting a deeper look at methods of image annotations are needed. One solution for data practitioners who are developing image captioning datasets is to consider integrating guidelines that have been developed for writing image descriptions that are sensitive to race, gender, and other identity categories.
The above case studies highlight only some of the KYD features. For example, Cloud Vision API signals are also integrated into KYD and can be used to infer signals that annotators haven't labeled directly. We encourage the broader ML community to perform their own KYD case studies and share their findings.
KYD complements other dataset analysis tools being developed across the ML community, including Google's growing Responsible AI toolkit . We look forward to ML practitioners using KYD to better understand their datasets and mitigate potential bias and fairness concerns. If you have feedback on KYD, please write to [email protected] .
Acknowledgements
The analysis and write-up in this post were conducted with equal contribution by Emily Denton, Mark Díaz, and Alex Hanna. We thank Marie Pellat, Ludovic Peran, Daniel Smilkov, Nikhil Thorat and Tsung-Yi for their contributions to and reviews of this post. We also thank the researchers and teams that have developed the signals and metrics used in KYD and particularly the team that has helped us implement nPMI.
Data Analysis Case Study: Learn From Humana’s Automated Data Analysis Project
Lillian Pierson, P.E.
Playback speed:
Got data? Great! Looking for that perfect data analysis case study to help you get started using it? You’re in the right place.
If you’ve ever struggled to decide what to do next with your data projects, to actually find meaning in the data, or even to decide what kind of data to collect, then KEEP READING…
Deep down, you know what needs to happen. You need to initiate and execute a data strategy that really moves the needle for your organization. One that produces seriously awesome business results.
But how you’re in the right place to find out..
As a data strategist who has worked with 10 percent of Fortune 100 companies, today I’m sharing with you a case study that demonstrates just how real businesses are making real wins with data analysis.
In the post below, we’ll look at:
A shining data success story;
What went on ‘under-the-hood’ to support that successful data project; and
The exact data technologies used by the vendor, to take this project from pure strategy to pure success
If you prefer to watch this information rather than read it, it’s captured in the video below:
Here’s the url too: https://youtu.be/xMwZObIqvLQ
3 Action Items You Need To Take
To actually use the data analysis case study you’re about to get – you need to take 3 main steps. Those are:
Reflect upon your organization as it is today (I left you some prompts below – to help you get started)
Review winning data case collections (starting with the one I’m sharing here) and identify 5 that seem the most promising for your organization given it’s current set-up
Assess your organization AND those 5 winning case collections. Based on that assessment, select the “QUICK WIN” data use case that offers your organization the most bang for it’s buck
Step 1: Reflect Upon Your Organization
Whenever you evaluate data case collections to decide if they’re a good fit for your organization, the first thing you need to do is organize your thoughts with respect to your organization as it is today.
Before moving into the data analysis case study, STOP and ANSWER THE FOLLOWING QUESTIONS – just to remind yourself:
What is the business vision for our organization?
What industries do we primarily support?
What data technologies do we already have up and running, that we could use to generate even more value?
What team members do we have to support a new data project? And what are their data skillsets like?
What type of data are we mostly looking to generate value from? Structured? Semi-Structured? Un-structured? Real-time data? Huge data sets? What are our data resources like?
Jot down some notes while you’re here. Then keep them in mind as you read on to find out how one company, Humana, used its data to achieve a 28 percent increase in customer satisfaction. Also include its 63 percent increase in employee engagement! (That’s such a seriously impressive outcome, right?!)
Step 2: Review Data Case Studies
Here we are, already at step 2. It’s time for you to start reviewing data analysis case studies (starting with the one I’m sharing below). I dentify 5 that seem the most promising for your organization given its current set-up.
Humana’s Automated Data Analysis Case Study
The key thing to note here is that the approach to creating a successful data program varies from industry to industry .
Let’s start with one to demonstrate the kind of value you can glean from these kinds of success stories.
Humana has provided health insurance to Americans for over 50 years. It is a service company focused on fulfilling the needs of its customers. A great deal of Humana’s success as a company rides on customer satisfaction, and the frontline of that battle for customers’ hearts and minds is Humana’s customer service center.
Call centers are hard to get right. A lot of emotions can arise during a customer service call, especially one relating to health and health insurance. Sometimes people are frustrated. At times, they’re upset. Also, there are times the customer service representative becomes aggravated, and the overall tone and progression of the phone call goes downhill. This is of course very bad for customer satisfaction.
Humana wanted to use artificial intelligence to improve customer satisfaction (and thus, customer retention rates & profits per customer).
Humana wanted to find a way to use artificial intelligence to monitor their phone calls and help their agents do a better job connecting with their customers in order to improve customer satisfaction (and thus, customer retention rates & profits per customer ).
In light of their business need, Humana worked with a company called Cogito, which specializes in voice analytics technology.
Cogito offers a piece of AI technology called Cogito Dialogue. It’s been trained to identify certain conversational cues as a way of helping call center representatives and supervisors stay actively engaged in a call with a customer.
The AI listens to cues like the customer’s voice pitch.
If it’s rising, or if the call representative and the customer talk over each other, then the dialogue tool will send out electronic alerts to the agent during the call.
Humana fed the dialogue tool customer service data from 10,000 calls and allowed it to analyze cues such as keywords, interruptions, and pauses, and these cues were then linked with specific outcomes. For example, if the representative is receiving a particular type of cues, they are likely to get a specific customer satisfaction result.
The Outcome
Customers were happier, and customer service representatives were more engaged..
This automated solution for data analysis has now been deployed in 200 Humana call centers and the company plans to roll it out to 100 percent of its centers in the future.
The initiative was so successful, Humana has been able to focus on next steps in its data program. The company now plans to begin predicting the type of calls that are likely to go unresolved, so they can send those calls over to management before they become frustrating to the customer and customer service representative alike.
What does this mean for you and your business?
Well, if you’re looking for new ways to generate value by improving the quantity and quality of the decision support that you’re providing to your customer service personnel, then this may be a perfect example of how you can do so.
Humana’s Business Use Cases
Humana’s data analysis case study includes two key business use cases:
Analyzing customer sentiment; and
Suggesting actions to customer service representatives.
Analyzing Customer Sentiment
First things first, before you go ahead and collect data, you need to ask yourself who and what is involved in making things happen within the business.
In the case of Humana, the actors were:
The health insurance system itself
The customer, and
The customer service representative
As you can see in the use case diagram above, the relational aspect is pretty simple. You have a customer service representative and a customer. They are both producing audio data, and that audio data is being fed into the system.
Humana focused on collecting the key data points, shown in the image below, from their customer service operations.
By collecting data about speech style, pitch, silence, stress in customers’ voices, length of call, speed of customers’ speech, intonation, articulation, silence, and representatives’ manner of speaking, Humana was able to analyze customer sentiment and introduce techniques for improved customer satisfaction.
Having strategically defined these data points, the Cogito technology was able to generate reports about customer sentiment during the calls.
Suggesting actions to customer service representatives.
The second use case for the Humana data program follows on from the data gathered in the first case.
In Humana’s case, Cogito generated a host of call analyses and reports about key call issues.
In the second business use case, Cogito was able to suggest actions to customer service representatives, in real-time , to make use of incoming data and help improve customer satisfaction on the spot.
The technology Humana used provided suggestions via text message to the customer service representative, offering the following types of feedback:
The tone of voice is too tense
The speed of speaking is high
The customer representative and customer are speaking at the same time
These alerts allowed the Humana customer service representatives to alter their approach immediately , improving the quality of the interaction and, subsequently, the customer satisfaction.
The preconditions for success in this use case were:
The call-related data must be collected and stored
The AI models must be in place to generate analysis on the data points that are recorded during the calls
Evidence of success can subsequently be found in a system that offers real-time suggestions for courses of action that the customer service representative can take to improve customer satisfaction.
Thanks to this data-intensive business use case, Humana was able to increase customer satisfaction, improve customer retention rates, and drive profits per customer.
The Technology That Supports This Data Analysis Case Study
I promised to dip into the tech side of things. This is especially for those of you who are interested in the ins and outs of how projects like this one are actually rolled out.
Here’s a little rundown of the main technologies we discovered when we investigated how Cogito runs in support of its clients like Humana.
For cloud data management Cogito uses AWS, specifically the Athena product
For on-premise big data management, the company used Apache HDFS – the distributed file system for storing big data
They utilize MapReduce, for processing their data
And Cogito also has traditional systems and relational database management systems such as PostgreSQL
In terms of analytics and data visualization tools, Cogito makes use of Tableau
And for its machine learning technology, these use cases required people with knowledge in Python, R, and SQL, as well as deep learning (Cogito uses the PyTorch library and the TensorFlow library)
These data science skill sets support the effective computing, deep learning , and natural language processing applications employed by Humana for this use case.
If you’re looking to hire people to help with your own data initiative, then people with those skills listed above, and with experience in these specific technologies, would be a huge help.
Step 3: S elect The “Quick Win” Data Use Case
Still there? Great!
It’s time to close the loop.
Remember those notes you took before you reviewed the study? I want you to STOP here and assess. Does this Humana case study seem applicable and promising as a solution, given your organization’s current set-up…
YES ▶ Excellent!
Earmark it and continue exploring other winning data use cases until you’ve identified 5 that seem like great fits for your businesses needs. Evaluate those against your organization’s needs, and select the very best fit to be your “quick win” data use case. Develop your data strategy around that.
NO , Lillian – It’s not applicable. ▶ No problem.
Discard the information and continue exploring the winning data use cases we’ve categorized for you according to business function and industry. Save time by dialing down into the business function you know your business really needs help with now. Identify 5 winning data use cases that seem like great fits for your businesses needs. Evaluate those against your organization’s needs, and select the very best fit to be your “quick win” data use case. Develop your data strategy around that data use case.
More resources to get ahead...
Get income-generating ideas for data professionals, are you tired of relying on one employer for your income are you dreaming of a side hustle that won’t put you at risk of getting fired or sued well, my friend, you’re in luck..
This 48-page listing is here to rescue you from the drudgery of corporate slavery and set you on the path to start earning more money from your existing data expertise. Spend just 1 hour with this pdf and I can guarantee you’ll be bursting at the seams with practical, proven & profitable ideas for new income-streams you can create from your existing expertise. Learn more here!
We love helping tech brands gain exposure and brand awareness among our active audience of 530,000 data professionals. If you’d like to explore our alternatives for brand partnerships and content collaborations, you can reach out directly on this page and book a time to speak.
DOES YOUR GROWTH STRATEGY PASS THE AI-READINESS TEST?
I've put these processes to work for Fortune 100 companies, and now I'm handing them to you...
DISCOVER UNTAPPED PROFITS IN YOUR MARKETING EFFORTS TODAY!
If you’re ready to reach your next level of growth.
Get the Reddit app
Wiki has been Updated!
A space for data science professionals to engage in discussions and debates on the subject of data science.
Practice take-home case study (datasets/code included)
By continuing, you agree to our User Agreement and acknowledge that you understand the Privacy Policy .
Enter the 6-digit code from your authenticator app
You’ve set up two-factor authentication for this account.
Enter a 6-digit backup code
Create your username and password.
Reddit is anonymous, so your username is what you’ll go by here. Choose wisely—because once you get a name, you can’t change it.
Reset your password
Enter your email address or username and we’ll send you a link to reset your password
Check your inbox
An email with a link to reset your password was sent to the email address associated with your account
Choose a Reddit account to continue
The Ultimate Guide to Qualitative Research - Part 1: The Basics
Introduction and overview
What is qualitative research?
What is qualitative data?
Examples of qualitative data
Qualitative vs. quantitative research
Mixed methods
Qualitative research preparation
Theoretical perspective
Theoretical framework
Literature reviews
Research question
Conceptual framework
Conceptual vs. theoretical framework
Data collection
Qualitative research methods
Focus groups
Observational research
What is a case study?
Applications for case study research, what is a good case study, process of case study design, benefits and limitations of case studies.
Ethnographical research
Ethical considerations
Confidentiality and privacy
Power dynamics
Reflexivity
Case studies
Case studies are essential to qualitative research , offering a lens through which researchers can investigate complex phenomena within their real-life contexts. This chapter explores the concept, purpose, applications, examples, and types of case studies and provides guidance on how to conduct case study research effectively.
Whereas quantitative methods look at phenomena at scale, case study research looks at a concept or phenomenon in considerable detail. While analyzing a single case can help understand one perspective regarding the object of research inquiry, analyzing multiple cases can help obtain a more holistic sense of the topic or issue. Let's provide a basic definition of a case study, then explore its characteristics and role in the qualitative research process.
Definition of a case study
A case study in qualitative research is a strategy of inquiry that involves an in-depth investigation of a phenomenon within its real-world context. It provides researchers with the opportunity to acquire an in-depth understanding of intricate details that might not be as apparent or accessible through other methods of research. The specific case or cases being studied can be a single person, group, or organization – demarcating what constitutes a relevant case worth studying depends on the researcher and their research question .
Among qualitative research methods , a case study relies on multiple sources of evidence, such as documents, artifacts, interviews , or observations , to present a complete and nuanced understanding of the phenomenon under investigation. The objective is to illuminate the readers' understanding of the phenomenon beyond its abstract statistical or theoretical explanations.
Characteristics of case studies
Case studies typically possess a number of distinct characteristics that set them apart from other research methods. These characteristics include a focus on holistic description and explanation, flexibility in the design and data collection methods, reliance on multiple sources of evidence, and emphasis on the context in which the phenomenon occurs.
Furthermore, case studies can often involve a longitudinal examination of the case, meaning they study the case over a period of time. These characteristics allow case studies to yield comprehensive, in-depth, and richly contextualized insights about the phenomenon of interest.
The role of case studies in research
Case studies hold a unique position in the broader landscape of research methods aimed at theory development. They are instrumental when the primary research interest is to gain an intensive, detailed understanding of a phenomenon in its real-life context.
In addition, case studies can serve different purposes within research - they can be used for exploratory, descriptive, or explanatory purposes, depending on the research question and objectives. This flexibility and depth make case studies a valuable tool in the toolkit of qualitative researchers.
Remember, a well-conducted case study can offer a rich, insightful contribution to both academic and practical knowledge through theory development or theory verification, thus enhancing our understanding of complex phenomena in their real-world contexts.
What is the purpose of a case study?
Case study research aims for a more comprehensive understanding of phenomena, requiring various research methods to gather information for qualitative analysis . Ultimately, a case study can allow the researcher to gain insight into a particular object of inquiry and develop a theoretical framework relevant to the research inquiry.
Why use case studies in qualitative research?
Using case studies as a research strategy depends mainly on the nature of the research question and the researcher's access to the data.
Conducting case study research provides a level of detail and contextual richness that other research methods might not offer. They are beneficial when there's a need to understand complex social phenomena within their natural contexts.
The explanatory, exploratory, and descriptive roles of case studies
Case studies can take on various roles depending on the research objectives. They can be exploratory when the research aims to discover new phenomena or define new research questions; they are descriptive when the objective is to depict a phenomenon within its context in a detailed manner; and they can be explanatory if the goal is to understand specific relationships within the studied context. Thus, the versatility of case studies allows researchers to approach their topic from different angles, offering multiple ways to uncover and interpret the data .
The impact of case studies on knowledge development
Case studies play a significant role in knowledge development across various disciplines. Analysis of cases provides an avenue for researchers to explore phenomena within their context based on the collected data.
This can result in the production of rich, practical insights that can be instrumental in both theory-building and practice. Case studies allow researchers to delve into the intricacies and complexities of real-life situations, uncovering insights that might otherwise remain hidden.
Types of case studies
In qualitative research , a case study is not a one-size-fits-all approach. Depending on the nature of the research question and the specific objectives of the study, researchers might choose to use different types of case studies. These types differ in their focus, methodology, and the level of detail they provide about the phenomenon under investigation.
Understanding these types is crucial for selecting the most appropriate approach for your research project and effectively achieving your research goals. Let's briefly look at the main types of case studies.
Exploratory case studies
Exploratory case studies are typically conducted to develop a theory or framework around an understudied phenomenon. They can also serve as a precursor to a larger-scale research project. Exploratory case studies are useful when a researcher wants to identify the key issues or questions which can spur more extensive study or be used to develop propositions for further research. These case studies are characterized by flexibility, allowing researchers to explore various aspects of a phenomenon as they emerge, which can also form the foundation for subsequent studies.
Descriptive case studies
Descriptive case studies aim to provide a complete and accurate representation of a phenomenon or event within its context. These case studies are often based on an established theoretical framework, which guides how data is collected and analyzed. The researcher is concerned with describing the phenomenon in detail, as it occurs naturally, without trying to influence or manipulate it.
Explanatory case studies
Explanatory case studies are focused on explanation - they seek to clarify how or why certain phenomena occur. Often used in complex, real-life situations, they can be particularly valuable in clarifying causal relationships among concepts and understanding the interplay between different factors within a specific context.
Intrinsic, instrumental, and collective case studies
These three categories of case studies focus on the nature and purpose of the study. An intrinsic case study is conducted when a researcher has an inherent interest in the case itself. Instrumental case studies are employed when the case is used to provide insight into a particular issue or phenomenon. A collective case study, on the other hand, involves studying multiple cases simultaneously to investigate some general phenomena.
Each type of case study serves a different purpose and has its own strengths and challenges. The selection of the type should be guided by the research question and objectives, as well as the context and constraints of the research.
The flexibility, depth, and contextual richness offered by case studies make this approach an excellent research method for various fields of study. They enable researchers to investigate real-world phenomena within their specific contexts, capturing nuances that other research methods might miss. Across numerous fields, case studies provide valuable insights into complex issues.
Critical information systems research
Case studies provide a detailed understanding of the role and impact of information systems in different contexts. They offer a platform to explore how information systems are designed, implemented, and used and how they interact with various social, economic, and political factors. Case studies in this field often focus on examining the intricate relationship between technology, organizational processes, and user behavior, helping to uncover insights that can inform better system design and implementation.
Health research
Health research is another field where case studies are highly valuable. They offer a way to explore patient experiences, healthcare delivery processes, and the impact of various interventions in a real-world context.
Case studies can provide a deep understanding of a patient's journey, giving insights into the intricacies of disease progression, treatment effects, and the psychosocial aspects of health and illness.
Asthma research studies
Specifically within medical research, studies on asthma often employ case studies to explore the individual and environmental factors that influence asthma development, management, and outcomes. A case study can provide rich, detailed data about individual patients' experiences, from the triggers and symptoms they experience to the effectiveness of various management strategies. This can be crucial for developing patient-centered asthma care approaches.
Other fields
Apart from the fields mentioned, case studies are also extensively used in business and management research, education research, and political sciences, among many others. They provide an opportunity to delve into the intricacies of real-world situations, allowing for a comprehensive understanding of various phenomena.
Case studies, with their depth and contextual focus, offer unique insights across these varied fields. They allow researchers to illuminate the complexities of real-life situations, contributing to both theory and practice.
Whatever field you're in, ATLAS.ti puts your data to work for you
Download a free trial of ATLAS.ti to turn your data into insights.
Understanding the key elements of case study design is crucial for conducting rigorous and impactful case study research. A well-structured design guides the researcher through the process, ensuring that the study is methodologically sound and its findings are reliable and valid. The main elements of case study design include the research question , propositions, units of analysis, and the logic linking the data to the propositions.
The research question is the foundation of any research study. A good research question guides the direction of the study and informs the selection of the case, the methods of collecting data, and the analysis techniques. A well-formulated research question in case study research is typically clear, focused, and complex enough to merit further detailed examination of the relevant case(s).
Propositions
Propositions, though not necessary in every case study, provide a direction by stating what we might expect to find in the data collected. They guide how data is collected and analyzed by helping researchers focus on specific aspects of the case. They are particularly important in explanatory case studies, which seek to understand the relationships among concepts within the studied phenomenon.
Units of analysis
The unit of analysis refers to the case, or the main entity or entities that are being analyzed in the study. In case study research, the unit of analysis can be an individual, a group, an organization, a decision, an event, or even a time period. It's crucial to clearly define the unit of analysis, as it shapes the qualitative data analysis process by allowing the researcher to analyze a particular case and synthesize analysis across multiple case studies to draw conclusions.
Argumentation
This refers to the inferential model that allows researchers to draw conclusions from the data. The researcher needs to ensure that there is a clear link between the data, the propositions (if any), and the conclusions drawn. This argumentation is what enables the researcher to make valid and credible inferences about the phenomenon under study.
Understanding and carefully considering these elements in the design phase of a case study can significantly enhance the quality of the research. It can help ensure that the study is methodologically sound and its findings contribute meaningful insights about the case.
Ready to jumpstart your research with ATLAS.ti?
Conceptualize your research project with our intuitive data analysis interface. Download a free trial today.
Conducting a case study involves several steps, from defining the research question and selecting the case to collecting and analyzing data . This section outlines these key stages, providing a practical guide on how to conduct case study research.
Defining the research question
The first step in case study research is defining a clear, focused research question. This question should guide the entire research process, from case selection to analysis. It's crucial to ensure that the research question is suitable for a case study approach. Typically, such questions are exploratory or descriptive in nature and focus on understanding a phenomenon within its real-life context.
Selecting and defining the case
The selection of the case should be based on the research question and the objectives of the study. It involves choosing a unique example or a set of examples that provide rich, in-depth data about the phenomenon under investigation. After selecting the case, it's crucial to define it clearly, setting the boundaries of the case, including the time period and the specific context.
Previous research can help guide the case study design. When considering a case study, an example of a case could be taken from previous case study research and used to define cases in a new research inquiry. Considering recently published examples can help understand how to select and define cases effectively.
Developing a detailed case study protocol
A case study protocol outlines the procedures and general rules to be followed during the case study. This includes the data collection methods to be used, the sources of data, and the procedures for analysis. Having a detailed case study protocol ensures consistency and reliability in the study.
The protocol should also consider how to work with the people involved in the research context to grant the research team access to collecting data. As mentioned in previous sections of this guide, establishing rapport is an essential component of qualitative research as it shapes the overall potential for collecting and analyzing data.
Collecting data
Gathering data in case study research often involves multiple sources of evidence, including documents, archival records, interviews, observations, and physical artifacts. This allows for a comprehensive understanding of the case. The process for gathering data should be systematic and carefully documented to ensure the reliability and validity of the study.
Analyzing and interpreting data
The next step is analyzing the data. This involves organizing the data , categorizing it into themes or patterns , and interpreting these patterns to answer the research question. The analysis might also involve comparing the findings with prior research or theoretical propositions.
Writing the case study report
The final step is writing the case study report . This should provide a detailed description of the case, the data, the analysis process, and the findings. The report should be clear, organized, and carefully written to ensure that the reader can understand the case and the conclusions drawn from it.
Each of these steps is crucial in ensuring that the case study research is rigorous, reliable, and provides valuable insights about the case.
The type, depth, and quality of data in your study can significantly influence the validity and utility of the study. In case study research, data is usually collected from multiple sources to provide a comprehensive and nuanced understanding of the case. This section will outline the various methods of collecting data used in case study research and discuss considerations for ensuring the quality of the data.
Interviews are a common method of gathering data in case study research. They can provide rich, in-depth data about the perspectives, experiences, and interpretations of the individuals involved in the case. Interviews can be structured , semi-structured , or unstructured , depending on the research question and the degree of flexibility needed.
Observations
Observations involve the researcher observing the case in its natural setting, providing first-hand information about the case and its context. Observations can provide data that might not be revealed in interviews or documents, such as non-verbal cues or contextual information.
Documents and artifacts
Documents and archival records provide a valuable source of data in case study research. They can include reports, letters, memos, meeting minutes, email correspondence, and various public and private documents related to the case.
These records can provide historical context, corroborate evidence from other sources, and offer insights into the case that might not be apparent from interviews or observations.
Physical artifacts refer to any physical evidence related to the case, such as tools, products, or physical environments. These artifacts can provide tangible insights into the case, complementing the data gathered from other sources.
Ensuring the quality of data collection
Determining the quality of data in case study research requires careful planning and execution. It's crucial to ensure that the data is reliable, accurate, and relevant to the research question. This involves selecting appropriate methods of collecting data, properly training interviewers or observers, and systematically recording and storing the data. It also includes considering ethical issues related to collecting and handling data, such as obtaining informed consent and ensuring the privacy and confidentiality of the participants.
Data analysis
Analyzing case study research involves making sense of the rich, detailed data to answer the research question. This process can be challenging due to the volume and complexity of case study data. However, a systematic and rigorous approach to analysis can ensure that the findings are credible and meaningful. This section outlines the main steps and considerations in analyzing data in case study research.
Organizing the data
The first step in the analysis is organizing the data. This involves sorting the data into manageable sections, often according to the data source or the theme. This step can also involve transcribing interviews, digitizing physical artifacts, or organizing observational data.
Categorizing and coding the data
Once the data is organized, the next step is to categorize or code the data. This involves identifying common themes, patterns, or concepts in the data and assigning codes to relevant data segments. Coding can be done manually or with the help of software tools, and in either case, qualitative analysis software can greatly facilitate the entire coding process. Coding helps to reduce the data to a set of themes or categories that can be more easily analyzed.
Identifying patterns and themes
After coding the data, the researcher looks for patterns or themes in the coded data. This involves comparing and contrasting the codes and looking for relationships or patterns among them. The identified patterns and themes should help answer the research question.
Interpreting the data
Once patterns and themes have been identified, the next step is to interpret these findings. This involves explaining what the patterns or themes mean in the context of the research question and the case. This interpretation should be grounded in the data, but it can also involve drawing on theoretical concepts or prior research.
Verification of the data
The last step in the analysis is verification. This involves checking the accuracy and consistency of the analysis process and confirming that the findings are supported by the data. This can involve re-checking the original data, checking the consistency of codes, or seeking feedback from research participants or peers.
Like any research method , case study research has its strengths and limitations. Researchers must be aware of these, as they can influence the design, conduct, and interpretation of the study.
Understanding the strengths and limitations of case study research can also guide researchers in deciding whether this approach is suitable for their research question . This section outlines some of the key strengths and limitations of case study research.
Benefits include the following:
Rich, detailed data: One of the main strengths of case study research is that it can generate rich, detailed data about the case. This can provide a deep understanding of the case and its context, which can be valuable in exploring complex phenomena.
Flexibility: Case study research is flexible in terms of design , data collection , and analysis . A sufficient degree of flexibility allows the researcher to adapt the study according to the case and the emerging findings.
Real-world context: Case study research involves studying the case in its real-world context, which can provide valuable insights into the interplay between the case and its context.
Multiple sources of evidence: Case study research often involves collecting data from multiple sources , which can enhance the robustness and validity of the findings.
On the other hand, researchers should consider the following limitations:
Generalizability: A common criticism of case study research is that its findings might not be generalizable to other cases due to the specificity and uniqueness of each case.
Time and resource intensive: Case study research can be time and resource intensive due to the depth of the investigation and the amount of collected data.
Complexity of analysis: The rich, detailed data generated in case study research can make analyzing the data challenging.
Subjectivity: Given the nature of case study research, there may be a higher degree of subjectivity in interpreting the data , so researchers need to reflect on this and transparently convey to audiences how the research was conducted.
Being aware of these strengths and limitations can help researchers design and conduct case study research effectively and interpret and report the findings appropriately.
Ready to analyze your data with ATLAS.ti?
See how our intuitive software can draw key insights from your data with a free trial today.
Power BI Case Study – CFI Capital Partners
experience real-world data scenarios
create a professional-looking Power BI report
practice modern user experience techniques for BI reporting
What You'll Learn
Career Programs
What Students Say
Power BI Case Study – CFI Capital Partners Overview
In this case study, you’ll take on the role of a business intelligence analyst in an investment bank. The sales and trading team at CFI Capital Partners needs you to develop a customized Power BI report to help with bespoke market analysis
You will need to connect to a variety of data sources for information about an investment portfolio, the securities in the portfolio, and the exchange on which the securities are traded. You will be required to transform and model data, create DAX measures, and build report visuals to satisfy the report requirements.
Power BI Case Study – CFI Capital Partners Learning Objectives
Transform data in Power Query and create a data model and DAX measures
Analyze and visualize data by creating report visuals
Build in better user experiences with functionality like Page Drillthrough, Bookmarks, and Conditional Formatting
Who should take this course?
Approx 3h to complete
100% online and self-paced
What you'll learn
Case study introduction, transform & model data, analyze & visualize data, user experience, qualified assessment, this course is part of the following programs.
Why stop here? Expand your skills and show your expertise with the professional certifications, specializations, and CPE credits you’re already on your way to earning.
Business Intelligence & Data Analyst (BIDA®) Certification
Skills Learned Data visualization, data warehousing and transformation, data modeling and analysis
Career Prep Business intelligence analyst, data scientist, data visualization specialist
Business Intelligence Analyst Specialization
Skills learned Data Transformation & Automation, Data Visualization, Coding, Data Modeling
Career prep Data Analyst, Business Intelligence Specialist, Finance Analyst, Data Scientist
What Our Members Say
Linda etuhole nakasole, onyedika nwoji, lydia angolo endjala, marco george saad, farooq anwer, hubert closa, joseph szczesniak, richard wilson, tinashe hemish katanda, beata suwala, daniel munyangeri, frequently asked questions.
Create a free account to unlock this Template
Access and download collection of free Templates to help power your productivity and performance.
Already have an account? Log in
Supercharge your skills with Premium Templates
Take your learning and productivity to the next level with our Premium Templates.
Upgrading to a paid membership gives you access to our extensive collection of plug-and-play Templates designed to power your performance—as well as CFI's full course catalog and accredited Certification Programs.
Already have a Self-Study or Full-Immersion membership? Log in
Access Exclusive Templates
Gain unlimited access to more than 250 productivity Templates, CFI's full course catalog and accredited Certification Programs, hundreds of resources, expert reviews and support, the chance to work with real-world finance and research tools, and more.
Already have a Full-Immersion membership? Log in
Datasets for Credit Risk Modeling
Important credit risk modeling projects.
Probability of Default (PD) tells us the likelihood that a borrower will default on the debt (loan or credit card). In simple words, it returns the expected probability of customers fail to repay the loan.
Loss Given Default (LGD) is a proportion of the total exposure when borrower defaults. It is calculated by (1 - Recovery Rate). For example someone takes $200,000 loan from bank for purchase of flat. He/She paid some installments before he stopped paying installments further. When he defaults, loan has an outstanding balance of $100,000. Bank took possession of flat and was able to sell it for $90,000. Net loss to the bank is $10,000 which is 100,000-90,000, and the LGD is 10% i.e. $10,000/$100,000.
Exposure at Default (EAD) is the amount that the borrower has to pay the bank at the time of default. In the above example shown in LGD, outstanding balance of $100,000 is EAD
Datasets for Credit Risk Modeling Projects
UCI Machine Learning Repository
Econometric Analysis Book by William H. Greene
Credit scoring and its applications Book by Lyn C. Thomas
Credit Risk Analytics Book by Harald, Daniel and Bart
Lending Club
PAKDD 2009 Data Mining Competition, organized by NeuroTech Ltd. and Center for Informatics of the Federal University of Pernambuco
Credit bureau variables which contains details about borrower's previous credits provided by other banks
Previous Loans that the applicant had with Home Credit
Previous Point of sales and cash loans that the applicant had with Home Credit
Previous Credit Cards that the applicant had with Home Credit
Variable Name
Description
SeriousDlqin2yrs
Person experienced 90 days past due delinquency or worse
RevolvingUtilizationOfUnsecuredLines
Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits
age
Age of borrower in years
NumberOfTime30-59DaysPastDueNotWorse
Number of times borrower has been 30-59 days past due but no worse in the last 2 years.
DebtRatio
Monthly debt payments, alimony,living costs divided by monthy gross income
MonthlyIncome
Monthly income
NumberOfOpenCreditLinesAndLoans
Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)
NumberOfTimes90DaysLate
Number of times borrower has been 90 days or more past due.
NumberRealEstateLoansOrLines
Number of mortgage and real estate loans including home equity lines of credit
NumberOfTime60-89DaysPastDueNotWorse
Number of times borrower has been 60-89 days past due but no worse in the last 2 years.
NumberOfDependents
Number of dependents in family excluding themselves (spouse, children etc.)
You can download data and its description from this link
Dataset about credit card defaults in Taiwan contains several attributes or characters which can be leveraged to test various machine learning algorithms for building credit scorecard. Note : Poland dataset contains information about attributes of companies rather than retail customers.
To download the datasets below, visit the link and fill the required details in the form. Once filled, you can download the datasets.
The data set HMEQ reports characteristics and delinquency information for 5,960 home equity loans. A home equity loan is a loan where the obligor uses the equity of his or her home as the underlying collateral.
The data set mortgage is in panel form and reports origination and performance observations for 50,000 residential U.S. mortgage borrowers over 60 periods. The periods have been deidentified. As in the real world, loans may originate before the start of the observation period (this is an issue where loans are transferred between banks and investors as in securitization). The loan observations may thus be censored as the loans mature or borrowers refinance. The data set is a randomized selection of mortgage-loan-level data collected from the portfolios underlying U.S. residential mortgage-backed securities (RMBS) securitization portfolios and provided by International Financial Research (www.internationalfinancialresearch.org).
The data set has been kindly provided by a European bank and has been slightly modified and anonymized. It includes 2,545 observations on loans and LGDs.
The ratings data set is an anonymized data set with corporate ratings where the ratings have been numerically encoded (1 = AAA, etc.).
Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.
You just saved a hell lot of time for me!! I was struggling a lot to find lgd data. You just made my task simpler.
Hi , I am looking for Indian credit data set , along with default flags , and loan types for my research . Will you be able to help me with any references please
find listen data extremely useful.It makes understanding difficult concepts of analytics extremely easy. Thanks a ton once again :)
You've done so much a great job! Thanks a bunch!
Thanks a lot...
Hi, I am not able to download the LGD data from the link given above. Could anyone kindly help me with a source to get the LGD data
I have updated the link
Future Students
Parents/Families
Alumni/Friends
Current Students
Faculty/Staff
MyOHIO Student Center
Visit Athens Campus
Regional Campuses
OHIO Online
Faculty/Staff Directory
University Community
Research & Impact
Alumni & Friends
Search All News
OHIO Today Magazine
Colleges & Campuses
For the Media
Helpful Links
Navigate OHIO
Connect With Us
University Libraries purchases Sage Research Methods package
Ohio University Libraries has purchased Sage Research Methods, a platform that includes textbooks, foundation research guidelines, data sets, code books, peer-reviewed case studies and more with updates through 2029.
If members of the OHIO community are looking to explore a new research methodology, hoping to reduce textbook costs or needing a case study for a course, Sage Research Methods can help.
The platform boasts more than 500 downloadable datasets, with code books and instructional materials that provide step by step walk-throughs of the data analysis. The platform also includes quantitative data sets that come with software guides that can assist in understanding tools like SPSS, R, Stata, and Python.
Ohio University now has access to more than 1,000 book titles (including the quantitative social sciences "little green books") that support a variety of research methodologies and approaches from beginner to expert. The OHIO package also includes peer-reviewed case studies with accompanying discussion questions and multiple-choice quiz questions and can be embedded into Canvas courses. Further, the collection includes a Diversifying and Decolonizing Research subcollection that highlights the importance of inclusive research, perspectives from marginalized populations and cultures, and minimizing bias in data analysis.
Highlighted features within Sage Research Methods include:
Ability to browse content, including datasets, by disciplines and/or methodology.
Informational and instructional videos that cover topics like market research, data visualization, ethics and integrity, and Big Data. The videos are easy to embed, too.
Interactive research tools that help with research plan development: Methods Map visualizer, Project Planner for outlining, and Reading Lists.
Permalinks are easy to access by just copying and pasting the URL into Canvas or your syllabus.
Learn more about Sage Research Methods.
University Libraries strives to support the OHIO community in and out of the classroom by supporting varying pedagogic approaches and finding ways to make learning more affordable for our students. Further, the Libraries aims to provide access and discoverability to research materials to support Ohio University’s innovative research enterprise. Purchasing Sage Research Methods supports both initiatives as this resource can be used by all students, faculty and staff at Ohio University for research support and instructors for course materials.
Students, faculty and staff Interested in learning more about any of the resources mentioned above are encouraged to reach out to Head of Learning Services and Education Librarian Dr. Chris Guder , Head of Research Services and Health Sciences Librarian Hanna Schmillen or a subject librarian.
Be sure to explore Sage Research Methods on your own; the platform can be accessed through Ohio University Libraries . In addition, there are training sessions and videos from Sage on its training website.
Data Sets for Cases
To learn more about the book this website supports, please visit its .
Copyright Any use is subject to the and |
You must be a registered user to view the in this website.
If you already have a username and password, enter it below. If your textbook came with a card and this is your first visit to this site, you can to register.
Username:
Password:
( )
.'); } else{ document.write('This form changes settings for this website only.'); } //-->
(optional) Select some text on the page (or do this before you open the "Notes" drawer).
3.
Highlighter Color:
4.
Search for:
Search in:
Course-wide Content
Instructor Resources
Course-wide Content
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
Publications
Account settings
My Bibliography
Collections
Citation manager
Save citation to file
Email citation, add to collections.
Create a new collection
Add to an existing collection
Add to My Bibliography
Your saved search, create a file for external citation management software, your rss feed.
Search in PubMed
Search in NLM Catalog
Add to Search
Implementation of the World Health Organization Minimum Dataset for Emergency Medical Teams to Create Disaster Profiles for the Indonesian SATUSEHAT Platform Using Fast Healthcare Interoperability Resources: Development and Validation Study
Affiliations.
1 Department of Medical Informatics, Tohoku University Graduate School of Medicine, 2-1 Seiryo-machi, Aoba-ku, Sendai, 980-8574, Japan, 81 22-717-7572, 81 22-717-7505.
2 Department of Physiology, Faculty of Medicine, UIN Syarif Hidayatullah Jakarta, Tangerang Selatan, Indonesia.
PMID: 39196270
DOI: 10.2196/59651
Background: The National Disaster Management Agency (Badan Nasional Penanggulangan Bencana) handles disaster management in Indonesia as a health cluster by collecting, storing, and reporting information on the state of survivors and their health from various sources during disasters. Data were collected on paper and transferred to Microsoft Excel spreadsheets. These activities are challenging because there are no standards for data collection. The World Health Organization (WHO) introduced a standard for health data collection during disasters for emergency medical teams (EMTs) in the form of a minimum dataset (MDS). Meanwhile, the Ministry of Health of Indonesia launched the SATUSEHAT platform to integrate all electronic medical records in Indonesia based on Fast Healthcare Interoperability Resources (FHIR).
Objective: This study aims to implement the WHO EMT MDS to create a disaster profile for the SATUSEHAT platform using FHIR.
Methods: We extracted variables from 2 EMT MDS medical records-the WHO and Association of Southeast Asian Nations (ASEAN) versions-and the daily reporting form. We then performed a mapping process to match these variables with the FHIR resources and analyzed the gaps between the variables and base resources. Next, we conducted profiling to see if there were any changes in the selected resources and created extensions to fill the gap using the Forge application. Subsequently, the profile was implemented using an open-source FHIR server.
Results: The total numbers of variables extracted from the WHO EMT MDS, ASEAN EMT MDS, and daily reporting forms were 30, 32, and 46, with the percentage of variables matching FHIR resources being 100% (30/30), 97% (31/32), and 85% (39/46), respectively. From the 40 resources available in the FHIR ID core, we used 10, 14, and 9 for the WHO EMT MDS, ASEAN EMT MDS, and daily reporting form, respectively. Based on the gap analysis, we found 4 variables in the daily reporting form that were not covered by the resources. Thus, we created extensions to address this gap.
Conclusions: We successfully created a disaster profile that can be used as a disaster case for the SATUSEHAT platform. This profile may standardize health data collection during disasters.
Keywords: EMR; EMT; FHIR; Fast Healthcare Interoperability Resources; Indonesia; MDS; SATUSEHAT; WHO; WHO EMT MDS; World Health Organization; development; disaster; disaster management; disaster profile; electronic medical records; emergency medical team; health data; health data collection; implementation; interoperability; minimum dataset; reporting; resources; validation.
The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.
Effects of tuning decision trees in random forest regression on predicting porosity of a hydrocarbon reservoir. A case study: volve oil field, north sea
First published on 8th August 2024
Machine learning (ML) has emerged as a powerful tool in petroleum engineering for automatically interpreting well logs and characterizing reservoir properties such as porosity. As a result, researchers are trying to enhance the performance of ML models further to widen their applicability in the real world. Random forest regression (RFR) is one such widely used ML technique that was developed by combining multiple decision trees. To improve its performance, one of its hyperparameters, the number of trees in the forest ( n_estimators ), is tuned during model optimization. However, the existing literature lacks in-depth studies on the influence of n_estimators on the RFR model when used for predicting porosity, given that n_estimators is one of the most influential hyperparameters that can be tuned to optimize the RFR algorithm. In this study, the effects of n_estimators on the RFR model in porosity prediction were investigated. Furthermore, n_estimators ’ interactions with two other key hyperparameters, namely the number of features considered for the best split ( max_features ) and the minimum number of samples required to be at a leaf node ( min_samples_leaf ) were explored. The RFR models were developed using 4 input features, namely, resistivity log (RES), neutron porosity log (NPHI), gamma ray log (GR), and the corresponding depths obtained from the Volve oil field in the North Sea, and calculated porosity was used as the target data. The methodology consisted of 4 approaches. In the first approach, only n_estimators were changed; in the second approach, n_estimators were changed along with max_features ; in the third approach, n_estimators were changed along with min_samples_leaf ; and in the final approach, all three hyperparameters were tuned. Altogether 24 RFR models were developed, and models were evaluated using adjusted R 2 (adj. R 2 ), root mean squared error (RMSE), and their computational times. The obtained results showed that the highest performance with an adj. R 2 value of 0.8505 was achieved when n_estimators was 81, max_features was 2 and min_samples_leaf was 1. In approach 2, when n_estimators’ upper limit was increased from 10 to 100, there was a test model performance growth of more than 1.60%, whereas increasing n_estimators’ upper limit from 100 to 1000 showed a performance drop of around 0.4%. Models developed by tuning n_estimators from 1 to 100 in intervals of 10 had healthy test model adj. R 2 values and lower computational times, making them the best n_estimators’ range and interval when both performances and computational times were taken into consideration to predict the porosity of the Volve oil field in the North Sea. Thus, it was concluded that by tuning only n_estimators and max_features , the performance of RFR models can be increased significantly.
1. Introduction
ML application in reservoir characterization has significantly increased over the last couple of decades due to its ability to tackle regression and classification-type problems. 5–7 With the evolution of ML, a notable number of algorithms have been introduced. The artificial neural network (ANN), which uses a parallel processing approach and was developed based on the function of a neuron of a human brain, has been utilized in petrophysical parameter prediction. 8,9 Support vector regression (SVR) is another algorithm developed in the initial stages of the ML timeline, and it can handle non-linear relationships between a set of inputs and an output. Moreover, SVR has been utilized widely in reservoir characterization. 10–13 The least absolute shrinkage and selection operator (LASSO) regression and Bayesian model averaging (BMA) have also been extensively used in ML-related studies in the literature. 14 BMA uses Bayes theorem and LASSO uses residual sums of squares to build a linear relationship between the inputs and the output. BMA and LASSO regressions have been used in permeability modelling in recent studies. 5 Apart from petrophysical parameter predictions, ML models have also been used in lithofacies classification. 15 Generally, these studies utilized ML approaches to model lithofacies sequences as a function of well-logging data to predict discrete lithofacies distribution at missing intervals. 16–18 Besides permeability prediction, water saturation estimation, and lithofacies classification, ML models have been used in reservoir porosity estimation, which is the parameter of focus in this study. ML algorithms, such as ANN, deep learning, and SVR, have been used to predict porosity using logging data, seismic attributes, and drilling parameters. 19–21
Apart from the mentioned ML models, the ML approach known as ensemble learning has been applied in many recent studies. Here, ML base models (weaker models) are strategically combined to produce a high-performing and efficient model as shown in Fig. 1 . Ensemble ML models have become a popular tool among researchers to predict petrophysical properties due to their ability to reduce overfitting and underfitting. 22–26 RFR is one such popular ensemble ML model that was developed by amalgamating multiple decision trees. 27
Representation of the ensemble model.
Hyperparameter tuning is a process that is implemented to fine-tune ML algorithms to obtain optimal models. 28–30 Several hyperparameters can be controlled in an RFR model, such as n_estimators , max_features , min_samples_leaf , maximum depth of the tree ( max_depth ), fraction of the original dataset assigned to any individual tree ( max_samples ), minimum number of samples required to split an internal node ( min_samples_split ), maximum leaf nodes to restrict the growth of the tree ( max_leaf_nodes ).
Hyperparameter optimization has been utilized in recent studies related to reservoir characterization. Wang et al. developed an RFR model to predict permeability in the Xishan Coalfield, China. 24 Five hyperparameters, n_estimators , max_features , max_depth , min_samples_leaf and min_samples_split , were tuned during hyperparameter optimization. Zou et al. estimated reservoir porosity using a random forest algorithm. 31 During the hyperparameter optimization stage, n_estimators , max_features , min_samples_leaf , min_samples_split and max_depth were tuned. Rezaee and Ekundayo tuned n_estimators , min_samples_leaf , min_samples_split , and max_depth during the development of the RFR model used to predict the permeability of precipice sandstone in the Surat Basin, Australia. 32
Even though hyperparameters have been tuned during the hyperparameter optimization phase of an ensemble ML model development, the literature lacks studies that specifically focus on the effects of hyperparameter tuning in ensemble learning when predicting petrophysical properties in reservoir characterization. Addressing this research gap, in this study, the authors investigated the influence of one of the most utilized hyperparameters in the literature, namely, the n_estimators of RFR, when predicting the porosity of a hydrocarbon reservoir. Also, the effects of n_estimators were studied along with another two widely used hyperparameters, max_features and min_samples_leaf , when predicting the porosity of the Volve oil field in the North Sea. The study considered a supervised learning regression approach. The workflow of the study consisted of data preprocessing, RFR model development, and model analysis. Several RFR models were developed, including tuning n_estimators , tuning n_estimators along with max_features , tuning n_estimators along with min_samples_leaf , and tuning all three hyperparameters at once under four approaches by integrating grid search optimization and K-fold cross-validation. The models’ performances were evaluated based on the adjusted coefficient of determination (adj. R 2 ), root mean squared error (RMSE), and computational time. Only the three aforementioned hyperparameters were considered due to processing capacity limitations; however, this study is expected to be a solid initiation towards the development of future studies on the effects of hyperparameters in ML algorithms in reservoir characterization.
2. Methodology
2.1 geological setting and dataset.
Study area – Volve oil field's location in the North Sea. Adapted from Mapchart.
The Hugin Formation is 153 m thick and oil-bearing and was penetrated at 3796.5 m, approximately 60 m deeper than expected. The total oil column in the well was 80 m, but no clear oil–water contact was observed. 38,40 The reservoir section was made up of highly variable fine to coarse-grained, well to poorly-sorted subarkosic arenite sandstones with good to excellent reservoir properties. The Hugin Formation of the area consists of a shallow marine shoreface, coastal plain/lagoonal, channel, and possibly mouth bar deposits. The underlying Skagerrak Formation was completely tight due to extensive kaolinite and dolomite cementation. The current study used data from well 15/9-19A. The well was drilled through the Skagerrak Formation and terminated approximately 30 m into the Triassic Smith Bank Formation. To fully utilize the available data, the study considered data from the 3666.59 to 3907.08 m depth interval. This depth interval ran through three formations, namely, Draupne, Heather, and Hugin. The stratigraphic column and description of the vertical facies distribution of the section are shown in Fig. 3 .
Stratigraphic column and facies description of the considered subsurface section. Adapted from Statoil.
PHIF = PHID + A × (NPHI − PHID) + B
(1)
(2)
2.2 Data preprocessing
Feature scaling is also a common practice implemented during data preprocessing. There are two widely used feature scaling approaches in the literature, namely, normalization and standardization. However, in this study feature scaling was neglected since RFR is a tree-based ML model where splits do not change with any monotonic transformation. 52
2.3 Machine learning model development
Random Forest architecture (left) and the base model architecture (right).
(3)
E ,Y(Y − (X)) → E ,Y(Y − Eh(X;θ)) .
(4)
(5)
(6)
The inequality shown by eqn (6) highlights what is required for accurate RFR, which is having a low correlation between residuals of differing tree members of the forest and low prediction error for the individual trees. The model's performance can be further enhanced by tuning its hyperparameters.
During the study, RFR models were developed using the Python programming language. The cleaned dataset obtained during the data preprocessing stage was loaded into Python, then split into training and testing. The Python-based scikit-learn library's RandomForestRegressor was used to develop the RFR algorithm. The RandomForestRegressor comes with default hyperparameters built into it. Default values assigned to some of the main hyperparameters of RFR in scikit-learn are given in Table 1 .
Hyperparameter
Default value
n_estimators
100
max_features
1.0
min_samples_leaf
1
max_depth
None
max_samples
None
min_samples_split
2
max_leaf_nodes
None
However, rather than using the default hyperparameters assigned by the scikit-lean library, to achieve the primary objectives of the study, hyperparameter optimization was implemented. Hyperparameter optimization is a commonly used practice to build robust ML models. 56,57 The hyperparameters of RFR were tuned using the grid search optimization (GSO) approach. For this, the GridSearchCV optimization algorithm in the scikit-learn library was used. GSO was considered since it runs through all the possible combinations in the hyperparameter space, thus selecting the best combination of the space. 57,58 The hyperparameter space was predefined by including the possible values and it was fed into the GSO algorithm.
GSO was implemented along with random subsampling cross-validation. An approach known as the K-fold cross-validation was used. During the K-fold cross-validation, the training dataset is divided into K number of same-sized portions (folds), and K − 1 of the portions are used for training and the remainder are used for validation. 59,60 This is repeated until each fold gets the chance to be the validation set. For this study, a 5-fold cross-validation was implemented as shown in Fig. 5 . Therefore, the training set was divided into five portions and during each split, four portions were used for training and one portion was used for validation.
Demonstration of the K-fold cross-validation.
Tuning was done under 4 approaches as shown in Fig. 6 to investigate the effects of the considered hyperparameters. In the first approach, n_estimators was changed from 1 to 10, 1 to 100, and 1 to 1000 in different intervals. The notation used to demonstrate the n_estimators change is shown in Table 2 .
Workflow of the methodology.
n_estimators change notation
Starting value
Ending value
Increment
1
1
10
1
1
1
100
1
1
1
100
10
1
1
1000
1
1
1
1000
10
1
1
1000
100
In the second approach, n_estimators was changed from 1 to 1000 in the same way as approach 1 along with max_features . Here, max_features was changed from 10% to 100% of total features in increments of 10%. In the third approach, n_estimators was changed in the same way along with min_samples_leaf . In this case, min_samples_leaf was changed from 1 to 20 in intervals of 1. In the fourth approach, all 3 hyperparameters, i.e. , n_estimators , max_features and min_samples_leaf were varied at the same time in the above-mentioned intervals. In each approach, values of all the other hyperparameters of RFR were kept at their default values assigned by the scikit-learn library. The link to the GitHub folder with the developed codes is given in the appendix.
2.4 Results analysis
(7)
(8)
In eqn (7) and (8) , y i is the actual value, ŷ is the predicted value, ȳ is the mean value of the distribution, n is the number of data points and m is the number of input features.
(9)
3. Results and discussion
Model no.
n_estimators change
n_estimators
Adj. R
Computational time (s)
Training
Validation
Testing
M11
1
8
0.9650
0.8188
0.8024
0.81
M12
1
51
0.9760
0.8367
0.8202
70.25
M13
1
51
0.9760
0.8367
0.8202
6.88
M14
1
51
0.9760
0.8367
0.8202
6932.55
M15
1
51
0.9760
0.8367
0.8202
707.56
M16
1
801
0.9799
0.8352
0.8218
65.73
Adjusted coefficient of determination values of each approach for different changes in n_estimators.
Interestingly, when the upper limit of the n_estimators range was pushed beyond 100, the performance of the model did not show any noticeable increase in all training validation and testing adj. R 2 values. When n_estimators changed from 1 to 100 in intervals 1 and 10 (models M12 and M13) and n_estimators changed from 1 to 1000 in intervals 1 and 10 (models M14 and M15), the models showed the same performance, i.e. a training score of 0.9760, validation score of 0.8367 and a testing score of 0.8202. However, when the n_estimators changed from 1 to 1000 in intervals of 100, the training and testing scores of the M16 model showed a slight increase in performance, yielding an adj. R 2 of 0.9799 and 0.8218, respectively. However, the validation score showed a slight decrease, which was negligible.
The highest computational time of 6932.55 seconds was shown by the model M14 where n_estimators changed from 1 to 1000 in increments of 1. The results from approach 1 showed that after a certain n_estimators value, the models’ performances increased drastically and the performance was maintained at a constant value over a certain n_estimators range showing that the performance of the RFR when n_estimators was tuned was efficient within a certain range. Since the range and interval at which the n_estimators values are tuned affect the computational time, an effective range and an interval for n_estimators should be decided upon, taking computational time into account.
In approach 2, max_features were also tuned along with n_estimators . Results obtained using approach 2 of the methodology are tabulated in Table 4 . As observed in approach 1, clear spikes in training, validation, and testing adj. R 2 values were observed when the upper limit of n_estimators was increased from 10 to 100. The training score had an increase of 1.36%, the validation score had an increase of 1.92%, and the test score had an increase of 1.60%. This clear jump in performance is noticeable in Fig. 7 . Interestingly, the performances of the models developed in approach 2 were significantly higher than the performance of the corresponding “ n_estimators change” in approach 1. This is quite visible in Fig. 8 . Further, going from approach 1 to 2, the average validation score increased by 2.24% and the testing score increased by 3.52%, which was significant. This increase in adj. R 2 values is an indication that tuning max_features has a major impact on predicting the porosity using RFR. Model M21, where n_estimators were changed from 1 to 10 in intervals of 1 and max_features were changed from 0.1 to 1 in intervals of 0.1, showed the least performance with a training score of 0.9672, validation score of 0.8381, and a testing score of 0.8366. On the other hand, model M23 showed the highest testing performance with an adj. R 2 of 0.8505 where n_estimators changed from 1 to 100 in intervals of 10 and max_features changed from 0.1 to 1 in intervals of 0.1. The model M23 yielded its best test model when n_estimators was 81 and max_features were 0.5. It should be noted that even though model M23 had the highest testing score, the training, and validation scores were not the best out of all the models developed in approach 2. The highest training score of 0.9823 was shown by models M24, M25, and M26. The highest validation scores were shown by models M24 and M25. However, it is more meaningful to select model M23 as the best-performing model since the testing set represents an independent dataset that had never been seen by the model before.
Model no.
n_estimators change
n_estimators
max_features
Adj. R
Computational time (s)
Training
Validation
Testing
M21
1
9
0.1
0.9672
0.8381
0.8366
3.69
M22
1
79
0.5
0.9804
0.8542
0.8500
326.56
M23
1
81
0.5
0.9806
0.8541
0.8505
30.20
M24
1
520
0.5
0.9823
0.8556
0.8467
32
M25
1
521
0.5
0.9823
0.8556
0.8467
3045.27
M26
1
801
0.5
0.9823
0.8554
0.8471
284.29
Adjusted coefficient of determination values for each change in n_estimators for different approaches.
The anomaly in the validation score observed when the n_estimators were changed from 1 to 1000 in intervals of 100 in approach 1 was also observable in approach 2. The difference in train–test scores provides an idea about the generalizability of the model. The smaller the train–test difference, the higher the generalizability of the model. Overall, the train–test difference in approach 2 was noticeably less than that of approach 1. The average train–test difference decreased by 15.51% on going from approach 1 to 2. This showed that the generalizability of the models improved when max_features was introduced into the hyperparameter space. Similar to approach 1, the highest runtime was shown when the n_estimators changed from 1 to 1000 in increments of 1.
In approach 3, n_estimators was investigated with the alteration of min_samples_leaf , and the results obtained are tabulated in Table 5 . Notably, all the performance results obtained for all the RFR models except the runtimes were the same as that of approach 1, as seen in Fig. 7 and 8 . This was because the optimum value selected by the grid search optimization of the min_samples_leaf was the same as the default value assigned by the scikit-learn library for the RFR algorithm, hence the best testing adj. R 2 was shown by model M34 when the n_estimators was changed from 1 to 1000 in intervals of 100. Computational times were longer than those obtained in approach 1 since models developed in approach 3 had a larger hyperparameter space as compared to approach 1.
Model no.
n_estimators change
n_estimators
min_samples_leaf
Adj. R
Computational time (s)
Training
Validation
Testing
M31
1
8
1
0.9650
0.8188
0.8024
7.79
M32
1
51
1
0.9760
0.8367
0.8202
674.81
M33
1
51
1
0.9760
0.8367
0.8202
64.96
M34
1
51
1
0.9760
0.8367
0.8202
70
M35
1
51
1
0.9760
0.8367
0.8202
6525.18
M36
1
801
1
0.9799
0.8352
0.8218
606.28
Model no.
n_estimators change
n_estimators
max_features
min_samples_leaf
Adj. R
Computational time (s)
Training
Validation
Testing
M41
1
9
0.1
1
0.9672
0.8381
0.8366
56.22
M42
1
79
0.5
1
0.9804
0.8542
0.8500
4242.86
M43
1
81
0.5
1
0.9806
0.8541
0.8505
425.65
M44
1
520
0.5
1
0.9823
0.8556
0.8467
82
M45
1
521
0.5
1
0.9823
0.8556
0.8467
51
M46
1
801
0.5
1
0.9823
0.8554
0.8471
3796.99
Table 7 shows the RMSE values of approaches 1, 2, 3, and 4. While the adj. R 2 values give an idea about the correlation between the actual porosities and the predicted porosities, the RMSE values provide an idea about the difference (or the error) between the two. Therefore, RMSE is also an important parameter in ML model performance evaluation. The pattern in which RMSE values fluctuated in the 4 approaches was similar to that of adj. R 2 . The smallest RMSEs were shown by model M16 with a training model RMSE of 0.9988 and a testing model RMSE of 2.8312. The improvement in the results when max_features was introduced into the hyperparameter space was also evident based on the RMSE values obtained in approach 2. There was a clear decrease in RMSE values in both training and testing models in approaches 2 and 4 where max_features was tuned.
RMSE
Approach 1
Approach 2
Approach 3
Approach 4
Model no.
Training
Testing
Model no.
Training
Testing
Model no.
Training
Testing
Model no.
Training
Testing
M11
1.2894
2.9967
M21
1.2516
2.7218
M31
1.2894
2.9967
M41
1.2516
2.7218
M12
1.0817
2.8499
M22
0.9835
2.5917
M32
1.0817
2.8499
M42
0.9835
2.5917
M13
1.0817
2.8499
M23
0.9798
2.5875
M33
1.0817
2.8499
M43
0.9798
2.5880
M14
1.0817
2.8499
M24
0.9399
2.6190
M34
1.0817
2.8499
M44
0.9399
2.6190
M15
1.0817
2.8499
M25
0.9396
2.6187
M35
1.0817
2.8499
M45
0.9396
2.6187
M16
0.9988
2.8312
M26
0.9396
2.6148
M36
0.9988
2.8312
M46
0.9396
2.6148
Runtime and grid search combinations had a positive relationship, i.e. , when the number of combinations in the grid search space was the largest, the runtime of the model was the highest, and vice versa . Further, it was observed that from approach 1 to approach 3, the increase in computational times was roughly proportional to each other as seen in Fig. 9 . However, in approach 4 where n_estimators was changed along with the tuning of max_features and min_samples_leaf , an anomaly was observed when n_estimators was changed from 1 to 1000 in intervals of 10.
Runtimes of the models of each n_estimators’ change for different approaches.
Even though the primary objective of the study was to investigate the influences of n_estimators along with max_features and min_samples_leaf on the performance of RFR, having an overall picture of the variation of the actual and predicted porosity and their relationship is important to understand the model's applicability in porosity prediction. To achieve this, depth-porosity graphs and correlation plots were plotted. Fig. 10 shows one such depth-porosity graph and a correlation plot developed for the best-performing RFR test model (model M23) of the study. The depth-porosity plot indicated that most of the time, the predicted porosity followed the pattern of the actual porosity. The correlation plot showed that the majority of the points were scattered around the perfect correlation line, which is an indication of a high correlation between the actual values and the predicted values.
Depth-porosity and correlation plots obtained from the predictions of the best-performing RFR testing model.
4. Conclusions
• Overall, based on both the performance and computational time, the RFR model with n_estimators at 81 and max_features at 2 (while keeping all the other hyperparameters at their default values), which was developed in approach 2, produced the most effective model for predicting the porosity of the Volve oil field in the North Sea with a testing model adj. R 2 of 0.8505, a testing model RMSE of 2.5875, and a computational time of 30.2 seconds.
• There was a notable increase in performance when the upper limit of the n_estimators increased from 10 to 100. On the other hand, the performance of the models did not increase significantly when the upper limit of n_estimators increased from 100 to 1000. This phenomenon indicated that identifying an effective n_estimators range that is not too low (which will make the performance significantly low) and not too high (which will increase the computational time) is important to produce an efficient RFR model during porosity prediction.
• A range of 1 to 100 changed in intervals of 10 can be suggested for n_estimators when developing an RFR model to predict the porosity of the Volve oil field since these models showed higher performances and lower computational times in all four approaches. When the n_estimators range of 1 to 100 was changed in intervals of 10, it always yielded a high adj. R 2 value (in approaches 2 and 4, it yielded the highest testing model adj. R 2 value) for the model and had the second least computational time.
• When n_estimators was tuned along with max_features in approach 2, the results improved drastically as compared to approach 1 where only n_estimators was tuned. There was an average validation score increase of 2.24% and a testing score increase of 3.52% on going from approach 1 to 2. This improvement of the scores (adj. R 2 ) showed that max_features has a significant influence on the RFR model's performance.
• It was observed that computational time was largely affected by the number of hyperparameters altered, their range, and interval. Of all the approaches, the longest computational time was when n_estimators was tuned from 1 to 1000 in intervals of 1 along with max_features and min_samples_leaf .
Based on the results, only by adjusting n_estimators and max_features can an RFR model be developed with a robust prediction power to estimate the porosity in the Volve oil field.
Recommendations
Abbreviation.
AI
Artificial intelligence
ML
Machine learning
RFR
Random forest regression
ANN
Artificial neural network
SVR
Support vector regression
LASSO
Least absolute shrinkage and selection operator
BMA
Bayesian model averaging
GSO
Grid search optimization
RMSE
Root mean squared error
R
Coefficient of determination
adj. R
Adjusted coefficient of determination
RES
Resistivity log
NPHI
Neutron porosity log
GR
Gamma ray log
PHIF
Total porosity
PHID
Porosity from density log
n_estimators
Number of trees in the forest
max_features
Number of features considered for the best split
min_samples_leaf
Minimum number of samples required to be at a leaf node
max_depth
Maximum depth of the tree
max_samples
Fraction of the original dataset assigned to any individual tree
min_samples_split
Minimum number of samples required to split an internal node
max_leaf_nodes
Maximum leaf nodes to restrict the growth of the tree
A
A regression coefficient
B
A regression coefficient
ρ
Matrix density
ρ
Measured bulk density
ρ
Pore fluid density
n
Number of datapoints
m
Number of input features
X
Independent and identically distributed random vector
θ
Independent and identically distributed random vector
x
Observed input vector associated with vector X
Y
A vector with numerical outcomes
y
Actual value
ŷ
Predicted value
ȳ
Mean value of the distribution
Author contributions
Data availability, conflicts of interest, acknowledgements.
C. Kavuri and S. L. Kokjohn, Exploring the potential of machine learning in reducing the computational time/expense and improving the reliability of engine optimization studies, Int. J. Engine Res. , 2020, 21 (7), 1251–1270 Search PubMed .
N. Zhan and J. R. Kitchin, Uncertainty quantification in machine learning and nonlinear least squares regression models, AIChE J. , 2022, 68 (6), e17516 Search PubMed .
X. Zhang, Y. Tian, L. Chen, X. Hu and Z. Zhou, Machine learning: a new paradigm in computational electrocatalysis, J. Phys. Chem. Lett. , 2022, 13 (34), 7920–7930 Search PubMed .
A. M. Turing, Computing machinery and intelligence , Springer, Netherlands, 2009 Search PubMed .
W. J. Al-Mudhafar, Bayesian and LASSO regressions for comparative permeability modeling of sandstone reservoirs, Nat. Resour. Res. , 2019, 28 (1), 47–62 Search PubMed .
C. Ojukwu, K. Smith, N. Kadkhodayan, M. Leung and K. Baldwin, Reservoir Characterization, Machine Learning and Big Data–An Offshore California Case Study. InSPE Nigeria Annual International Conference and Exhibition 2020 Aug 11 (p. D013S002R005). SPE.
A. A. Silva, M. W. Tavares, A. Carrasquilla, R. Misságia and M. Ceia, Petrofacies classification using machine learning algorithms, Geophysics. , 2020, 85 (4), WA101–WA113 Search PubMed .
M. Amiri, J. Ghiasi-Freez, B. Golkar and A. Hatampour, Improving water saturation estimation in a tight shaly sandstone reservoir using artificial neural network optimized by imperialist competitive algorithm–A case study, J. Pet. Sci. Eng. , 2015, 127 , 347–358 Search PubMed .
S. Elkatatny, M. Mahmoud, Z. Tariq and A. Abdulraheem, New insights into the prediction of heterogeneous carbonate reservoir permeability from well logs using artificial intelligence network, Neural Comput. Appl. , 2018, 30 , 2673–2683 Search PubMed .
K. O. Akande, T. O. Owolabi, S. O. Olatunji and A. AbdulRaheem, A hybrid particle swarm optimization and support vector regression model for modelling permeability prediction of hydrocarbon reservoir, J. Pet. Sci. Eng. , 2017, 150 , 43–53 Search PubMed .
S. Baziar, H. B. Shahripour, M. Tadayoni and M. Nabi-Bidhendi, Prediction of water saturation in a tight gas sandstone reservoir by using four intelligent methods: a comparative study, Neural Comput. Appl. , 2018, 30 , 1171–1185 Search PubMed .
F. Anifowose, A. Abdulraheem and A. Al-Shuhail, A parametric study of machine learning techniques in petroleum reservoir permeability prediction by integrating seismic attributes and wireline data, J. Pet. Sci. Eng. , 2019, 176 , 762–774 Search PubMed .
M. Z. Kamali, S. Davoodi, H. Ghorbani, D. A. Wood, N. Mohamadian, S. Lajmorak, V. S. Rukavishnikov, F. Taherizade and S. S. Band, Permeability prediction of heterogeneous carbonate gas condensate reservoirs applying group method of data handling, Mar. Pet. Geol. , 2022, 139 , 105597 Search PubMed .
W. Al-Mudhafar Integrating bayesian model averaging for uncertainty reduction in permeability modeling. Inoffshore technology conference 2015 May 4 (pp. OTC-25646). OTC.
G. Wang, Y. Ju, C. Li, T. R. Carr and G. Cheng Application of artificial intelligence on black shale lithofacies prediction in Marcellus Shale, Appalachian Basin. InUnconventional Resources Technology Conference, Denver, Colorado, 25-27 August 2014 2014 Aug 27 (pp. 1970–1980). Society of Exploration Geophysicists, American Association of Petroleum Geologists, Society of Petroleum Engineers.
W. J. Al-Mudhafar, Integrating well log interpretations for lithofacies classification and permeability modeling through advanced machine learning algorithms, J. Pet. Explor. Prod. Technol. , 2017, 7 (4), 1023–1033 Search PubMed .
W. J. Al-Mudhafar, Integrating lithofacies and well logging data into smooth generalized additive model for improved permeability estimation: Zubair formation, South Rumaila oil field, Mar. Geophys. Res. , 2019, 40 , 315–332 Search PubMed .
J. Kim, Lithofacies classification integrating conventional approaches and machine learning technique, J. Nat. Gas Sci. Eng. , 2022, 100 , 104500 Search PubMed .
S. R. Na’imi, S. R. Shadizadeh, M. A. Riahi and M. Mirzakhanian, Estimation of reservoir porosity and water saturation based on seismic attributes using support vector regression approach, J. Appl. Geophys. , 2014, 107 , 93–101 Search PubMed .
A. Al-AbdulJabbar, K. Al-Azani and S. Elkatatny, Estimation of reservoir porosity from drilling parameters using artificial neural networks, Petrophysics , 2020, 61 (03), 318–330 Search PubMed .
W. Chen, L. Yang, B. Zha, M. Zhang and Y. Chen, Deep learning reservoir porosity prediction based on multilayer long short-term memory network, Geophysics , 2020, 85 (4), WA213–WA225 Search PubMed .
F. A. Anifowose Ensemble machine learning: the latest development in computational intelligence for petroleum reservoir characterization. InSPE Kingdom of Saudi Arabia Annual Technical Symposium and Exhibition 2013 May 19 (pp. SPE-168111). SPE.
A. Subasi, M. F. El-Amin, T. Darwich and M. Dossary, Permeability prediction of petroleum reservoirs using stochastic gradient boosting regression, J. Ambient Intell. Humaniz. Comput. , 2022, 1 Search PubMed .
J. Wang, W. Yan, Z. Wan, Y. Wang, J. Lv and A. Zhou, Prediction of permeability using random forest and genetic algorithm model, Comput. Model. Eng. Sci. , 2020, 125 (3), 1135–1157 Search PubMed .
D. A. Otchere, T. O. Ganat, R. Gholami and M. Lawal, A novel custom ensemble learning model for an improved reservoir permeability and water saturation prediction, J. Nat. Gas Sci. Eng. , 2021, 91 , 103962 Search PubMed .
Z. Zhang and Z. Cai, Permeability prediction of carbonate rocks based on digital image analysis and rock typing using random forest algorithm, Energy Fuels , 2021, 35 (14), 11271–11284 Search PubMed .
T. H. Lee, A. Ullah and R. Wang, Bootstrap aggregating and random forest, Macroeconomic forecasting in the era of big data: Theory and practice , 2020, pp. 389–429 Search PubMed .
M. M. Maher and S. Sakr Smartml: A meta learning-based framework for automated selection and hyperparameter tuning for machine learning algorithms. InEDBT: 22nd International conference on extending database technology 2019 Mar 26.
L. Yang and A. Shami, On hyperparameter optimization of machine learning algorithms: Theory and practice, Neurocomputing , 2020, 415 , 295–316 Search PubMed .
J. Isabona, A. L. Imoize and Y. Kim, Machine learning-based boosted regression ensemble combined with hyperparameter tuning for optimal adaptive learning, Sensors , 2022, 22 (10), 3776 Search PubMed .
C. Zou, L. Zhao, M. Xu, Y. Chen and J. Geng, Porosity prediction with uncertainty quantification from multiple seismic attributes using random forest, J. Geophys. Res.: Solid Earth , 2021, 126 (7), e2021JB021826 Search PubMed .
R. Rezaee and J. Ekundayo, Permeability prediction using machine learning methods for the CO 2 injectivity of the precipice sandstone in Surat Basin, Australia, Energies , 2022, 15 (6), 2053 Search PubMed .
S. García, J. Luengo and F. Herrera, Introduction to data preprocessing, in Data preprocessing in data mining , ed. J. Kacprzyk and L. C. Jain, Springer International Publishing, Cham, Switzerland, 2015, pp. 10–13 Search PubMed .
V. Gudivada, A. Apon and J. Ding, Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations, Int. J. Adv. Softw. , 2017, 10 (1), 1–20 Search PubMed .
K. Maharana, S. Mondal and B. Nemade, A review: Data pre-processing and data augmentation techniques, Global Transit. Proceedings , 2022, 3 (1), 91–99 Search PubMed .
A. Al Ghaithi and M. Prasad Machine learning with artificial neural networks for shear log predictions in the Volve field Norwegian North Sea. InSEG Technical Program Expanded Abstracts 2020 2020 Sep 30 (pp. 450–454). Society of Exploration Geophysicists.
C. S. Ng, A. J. Ghahfarokhi and M. N. Amar, Well production forecast in Volve field: Application of rigorous machine learning techniques and metaheuristic algorithm, J. Pet. Sci. Eng. , 2022, 208 , 109468 Search PubMed .
N. O. Nikitin, I. Revin, A. Hvatov, P. Vychuzhanin and A. V. Kalyuzhnaya, Hybrid and automated machine learning approaches for oil fields development: The case study of Volve field, North Sea, Comput. Geosci. , 2022, 161 , 105061 Search PubMed .
Mapchart. World map: simple [Internet]. 2024 [cited 2024 Jul 22]. Available from: https://www.mapchart.net/world.html.
S. Sen and S. S. Ganguli Estimation of pore pressure and fracture gradient in Volve field, Norwegian North Sea. InSPE Oil and Gas India Conference and Exhibition. 2019 Apr 8 (p. D022S027R002). SPE.
Statoil. 15/9-19A Well Composite Log, Sleipner, Theta Vest Prospect Structure [Internet]. 1998 [cited 2023 Mar 1]. Available from: https://discovervolve.com/citation-non-commerciality-clause/.
I. F. Ilyas and X. Chu, Data cleaning , Morgan & Claypool, 2019 Search PubMed .
A. Jain, H. Patel, L. Nagalapatti, N. Gupta, S. Mehta, S. Guttula, S. Mujumdar, S. Afzal, R. Sharma Mittal and V. Munigala Overview and importance of data quality for machine learning tasks. InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining 2020 Aug 23 (pp. 3561–3562).
S. Rawat, A. Rawat, D. Kumar and A. S. Sabitha, Application of machine learning and data visualization techniques for decision support in the insurance sector, Int. J. Inf. Manag. Data Insights , 2021, 1 (2), 100012 Search PubMed .
M. M. Ahamad, S. Aktar, M. Rashed-Al-Mahfuz, S. Uddin, P. Liò, H. Xu, M. A. Summers, J. M. Quinn and M. A. Moni, A machine learning model to identify early stage symptoms of SARS-Cov-2 infected patients, Expert Syst. Appl. , 2020, 160 , 113661 Search PubMed .
I. H. Sarker, Y. B. Abushark, F. Alsolami and A. I. Khan, Intrudtree: a machine learning based cyber security intrusion detection model, Symmetry , 2020, 12 (5), 754 Search PubMed .
H. Feizi, H. Apaydin, M. T. Sattari, M. S. Colak and M. Sibtain, Improving reservoir inflow prediction via rolling window and deep learning-based multi-model approach: case study from Ermenek Dam, Turkey, Stoch. Environ. Res. Risk Assess. , 2022, 36 (10), 3149–3169 Search PubMed .
J. J. Salazar, L. Garland, J. Ochoa and M. J. Pyrcz, Fair train-test split in machine learning: Mitigating spatial autocorrelation for improved prediction accuracy, J. Pet. Sci. Eng. , 2022, 209 , 109885 Search PubMed .
G. M. Mask and X. Wu Deriving New Type Curves through Machine Learning in the Wolfcamp Formation. InSPE Reservoir Characterisation and Simulation Conference and Exhibition. 2023 Jan 24 (p. D011S001R007). SPE.
T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, B. Mphago and O. Tabona, A survey on missing data in machine learning, J. Big Data , 2021, 8 , 1–37 Search PubMed .
M. M. Seliem, HandlingOutlier data as missing values by imputation methods: application of machine learning algorithms, Turk. J. Comput. Math. Educ. , 2022, 13 (1), 273–286 Search PubMed .
R. Garcia-Carretero, R. Holgado-Cuadrado and Ó. Barquero-Pérez, Assessment of classification models and relevant features on nonalcoholic steatohepatitis using random forest, Entropy , 2021, 23 (6), 763 Search PubMed .
I. C. Suherman and R. Sarno Implementation of random forest regression for COCOMO II effort estimation. In2020 international seminar on application for technology of information and communication (iSemantic) 2020 Sep 19 (pp. 476–481). IEEE.
S. Yilmazer and S. Kocaman, A mass appraisal assessment study using machine learning based on multiple regression and random forest, Land use policy , 2020, 99 , 104889 Search PubMed .
M. R. Segal Machine learning benchmarks and random forest regression.
M. Abbaszadeh, S. Soltani-Mohammadi and A. N. Ahmed, Optimization of support vector machine parameters in modeling of Iju deposit mineralization and alteration zones using particle swarm optimization algorithm and grid search method, Comput. Geosci. , 2022, 165 , 105140 Search PubMed .
M. A. Abbas, W. J. Al-Mudhafar and D. A. Wood, Improving permeability prediction in carbonate reservoirs through gradient boosting hyperparameter tuning, Earth Sci. Inform. , 2023, 16 (4), 3417–3432 Search PubMed .
K. Sandunil, Z. Bennour, H. Ben Mahmud and A. Giwelli Effects of Tuning Hyperparameters in Random Forest Regression on Reservoir's Porosity Prediction. Case Study: Volve Oil Field, North Sea. InARMA US Rock Mechanics/Geomechanics Symposium 2023 Jun 25 (pp. ARMA-2023). ARMA.
W. J. Al-Mudhafar Incorporation of bootstrapping and cross-validation for efficient multivariate facies and petrophysical modeling. InSPE Rocky Mountain Petroleum Technology Conference/Low-Permeability Reservoirs Symposium 2016 May 5 (pp. SPE-180277). SPE.
M. Rahimi and M. A. Riahi, Reservoir facies classification based on random forest and geostatistics methods in an offshore oilfield, J. Appl. Geophys. , 2022, 201 , 104640 Search PubMed .
A. A. Mahmoud, S. Elkatatny, W. Chen and A. Abdulraheem, Estimation of oil recovery factor for water drive sandy reservoirs through applications of artificial intelligence, Energies , 2019, 12 (19), 3671 Search PubMed .
H. Al Khalifah, P. W. Glover and P. Lorinczi, Permeability prediction and diagenesis in tight carbonates using machine learning techniques, Mar. Pet. Geol. , 2020, 112 , 104096 Search PubMed .
W. M. Ridwan, M. Sapitang, A. Aziz, K. F. Kushiar, A. N. Ahmed and A. El-Shafie, Rainfall forecasting model using machine learning methods: Case study Terengganu, Malaysia, Ain Shams Eng. J. , 2021, 12 (2), 1651–1663 Search PubMed .
H. Wang, Z. Lei, X. Zhang, B. Zhou and J. Peng, Machine learning basics, Deep Learning , 2016, 98–164 Search PubMed .
P. Mehta, M. Bukov, C. H. Wang, A. G. Day, C. Richardson, C. K. Fisher and D. J. Schwab, A high-bias, low-variance introduction to machine learning for physicists, Phys. Rep. , 2019, 810 , 1–24 Search PubMed .
GCSE results day 2024: Everything you need to know including the number grading system
Thousands of students across the country will soon be finding out their GCSE results and thinking about the next steps in their education.
Here we explain everything you need to know about the big day, from when results day is, to the current 9-1 grading scale, to what your options are if your results aren’t what you’re expecting.
When is GCSE results day 2024?
GCSE results day will be taking place on Thursday the 22 August.
The results will be made available to schools on Wednesday and available to pick up from your school by 8am on Thursday morning.
Schools will issue their own instructions on how and when to collect your results.
When did we change to a number grading scale?
The shift to the numerical grading system was introduced in England in 2017 firstly in English language, English literature, and maths.
By 2020 all subjects were shifted to number grades. This means anyone with GCSE results from 2017-2020 will have a combination of both letters and numbers.
The numerical grading system was to signal more challenging GCSEs and to better differentiate between students’ abilities - particularly at higher grades between the A *-C grades. There only used to be 4 grades between A* and C, now with the numerical grading scale there are 6.
What do the number grades mean?
The grades are ranked from 1, the lowest, to 9, the highest.
The grades don’t exactly translate, but the two grading scales meet at three points as illustrated below.
The bottom of grade 7 is aligned with the bottom of grade A, while the bottom of grade 4 is aligned to the bottom of grade C.
Meanwhile, the bottom of grade 1 is aligned to the bottom of grade G.
What to do if your results weren’t what you were expecting?
If your results weren’t what you were expecting, firstly don’t panic. You have options.
First things first, speak to your school or college – they could be flexible on entry requirements if you’ve just missed your grades.
They’ll also be able to give you the best tailored advice on whether re-sitting while studying for your next qualifications is a possibility.
If you’re really unhappy with your results you can enter to resit all GCSE subjects in summer 2025. You can also take autumn exams in GCSE English language and maths.
Speak to your sixth form or college to decide when it’s the best time for you to resit a GCSE exam.
Look for other courses with different grade requirements
Entry requirements vary depending on the college and course. Ask your school for advice, and call your college or another one in your area to see if there’s a space on a course you’re interested in.
Consider an apprenticeship
Apprenticeships combine a practical training job with study too. They’re open to you if you’re 16 or over, living in England, and not in full time education.
As an apprentice you’ll be a paid employee, have the opportunity to work alongside experienced staff, gain job-specific skills, and get time set aside for training and study related to your role.
You can find out more about how to apply here .
Talk to a National Careers Service (NCS) adviser
The National Career Service is a free resource that can help you with your career planning. Give them a call to discuss potential routes into higher education, further education, or the workplace.
Whatever your results, if you want to find out more about all your education and training options, as well as get practical advice about your exam results, visit the National Careers Service page and Skills for Careers to explore your study and work choices.
You may also be interested in:
Results day 2024: What's next after picking up your A level, T level and VTQ results?
When is results day 2024? GCSEs, A levels, T Levels and VTQs
Tags: GCSE grade equivalent , gcse number grades , GCSE results , gcse results day 2024 , gsce grades old and new , new gcse grades
Sharing and comments
Share this page, related content and links, about the education hub.
The Education Hub is a site for parents, pupils, education professionals and the media that captures all you need to know about the education system. You’ll find accessible, straightforward information on popular topics, Q&As, interviews, case studies, and more.
Please note that for media enquiries, journalists should call our central Newsdesk on 020 7783 8300. This media-only line operates from Monday to Friday, 8am to 7pm. Outside of these hours the number will divert to the duty media officer.
Members of the public should call our general enquiries line on 0370 000 2288.
Sign up and manage updates
Follow us on social media, search by date.
August 2024
M
T
W
T
F
S
S
1
2
3
4
5
7
8
9
10
11
13
14
15
16
17
18
21
22
23
24
25
26
27
29
30
31
Comments and moderation policy
WIC Participant and Program Characteristics 2018
In 1986, the Congress enacted Public Laws 99-500 and 99-591, requiring a biennial report on the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC). In response to these requirements, FNS developed a prototype system that allowed for the routine acquisition of information on WIC participants from WIC State Agencies. Since 1992, State Agencies have provided electronic copies of these data to FNS on a biennial basis.
FNS and the National WIC Association (formerly National Association of WIC Directors) agreed on a set of data elements for the transfer of information. In addition, FNS established a minimum standard dataset for reporting participation data. For each biennial reporting cycle, each State Agency is required to submit a participant-level dataset containing standardized information on persons enrolled at local agencies for the reference month of April.
The 2018 Participant and Program Characteristics (PC2018) is the fourteenth data submission to be completed using the WIC PC reporting system. In April 2018, there were 90 State agencies: the 50 States, American Samoa, the District of Columbia, Guam, the Northern Mariana Islands, Puerto Rico, the American Virgin Islands, and 34 Indian tribal organizations.
Processing methods and equipment used Specifications on formats (“Guidance for States Providing Participant Data”) were provided to all State agencies in January 2018. This guide specified 20 minimum dataset (MDS) elements and 11 supplemental dataset (SDS) elements to be reported on each WIC participant. Each State Agency was required to submit all 20 MDS items and any SDS items collected by the State agency. Study date(s) and duration The information for each participant was from the participants’ most current WIC certification as of April 2018.
Study spatial scale (size of replicates and spatial scale of study area) In April 2018, there were 90 State agencies: the 50 States, American Samoa, the District of Columbia, Guam, the Northern Mariana Islands, Puerto Rico, the American Virgin Islands, and 34 Indian tribal organizations.
Level of true replication Unknown
Sampling precision (within-replicate sampling or pseudoreplication) State Agency Data Submissions. PC2018 is a participant dataset consisting of 7,837,672 active records. The records, submitted to USDA by the State Agencies, comprise a census of all WIC enrollees, so there is no sampling involved in the collection of this data.
PII Analytic Datasets. State agency files were combined to create a national census participant file of approximately 7.8 million records. The census dataset contains potentially personally identifiable information (PII) and is therefore not made available to the public.
National Sample Dataset. The public use SAS analytic dataset made available to the public has been constructed from a nationally representative sample drawn from the census of WIC participants, selected by participant category. The national sample consists of 1 percent of the total number of participants, or 78,365 records. The distribution by category is 6,825 pregnant women, 6,189 breastfeeding women, 5,134 postpartum women, 18,552 infants, and 41,665 children.
Level of subsampling (number and repeat or within-replicate sampling) The proportionate (or self-weighting) sample was drawn by WIC participant category: pregnant women, breastfeeding women, postpartum women, infants, and children. In this type of sample design, each WIC participant has the same probability of selection across all strata. Sampling weights are not needed when the data are analyzed. In a proportionate stratified sample, the largest stratum accounts for the highest percentage of the analytic sample.
Study design (before–after, control–impacts, time series, before–after-control–impacts) None – Non-experimental
Description of any data manipulation, modeling, or statistical analysis undertaken Each entry in the dataset contains all MDS and SDS information submitted by the State agency on the sampled WIC participant. In addition, the file contains constructed variables used for analytic purposes. To protect individual privacy, the public use file does not include State agency, local agency, or case identification numbers.
Description of any gaps in the data or other limiting factors All State agencies except New Mexico provided data on a census of their WIC participants.
Resource Title: WIC Participant and Program Characteristics 2018 Data.
File Name: wicpc.wicpc2018_public_use.csv
Resource Title: WIC Participant and Program Characteristics 2018 Dataset Codebook.
File Name: PC2018 National Sample File Public Use Codebook updated.docx
Resource Description: The 2018 Participant and Program Characteristics (PC2018) is the fourteenth data submission to be completed using the WIC PC reporting system. In April 2018, there were 90 State agencies: the 50 States, American Samoa, the District of Columbia, Guam, the Northern Mariana Islands, Puerto Rico, the American Virgin Islands, and 34 Indian tribal organizations.
Resource Title: WIC Participant and Program Characteristics 2018 Datasets SAS STATA SPSS.
File Name: wicpc2018_agdatacoomonsupload.zip
USDA-FNS: Contract No. No. AG-3198-C-11-0010
Data contact name, data contact email, intended use, use limitations, temporal extent start date, temporal extent end date.
Not specified
Geographic Coverage
Geographic location - description, iso topic category, national agricultural library thesaurus terms, omb bureau code.
005:84 - Food and Nutrition Service
OMB Program Code
005:040 - National Research
Pending citation
Public access level, preferred dataset citation, usage metrics.
Food sciences
Food nutritional balance
Agricultural economics
Strategies for Improving Sustainable Rice Seed Supply Chain Performance in Indonesia: A Case Study in Bali Province
Description.
The sustainability of the rice seed supply chain still needs to be improved to ensure the availability of rice seeds. To achieve food security (rice) cannot be separated from the availability of seeds. Data on sustainability attributes according to farmer groups, farmers implementing multiplication (cooperators), seed producers and key informants are used in analyzing the level of sustainability of the rice seed supply chain. Data were obtained through surveys and in-depth discussions with research objects. The tabulated data is continued by analyzing the data, then the results are found which are then juxtaposed with the criteria and results of previous research.
Steps to reproduce
The data used in this study are primary and secondary data. Primary data was obtained through interviews and field observations. Primary data includes: data related to preferences for VUB, partnerships, respondent characteristics, descriptive data on supply chain activities (qualitative and quantitative), data on the relationship between Key Performance Index (KPI) from ANP software, data on indicators of rice seed supply chain sustainability, data on the influence between sustainability variables. The time span of the data used is a six-year period from 2017 to 2022. Secondary data includes: data on producers and production of rice seeds, profiles of seed producers, data on groups (Subak) of paddy fields, rice area and production, agricultural labor, production costs and other data related to rice seed supply chain activities in Bali Province. Using the in-depth interview method, primary data was gathered through talks and interviews with key informants who included experts, practitioners and regulators. This study employs the Multi-dimensional Scaling and Rapid Appraisal for Sustainability (MDS-RAPS) approach to analyze the sustainable rice seed supply chain, followed by prospective analysis to generate expected sustainability strategies. The use of tools such as Super Decisions Software in implementing the Analytic Network Process (ANP). Sustainability data processing using Microsoft Excel software with Rapfish application and Prospective analysis using exsimpro software
The case study data collection and analysis process (an author's view
How to Customize a Case Study Infographic With Animated Data
case study data interpretation
DATA SETS OF CASE STUDY
How to Perform Case Study Using Excel Data Analysis
Data Analysis Case Study: Learn From These #Winning Data Projects
VIDEO
Data Science Research Showcase
Case Function In Google Data Studio: Example & Use Cases
Complete Data Structures and Algorithm Revision through Most Repeated PYQs| Exam Special Revision
Lecture_55: Capstone Project on Data Analysis and Visualizations
Case Study: Data Analyst တွေအတွက် Domain Knowledge ကဘာကြောင့်အရေးကြီးတာလဲ ?
Socket Set Case Upgrade #tools #automobile #mechanic #socketset
COMMENTS
10 Real World Data Science Case Studies Projects with Example
Data Analytics Case Study Examples in Travel Industry . Below you will find case studies for data analytics in the travel and tourism industry. 5) Airbnb. ... Airbnb uses predictive analytics to predict the prices of the listings and help the hosts set a competitive and optimal price. The overall profitability of the Airbnb host depends on ...
Statistics Case Study and Dataset Resources
The data is contextualized, provided for download in multiple formats, and includes questions to consider as well as references for each data set. The case studies for the current year can be found by clicking on the "Meetings" tab in the navigation sidebar, or by searching for "case study" in the search bar. Journal of Statistics Education
Top 25 Data Science Case Studies [2024]
Data-Driven Decision-Making: Enables better farming decisions through timely and accurate data. Case Study 14 - Streamlining Drug Discovery (Pfizer) ... As we look to the future, the role of data science is set to grow, promising even more innovative solutions and smarter strategies across all sectors. These case studies inspire and serve as ...
12 Data Science Case Studies: Across Various Industries
Top 12 Data Science Case Studies. 1. Data Science in Hospitality Industry. In the hospitality sector, data analytics assists hotels in better pricing strategies, customer analysis, brand marketing, tracking market trends, and many more. Airbnb focuses on growth by analyzing customer voice using data science. A famous example in this sector is ...
Free Public Data Sets For Analysis
Public data sets are ideal resources to tap into to create data visualizations. With the information provided below, you can explore a number of free, accessible data sets and begin to create your own analyses. The following COVID-19 data visualization is representative of the the types of visualizations that can be created using free public ...
10 Real-World Data Science Case Studies Worth Reading
Real-world data science case studies differ significantly from academic examples. While academic exercises often feature clean, well-structured data and simplified scenarios, real-world projects tackle messy, diverse data sources with practical constraints and genuine business objectives. These case studies reflect the complexities data ...
Case Study Library
Mendel's Laws of Inheritance. Use the data sets provided to explore Mendel's Laws of Inheritance for dominant and recessive traits. Key words: Bar charts, frequency distributions, goodness-of-fit tests, mosaic plot, hypothesis tests for proportions. Download the case study (PDF) Download the data set 1. Download the data set 2.
Data in Action: 7 Data Science Case Studies Worth Reading
7 Top Data Science Case Studies . Here are 7 top case studies that show how companies and organizations have approached common challenges with some seriously inventive data science solutions: Geosciences. Data science is a powerful tool that can help us to understand better and predict geoscience phenomena.
A Dataset Exploration Case Study with Know Your Data
A KYD Case Study. As a case study, we explore some of these features using the COCO Captions dataset, an image dataset that contains five human-generated captions for each of over 300k images. Given the rich annotations provided by free-form text, we focus our analysis on signals already present within the dataset.
Data Analysis Case Study: Learn From These Winning Data Projects
Step 2: Review Data Case Studies. Here we are, already at step 2. It's time for you to start reviewing data analysis case studies (starting with the one I'm sharing below). Identify 5 that seem the most promising for your organization given its current set-up.
15 Free Data Sets for Your Next Project or Portfolio
Data.gov. Data.gov is where all of the American government's public data sets live. You can access all kinds of data that is a matter of public record in the country. The main categories of data available are agriculture, climate, energy, local government, maritime, ocean, and older adult health.
Practice take-home case study (datasets/code included)
Going through several of these ourselves, and getting tips from friends, we've compiled a practice take home case study. Let us know what you think and we look forward to your feedback! 10. Award. jambery. • 6 yr. ago. Awesome insights into a realistic dataset! 2. Award.
Case Study Method: A Step-by-Step Guide for Business Researchers
Qualitative case study methodology enables researchers to conduct an in-depth exploration of intricate phenomena within some specific context. ... Villiers and Fouché (2015) depicted a paradigm as a set framework making various assumptions about the social world, about how ... The authors interpreted the raw data for case studies with the help ...
Google Data Analytics Capstone: Complete a Case Study
There are 4 modules in this course. This course is the eighth and final course in the Google Data Analytics Certificate. You'll have the opportunity to complete a case study, which will help prepare you for your data analytics job hunt. Case studies are commonly used by employers to assess analytical skills. For your case study, you'll ...
Data Science Use Cases Guide
Data science use case in transport and logistics: Identifying the optimal positioning of taxi vehicles. Uber Technologies Inc., or Uber, is an American company that provides various logistics and transport services. In this case study, we're going to cluster Uber ride-sharing GPS data to identify the optimal positioning of taxi vehicles.
Data Exploration
Each environment included only the hardware each firm required, alongside premiere software and data. FactSet Data Exploration provided a turnkey solution, and granted users across Firm A and Firm B access to industry-standard tools such as Microsoft SQL Server, MATLAB, Python, R Studio, and Tableau. In addition, all of FactSet's Standard ...
What is a Case Study?
A case study protocol outlines the procedures and general rules to be followed during the case study. This includes the data collection methods to be used, the sources of data, and the procedures for analysis. Having a detailed case study protocol ensures consistency and reliability in the study.
What are Cases in Statistics? (Definition & Examples)
For example, the following dataset contains 10 cases and 3 variables that we measure for each case: Notice that each case has multiple variables or "attributes." For example, each player has a value for points, assists, and rebounds. Note that cases are also sometimes called experimental units. These terms are used interchangeably.
Power BI Case Study with Practice Datasets
Power BI Case Study - CFI Capital Partners Learning Objectives. Upon completing this course, you will be able to: Transform data in Power Query and create a data model and DAX measures. Analyze and visualize data by creating report visuals. Build in better user experiences with functionality like Page Drillthrough, Bookmarks, and Conditional ...
Datasets for Credit Risk Modeling
Data Set Mortgage. The data set mortgage is in panel form and reports origination and performance observations for 50,000 residential U.S. mortgage borrowers over 60 periods. The periods have been deidentified. As in the real world, loans may originate before the start of the observation period (this is an issue where loans are transferred ...
How to Analyze a Dataset: 6 Steps
6 Steps to Analyze a Dataset. 1. Clean Up Your Data. Data wrangling —also called data cleaning—is the process of uncovering and correcting, or eliminating inaccurate or repeat records from your dataset. During the data wrangling process, you'll transform the raw data into a more useful format, preparing it for analysis.
University Libraries purchases Sage Research Methods package
Ohio University Libraries has purchased Sage Research Methods, a platform that includes textbooks, foundation research guidelines, data sets, code books, peer-reviewed case studies and more with updates through 2029. If members of the OHIO community are looking to explore a new research methodology ...
Data Sets for Cases
Data Sets for Cases. (See related pages) Donald, Bowersox 6e, Supply Chain Logistics Management, 2024 - 1265072604. Case 6 Western Pharmaceutical A Data. Case 7 Western Pharmaceutical B Data. Case 8 Figure 1 Woodson Chemical Company North America Division organization structure. Case 10 - Cooper Processing Solution. Case 11 - Dream Beauty Solution.
Implementation of the World Health Organization Minimum ...
Background: The National Disaster Management Agency (Badan Nasional Penanggulangan Bencana) handles disaster management in Indonesia as a health cluster by collecting, storing, and reporting information on the state of survivors and their health from various sources during disasters. Data were collected on paper and transferred to Microsoft Excel spreadsheets.
Effects of tuning decision trees in random forest regression on
A case study: volve oil field, north sea ... and the remainder are used for validation. 59,60 This is repeated until each fold gets the chance to be the validation set. For this study, ... K. Smith, N. Kadkhodayan, M. Leung and K. Baldwin, Reservoir Characterization, Machine Learning and Big Data-An Offshore California Case Study. InSPE ...
User-centered design in brain-computer interfaces—A case study
Objective: The array of available brain-computer interface (BCI) paradigms has continued to grow, and so has the corresponding set of machine learning methods which are at the core of BCI systems. The latter have evolved to provide more robust data analysis solutions, and as a consequence the proportion of healthy BCI users who can use a BCI successfully is growing. With this development the ...
GCSE results day 2024: Everything you need to know including the number
Apprenticeships combine a practical training job with study too. They're open to you if you're 16 or over, living in England, and not in full time education. As an apprentice you'll be a paid employee, have the opportunity to work alongside experienced staff, gain job-specific skills, and get time set aside for training and study related ...
Environment-Specific Stable Carbon Isotope Fractionation of
Raw data for all figures and models in the manuscript by Cong-cong Guo to be published by Geochimica et Cosmochimica Acta as "Environment-Specific Stable Carbon Isotope Fractionation of Phytoplankton As the Basis in Better Constraining Marine Bulk Particulate Organic Carbon Dynamics and Budgets: The case study in a Temperate Coastal Ocean (the Yellow Sea)"
WIC Participant and Program Characteristics 2018
In 1986, the Congress enacted Public Laws 99-500 and 99-591, requiring a biennial report on the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC). In response to these requirements, FNS developed a prototype system that allowed for the routine acquisition of information on WIC participants from WIC State Agencies. Since 1992, State Agencies have provided electronic ...
Strategies for Improving Sustainable Rice Seed ...
The sustainability of the rice seed supply chain still needs to be improved to ensure the availability of rice seeds. To achieve food security (rice) cannot be separated from the availability of seeds. Data on sustainability attributes according to farmer groups, farmers implementing multiplication (cooperators), seed producers and key informants are used in analyzing the level of ...
IMAGES
VIDEO
COMMENTS
Data Analytics Case Study Examples in Travel Industry . Below you will find case studies for data analytics in the travel and tourism industry. 5) Airbnb. ... Airbnb uses predictive analytics to predict the prices of the listings and help the hosts set a competitive and optimal price. The overall profitability of the Airbnb host depends on ...
The data is contextualized, provided for download in multiple formats, and includes questions to consider as well as references for each data set. The case studies for the current year can be found by clicking on the "Meetings" tab in the navigation sidebar, or by searching for "case study" in the search bar. Journal of Statistics Education
Data-Driven Decision-Making: Enables better farming decisions through timely and accurate data. Case Study 14 - Streamlining Drug Discovery (Pfizer) ... As we look to the future, the role of data science is set to grow, promising even more innovative solutions and smarter strategies across all sectors. These case studies inspire and serve as ...
Top 12 Data Science Case Studies. 1. Data Science in Hospitality Industry. In the hospitality sector, data analytics assists hotels in better pricing strategies, customer analysis, brand marketing, tracking market trends, and many more. Airbnb focuses on growth by analyzing customer voice using data science. A famous example in this sector is ...
Public data sets are ideal resources to tap into to create data visualizations. With the information provided below, you can explore a number of free, accessible data sets and begin to create your own analyses. The following COVID-19 data visualization is representative of the the types of visualizations that can be created using free public ...
Real-world data science case studies differ significantly from academic examples. While academic exercises often feature clean, well-structured data and simplified scenarios, real-world projects tackle messy, diverse data sources with practical constraints and genuine business objectives. These case studies reflect the complexities data ...
Mendel's Laws of Inheritance. Use the data sets provided to explore Mendel's Laws of Inheritance for dominant and recessive traits. Key words: Bar charts, frequency distributions, goodness-of-fit tests, mosaic plot, hypothesis tests for proportions. Download the case study (PDF) Download the data set 1. Download the data set 2.
7 Top Data Science Case Studies . Here are 7 top case studies that show how companies and organizations have approached common challenges with some seriously inventive data science solutions: Geosciences. Data science is a powerful tool that can help us to understand better and predict geoscience phenomena.
A KYD Case Study. As a case study, we explore some of these features using the COCO Captions dataset, an image dataset that contains five human-generated captions for each of over 300k images. Given the rich annotations provided by free-form text, we focus our analysis on signals already present within the dataset.
Step 2: Review Data Case Studies. Here we are, already at step 2. It's time for you to start reviewing data analysis case studies (starting with the one I'm sharing below). Identify 5 that seem the most promising for your organization given its current set-up.
Data.gov. Data.gov is where all of the American government's public data sets live. You can access all kinds of data that is a matter of public record in the country. The main categories of data available are agriculture, climate, energy, local government, maritime, ocean, and older adult health.
Going through several of these ourselves, and getting tips from friends, we've compiled a practice take home case study. Let us know what you think and we look forward to your feedback! 10. Award. jambery. • 6 yr. ago. Awesome insights into a realistic dataset! 2. Award.
Qualitative case study methodology enables researchers to conduct an in-depth exploration of intricate phenomena within some specific context. ... Villiers and Fouché (2015) depicted a paradigm as a set framework making various assumptions about the social world, about how ... The authors interpreted the raw data for case studies with the help ...
There are 4 modules in this course. This course is the eighth and final course in the Google Data Analytics Certificate. You'll have the opportunity to complete a case study, which will help prepare you for your data analytics job hunt. Case studies are commonly used by employers to assess analytical skills. For your case study, you'll ...
Data science use case in transport and logistics: Identifying the optimal positioning of taxi vehicles. Uber Technologies Inc., or Uber, is an American company that provides various logistics and transport services. In this case study, we're going to cluster Uber ride-sharing GPS data to identify the optimal positioning of taxi vehicles.
Each environment included only the hardware each firm required, alongside premiere software and data. FactSet Data Exploration provided a turnkey solution, and granted users across Firm A and Firm B access to industry-standard tools such as Microsoft SQL Server, MATLAB, Python, R Studio, and Tableau. In addition, all of FactSet's Standard ...
A case study protocol outlines the procedures and general rules to be followed during the case study. This includes the data collection methods to be used, the sources of data, and the procedures for analysis. Having a detailed case study protocol ensures consistency and reliability in the study.
For example, the following dataset contains 10 cases and 3 variables that we measure for each case: Notice that each case has multiple variables or "attributes." For example, each player has a value for points, assists, and rebounds. Note that cases are also sometimes called experimental units. These terms are used interchangeably.
Power BI Case Study - CFI Capital Partners Learning Objectives. Upon completing this course, you will be able to: Transform data in Power Query and create a data model and DAX measures. Analyze and visualize data by creating report visuals. Build in better user experiences with functionality like Page Drillthrough, Bookmarks, and Conditional ...
Data Set Mortgage. The data set mortgage is in panel form and reports origination and performance observations for 50,000 residential U.S. mortgage borrowers over 60 periods. The periods have been deidentified. As in the real world, loans may originate before the start of the observation period (this is an issue where loans are transferred ...
6 Steps to Analyze a Dataset. 1. Clean Up Your Data. Data wrangling —also called data cleaning—is the process of uncovering and correcting, or eliminating inaccurate or repeat records from your dataset. During the data wrangling process, you'll transform the raw data into a more useful format, preparing it for analysis.
Ohio University Libraries has purchased Sage Research Methods, a platform that includes textbooks, foundation research guidelines, data sets, code books, peer-reviewed case studies and more with updates through 2029. If members of the OHIO community are looking to explore a new research methodology ...
Data Sets for Cases. (See related pages) Donald, Bowersox 6e, Supply Chain Logistics Management, 2024 - 1265072604. Case 6 Western Pharmaceutical A Data. Case 7 Western Pharmaceutical B Data. Case 8 Figure 1 Woodson Chemical Company North America Division organization structure. Case 10 - Cooper Processing Solution. Case 11 - Dream Beauty Solution.
Background: The National Disaster Management Agency (Badan Nasional Penanggulangan Bencana) handles disaster management in Indonesia as a health cluster by collecting, storing, and reporting information on the state of survivors and their health from various sources during disasters. Data were collected on paper and transferred to Microsoft Excel spreadsheets.
A case study: volve oil field, north sea ... and the remainder are used for validation. 59,60 This is repeated until each fold gets the chance to be the validation set. For this study, ... K. Smith, N. Kadkhodayan, M. Leung and K. Baldwin, Reservoir Characterization, Machine Learning and Big Data-An Offshore California Case Study. InSPE ...
Objective: The array of available brain-computer interface (BCI) paradigms has continued to grow, and so has the corresponding set of machine learning methods which are at the core of BCI systems. The latter have evolved to provide more robust data analysis solutions, and as a consequence the proportion of healthy BCI users who can use a BCI successfully is growing. With this development the ...
Apprenticeships combine a practical training job with study too. They're open to you if you're 16 or over, living in England, and not in full time education. As an apprentice you'll be a paid employee, have the opportunity to work alongside experienced staff, gain job-specific skills, and get time set aside for training and study related ...
Raw data for all figures and models in the manuscript by Cong-cong Guo to be published by Geochimica et Cosmochimica Acta as "Environment-Specific Stable Carbon Isotope Fractionation of Phytoplankton As the Basis in Better Constraining Marine Bulk Particulate Organic Carbon Dynamics and Budgets: The case study in a Temperate Coastal Ocean (the Yellow Sea)"
In 1986, the Congress enacted Public Laws 99-500 and 99-591, requiring a biennial report on the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC). In response to these requirements, FNS developed a prototype system that allowed for the routine acquisition of information on WIC participants from WIC State Agencies. Since 1992, State Agencies have provided electronic ...
The sustainability of the rice seed supply chain still needs to be improved to ensure the availability of rice seeds. To achieve food security (rice) cannot be separated from the availability of seeds. Data on sustainability attributes according to farmer groups, farmers implementing multiplication (cooperators), seed producers and key informants are used in analyzing the level of ...