Statistical Intuitions and Applications**** Assignment #1

An open dataset from D

Statistical Intuitions and Applications**** Assignment #1

An open dataset from Dubai Statistics Center will be used in this assignment and an in-depth analysis of different features of the main variables will be carried out through a thorough investigation using the statistical tools. You will analyze the data set and prepare a report by completing the tasks and answering the questions that follow.

Main Setup:

For this assignment you will select a random sample from a dataset using the code given below with all the detailed steps. To select your random sample and save your data set on your computer follow these instructions:

Run the code below to find the sample size ‘n’ on which you will work.

Go the 3rd line of the next code and enter your sample size n = ???

Now go the 4th line: df.to_csv(r’**Path where you want to store the exported CSV fileFileName.csv**’)

Change **Path where you want to store the exported CSV file** to where you want to store your data.

Change **File Name** to your first name.

Run the code.

Use this data set to complete your assignment. **Also include this CSV file in your assignment submission!**

import pandas

original_data = pandas.read_csv(“https://raw.githubusercontent.com/zu-math/SIA-Fall-2023-Dataset/main/mod_mea_f.csv”)

df=original_data.sample(n=???)

df.to_csv(r’Path where you want to store the exported CSV fileFile Name.csv’)

print (df)

Once above is done, then you will work on the tasks stated below. Please be clear in explaining your analyses and your findings. Show you have been thorough and careful by explaining and discussing your findings, not by presenting huge amounts of computer output without appropriate interpretation. Your report should be clear and concise. (Consider how some tables might help to summarize a lot of results.) Please use normal margins and a readable font size.

Question 1.

Your first task is to briefly introduce the study and all the main variables in it using a brief report with clear wordings. Identify all the variables in the dataset. Explain what will you be analyzing in this report to the readers.

Question 2.

In this task, you will generate descriptive statistics for all the quantitative variables in the dataset using the histograms and describe their distributions in terms of shape, center, spread, and presence of outliers. The codes below will provide you the histogram and five-number summary of the relevant column. You need to replace ‘???’ with column name.

Question 3.

Your next task is to choose three quantitative variables of your choice and two categorical variables.

Suppose you have chosen column ‘Masters’ as a quantitative and ‘Gender_EN’ for a categorical variable. Replicate the steps for task below.

3a. Generate a grouped box plot to compare the distribution of Masters degree holders among male and female students. Describe your observations referring to the five-number-summaries of both genders.

In the same way, do it for other quantitative and categorical variables. This should give rise to six cases.

3b. Discuss any patterns you observe between male and female genders when you compare them.

Question 4.

In this task, you will work on the scatterplots to examine the relationship between dependent and independent variables. Treat ‘Academic_Year’ as an independent variable and use any of the two dependent variables you chose in Question 3 as the dependent variables.

**4a.** Create separate scatterplots to examine the relationship between the dependent variables and the independent variable. Describe the scatterplots in terms of the form, strength, and direction of the relationships. Further examine if the relationship between the independent variable and each of the dependent variables varies by gender (you will need to create scatterplots separately for each gender to answer this question.)

In the same way, do it for other quantitative variables. Now change the categorical variable and replicate calculations. This should give rise to ten cases.

**4b.** Explain in simple words what you observed by reporting your findings.

Question 5.

You will now work on the simple linear regression model that predicts for the dependent variable. Treat ‘Academic_Year’ as an independent variable and use any of the two dependent variables you chose in Questions 3 as the dependent variables.

**5a.** Fit a simple linear regression model between your dependent and independent variables. Generate and use the residual plot, the standard error, and the R^2 to assess the fit of each linear model. If the model is a good fit, interpret the slope and the intercept.

In the same way, do it for other quantitative variables. Now change the categorical variable and replicate calculations. This should give rise to ten cases.

**5b.** Summarize and present your findings in a sophisticated statistical terms.

Question 6.

The conservation and rehabilitation of local flora and natural habitats comes under part of the biodiversity program of Environment Agency, Abu Dhabi (EAD). A sophisticated and high-tech monitoring system provides the following annual production of plants within native plant nursery. The number of floras in a particular year is written on top of that column which may be used in calculations.

Answer the following questions.

**6a.** Calculate the percentage increase in plants products from year Y1 to year Y2, where both years Y1 and Y2 are obtained by running the code below.

**6b.** Which two years the percentage increase in the annual production of plants was equal to P%? where P is to be found by running the code below and it is calculated up to one decimal place.

Question 7.

Under the long-term marine water quality monitoring program of the Environment Agency Abu Dhabi (EAD), a red tide monitoring project was launched to look for harmful algal blooms (HAB) in marine water. The project accounted for number of such HAB incidents in Abu Dhabi from 2002 to 2022. Use your statistical skills of this course IDS-103, to make two strong observations (non-trivial) to be reported to EAD from the graph below.