https://github.com/johnny-godoy/laboratorios-mds/blob/main/lab%203/laboratorio_3.ipynb
¶pandas
.pandas
para manejar datos tabulares.Nota: El laboratorio deberá ser desarrollado sin el uso indiscriminado de iteradores nativos de python (aka "for", "while"). La idea es que aprendan a exprimir al máximo las funciones optimizadas que nos entrega
pandas
, las cuales vale mencionar, son bastante más eficientes que los iteradores nativos sobre DataFrames.
# Libreria Core del lab.
import numpy as np
import pandas as pd
from IPython.display import display
#Libreria para plotear
# !pip install --upgrade plotly
import plotly.express as px
# Librerias utiles
from sklearn.preprocessing import StandardScaler
Para este laboratorio deberán trabajar con el dataset students_grades.json
el cual contiene información sobre el rendimiento de alumnos en distintos ramos.
El objetivo principal será el de realizar un trabajo de limpieza de prepración y exploración de datos utilizando exclusivamente DataFrames
y sus métodos asociados.
Don Caguayo, rector de la Universidad de la Cachaña, les comenta que actualmente la Universidad no tiene una subdirección de análisis de datos, pero que tiene un gran interés en formar una. Es por esto, que se contacta con ud. y su equipo para iniciar de forma exploratoria diversas tareas de análisis y consolidación de datos respectivos al rendimiento de sus alumnos.
Previo a su primera reunión con el rector, el rector le envía una copia de la base de datos, la cuál a usted y su equipo le encantaría examinar. Para esto, la primera tarea que debe realizar es abrir y cargar el dataset.
Notar que es sumamente importante el formato que posee cada columna para su manipulación y que posiblemente ustedes deberán especificar algunos tipos de datos a mano. Ver la documentación del método
read_json
para mayor información.
A continuación se muestra un ejemplo de los dos primeros registros del dataset:
[
{
"names": "Rita Courtney",
"gender": "female",
"race/ethnicity": "group B",
"parental level of education": "some high school",
"lunch": "standard",
"test preparation course": "none",
"math score": 37,
"reading score": 46,
"writing score": 46
},
{
"names": "Madeline Fuller",
"gender": "female",
"race/ethnicity": "group C",
"parental level of education": "some high school",
"lunch": "standard",
"test preparation course": "none",
"math score": 124,
"reading score": 142,
"writing score": 142
}
]
df_grades = pd.read_json("students_grades.json")
# Convirtiendo a tipos apropiados
columnas_notas = [f"{subject} score" for subject in ("math", "reading", "writing")]
# Reemplazando los datos vacíos con np.nan, y transfdormando los datos a numéricos
df_grades[columnas_notas] = df_grades[columnas_notas].replace("", np.nan).astype(float)
Posteriormente, el rector les comenta que la sección de TI posee una base de datos con más de 875 filas, en donde cada fila representa un alumno matriculado. Sin embargo, cree fuertemente que existen errores en esta: solo tiene registro de 875 matriculas. Por este motivo, especula que se han realizado malas prácticas al registrar dichos datos.
En consecuencia, el rector les solicita a ustedes como tarea inicial que exploren la base de datos y encuentre posibles falencias dentro de sus registros. Para esto, usted debe estar al tanto que las notas están en una escala que van del 0 al 100 (porcentaje de logro) y no es posible que un alumno haya dado más de una vez el mismo examen. Cualquier alumno que posea un valor diferente a esta escala, debería ser eliminado de la base de datos.
Instrucciones
Cree una función que tome un DataFrame
y que le permita revisar si las diferentes variables que lo componen son coherentes y si es posible observar anomalías en estas.
To-Do
isna()
y sum()
de los DataFrames).nunique()
) .duplicated()
y su parámetro subset
).duplicated()
y su parámetro subset
)..plot.box()
plotear el boxplot de las notas. De ser necesario, invetigue como funcionan los gráficos de caja (esta sección es provista).¿Que anomalias es posible observar de los datos que componen el DataFrame? ¿Existen valores nulos o duplicados? ¿Qué puede interpretar de la distribución de las notas en el boxplot? (), comente esto en la celda ubicada debajo de justificación.
Nota: Para imprmir utilice
print(...)
. Para mostrarDataFrames
, utilicedisplay(...)
ya que este permite mostrar estos elementos como tablas.Nota 2: Utilice la documentación de pandas como también StackOverflow para resolver cualquier problema o inconveniente.
Nota 3: Si no ha trabajado con boxplot antes, el siguiente link puede serle de mucha ayuda: boxplot
Respuesta:
def exploratory_data_analysis(dataframe):
# Imprimir un mensaje con las dimensiones del Dataframe.
filas, columnas = dataframe.shape
print(f"1.- Los datos tienen {filas} filas y {columnas} columnas")
# Imprimir un mensaje con los nombres de las columnas que conforman el DataFrame.
print(f"2.- Los datos tienen las siguientes columnas: {list(dataframe.columns)}")
# Mostrar los primeros 5, los últimos 5 elementos del DataFrame y realice un muestreo de 5 elementos de forma aleatoria.
print("3.- Ejemplos de filas")
print("Los 5 primeros datos son:")
display(dataframe.head(5))
print("Los 5 últimos datos son:")
display(dataframe.tail(5))
print("Muestra de 5 datos aleatorios:")
display(dataframe.sample(5))
# Mostrar una descripción rápida de las variables numéricas del DataFrame (utilice un método visto en clases).
print("4.- Descripción de variables numéricas")
display(dataframe.describe())
# Imprimir un mensaje con la cantidad de valores nulos (investigue el método `isna()` y `sum()` de los DataFrames).
print(f"5.- La cantidad de valores nulos para cada atributo son:\n{dataframe.isna().sum()}")
# Imprimir un mensaje que detalle la cantidad de valores unicos (investigue `nunique()`) .
print(f"6.- La cantidad de valores únicos para cada atributo son:\n{dataframe.nunique()}")
# Imprimir un mensaje con el total de filas que contengan nombres duplicados (investigue `duplicated()` y su parámetro `subset`).
mascara_duplicados = dataframe.duplicated(subset="names")
print(f"7.- La cantidad de valores duplicados es {mascara_duplicados.sum()}")
# Mostrar las filas que contengan nombres sean duplicados (investigue `duplicated()` y su parámetro `subset`).
print(f"8.- Los datos de los nombres duplicados son:")
display(dataframe[mascara_duplicados])
# A traves de `.plot.box()` plotear el boxplot de las notas. De ser necesario, invetigue como funcionan los gráficos de caja (esta sección es provista).
print(f"9.- Gráfico de caja:")
dataframe.plot.box()
exploratory_data_analysis(df_grades)
1.- Los datos tienen 1400 filas y 9 columnas 2.- Los datos tienen las siguientes columnas: ['names', 'gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course', 'math score', 'reading score', 'writing score'] 3.- Ejemplos de filas Los 5 primeros datos son:
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|---|
0 | Rita Courtney | female | group B | some high school | standard | none | 37.0 | 46.0 | 46.0 |
1 | Madeline Fuller | female | group C | some high school | standard | none | 124.0 | 142.0 | 142.0 |
2 | Daniel Rester | male | group E | associate's degree | standard | none | 156.0 | 196.0 | 193.0 |
3 | Charles Linstrom | male | group A | bachelor's degree | standard | completed | 80.0 | 78.0 | 81.0 |
4 | Geraldine Hale | female | group A | associate's degree | standard | none | 131.0 | 166.0 | 166.0 |
Los 5 últimos datos son:
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|---|
1395 | Alexander Dillon | male | group D | associate's degree | standard | none | 193.0 | 142.0 | 193.0 |
1396 | Marina Zeigler | female | group C | bachelor's degree | free/reduced | completed | 66.0 | 74.0 | 81.0 |
1397 | Laurie Carter | female | group B | some high school | standard | completed | 54.0 | 61.0 | 62.0 |
1398 | Joseph Mccoy | male | group D | some college | free/reduced | none | 193.0 | 196.0 | 193.0 |
1399 | Amanda Perez | female | group A | high school | standard | completed | 68.0 | 80.0 | 76.0 |
Muestra de 5 datos aleatorios:
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|---|
123 | Milton Davidson | male | group C | associate's degree | free/reduced | completed | 124.0 | NaN | 193.0 |
91 | William Austin | male | group C | associate's degree | standard | completed | 57.0 | 54.0 | 56.0 |
112 | John Salter | male | group D | some college | standard | none | 44.0 | 54.0 | 53.0 |
465 | Alfonzo Kellar | male | group C | some high school | standard | none | 75.0 | 72.0 | 62.0 |
695 | Donna Medlin | female | group B | high school | free/reduced | none | 38.0 | 142.0 | 193.0 |
4.- Descripción de variables numéricas
math score | reading score | writing score | |
---|---|---|---|
count | 1360.000000 | 1360.000000 | 1353.000000 |
mean | 89.538971 | 93.086029 | 91.852919 |
std | 47.152422 | 46.385775 | 47.043570 |
min | 0.000000 | 17.000000 | 10.000000 |
25% | 59.000000 | 63.000000 | 62.000000 |
50% | 72.000000 | 76.000000 | 75.000000 |
75% | 97.250000 | 100.000000 | 100.000000 |
max | 196.000000 | 196.000000 | 196.000000 |
5.- La cantidad de valores nulos para cada atributo son: names 0 gender 0 race/ethnicity 0 parental level of education 0 lunch 0 test preparation course 0 math score 40 reading score 40 writing score 47 dtype: int64 6.- La cantidad de valores únicos para cada atributo son: names 1153 gender 2 race/ethnicity 5 parental level of education 6 lunch 2 test preparation course 2 math score 96 reading score 88 writing score 92 dtype: int64 7.- La cantidad de valores duplicados es 247 8.- Los datos de los nombres duplicados son:
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|---|
12 | Rita Courtney | female | group B | some high school | standard | none | 193.0 | 193.0 | 193.0 |
81 | Elizabeth Quintero | female | group B | some college | standard | none | 82.0 | 85.0 | 87.0 |
160 | Michelle Gonzales | female | group B | some high school | standard | completed | 60.0 | 70.0 | 74.0 |
170 | Carla Mendoza | female | group D | associate's degree | free/reduced | none | 43.0 | 60.0 | 58.0 |
176 | Darlene Parker | female | group C | associate's degree | standard | none | 54.0 | 61.0 | 58.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1385 | James Crawford | male | group D | master's degree | standard | none | 193.0 | 147.0 | 131.0 |
1390 | Gloria Bradford | female | group B | high school | standard | none | 124.0 | 64.0 | 193.0 |
1393 | Adam Romero | male | group C | high school | standard | completed | 124.0 | 193.0 | 49.0 |
1395 | Alexander Dillon | male | group D | associate's degree | standard | none | 193.0 | 142.0 | 193.0 |
1398 | Joseph Mccoy | male | group D | some college | free/reduced | none | 193.0 | 196.0 | 193.0 |
247 rows × 9 columns
9.- Gráfico de caja:
Justificación:
¿Que anomalias es posible observar de los datos que componen el DataFrame? ¿Existen valores nulos o duplicados? ¿Qué puede interpretar de la distribución de las notas en el boxplot?
Las anomalías observables son:
Respuesta Esperada:
exploratory_data_analysis(df_grades)
1.- El DataFrame tiene 1400 filas y 9 columnas 2.- El DataFrame esta compuesto por las siguientes columnas: ['names', 'gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course', 'math score', 'reading score', 'writing score'] 3.- Ejemplos de filas del DataFrame: Primeras 5 filas:
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|---|
0 | Rita Courtney | female | group B | some high school | standard | none | 37.0 | 46.0 | 46.0 |
1 | Madeline Fuller | female | group C | some high school | standard | none | 124.0 | 142.0 | 142.0 |
2 | Daniel Rester | male | group E | associate's degree | standard | none | 156.0 | 196.0 | 193.0 |
3 | Charles Linstrom | male | group A | bachelor's degree | standard | completed | 80.0 | 78.0 | 81.0 |
4 | Geraldine Hale | female | group A | associate's degree | standard | none | 131.0 | 166.0 | 166.0 |
Últimas 5 filas:
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|---|
1395 | Alexander Dillon | male | group D | associate's degree | standard | none | 193.0 | 142.0 | 193.0 |
1396 | Marina Zeigler | female | group C | bachelor's degree | free/reduced | completed | 66.0 | 74.0 | 81.0 |
1397 | Laurie Carter | female | group B | some high school | standard | completed | 54.0 | 61.0 | 62.0 |
1398 | Joseph Mccoy | male | group D | some college | free/reduced | none | 193.0 | 196.0 | 193.0 |
1399 | Amanda Perez | female | group A | high school | standard | completed | 68.0 | 80.0 | 76.0 |
Muestreo aleatorio de 5 filas:
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|---|
1282 | Alberto Roberts | male | group D | associate's degree | standard | none | 52.0 | 55.0 | 49.0 |
604 | Ronald Wilson | male | group C | some high school | standard | none | 64.0 | 58.0 | 51.0 |
559 | Christine Reed | female | group E | bachelor's degree | standard | completed | 188.0 | 142.0 | 131.0 |
380 | Edna Johnson | female | group E | some college | standard | completed | 193.0 | 124.0 | 196.0 |
1392 | Richard Young | male | group D | high school | standard | none | 69.0 | 75.0 | 71.0 |
4.- Descripción numérica del Dataframe:
math score | reading score | writing score | |
---|---|---|---|
count | 1360.000000 | 1360.000000 | 1353.000000 |
mean | 89.538971 | 93.086029 | 91.852919 |
std | 47.152422 | 46.385775 | 47.043570 |
min | 0.000000 | 17.000000 | 10.000000 |
25% | 59.000000 | 63.000000 | 62.000000 |
50% | 72.000000 | 76.000000 | 75.000000 |
75% | 97.250000 | 100.000000 | 100.000000 |
max | 196.000000 | 196.000000 | 196.000000 |
5.- Cantidad de valores nulos por columna:
names 0 gender 0 race/ethnicity 0 parental level of education 0 lunch 0 test preparation course 0 math score 40 reading score 40 writing score 47 dtype: int64
6.- Cantidad de valores únicos por columna:
names 1153 gender 2 race/ethnicity 5 parental level of education 6 lunch 2 test preparation course 2 math score 96 reading score 88 writing score 92 dtype: int64
7.- Presenta un total de 247 filas con nombres duplicados. 8.- Las filas con nombres duplicados son:
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|---|
12 | Rita Courtney | female | group B | some high school | standard | none | 193.0 | 193.0 | 193.0 |
81 | Elizabeth Quintero | female | group B | some college | standard | none | 82.0 | 85.0 | 87.0 |
160 | Michelle Gonzales | female | group B | some high school | standard | completed | 60.0 | 70.0 | 74.0 |
170 | Carla Mendoza | female | group D | associate's degree | free/reduced | none | 43.0 | 60.0 | 58.0 |
176 | Darlene Parker | female | group C | associate's degree | standard | none | 54.0 | 61.0 | 58.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1385 | James Crawford | male | group D | master's degree | standard | none | 193.0 | 147.0 | 131.0 |
1390 | Gloria Bradford | female | group B | high school | standard | none | 124.0 | 64.0 | 193.0 |
1393 | Adam Romero | male | group C | high school | standard | completed | 124.0 | 193.0 | 49.0 |
1395 | Alexander Dillon | male | group D | associate's degree | standard | none | 193.0 | 142.0 | 193.0 |
1398 | Joseph Mccoy | male | group D | some college | free/reduced | none | 193.0 | 196.0 | 193.0 |
247 rows × 9 columns
9.- Boxplot de notas:
En base a lo realizado anteriormente, realice una limpieza de los dataset. Asuma los supuestos señalados en la pregunta anterior para limpiar el DataFrame
y cree una función que le permita limpiar los datos. Una vez limpiado cada uno de los DataFrames
compruebe nuevamente los datos con la función creada en la sección 1.1.
To-Do:
Hint
Puede ser útil los métodos .drop_duplicates()
y .dropna()
como también utilizar queries con máscaras booleanas.
def clean_data(dataframe):
dropped = dataframe.drop_duplicates(subset="names").dropna()
valid_score = dropped.loc[dropped[columnas_notas].max(axis=1) <= 100]
return valid_score.set_index(np.arange(len(valid_score)))
df_grades = clean_data(df_grades)
df_grades
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|---|
0 | Rita Courtney | female | group B | some high school | standard | none | 37.0 | 46.0 | 46.0 |
1 | Charles Linstrom | male | group A | bachelor's degree | standard | completed | 80.0 | 78.0 | 81.0 |
2 | Brian Young | male | group C | some high school | standard | none | 73.0 | 66.0 | 63.0 |
3 | Howard Jimenez | male | group E | some high school | standard | completed | 81.0 | 75.0 | 76.0 |
4 | Wayne Wilson | male | group B | some high school | standard | completed | 94.0 | 86.0 | 87.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
870 | Richard Young | male | group D | high school | standard | none | 69.0 | 75.0 | 71.0 |
871 | Wanda Russell | female | group B | high school | free/reduced | completed | 23.0 | 44.0 | 36.0 |
872 | Marina Zeigler | female | group C | bachelor's degree | free/reduced | completed | 66.0 | 74.0 | 81.0 |
873 | Laurie Carter | female | group B | some high school | standard | completed | 54.0 | 61.0 | 62.0 |
874 | Amanda Perez | female | group A | high school | standard | completed | 68.0 | 80.0 | 76.0 |
875 rows × 9 columns
Respuesta:
df_grades = clean_data(df_grades)
df_grades
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|---|
0 | Rita Courtney | female | group B | some high school | standard | none | 37.0 | 46.0 | 46.0 |
1 | Charles Linstrom | male | group A | bachelor's degree | standard | completed | 80.0 | 78.0 | 81.0 |
2 | Brian Young | male | group C | some high school | standard | none | 73.0 | 66.0 | 63.0 |
3 | Howard Jimenez | male | group E | some high school | standard | completed | 81.0 | 75.0 | 76.0 |
4 | Wayne Wilson | male | group B | some high school | standard | completed | 94.0 | 86.0 | 87.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
870 | Richard Young | male | group D | high school | standard | none | 69.0 | 75.0 | 71.0 |
871 | Wanda Russell | female | group B | high school | free/reduced | completed | 23.0 | 44.0 | 36.0 |
872 | Marina Zeigler | female | group C | bachelor's degree | free/reduced | completed | 66.0 | 74.0 | 81.0 |
873 | Laurie Carter | female | group B | some high school | standard | completed | 54.0 | 61.0 | 62.0 |
874 | Amanda Perez | female | group A | high school | standard | completed | 68.0 | 80.0 | 76.0 |
875 rows × 9 columns
Utilice la función creada en el punto 1.1 para verificar sus resultados.
exploratory_data_analysis(df_grades)
1.- Los datos tienen 875 filas y 9 columnas 2.- Los datos tienen las siguientes columnas: ['names', 'gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course', 'math score', 'reading score', 'writing score'] 3.- Ejemplos de filas Los 5 primeros datos son:
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|---|
0 | Rita Courtney | female | group B | some high school | standard | none | 37.0 | 46.0 | 46.0 |
1 | Charles Linstrom | male | group A | bachelor's degree | standard | completed | 80.0 | 78.0 | 81.0 |
2 | Brian Young | male | group C | some high school | standard | none | 73.0 | 66.0 | 63.0 |
3 | Howard Jimenez | male | group E | some high school | standard | completed | 81.0 | 75.0 | 76.0 |
4 | Wayne Wilson | male | group B | some high school | standard | completed | 94.0 | 86.0 | 87.0 |
Los 5 últimos datos son:
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|---|
870 | Richard Young | male | group D | high school | standard | none | 69.0 | 75.0 | 71.0 |
871 | Wanda Russell | female | group B | high school | free/reduced | completed | 23.0 | 44.0 | 36.0 |
872 | Marina Zeigler | female | group C | bachelor's degree | free/reduced | completed | 66.0 | 74.0 | 81.0 |
873 | Laurie Carter | female | group B | some high school | standard | completed | 54.0 | 61.0 | 62.0 |
874 | Amanda Perez | female | group A | high school | standard | completed | 68.0 | 80.0 | 76.0 |
Muestra de 5 datos aleatorios:
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|---|
454 | Emily Devins | female | group B | associate's degree | standard | none | 47.0 | 49.0 | 50.0 |
481 | Micki Banks | female | group A | high school | standard | none | 55.0 | 73.0 | 73.0 |
160 | Michael Reel | male | group D | high school | standard | none | 45.0 | 48.0 | 46.0 |
869 | Amy Sieving | female | group C | high school | standard | none | 29.0 | 29.0 | 30.0 |
409 | Hubert Mitchell | male | group D | high school | standard | none | 66.0 | 69.0 | 63.0 |
4.- Descripción de variables numéricas
math score | reading score | writing score | |
---|---|---|---|
count | 875.000000 | 875.000000 | 875.000000 |
mean | 66.225143 | 69.195429 | 68.114286 |
std | 15.076716 | 14.629270 | 15.174220 |
min | 0.000000 | 17.000000 | 10.000000 |
25% | 57.000000 | 59.000000 | 57.500000 |
50% | 66.000000 | 70.000000 | 69.000000 |
75% | 76.500000 | 79.000000 | 79.000000 |
max | 100.000000 | 100.000000 | 100.000000 |
5.- La cantidad de valores nulos para cada atributo son: names 0 gender 0 race/ethnicity 0 parental level of education 0 lunch 0 test preparation course 0 math score 0 reading score 0 writing score 0 dtype: int64 6.- La cantidad de valores únicos para cada atributo son: names 875 gender 2 race/ethnicity 5 parental level of education 6 lunch 2 test preparation course 2 math score 79 reading score 72 writing score 77 dtype: int64 7.- La cantidad de valores duplicados es 0 8.- Los datos de los nombres duplicados son:
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score |
---|
9.- Gráfico de caja:
Justificación:
Escriba aquí su justificación
En una de las reuniones con Don Caguayo este les comenta su intención de pasar todas sus notas a la "Escala Chilena". Para esto les facilita la siguiente ecuación:
$$Nota \, Chilena= \% \, de \, logro*0.06 + 1$$Utilizando la ecuación entregada, cree una función que le permita pasar a la escala Chilena las notas que se escapan del rango [1,7] y entregue las notas redondeadas al segundo decimal.
To-Do
Respuesta:
def nota_chilena(dataframe_in, columna):
"""Retorna la columna transformada a escala chilena."""
return (dataframe_in[columna]*0.06 + 1).round(2)
def con_notas_chilenas(dataframe_in):
"""Retorna un dataframe copiado con las columnas de nota en escala chilena."""
dataframe_out = dataframe_in.copy()
dataframe_out[columnas_notas] = nota_chilena(dataframe_in, columnas_notas)
return dataframe_out
Con todas las notas adjuntadas, la prestigiosa Universidad De la Cachaña necesita obtener el promedio general de cada uno de los alumnos. Para esto se le solicita que calcule el promedio con cada una de las asignaturas y el promedio se anexe como una nueva columna al final del DataFrame original. Cabe señalar que las notas deben ser redondeadas al segundo decimal.
To-Do:
Respuesta:
def alumnos_promedio(dataframe_in):
dataframe_out = dataframe_in.copy()
dataframe_out["GPA"] = con_notas_chilenas(dataframe_in)[columnas_notas].mean(axis=1).round(2)
return dataframe_out
df_grades = alumnos_promedio(df_grades)
df_grades.head()
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | GPA | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Rita Courtney | female | group B | some high school | standard | none | 37.0 | 46.0 | 46.0 | 3.58 |
1 | Charles Linstrom | male | group A | bachelor's degree | standard | completed | 80.0 | 78.0 | 81.0 | 5.78 |
2 | Brian Young | male | group C | some high school | standard | none | 73.0 | 66.0 | 63.0 | 5.04 |
3 | Howard Jimenez | male | group E | some high school | standard | completed | 81.0 | 75.0 | 76.0 | 5.64 |
4 | Wayne Wilson | male | group B | some high school | standard | completed | 94.0 | 86.0 | 87.0 | 6.34 |
Respuesta Esperada:
df_grades = alumnos_promedio(df_grades)
df_grades.head()
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | GPA | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Rita Courtney | female | group B | some high school | standard | none | 3.22 | 3.76 | 3.76 | 3.58 |
1 | Charles Linstrom | male | group A | bachelor's degree | standard | completed | 5.80 | 5.68 | 5.86 | 5.78 |
2 | Brian Young | male | group C | some high school | standard | none | 5.38 | 4.96 | 4.78 | 5.04 |
3 | Howard Jimenez | male | group E | some high school | standard | completed | 5.86 | 5.50 | 5.56 | 5.64 |
4 | Wayne Wilson | male | group B | some high school | standard | completed | 6.64 | 6.16 | 6.22 | 6.34 |
Contentos con su desempeño, la Universidad les solicita que obtengan los alumnos quienes han reprobado el año y los que han tenido desempeño de destacado. Para esto se considerará que un alumno ha reprobado el año cuando el promedio de notas es inferior a 4.0 en la Escala Chilena, y se considerará destacado si posee notas superiores a 5.8 en todas las asignaturas (o sea notas superiores a 5.8 en math score
, reading score
y writing score
).
Cree una función que permita obtener tres Dataframes:
To-Do:
Respuesta:
def reprobados_destacados(dataframe_in):
mascara_reprobados = dataframe_in.GPA < 4.
reprobados = dataframe_in[mascara_reprobados]
modificado = dataframe_in.copy()
modificado.loc[mascara_reprobados, "GPA"] = "R"
destacados = dataframe_in.loc[dataframe_in[columnas_notas].min(axis=1) >= 5.8]
destacados = destacados.sort_values("GPA", ascending=False)
return reprobados, modificado, destacados
dataframe1, dataframe2, dataframe3 = reprobados_destacados(df_grades)
dataframe1
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | GPA | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Rita Courtney | female | group B | some high school | standard | none | 3.22 | 3.76 | 3.76 | 3.58 |
1 | Mikki Amaya | female | group A | high school | free/reduced | completed | 3.04 | 3.88 | 3.46 | 3.46 |
2 | Samuel Dougal | male | group C | master's degree | free/reduced | completed | 3.76 | 3.52 | 3.76 | 3.68 |
3 | Rodger Talavera | male | group B | high school | free/reduced | none | 3.16 | 2.74 | 2.62 | 2.84 |
4 | Morris Johnson | male | group A | associate's degree | free/reduced | completed | 3.40 | 4.30 | 4.18 | 3.96 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
81 | Charles Harian | male | group C | some high school | free/reduced | completed | 3.70 | 4.12 | 3.94 | 3.92 |
82 | Wendy Gunnell | female | group D | some high school | free/reduced | none | 2.62 | 3.04 | 2.92 | 2.86 |
83 | Diana Costa | female | group C | high school | free/reduced | none | 3.46 | 3.76 | 3.58 | 3.60 |
84 | Amy Sieving | female | group C | high school | standard | none | 2.74 | 2.74 | 2.80 | 2.76 |
85 | Wanda Russell | female | group B | high school | free/reduced | completed | 2.38 | 3.64 | 3.16 | 3.06 |
86 rows × 10 columns
dataframe2
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | GPA | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Rita Courtney | female | group B | some high school | standard | none | 3.22 | 3.76 | 3.76 | R |
1 | Charles Linstrom | male | group A | bachelor's degree | standard | completed | 5.80 | 5.68 | 5.86 | 5.78 |
2 | Brian Young | male | group C | some high school | standard | none | 5.38 | 4.96 | 4.78 | 5.04 |
3 | Howard Jimenez | male | group E | some high school | standard | completed | 5.86 | 5.50 | 5.56 | 5.64 |
4 | Wayne Wilson | male | group B | some high school | standard | completed | 6.64 | 6.16 | 6.22 | 6.34 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
870 | Richard Young | male | group D | high school | standard | none | 5.14 | 5.50 | 5.26 | 5.3 |
871 | Wanda Russell | female | group B | high school | free/reduced | completed | 2.38 | 3.64 | 3.16 | R |
872 | Marina Zeigler | female | group C | bachelor's degree | free/reduced | completed | 4.96 | 5.44 | 5.86 | 5.42 |
873 | Laurie Carter | female | group B | some high school | standard | completed | 4.24 | 4.66 | 4.72 | 4.54 |
874 | Amanda Perez | female | group A | high school | standard | completed | 5.08 | 5.80 | 5.56 | 5.48 |
875 rows × 10 columns
dataframe3
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | GPA | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Richard Pryor | male | group E | bachelor's degree | standard | completed | 7.00 | 7.00 | 7.00 | 7.0 |
1 | Sandra Pompey | female | group E | associate's degree | standard | none | 7.00 | 7.00 | 7.00 | 7.0 |
2 | Elizabeth Beasley | female | group E | bachelor's degree | standard | none | 7.00 | 7.00 | 7.00 | 7.0 |
3 | Emma Gray | female | group E | bachelor's degree | standard | completed | 6.94 | 7.00 | 7.00 | 6.98 |
4 | Emma Brasher | female | group D | some high school | standard | completed | 6.82 | 7.00 | 7.00 | 6.94 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
89 | Rolando Widmer | male | group C | high school | standard | completed | 5.92 | 6.04 | 5.92 | 5.96 |
90 | Lucille Stanphill | female | group C | high school | standard | none | 5.86 | 6.04 | 5.92 | 5.94 |
91 | Steven Kelly | male | group D | some college | standard | none | 5.86 | 5.92 | 6.04 | 5.94 |
92 | Peter Perez | male | group D | master's degree | standard | none | 5.86 | 5.86 | 6.04 | 5.92 |
93 | Harry Berry | male | group B | associate's degree | standard | completed | 5.86 | 5.92 | 5.92 | 5.9 |
94 rows × 10 columns
Respuesta:
reprobados, modificado, destacados = reprobados_destacados(df_grades)
print("Reprobados")
display(reprobados)
print("Modificado")
display(modificado)
print("Destacados")
display(destacados)
Reprobados
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | GPA | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Rita Courtney | female | group B | some high school | standard | none | 37.0 | 46.0 | 46.0 | 3.58 |
7 | Mikki Amaya | female | group A | high school | free/reduced | completed | 34.0 | 48.0 | 41.0 | 3.46 |
16 | Samuel Dougal | male | group C | master's degree | free/reduced | completed | 46.0 | 42.0 | 46.0 | 3.68 |
39 | Rodger Talavera | male | group B | high school | free/reduced | none | 36.0 | 29.0 | 27.0 | 2.84 |
43 | Morris Johnson | male | group A | associate's degree | free/reduced | completed | 40.0 | 55.0 | 53.0 | 3.96 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
852 | Charles Harian | male | group C | some high school | free/reduced | completed | 45.0 | 52.0 | 49.0 | 3.92 |
854 | Wendy Gunnell | female | group D | some high school | free/reduced | none | 27.0 | 34.0 | 32.0 | 2.86 |
868 | Diana Costa | female | group C | high school | free/reduced | none | 41.0 | 46.0 | 43.0 | 3.60 |
869 | Amy Sieving | female | group C | high school | standard | none | 29.0 | 29.0 | 30.0 | 2.76 |
871 | Wanda Russell | female | group B | high school | free/reduced | completed | 23.0 | 44.0 | 36.0 | 3.06 |
86 rows × 10 columns
Modificado
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | GPA | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Rita Courtney | female | group B | some high school | standard | none | 37.0 | 46.0 | 46.0 | R |
1 | Charles Linstrom | male | group A | bachelor's degree | standard | completed | 80.0 | 78.0 | 81.0 | 5.78 |
2 | Brian Young | male | group C | some high school | standard | none | 73.0 | 66.0 | 63.0 | 5.04 |
3 | Howard Jimenez | male | group E | some high school | standard | completed | 81.0 | 75.0 | 76.0 | 5.64 |
4 | Wayne Wilson | male | group B | some high school | standard | completed | 94.0 | 86.0 | 87.0 | 6.34 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
870 | Richard Young | male | group D | high school | standard | none | 69.0 | 75.0 | 71.0 | 5.3 |
871 | Wanda Russell | female | group B | high school | free/reduced | completed | 23.0 | 44.0 | 36.0 | R |
872 | Marina Zeigler | female | group C | bachelor's degree | free/reduced | completed | 66.0 | 74.0 | 81.0 | 5.42 |
873 | Laurie Carter | female | group B | some high school | standard | completed | 54.0 | 61.0 | 62.0 | 4.54 |
874 | Amanda Perez | female | group A | high school | standard | completed | 68.0 | 80.0 | 76.0 | 5.48 |
875 rows × 10 columns
Destacados
names | gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | GPA | |
---|---|---|---|---|---|---|---|---|---|---|
838 | Sandra Pompey | female | group E | associate's degree | standard | none | 100.0 | 100.0 | 100.0 | 7.00 |
435 | Richard Pryor | male | group E | bachelor's degree | standard | completed | 100.0 | 100.0 | 100.0 | 7.00 |
253 | Elizabeth Beasley | female | group E | bachelor's degree | standard | none | 100.0 | 100.0 | 100.0 | 7.00 |
313 | Emma Gray | female | group E | bachelor's degree | standard | completed | 99.0 | 100.0 | 100.0 | 6.98 |
188 | Emma Brasher | female | group D | some high school | standard | completed | 97.0 | 100.0 | 100.0 | 6.94 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
370 | Charles Jackson | male | group E | some high school | standard | none | 30.0 | 26.0 | 22.0 | 2.56 |
786 | Sonia Brent | female | group B | some high school | free/reduced | none | 18.0 | 32.0 | 28.0 | 2.56 |
247 | John Gunter | male | group A | some college | free/reduced | none | 28.0 | 23.0 | 19.0 | 2.40 |
723 | Richard Guercio | male | group B | high school | free/reduced | none | 30.0 | 24.0 | 15.0 | 2.38 |
436 | Luetta Caponigro | female | group B | high school | free/reduced | none | 8.0 | 24.0 | 23.0 | 2.10 |
874 rows × 10 columns
Algunos padres (extremadamente entrometidos y...) preocupados por el desempeño de sus hijos, desean saber estos han aprobado o no el año. Como la tarea de buscar a mano cada uno de estos alumnos en el DataFrame
resulta demasiado tedioso, el rector le solicita crear una función que tenga como entrada el DataFrame
donde se realizará la busqueda y una lista con los nombres de los alumnos, luego la función deberá retornar un DataFrame
con los nombres de los alumnos y el promedio general de cada uno.
Creada la función busque el promedio general de los siguientes alumnos en el segundo Data frame creado en la sección 1.3.3:
['Lucille Stanphill','Marcus Mcfarland', 'Matthew Freeman', 'Dana Mojica', 'Paul Hyder', 'Jeffrey Korn', 'Robert Strain',
'Ronald Jett', 'Lorena Reed', 'Hazel Posey']
Hint: Puede serle de utilidad utilizar isin()
.
Respuesta:
lista_alumnos = [
"Lucille Stanphill",
"Marcus Mcfarland",
"Matthew Freeman",
"Dana Mojica",
"Paul Hyder",
"Jeffrey Korn",
"Robert Strain",
"Ronald Jett",
"Lorena Reed",
"Hazel Posey",
]
consulta_gpa(dataframe2, lista_alumnos)
names | GPA | |
---|---|---|
198 | Lucille Stanphill | 5.94 |
372 | Marcus Mcfarland | 4.86 |
418 | Matthew Freeman | R |
421 | Dana Mojica | R |
675 | Robert Strain | 4.4 |
679 | Ronald Jett | R |
681 | Hazel Posey | R |
753 | Lorena Reed | R |
def consulta_gpa(dataframe_in, lista_alumnos):
return dataframe_in[dataframe_in.names.isin(lista_alumnos)][["names", "GPA"]]
lista_alumnos = [
"Lucille Stanphill",
"Marcus Mcfarland",
"Matthew Freeman",
"Dana Mojica",
"Paul Hyder",
"Jeffrey Korn",
"Robert Strain",
"Ronald Jett",
"Lorena Reed",
"Hazel Posey",
]
consulta_gpa(modificado, lista_alumnos)
names | GPA | |
---|---|---|
198 | Lucille Stanphill | 5.94 |
372 | Marcus Mcfarland | 4.86 |
418 | Matthew Freeman | R |
421 | Dana Mojica | R |
675 | Robert Strain | 4.4 |
679 | Ronald Jett | R |
681 | Hazel Posey | R |
753 | Lorena Reed | R |
El rector, intrigado por saber si existe una relación entre el nivel educacional de los padres y las notas de sus hijos les plantea estudiar dicha relación. Para esto, les solicita que desde el DataFrame
que posee a todos los alumnos (DataFrame
obtenido tras la limpieza), generen una nueva columna que clasifique en quintiles de rendimiento para sus alumnos: ['Bajísimo', 'Bajo', 'Medio', 'Alto', 'Muy alto']
. (investigar qcut()
)
Hecho esto, grafique a través de un scatter plot el nivel educacional de los padres v/s el promedio de los alumnos, utilice los cuantiles generados anteriormente para mostrar en que cuantil se encuentra cada uno de los datos.
Realizado el grafico, responda la siguiente pregunta: ¿Es posible visualizar alguna tendencia en el gráfico?.
Respuesta:
# Código para quintiles
# La variable que contenga los quintiles debe llamarse GPA cuantiles para que el gráfico funcione.
df_grades["GPA cuantiles"] = pd.qcut(df_grades.GPA, 5,
labels=['Bajísimo', 'Bajo', 'Medio', 'Alto', 'Muy alto'])
# Código de gráfico
fig = px.scatter(
df_grades.sort_values(by="GPA"),
x="GPA",
y="parental level of education",
hover_data=["gender"],
color="GPA cuantiles",
title="GPA por Nivel de educación de los padres",
color_discrete_sequence=["tomato", "orange", "yellow", "greenyellow", "green"]
)
fig
Parece efectivamente existir una correlación entre el GPA y el nivel de educación de los padres. Salvo ciertos outliers (notaas muy bajas en todo grupo, y una nota muy alta en el grupo de alumnos con padres que no terminaron la enseñanza media), se observa que en general mientras menor el nivel educativo de los padres, hay menor nota mínima y nota máxima, con excepción del nivel de magíster que no registra notas 7.
Hay que destacar que el rango de notas tiende a ser más grande cuando el nivel educativo es menor.
También es importante que si bien uno puede ver que pocos alumnos con padres que tienen un magíster tienen GPA bajísimo, eso es porque este grupo de estudiantes es pequeño, y en todo análisis hay que recordar considerar que las muestras no son de igual tamaño.
Eso ha sido todo para el lab de hoy, recuerden que el laboratorio tiene un plazo de entrega de una semana. Cualquier duda del laboratorio, no duden en contactarnos por mail o U-cursos.