glass_half_full_image

Date: November 14, 2021

My Teammates

These are the wonderful people that were apart of this awesome project.

Alexia Garces	Brooke Holyoak

Jason Tellez	Malachi Hale

Project Planning
Executive Summary
1. Goals
2. Findings
Acquire Data
Prepare Data
1. Distributions
2. Prepare Takeaways
Data Exploration
Modeling & Evaluation
Project Delivery

Project Planning

✓ 🟢 Plan ➜ ☐ Acquire ➜ ☐ Prepare ➜ ☐ Explore ➜ ☐ Model ➜ ☐ Deliver

Project Objectives

Utilize American Trends Panel Datasets (downloadable here), with statistical modeling techniques to assess and attempt to predict sentiment toward particular topics.

This will culminate into a well-built well-documented jupyter notebook that contains our process and derivation of these predictions.

Modules will be created that abstract minutiae aspects of the data pipeline process.

Business Goals

Utilize tabulated statistical data aquired from Pew Research American Trends Surveys.

Prepare, explore and formulate hypthoesis about the data.

Build models that can predict future American sentiment toward certain topics, and utilize hyperparameter optimization and feature engineering to improve validation model performance prior to evaluating on test data.

Document all these steps throughly.

Audience

General population and individuals without specific knowledge or understanding of the topic or subject.

Deliverables

A clearly named final notebook. This notebook will contain more detailed processes other than noted within the README and have abstracted scripts to assist on readability.

A README that explains what the project is, how to reproduce the project, and notes about the project.

A Python module and associated modules that automate the data acquisition and preparation process.

Table of Contents

Executive Summary

Our team acquired Pew Research Panel survey data and utilized this data to explore the drivers of pessimism in American Prospective Attitudes.

Being able to understand what most likely drives pessimistic or optimistic thinking about the future will help business leaders clarify strategies for moving foward.

This project will also help guide expectations of future sucess in the customers these business leaders serve, in addition to the products offered, investment, marketing and sales, and other aspects throughout their organization.

Goals

Build a model that can predict future American sentiment toward certain topics, utilizing split survey data as the training dataset.

Findings

Standard demographic features like age, sex, and income are not drivers of overall pessimism. However, features like what will happen to the average family’s standard of living, cost of healthcare, and the future of the public education system are highly correlated to overall pessimism.

Table of Contents

Acquire Data

✓ Plan ➜ 🟢 Acquire ➜ ☐ Prepare ➜ ☐ Explore ➜ ☐ Model ➜ ☐ Deliver

The data is acquired from Pew Research Panel survey data that asks various demographic questions in conjunction with the survey questions themselves.

The questions all ask for categorical responses from the individuals and the questions pertain to various topics of American life, such as politics and economics.

Working with American Trends Panel Data

Demographic Profile Variables

Each ATP dataset comes with a number of variables prefixed by “F_” (for “frame”) that contain demographic profile data. These variables are not measured every wave; instead, they are sourced from panel profile surveys conducted on a less frequent basis. Some profile variables are also occasionally asked on panel waves and are accordingly updated for each panelist. Profile information is based on panelists’ most recent response to the profile questions. Some variables are coarsened to help protect the confidentiality of our panelists. Interviewer instructions in [ ] and voluntary responses in ( ) are included if the source of a profile variable was ever presented in phone (CATI) mode. See Appendix I for the profile variable codebook.

Unique Identifier

The variable QKEY is a unique identifier assigned to each respondent. QKEY can be used to link multiple panel waves together. Note that except in a few instances, WEIGHT_W41 are only provided for single waves. Use caution when analyzing data from multiple waves without weights that are designed for use with multiple waves.

Data Variable Types

American Trends Panel datasets contain single-punch or multi-punch variables. For questions in a ‘Check all that apply’ format, each option has its own variable indicating whether a respondent selected the item or not. For some datasets there is an additional variable indicating whether a respondent did not select any of the options. Open-end string variables are not included in ATP datasets. Coded responses to open-end questions are included when available.

Dataset Format

The dataset is formatted as a .sav file and can be read with the SPSS software program. The dataset can also be read with the R programming language, using the foreign package. R is a free, open-source program for statistical analysis that can be downloaded here. It can also be used to export data in .csv format for use with other software programs.

NOTE: Using other tools to directly convert the .sav file to another format such as .csv may ERASE value labels. For this reason, it is highly recommended that you use either SPSS or R to read the file directly.

DataFrame Dictionary

Data Dictionary can be viewed here

Takeaways from Acquire:

We acquired a DataFrame from a Pew Research Panel survey which contained 2524 observations and 124 columns.

Each row represents an individual American adult and his or her responses to the survey questions.

Of our 124 columns, 2 are continuous and numeric: qkey and weight. The remaining 122 columns are categorical features.

The weight column indicates the corresponding survey weight of each respondent in the sample. The survey weight indicates how representative an observation is of the total population.

The survey results provide us with information regarding each respondents’ views about the future of the United States. In addition,the acquired data contains demographic data for each respondent, including gender, race, income level, and political affiliation.

Table of Contents

Prepare Data

✓ Plan ➜ ✓ Acquire ➜ 🟢 Prepare ➜ ☐ Explore ➜ ☐ Model ➜ ☐ Deliver

We will import our prepare.py file, which performs a series of steps to clean and prepare our data:

First, we convert the categorical features in the DataFrame to objects.

Second, because our target variable will be the respondents’ prospective thinking, we drop rows for which the respondent refused to answer the question about prospective thinking in the column OPTIMISMT_W41.

Third, we rename the columns as indiciated by our data dictionary above.

Fourth, from the column OPTIMIST_W41, we create new columns is_pes, pes_val, is_very_pes, and is_very_opt.

The column is_pes introduces a Boolean value where 1 indicates a pessimistic outlook and is 0 indicates an optimistic outlook.

The column pes_val ranks a respondent’s pessisism, with 0 being the least pessismistic and 3 being the most pessimistic.

The column is_very_pes introduces a Boolean value where 1 indicates a very pessimistic outlook and 0 indicates a somewhat pessimistic, somewhat optimistic, or very optimistic outlook.

The column is_very_opt introduces a Boolean value where 1 indicates a very optimistic outlook and 0 indicates a somewhat optimistic, somewhat pessimistic, or very pessimistic outlook.

Fifth, we create a replace_key which transforms every response in the categorical columns to a corresponding numeric value. We also introduce a revert_key which reverts the numeric values back to the original string responses.

Finally, we convert the column indicating the unique identity of each respondent QKEY to an integer.

Additionally, we split the data into train, validate, and test datasets, stratifying on the target feature is_pes.

Prepare Takeaways

Utilizing the functions in our prepare.py we implemented a series of functions to clean our data.

We eliminated nine respondents from our dataset because these respondents refused to answer the question OPTIMIST_W41 about prospective thinking of the US’ future.

Our newly created target feature is_pes maps the responses to question OPTIMIST_W41 “Somewhat pessimistic” and “Very pessimistic” as the single Boolean value 1 and the responses “Somewhat optimistic” and “Very optimistic” to the single Boolean value 0.

Stratifying on is_pes, we split our data into train, validate, and test, datasets of lengths 1408, 604, and 503, respectively.

Table of Contents

Explore Data

✓ Plan ➜ ✓ Acquire ➜ ✓ Prepare ➜ 🟢 Explore ➜ ☐ Model ➜ ☐ Deliver

We dropped columns that were too closely related to the derivative of our target column (is_pes):

Those columns ended up being avg_family, attitude, pes_val, is_very_pes and is_very_opt.

We also dropped qkey since it is only an id value and will not provide any information since each is a unique value.

We split our train, validate, and test columns to feature dataframes and target series.

Statistical Test Results

column_name	chi2	p_val	deg_free	expected_freq
`happen_general`	309.847	5.21955e-68	2	[[ 65.23082386 51.76917614]
				[299.39275568 237.60724432]
				[420.37642045 333.62357955]]
`happen_pub_ed`	236.57	4.26122e-52	2	[[ 75.26633523 59.73366477]
				[411.45596591 326.54403409]
				[298.27769886 236.72230114]]
`happen_child_f2`	180.023	8.72154e-39	3	[[152.76278409 121.23721591]
				[210.18821023 166.81178977]
				[ 34.56676136 27.43323864]
				[387.48224432 307.51775568]]
`happen_race`	157.809	5.39807e-35	2	[[317.79119318 252.20880682]
				[398.07528409 315.92471591]
				[ 69.13352273 54.86647727]]
`happen_health`	145.05	3.18329e-32	2	[[447.6953125 355.3046875 ]
				[265.94105114 211.05894886]
				[ 71.36363636 56.63636364]]

Univariate Distributions

univariate_exploration_plots

Males vs Females Pessimisim

Male_vs_female_pessimisim_plot

Family Income Pessimisim

family_income_pessimism_plot

Educational Attainment Pessimisim

educational_attainment_pessimisim_plot

Hypotheses & Testing

Hypothesis 1

H₀: Is sex independent of a pessimsitic future outlook?

H_a: Sex is dependent on pessimist future outlook.

ɑ: 0.05

Hypothesis 1 Takeaways

The number of females and males who are overall pessimistic are about the same.

The p-value is above 0.05, so we accept the null hypothesis.

Hypothesis 2

H₀: Is income independent of a pessimistic future outlook?

H_a: Income is dependent on pessimistic future outlook.

ɑ = 0.05

Hypothesis 2 Takeaways

While it appears that people who are middle income earners are more pessimistic, there is no significance for overall income related to overall pessimism.

The p-value is above 0.5, so we accept the null hypothesis.

Explore Takeaways

There a small differences in future outlook when considering sex, income and education.

While the small differences exist, they do not appear to be significant.

Through chi-squared testing, we verify that there is not a significant relationship between theses features and our target.

Even though they are not necessarily drivers of overall future outlook, the findings are still helpful in isolation.

Table of Contents

Modeling & Evaluation

✓ Plan ➜ ✓ Acquire ➜ ✓ Prepare ➜ ✓ Explore ➜ 🟢 Model ➜ ☐ Deliver

We first have to choose a baseline to compare our models against.

The main models we will be using are Decision Tree, Random Forest, and K-Nearest Neighbor.

We will use different variations of our models until we determine one model to have outperformed our baseline and to have avoided overfitting or underfitting on the training data.

We will also be testing feature importance to see what drives an individual’s overall attitude.

Once we choose our best model, we will run this model on our Out-of-sample dataset.

Baseline

With a non-pessimistic attitude as our baseline, we calculated our accuracy by asuming that every respondent was non-pressimistic. This method gave us an accuracy of 55.75%.

Decision Tree

Utilizing the decision_tree_models function from our model.py file, we created a series of Decision Tree models with varying depths. Using our test_a_model function from the model.py, we calculated the accuracies of these models on the train and validate datasets for each of these models.

Random Forest

Utilizing the random_forest_models function from our model.py file, we created a series of Random Forest models with varying depths and min samples leaf. Using our test_a_model function from the model.py, we calculated the accuracies of these models on the train and validate datasets for each of these models.

K Nearest Neighbors

Utilizing the random_forest_models function from our model.py file, we created a series of K Nearest Neighbors models with varying numbers of neighbors. Using our test_a_model function from the model.py, we calculated the accuracies of these models on the train and validate datasets for each of these models.

Other Models

We used the models Linear SVC, Logistic Regression, and Naive Bayes to classifiy pessismistic respondents. We then used the test_a_model function to evaluate the accuracy of these models on the train and validate datasets.

Feature Importance

Of the models mentioned above, our best performing model was the Random Forest Classifier with depth 8, min samples leaf 3. We utilized this model to perform feature importance on the features in our dataset. We found that public education and U.S. economics are major drivers of pessimism.

feature_importance_plot

Modeling with just the top features

Feature importances gave us a ranked order of the features by importance in predicting pessimism. Using these ordered features, we ran a series of Random Forest Classifier models using just the top thirty most important features and just the forty most important features, using varying parameters. None of these models, however, outperformed the the Random Forest Classifier with depth 8, min samples leaf 3 using all features.

Model Comparison

Our best performing model was the Random Forest Classifier which included all features and had min samples leaf 3 and a depth of 8. This model had an accuracy of 80.46% on the validate dataset.

model_comparison

Modeling the Gender Subset

We ran Random Forest Classification models on the subsets of female and male respondents.

Most Important Issues For Women	Most Important Issues for Men

Modeling the Political Party Subsets

We ran Random Forest Classification models on the subsets of Republican and Democrat respondents.

Most Important Issues For Republicans	Most Important Issues for Democrats

Modeling the Income Level Subsets

We ran Random Forest Classification models on on the subsets for income groups less than $30,000, between $30,000 and $75,000, and more than $75,000.

Most Important Issues For Lower Income Level	Most Important Issues for Middle Income Level	Most Important Issues for Upper Income Level

Modeling the Education Level Subsets

We ran Random Forest Classification models on the subsets for the respondents’ highest education level, grouped by: high school or less, some college, or college graduate and above.

Most Important Issues For Highest Education High School or Less	Most Important Issues For Highest Education Some College	Most Important Issues For Highest Education College Degree

Out of Sample

We ran our best performing model, selected above on the out-of-sample test dataset. We achieved a 76.54% accuracy.

Modeling Takeaways

Big drivers of pessimism are public education and economics

Some other main drivers are job benefits and job security, race relations, standards of living, healthcare, and the country’s world status are also very important to adults

We chose the most common result of the target column as our baseline with an accuracy of 55.75%.

We ran over 200 variations of Decision Tree, Random Forest, K-Nearest Neighbor, and other models

Overall, the model with the best performances was the Random Forest with

max_depth = 8

min_samples_leaf = 3

Accuracy:

train (In-sample) = 92.05%

validate (Out-of-sample) = 80.46%

test (Out-of-sample) = 76.54%

Table of Contents

Project Delivery

✓ Plan ➜ ✓ Acquire ➜ ✓ Prepare ➜ ✓ Explore ➜ ✓ Model ➜ 🟢 Deliver

Currently we are achieving an Out-of-sample accuracy of ~76% on our test data and we believe with further feature engineering and hyper-parameter optimization we could achieve a higher accuracy.

Conclusion and Next Steps

While it appeared that there may have been a significant difference between the genders and their pessimisim, it was not result in this instance.

Additionally, our other potential observation, that there would be a significant difference in the pessimisim reletive to income, it was again not the result in this instance.

The next step is to continue finalizing the work and ensuring our work is throughly documented.

With more time we will continue examining multiple different feature combinations and test for significance from these observations.

Project Replication

Statistical data can be downloaded from here.

You can read the SPSS Statistic data file with pandas.read_spss("ATP W41.sav")

Data Use Agreements

The source of the data with express reference to the center in accordance with the following citation: “Pew Research Center’s American Trends Panel”

Any hypothesis, insight and or result within this project in no way implies or suggests as attributing a particular policy or lobbying objective or opinion to the Center, and

“The opinions expressed herein, including any implications for policy, are those of the author and not of Pew Research Center.”

Information on The American Trends Panel (ATP) can be found at The American Trends Panel

More information on these user agreements can be found at Pew Research.

Table of Contents

Citation

“American Trends Panel Wave 41.” Pew Research Center, Washington, D.C. (December 27, 2018).

Discovering Drivers of American Attitudes

Date: November 14, 2021

My Teammates

Alexia Garces

Brooke Holyoak

Jason Tellez

Malachi Hale

Table of Contents

Project Planning

Project Objectives

Business Goals

Audience

Deliverables

Executive Summary

Goals

Findings

Acquire Data

Working with American Trends Panel Data

Demographic Profile Variables

Unique Identifier

Data Variable Types

Dataset Format

DataFrame Dictionary

Takeaways from Acquire:

Prepare Data

Prepare Takeaways

Explore Data

Statistical Test Results

Univariate Distributions

Males vs Females Pessimisim

Family Income Pessimisim

Educational Attainment Pessimisim

Hypotheses & Testing

Hypothesis 1

Hypothesis 1 Takeaways

Hypothesis 2

Hypothesis 2 Takeaways

Explore Takeaways

Modeling & Evaluation

Baseline

Decision Tree

Random Forest

K Nearest Neighbors

Other Models

Feature Importance

Modeling with just the top features

Model Comparison

Modeling the Gender Subset

Modeling the Political Party Subsets

Modeling the Income Level Subsets

Modeling the Education Level Subsets

Out of Sample

Modeling Takeaways

Project Delivery

Conclusion and Next Steps

Project Replication

Data Use Agreements