The relationship between health and wealth is a cornerstone of modern socioeconomic study. While much attention is paid to diet and exercise, sleep is often the invisible metric of productivity. According to research by the RAND Corporation, sleep deprivation among the U.S. leads to a world’s highest economic loss of up to $411 billion annually. This project examines whether this burden is shared equally or if “sleep debt” is a tax paid primarily by lower-income regions.
Our investigation is driven by three primary inquiries:
Is there a statistically significant correlation between a state’s median household income and its prevalence of short sleep duration?
Which specific states act as outliers, maintaining high income despite high sleep debt, or vice-versa?
Do states grouped by “high sleep debt” show a distinct, lower income distribution compared to “low sleep debt” states?
Justifying this analysis is critical for public policy. If sleep deprivation is geographically and economically clustered, it suggests that exhaustion is not just a personal choice, but a systemic symptom of financial instability. By identifying these “Sleep Debt” zones, we can better understand how environmental stressors in lower-income states might be preventing economic mobility through chronic fatigue.
Our analysis is important because it highlights geographical “sleep debt” zones. By identifying these areas, policymakers can better target health and economic interventions and see if there is a measurable “poverty trap” where economic status limits the physiological ability to recover.
When handling public health and economic data, privacy is the paramount concern. Both the CDC and Census Bureau datasets are strictly aggregated at the county or state level, meaning no individual records are accessible, thus protecting participant anonymity. However, we must acknowledge coverage bias. The CDC PLACES model relies on telephone surveys (BRFSS), which may systematically underrepresent populations without stable housing or those who primarily use unlisted mobile numbers.
Furthermore, we must consider the human value of this data. Using statistics to label certain regions as “unhealthy” can lead to stigmatization. Our goal is to use this data as a tool to highlight where resources are highly needed rather than to judge the lifestyle choices of residents in different economic tiers.
The first dataset comes from the CDC PLACES program, developed by the Centers for Disease Control and Prevention (CDC) to provide local-level estimates of health behaviors across the United States. It is derived from the Behavioral Risk Factor Surveillance System (BRFSS), the nation’s largest ongoing health survey, which conducts more than 400,000 annual telephone interviews with U.S. adults. The 2022 release uses model-based methods to estimate the percentage of adults in each county who sleep fewer than seven hours per night, reported as age-adjusted prevalence to allow fair comparisons across regions with different age distributions. This dataset is well suited for identifying broad geographic patterns—such as where sleep deprivation is concentrated—and comparing them with economic factors, though its model-based approach may smooth over hyper-local variation, particularly in sparsely populated rural counties.
The income data originates from Table H-8 of the Current Population Survey (CPS), a joint effort between the Bureau of Labor Statistics and the U.S. Census Bureau. The survey records data from a sample of approximately 100,000 addresses per year, using a rotating panel design to ensure a representative cross-section of the American population. Our analysis utilizes the 2022 “Current Dollar” estimates, which provide median household income figures for every state and the District of Columbia. While this is considered the gold standard for U.S. economic data, it relies on self-reported income, which can be subject to recall bias or underreporting of non-traditional income streams.
To provide a mathematical answer to our first big-picture question, we calculated the *Pearson Correlation Coefficient** (\(r\)).
[1] "Correlation Coefficient: -0.463"
Standard visualization can be deceptive. By using cor(), an intermediate statistical tool, we gain a standardized score between -1 and 1. This value tells us not just that a relationship exists, but precisely how strong it is, allowing us to validate our claims with more than just a visual guess.
By arranging the data and slicing the top 10 most sleep-deprived states, we can identify geographic clusters. This step is necessary to see if the “sleep debt” is a national issue or concentrated in specific regional economies.
We utilized case_when() to transform our continuous sleep data into categorical “levels.” This allowed us to create a boxplot to compare the median and spread of income across different tiers of sleep health.
Strong Negative Correlation: Our analysis confirms a strong negative relationship between income and sleep deprivation. States with lower median incomes consistently report higher percentages of residents sleeping less than 7 hours.
Geographic Disparities: The “High Sleep Debt” states align with regions that historically face economic challenges, suggesting that sleep debt is a regional public health crisis linked to poverty.
Income Stability: Our categorical analysis suggests that while high-income states aren’t immune to sleep issues, low-income states have a much higher “floor” for sleep deprivation, meaning it is more pervasive across their entire population.
Bias Assessment: We believe our data might underestimate sleep debt in urban centers where “gig economy” workers may be less likely to respond to standard phone surveys.
To better answer our questions, we would like to join this data with “Cost of Living” indices. It is possible that residents in high-income states like California or New York still suffer from high sleep debt because their higher nominal income is offset by extreme housing costs, forcing longer working hours. Incorporating “hours worked per week” would provide the final piece of the productivity puzzle.
Intermediate Tool - the Correlation Coefficient: We used cor() to quantify the relationship between our two primary variables.
Standard Tools: All data wrangling relied on readxl, dplyr, and stringr as covered in class.
stringr::str_extract(): Used to pull 4-digit years from messy Excel headers. Standard tools like as.numeric failed here because the headers contained text like “2022 (Current)”.
ggplot2 Customization: Used reorder() to rank bars and human-readable labels for all axes and scales.
ggplotly: Used ggplotly() to enhance user engagement and provide exact values without cluttering the visual field with text labels.
Quarto Dashboard Formatting: We used orientation: columns and manual {width=...} tags to ensure the most important insights were prominent.
Hao Yang (University of Washington; hyang643@uw.edu): Interested in exploring how sleep deprivation patterns relate to economic outcomes across U.S. states.
Zehuan Mei (University of Washington; zehuanm@uw.edu): Interested in applied data science and communicating findings through clear visualizations.
Lab Section: BC 7 | TA: Alissa Lau