If you’re looking for “ugly” datasets that are intentionally messy—think inconsistent formats, structural issues, missing values, duplicates, or data that’s tough to import into tools like Excel or Access due to bad organization—these resources focus on real-world or simulated dirty data for training in data cleaning. I’ve prioritized sources with free, downloadable datasets that match your criteria, often in CSV, Excel, or other formats that require wrangling. Many come from community recommendations or curated lists.
1. Kaggle Datasets Tagged for Cleaning Practice
Kaggle has a ton of user-uploaded datasets specifically labeled as “dirty” or “unclean” for data cleaning exercises. These often include scraped data with real-world messiness like inconsistent columns, mixed data types, or import errors (e.g., non-UTF encoding or oversized files that choke Excel). Search for terms like “dirty dataset” or “data cleaning practice” on kaggle.com/datasets.
- Examples:
- Dirty Dataset to Practice Data Cleaning: Scraped from Wikipedia, with issues like irregular formatting and missing entries. Link: kaggle.com/datasets/amruthayenikonda/dirty-dataset-to-practice-data-cleaning
- Movies Dataset for Feature Extraction: Web-scraped IMDb data on Netflix movies/TV shows; problems include inconsistent ratings, genres lumped together, and null values. Link: kaggle.com/datasets/bharatnatrayn/movies-dataset-for-feature-extracion-prediction
- Food Choices: Survey on college students’ food preferences; messy with coded responses, missing nutrition data, and inconsistent scales. Link: kaggle.com/datasets/borapajo/food-choices
- Data Science Job Postings on Glassdoor: Unclean scraped job data; issues like salary ranges in text, duplicate postings, and varying company formats. Link: kaggle.com/datasets/rashikrahmanpritom/data-science-job-posting-on-glassdoor
- Audible Dataset: Scraped audiobook data; challenges include lumped author/narrator fields, inconsistent durations, and encoding errors. Link: kaggle.com/datasets/snehangsude/audible-dataset
- Mexican Federal Government Salaries: Raw salary data; messy with varying position titles, currency inconsistencies, and structural hierarchies. Link: kaggle.com/datasets/ivansabik/mexican-federal-government-salaries
2. GitHub Repositories with Messy Data
GitHub hosts repos dedicated to unclean datasets, often with before/after examples. These can be tricky to import due to non-standard structures or multiple files.
- eyowhite/Messy-dataset: Contains unclean data science job postings in CSV; issues like missing headers, inconsistent locations, and salary parsing. Includes cleaned versions for comparison. Link: github.com/eyowhite/Messy-dataset
- Ask a Manager Salary Survey: Real survey data with free-text fields (e.g., job titles, industries) leading to duplicates, misspellings, and import headaches from wide formats. Link to dataset: docs.google.com/spreadsheets/d/1IPS5dBSGtwYVbjsfbaMCYIWnOuRmJcbequohNxCyGVw
3. Foresight BI Dirty Data Samples
This site offers Excel files simulating common real-life messes, perfect for import difficulties (e.g., merged cells, lumped data that breaks tabular import into Access). Each includes a raw dirty sheet and a target clean one.
- Badly Structured Sales Data 1: Mixed rows/columns needing rearrangement; watch for totals. Link: foresightbi.com.ng/wp-content/uploads/2020/05/1.-Badly-Structured-Sales-Data-1.xlsx
- Badly Structured Sales Data 2: Similar, with dates and no totals. Link: foresightbi.com.ng/wp-content/uploads/2020/05/2.-Badly-Structured-Sales-Data-2.xlsx
- Badly Structured Sales Data 3: Rearrange into five columns, with totals. Link: foresightbi.com.ng/wp-content/uploads/2020/05/3.-Badly-Structured-Sales-Data-3.xlsx
- Badly Structured Sales Data 4: Variation of the above. Link: foresightbi.com.ng/wp-content/uploads/2020/05/4.-Badly-Structured-Sales-Data-4.xlsx
- Jumbled Customer Details: Copied web data needing column separation. Link: foresightbi.com.ng/wp-content/uploads/2020/05/5.-Jumbled-up-Customers-Details.xlsx
- Medicine Data With Combined Quantity and Measure: Split lumped fields. Link: foresightbi.com.ng/wp-content/uploads/2020/05/6.-Hospital-Data-with-Mixed-Numbers-and-Characters.xlsx
- Hospital Data With Mixed Numbers and Characters: Numbers swapped with letters. Link: foresightbi.com.ng/wp-content/uploads/2020/05/7.-Medicine-Data-with-lumped-Quantity-and-Measure.xlsx
- Invoices With Merged Categories and Amounts: Lumped items needing row expansion. Link: foresightbi.com.ng/wp-content/uploads/2020/05/8.-Invoices-with-Merged-Categories-and-Merged-Amounts.xlsx
4. Government and Open Data Portals
Sites like these often provide raw, unpolished data from real sources, which can be hard to import due to large sizes, inconsistent schemas, or non-tabular formats.
- Data.gov: U.S. government open data; search for raw datasets in domains like health or environment—often messy with gaps, varying units, or multi-file zips. Link: data.gov
- Google Dataset Search: Aggregates datasets from various sources; filter for “unclean” or domain-specific raw data. Good for finding messy ones like survey responses. Link: datasetsearch.research.google.com
5. Curated Lists from Blogs and Communities
- Datasets for Data Cleaning Practice (Blog Post): A compilation of 14 datasets with specific cleaning challenges, like gaps in weather data or structural issues in surveys. Examples include:
- Hourly Weather Surface – Brazil: Gaps and inconsistencies across stations (source: INMET Brazil).
- ICOADS (Ocean-Atmosphere Data): Bad values, duplicates, and missing timestamps over centuries.
- London Air Quality: Outliers, nulls, and date pivots.
- WikiPlots: Incomplete plots and irrelevant text in 112K entries. Download: plots.zip (updated 2017). Full list: makingnoiseandhearingthings.com/2018/04/19/datasets-for-data-cleaning-practice
- Maven Analytics Data Playground: Free real-world datasets like e-commerce orders or LEGO sets; some are semi-structured and need cleaning for import. Link: mavenanalytics.io/data-playground
- Inside Airbnb: Scraped rental data; messy with varying property details and geodata inconsistencies. Link: insideairbnb.com
Additional Tips
- For even more, check Reddit (e.g., r/learnSQL or r/datasets) or X (formerly Twitter) for user-shared “dirty datasets”—recent posts often link to custom ones. Start with the Foresight BI Excels if you want quick, import-challenging files. If datasets are too large for Excel (e.g., >1M rows), they naturally force alternatives like Python or SQL, aligning with your training goals.
Suggested Sources for Messy, Unclean Datasets
If you’re looking for “ugly” datasets that are intentionally messy—think inconsistent formats, structural issues, missing values, duplicates, or data that’s tough to import into tools like Excel or Access due to bad organization—these resources focus on real-world or simulated dirty data for training in data cleaning. I’ve prioritized sources with free, downloadable datasets that match your criteria, often in CSV, Excel, or other formats that require wrangling. Many come from community recommendations or curated lists.
1. Kaggle Datasets Tagged for Cleaning Practice
Kaggle has a ton of user-uploaded datasets specifically labeled as “dirty” or “unclean” for data cleaning exercises. These often include scraped data with real-world messiness like inconsistent columns, mixed data types, or import errors (e.g., non-UTF encoding or oversized files that choke Excel). Search for terms like “dirty dataset” or “data cleaning practice” on kaggle.com/datasets.
- Examples:
- Dirty Dataset to Practice Data Cleaning: Scraped from Wikipedia, with issues like irregular formatting and missing entries. Link: kaggle.com/datasets/amruthayenikonda/dirty-dataset-to-practice-data-cleaning
- Movies Dataset for Feature Extraction: Web-scraped IMDb data on Netflix movies/TV shows; problems include inconsistent ratings, genres lumped together, and null values. Link: kaggle.com/datasets/bharatnatrayn/movies-dataset-for-feature-extracion-prediction
- Food Choices: Survey on college students’ food preferences; messy with coded responses, missing nutrition data, and inconsistent scales. Link: kaggle.com/datasets/borapajo/food-choices
- Data Science Job Postings on Glassdoor: Unclean scraped job data; issues like salary ranges in text, duplicate postings, and varying company formats. Link: kaggle.com/datasets/rashikrahmanpritom/data-science-job-posting-on-glassdoor
- Audible Dataset: Scraped audiobook data; challenges include lumped author/narrator fields, inconsistent durations, and encoding errors. Link: kaggle.com/datasets/snehangsude/audible-dataset
- Mexican Federal Government Salaries: Raw salary data; messy with varying position titles, currency inconsistencies, and structural hierarchies. Link: kaggle.com/datasets/ivansabik/mexican-federal-government-salaries
2. GitHub Repositories with Messy Data
GitHub hosts repos dedicated to unclean datasets, often with before/after examples. These can be tricky to import due to non-standard structures or multiple files.
- eyowhite/Messy-dataset: Contains unclean data science job postings in CSV; issues like missing headers, inconsistent locations, and salary parsing. Includes cleaned versions for comparison. Link: github.com/eyowhite/Messy-dataset
- Ask a Manager Salary Survey: Real survey data with free-text fields (e.g., job titles, industries) leading to duplicates, misspellings, and import headaches from wide formats. Link to dataset: docs.google.com/spreadsheets/d/1IPS5dBSGtwYVbjsfbaMCYIWnOuRmJcbequohNxCyGVw
3. Foresight BI Dirty Data Samples
This site offers Excel files simulating common real-life messes, perfect for import difficulties (e.g., merged cells, lumped data that breaks tabular import into Access). Each includes a raw dirty sheet and a target clean one.
- Badly Structured Sales Data 1: Mixed rows/columns needing rearrangement; watch for totals. Link: foresightbi.com.ng/wp-content/uploads/2020/05/1.-Badly-Structured-Sales-Data-1.xlsx
- Badly Structured Sales Data 2: Similar, with dates and no totals. Link: foresightbi.com.ng/wp-content/uploads/2020/05/2.-Badly-Structured-Sales-Data-2.xlsx
- Badly Structured Sales Data 3: Rearrange into five columns, with totals. Link: foresightbi.com.ng/wp-content/uploads/2020/05/3.-Badly-Structured-Sales-Data-3.xlsx
- Badly Structured Sales Data 4: Variation of the above. Link: foresightbi.com.ng/wp-content/uploads/2020/05/4.-Badly-Structured-Sales-Data-4.xlsx
- Jumbled Customer Details: Copied web data needing column separation. Link: foresightbi.com.ng/wp-content/uploads/2020/05/5.-Jumbled-up-Customers-Details.xlsx
- Medicine Data With Combined Quantity and Measure: Split lumped fields. Link: foresightbi.com.ng/wp-content/uploads/2020/05/6.-Hospital-Data-with-Mixed-Numbers-and-Characters.xlsx
- Hospital Data With Mixed Numbers and Characters: Numbers swapped with letters. Link: foresightbi.com.ng/wp-content/uploads/2020/05/7.-Medicine-Data-with-lumped-Quantity-and-Measure.xlsx
- Invoices With Merged Categories and Amounts: Lumped items needing row expansion. Link: foresightbi.com.ng/wp-content/uploads/2020/05/8.-Invoices-with-Merged-Categories-and-Merged-Amounts.xlsx
4. Government and Open Data Portals
Sites like these often provide raw, unpolished data from real sources, which can be hard to import due to large sizes, inconsistent schemas, or non-tabular formats.
- Data.gov: U.S. government open data; search for raw datasets in domains like health or environment—often messy with gaps, varying units, or multi-file zips. Link: data.gov
- Google Dataset Search: Aggregates datasets from various sources; filter for “unclean” or domain-specific raw data. Good for finding messy ones like survey responses. Link: datasetsearch.research.google.com
5. Curated Lists from Blogs and Communities
- Datasets for Data Cleaning Practice (Blog Post): A compilation of 14 datasets with specific cleaning challenges, like gaps in weather data or structural issues in surveys. Examples include:
- Hourly Weather Surface – Brazil: Gaps and inconsistencies across stations (source: INMET Brazil).
- ICOADS (Ocean-Atmosphere Data): Bad values, duplicates, and missing timestamps over centuries.
- London Air Quality: Outliers, nulls, and date pivots.
- WikiPlots: Incomplete plots and irrelevant text in 112K entries. Download: plots.zip (updated 2017). Full list: makingnoiseandhearingthings.com/2018/04/19/datasets-for-data-cleaning-practice
- Maven Analytics Data Playground: Free real-world datasets like e-commerce orders or LEGO sets; some are semi-structured and need cleaning for import. Link: mavenanalytics.io/data-playground
- Inside Airbnb: Scraped rental data; messy with varying property details and geodata inconsistencies. Link: insideairbnb.com
Additional Tips
- For even more, check Reddit (e.g., r/learnSQL or r/datasets) or X (formerly Twitter) for user-shared “dirty datasets”—recent posts often link to custom ones. Start with the Foresight BI Excels if you want quick, import-challenging files. If datasets are too large for Excel (e.g., >1M rows), they naturally force alternatives like Python or SQL, aligning with your training goals.