I’d like to start by saying wow, data is hard. There is so much to think about when it comes to creating and managing a data set. I have so much to think about for the future of my project. What kind of information will I provide in a data set? What will I deem irrelevant? Should I deem any data I can think of irrelevant? I don’t think I should. Dan Cohen mentions in “Eliminating the Power Cord” how we can never anticipate exactly how someone will use our product or data. I may find the exact page or word count of each letter meaningless, but already having that data prepared could be very useful to a researcher. If I can imagine it, I will include it. At least for now.
I created a Google Spreadsheet that can be viewed here.
Originally, I misread our goal for the activity this week. I created my own data linked above from 41 out of 117 possible letters. I created a tab with identifier information that would include the file name on my computer for ease of finding. I used this to reference back to from the main set of data in the first sheet. This data has the first and last name of the writer of the letter separated as well as the date it was written and the receiver of the letter. I made sure to separate variables into their own fields as advised by Hadley Wickham. I also used the ID structure to separate some data as seen when she discusses the weekly top hits chart. I found a lot of this article hard to wrap my head around, especially once she started talking about plyr and R. All the code confused me and I had no idea what I was looking at.
I thought the talk about concatenating data by Groot was interesting. I loved the idea of being able to take from my rows and combining the data to equal a name or date (or in the case of his examples, a street name). I did find myself wondering why go through the trouble of separating the data if only to concatenate it all in the spreadsheet. However, I see now that is not for the original data set, but instead for pull from the data set. I feel silly having had that misconception now, but live and learn.
Realizing my mistake with the assignment after hours going through my own data and working on transcriptions, I attempted to find data repositories that I could use focusing on WWI. I did a Google search and found what I thought were very promising sites. Unfortunately, I could not figure out a way to extract the data I had found so that was a dead end. I looked at IPUMS-USA for their 1850 to present records. I was excited to see that they had WWI Veteran records. Usable data! And I can extract it? Perfect! I selected the WWI records and focused my samples on the 1910-1930 census. I was discouraged once I was brought to the log-in screen and saw that I needed to request an account. I feared I would have to wait hours or days for that request. Imagine my relief when I discovered it was actually a registration, not a request per-say. Then the waiting. I was told I would receive an email when my data was ready to extract. Roadblock. I was not sure how long that would be so I went on another fruitless search for usable data repositories. Reusable data is hard to find. Luckily, my data was ready in approximately five minutes.
Unfortunately, I had no idea how to use or even access it. I followed the guide given by IPUMS, but that lead me no-where. The guide uses language discussing command files and I had no idea what to do with that. Do I need to download a separate program to view the data? I think so. I tried looking up a tutorial on uploading a .dat file into Excel and found this simple guide. Which led me to have these results:
Very interesting data it gave me on WWI Veterans. Okay, I did something wrong. What is that? I’m currently trying to figure that out. Next I explore trying to download Strata (one of the programs referenced in the IPUMS guide. I should be able to figure something out, right?
Okay, I gave it another go! I found a 14 day trial for SPSS that took an hour to download (not exaggerating unfortunately) and was itself a confusing process. Once I restarted my computer like it asked I could not find the program, only a statistics analyzer. I tried opening the files from IPUMS anyway, but no luck. Hoping to ease some frustration I looked through what some others did and saw a trend of using data from National Historical GIS. I gave that a go, hoping I could also have some form of success. The first data set I downloaded seemed far too small column-wise to get any real practice in; it only have 5-7. I created a new extract and had more than enough to work with! I was not able to work with data related to my project, but I got as close as I could. I used census information from Chicago for 1920.
There were not many fields I needed to separate, really none at all. I thought about separating the area name, but that somehow did not seem appropriate. There were some columns I completely deleted as those fields were completely blank for the set of data I was working with. Although, thinking back now, perhaps I should have inserted NULL for each entry to indicate that an entry could appear there. Looks like I created a silence in my “tidy” data.
I created a separate sheet for the nationality information and age information. I found the way the data was originally organized for nationality counts overwhelming and hard to read. I decided to show all the nationalities in one column so someone can see across a row how many of each nationality was counted total or for each tract #. I also put male and female information next to each other by tract #. I thought it was foolish to have the male and female completely separated in the original.
For the age information, I did not change much. I noticed that the headers changed pattern. At first it went over 21 and then under 21 for each main heading (example: Male_over21, Male_under21, Female_over21, Female_under21). After the initial four, it separate the information by over 21 and under 21. I put all main headings next to each other for comparison. However, I feel as though this was not sufficient. I think there is a much tidier way to clean this section of the data. Since each column has over 21 and under 21 in common, it may have been wise to make that the two columns or rows. However, how would I include the identifier information for reference?
Data is still very much a difficulty for me.
View my Excell spreadsheets: Clio Tidy Data Assignment