Firsts in American baby-naming

Even more practice with text filters, this time to find when baby names first became known.

Due: Tuesday, February 10
Points: 3 (Extra Credit)

This is yet another assignment using the U.S. Social Security Administration’s data. Using the nationwide set of baby names, find the first year in which a combination of baby name and gender first appeared on the SSA’s list. For example, Pat,F first had at least 100 babies in 1923, while Pat,M made its 100-baby mark in 1914.

This is a continuation of More analysis of trends in American baby-naming, though considerably less complicated. This is more meant to be a review of grep and of regular expressions, which will continue to be important for any kind of programming you do, including data-filtering and data-visualization tasks.

Deliverables

  • A folder named `homework/ssa-baby-name-fun`

    If you did the More analysis of trends in American baby-naming extra-credit, this folder will already exist, and your data-hold should already have the nationwide baby names. If you didn’t do that assignment, read the Hints section below to get the data bootstrapped.

      |-compciv/
        |-homework/
           |--ssa-baby-name-fun/
              |--first-year.sh
              |--data-hold/
                 |--names-nationwide/
                    |--yob1880.txt
                    |--(yob1881.txt etc. etc.)
    
  • The `first-year.sh` script

    With the assumption that the SSA nationwide baby name data exists in data-hold/names-nationwide, the first-year.sh script will:

    • Find every combination of name and gender that has had at least 1,000 babies in a single year.
    • For each combination above, the first year in which that combo had at least 100 babies.

    The output should be sorted alphabetically and look like this:

                  Aaden,M,2007
                  Aaliyah,F,1994
                  Aaron,M,1880
                  Abby,F,1954
                  Abel,M,1924
    

    See below in the Hints section to see for examples of what your output should look like.

  • No repetition in your regexes

    Time to use regular expressions (more) like a pro. If you are use a regex that looks like this:

        '[0-9][0-9][0-9]'
    

    you will lose points. Stop copy-paste-repeating yourself and use the proper repetition syntax.

  • Hints

    Bootstrapping the data

    If you already did the extra-credit, More analysis of trends in American baby-naming, then you'll already have the data as needed.

    mkdir -p ./data-hold/names-nationwide 
    cd data-hold/names-nationwide
    
    curl http://stash.compciv.org/ssa_baby_names/names.zip \
      -o names.zip
    
    unzip -o names.zip
    rm names.zip && cd ../..
    

    The code above will save the data in this structure:

       |--ssa-baby-name-fun/
          |--first-year.sh
          |--data-hold/
             |--names-nationwide/
                |--yob1880.txt
                |--(yob1881.txt etc. etc.)
    

    Your toolset

    For a not-fancy-but-at-least-it-works solution, you should not need anything more than:

    Check the Unix tools page, as it contains all the tools and relevant options you'll need. And definitely brush up on regular expressions.

    Thinking in numerical patterns

    To satisfy the requirement of finding baby names that have had "at least 1,000 babies in a single year", you might be tempted to use math and if-statements. You could. Or you could think of it another way: What does the number 1000 have that 999, 100, 42, and 6, do not have?

    Sample output

    The output of first-year.sh should have 1,592 lines

    The first 25 lines of output

    Aaden,M,2007
    Aaliyah,F,1994
    Aaron,M,1880
    Abby,F,1954
    Abel,M,1924
    Abigail,F,1949
    Abraham,M,1893
    Ada,F,1880
    Adalyn,F,2005
    Adalynn,F,2007
    Adam,M,1880
    Adan,M,1969
    Addison,F,1991
    Addyson,F,2001
    Adelaide,F,1887
    Adele,F,1888
    Adeline,F,1884
    Adelyn,F,2003
    Aden,M,1999
    Adrian,M,1912
    Adriana,F,1959
    Adrianna,F,1975
    Adrienne,F,1917
    Agnes,F,1880
    Aidan,M,1990
    

    The last 25 lines of output

    Willow,F,1996
    Wilma,F,1896
    Wilson,M,1909
    Winifred,F,1883
    Woodrow,M,1911
    Wyatt,M,1955
    Xander,M,1999
    Xavier,M,1953
    Ximena,F,2000
    Yahir,M,2002
    Yaretzi,F,2005
    Yasmin,F,1974
    Yesenia,F,1971
    Yolanda,F,1913
    Yvette,F,1917
    Yvonne,F,1903
    Zachary,M,1949
    Zachery,M,1976
    Zackary,M,1979
    Zander,M,1999
    Zane,M,1926
    Zayden,M,2004
    Zion,M,1998
    Zoe,F,1952
    Zoey,F,1992
    

    Solution

    names=$(cat data-hold/names-nationwide/*.txt | grep -E '[0-9]{4}' | cut -d ',' -f 1,2 | sort | uniq)
    
    for name in $names; do 
        year=$(grep -lE "$name,[0-9]{3}" data-hold/names-nationwide/*.txt | \
        sort | grep -oE '[0-9]{4}' | head -n 1)
    
        echo "$name,$year"
    done