lede image

Collecting and analyzing job listings from the USAJobs.gov API

Ask what you can do for your country, and what your country can pay you.

Due: Friday, February 6
Points: 10

Using the JSON API from USAJobs.gov, we’ll write a scraper that collects all the current job openings as raw data, stores them into a time-stamped directory, and does a quick analysis of the highs and lows for salaried positions.

This exercise is a repeat of the scraping-to-analysis exercises we’ve done before, including more parsing of JSON with jq, and you’ll have to write some runtime execution logic to deal with job categories that have multiple pages of job openings.

Because the USAJobs.gov API is free, and relatively forgiving, this is a good time to practice the concepts involved with interacting with a remote API. But try to avoid writing an infinite loop in your program.

The USAJobs.gov site search is actually pretty robust. However, being able to collect and parse the data as we please gives us a lot more speed and flexibility in performing queries to find interesting or specific jobs. If we wanted to study job posting trends over time, or perhaps create a job-search site of our own, then having a script that can be set to automatically run and collect data every day would be an extremely effective tool.

You can download a zipped snapshot of the job openings here

Deliverables

  • A project folder named `~/compciv/homework/usajobsgov`

    The directory structure will look something like this:

      |-compciv
        |-homework
           |--usajobsgov
              |--scraper.sh
              |--analyzer.sh
              |--data-hold
                 |--OccupationalSeries.xml
                 |--scrapes
                    |----2015-01-20_1500
                      |--0000-1.json
                      |--0000-2.json
                      |--0100-1.json
    
  • A script named `scraper.sh`

    Creating a time-stamped directory

    Upon launch, the scraper.sh script will create a timestamped directory in ./data-hold/scrapes, in this format:

      YYYY-MM-DD_HH00
    

    For example, if scraper.sh is run on Jan 20th, 2015, at 3:45PM, it should create this directory:

      ./data-hold/scrapes/2015-01-20_1500
    

    (this particular arrangement means that a scrape that runs at 3:10PM and 3:45PM on 1/20/2015 will save to the same directory, 2015-01-20_1500, which is fine for our purposes)

    Collecting the JobFamily values

    Then, scraper.sh will parse the data-hold/OccupationalSeries.xml file to find all of the JobFamily values, e.g. 0000, 2200, 9900.

    Collecting data from data.usajobs.gov/api/jobs

    For each of the JobFamily values, make the appropriate curl to the JSON API from USAJobs.gov and retrieve all the current job openings for that “JobFamily”. Check out the USAJobs.gov documentation on API Query Parameters] for more information on what you need to curl.

    Paginate when necessary

    In cases where there are more job postings than can fit in a single response, the scraper.sh script should loop and collect each page.

    After each visit to the https://data.usajobs.gov/api/jobs endpoint, the scraper.sh should save a file in the timestamped directory, one for each page of each “JobFamily”.

    So, if the JobFamily of 2200 had 4 pages of job openings, scraper.sh would save these files:

          |--scrapes
             |----2015-01-20_1500
                    |--2200-1.json
                    |--2200-2.json
                    |--2200-3.json
                    |--2200-4.json
    
  • A script named `analyzer.sh`

    When the analyzer.sh script is executed, it expects one argument to be passed in: a timestamped sub-directory within scrapes, i.e:

          bash analyzer.sh 2015-01-20_1500
    

    The analyzer.sh script will then:

    1. Collect just the job postings that have a salary-basis as “Per Year”
    2. Collect and count the unique job titles
    3. Select the 25 most frequently occurring job titles
    4. For each of these job-titles, print pipe-delimited-output that includes the job title, the minimum salary, and the maximum salary among the collected job records:

    In other words, each job title will presumably have more than one job-listing. The analyzer.sh script prints a simple report showing the variance in possible salaries.

    Here’s what the output looked like for jobs posted on January 26:

        Transportation Security Officer (TSO)|31203.00|52184.00
        Physician (Psychiatrist)|97987.00|250000.00
        Physician (Primary Care)|97987.00|215000.00
        Physical Therapist|39179.00|104306.00
        Contract Specialist|40336.00|172443.00
        Social Worker|49285.00|126949.00
        Program Analyst|47684.00|158700.00
        Medical Technologist|36379.00|107434.00
        Physician Assistant|57798.00|119443.00
        Physician (Hospitalist)|97987.00|240000.00
        Supply Technician|31315.00|61994.00
        Medical Support Assistant|25434.00|49166.00
        Clinical Psychologist|57408.00|118515.00
        Advanced Medical Support Assistant|35256.00|54806.00
        Physician|97987.00|325000.00
        Auditor|34576.00|116901.00
        Civil Engineer|36379.00|143152.00
        Physician (Gastroenterology)|97987.00|320000.00
        Interdisciplinary|31944.00|158700.00
        Dental Assistant|25434.00|50374.00
        Budget Analyst|39179.00|149333.00
        Public Affairs Specialist|50073.00|139523.00
        Psychiatrist|97987.00|260000.00
        Occupational Therapist|48403.00|91255.00
        Physician (Psychiatry)|96539.00|260000.00
    
  • Hints

    I'll be brief in directions here, as this exercise follows all the patterns and strategies you've practiced before.

    Here's a sample endpoint and response for the JobFamily of 2200:

    https://data.usajobs.gov/api/jobs?series=2210

    {
      "TotalJobs": "183",
      "JobData": [
        {
          "DocumentID": "391383700",
          "JobTitle": "Information Technology Specialist",
          "OrganizationName": "Department Of Health And Human Services",
          "AgencySubElement": "Centers for Medicare & Medicaid Services",
          "SalaryMin": "$76,378.00",
          "SalaryMax": "$99,296.00",
          "SalaryBasis": "Per Year",
          "StartDate": "1/16/2015",
          "EndDate": "1/28/2015",
          "WhoMayApplyText": "United States Citizens",
          "PayPlan": "GS",
          "Series": "2210",
          "Grade": "12/12",
          "WorkSchedule": "Full Time",
          "WorkType": "Permanent",
          "Locations": "Woodlawn, Maryland",
          "AnnouncementNumber": "CMS-OTS-DE-15-1301322",
          "JobSummary": "CMS' effectiveness depends on the capabilities of a dedicated, professional staff that is committed to supporting these objectives. A career with CMS offers the opportunity to get involved on important national health care issues and be part of a dynamic, fast-paced, and highly visible organization. For more information on CMS, please visit: http://www.cms.gov/ . This position is located in the Department of Health & Human Services (HHS), Centers for Medicare & Medicaid Services (CMS), Office of Technology Solutions (OTS), Woodlawn, MD. WHO MAY APPLY: This is a competitive vacancy, open to all United States Citizens or Nationals, advertised under Delegated Examining Authority....",
          "ApplyOnlineURL": "https://www.usajobs.gov/GetJob/ViewDetails/391383700?PostingChannelID=RESTAPI"
        }
      ],
      "Pages": "8"
    }
    

    Setup

    While you can get the OccupationalSeries.xml file from the USAJobs.gov page, it prevents a direct curl. So I've made a copy and you can download it like this:

    curl -o ./data-hold/OccupationalSeries.xml http://stash.compciv.org/usajobs.gov/OccupationalSeries.xml 
    

    Parsing OccupationalSeries.xml

    So the first thing you need to do is get a list of job categories, which, in the parlance of USAjobs.gov, is referred to as JobFamily.

    The OccupationalSeries.xml contains a list of JobFamily values. You'll iterate through each of these to get all of the job openings on data.usajobs.gov.

    While HTML is a subset of XML, you can't parse the OccupationalSeries.xml with pup.

    However, corn.stanford.edu has the hxselect program which can be used to parse XML in very much the same manner as pup. Check out the hxselect documentation to see how it is used.

    Hint: The OccupationalSeries.xml file has a confusing layout. There are effectively two lists. Only one of those lists contains just the unique JobFamily values.

    Creating a time-stamped directory

    Check out the tools page to see how date can be used to create a date-formatted string, which you can use for the directory's name.

    Read the documentation

    Read the documentation, especially the part about API Query Parameters. Besides the Series parameter, the only other parameter to really care about is the Page parameter. You need to be able to paginate through all of the jobs.

    Hint: There is one other parameter that is useful for reducing the number of pages you have to traverse; all of the other parameters basically narrow the search field, which you don't want.

    Do a simple scrape first

    Before worrying about getting all the pages, worry about correctly iterating through all the different JobFamily codes first, as if you only had to collect one page each.

    And use an echo statement to see what's happening:

    for jobfamily in $jobfamilies; do 
      page_count=1
      echo "Fetching jobs in $jobfamily, page $page_count"
      # etc etc
    done
    

    You should get output that looks like this:

    Fetching jobs in 0000, page 1
    Fetching jobs in 0100, page 1
    Fetching jobs in 0200, page 1
    Fetching jobs in 0400, page 1
    Fetching jobs in 0500, page 1
    

    Then, when you adjust your code to do multi-page downloading (it will likely require a for-loop within a for-loop), your echo output should look like this:

    Fetching jobs in 0000, page 1
    Fetching jobs in 0100, page 1
    Fetching jobs in 0100, page 2
    Fetching jobs in 0100, page 3
    Fetching jobs in 0200, page 1
    Fetching jobs in 0300, page 1
    Fetching jobs in 0300, page 2
    Fetching jobs in 0300, page 3
    Fetching jobs in 0400, page 1
    Fetching jobs in 0400, page 2
    Fetching jobs in 0500, page 1
    

    Basically, you want to avoid hammering the data.usajobs.gov site with the same call, over and over and over and over.

    Hint: don't use the word jobs as a variable name. The word jobs is already the name of a Unix command.

    Another big hint: When doing a curl of the URL, enclose it in double-quotes, e.g.: curl "http://data.usajobs.gov/etc&etc"…the ampersands in the URL can cause problems for you.

    Parsing with JQ

    If you want to practice parsing with jq without waiting to finish implementing scraper.sh, you can download a zipped snapshot of the job openings here.

    Getting analyzer.sh to produce the correct output will involve a combination of basic and maybe some fancy jq usage, and good old-fashioned Unix tools like grep and sort .

    Before writing analyzer.sh, try parsing the collected data for individual fields to get an idea of what gets returned:

    cat *.json | jq -r '. .JobData[] | .SalaryMax' | sort | uniq -c | sort -r | head -n 10
    
        201 $215,000.00
        161 $76,131.00
        153 $240,000.00
        150 $91,255.00
        120 $51,437.00
        102 $195,000.00
         98 $62,920.00
         97 $108,507.00
         94 $158,700.00
         89 $46,294.00
    
    cat *.json | jq -r '. .JobData[] | .SalaryBasis' | sort | uniq -c | sort
    
          1 Bi-weekly
          1 Per Day
          1 Per Month
          1 Student Stipend Paid
          2 School Year
          8 Fee Basis
         11 Without Compensation
       1300 Per Hour
       5600 Per Year
    

    Using jq's select

    Remember that the output should include only jobs that are "Per Year". This requires using the select function for jq. Rather than risk you going off the deep-end with using grep, here's an example you can use and modify:

    yearly_jobs=$(cat *.json | jq '.JobData[] | select(.SalaryBasis == "Per Year")')
    

    That code snippet will select all items in the .JobData array that have an attribute of .SalaryBasis in which the value is Per Year

    New hint: Transforming the job data

    For the rest of analyzer.sh, you only care about three fields: JobTitle, SalaryMin, and SalaryMax

    So assuming yearly_jobs contains just the Per Year jobs, as in the above step, you can use jq to get you a line-by-line list of JobTitle, SalaryMin, and SalaryMax`:

    simple_rows=$(echo $yearly_jobs | jq '. | {JobTitle, SalaryMin, SalaryMax}') 
    

    Collecting unique job titles and using read-while

    Regarding the requirement that the output include the 25 most-frequent job titles…

    Here's another potential pit-fall, given the format of the data and Bash's handling of whitespace. Using a for-loop here may not be optimal…so here's the code for the proper read-while loop, using standard input redirection and a command subprocess…two things I've skimmed over in class.

    simple_rows=$(echo $yearly_jobs | jq '. | {JobTitle, SalaryMin, SalaryMax}')
    
    while read line; do 
      # remember that each line contains something like:
      #   50   Some Job Title
      title=$(echo $line | grep -oE '[[:alpha:]].+')
      
      # RECENT FIX:
      # $rows  will filter $simple_rows to pick the rows of the job title
      filtered_rows=$(echo $simple_rows | jq "select(.JobTitle == \"$title\")" )
    
      min=$(echo $filtered_rows | jq -r '.SalaryMin' | tr -d '$' | tr -d ',' | sort -n | head -n 1)
    
      ## Get the max on your own
      echo "Finish the rest of your steps here to get the max, and print out the proper line as in the Deliverables"
      
      ## Echo the proper format as specified in the requirements
    
      # The done < ... is done for you
    done < <(echo $simple_rows | jq -r '.JobTitle' | sort | uniq -c | sort -rn | head -n 25)
    

    You can use this loop…but run that command encased in the <(...) to make sure you know what line contains.

    Solution

    scraper.sh

    td="./data-hold/scrapes/$(date +%Y-%m-%d_%H00)"
    mkdir -p $td
    
    for snum in $series; do 
      # We always have to fetch the first page of the series
      echo "Fetching series $snum, page 1"
      curl -s "https://data.usajobs.gov/api/jobs?Series=$snum&NumberOfJobs=250&Page=1" -o "$td/$snum-1.json"
    
      # now parse the first page to find the number of pages
      # remaining
      total_pages=$(cat "$td/$snum-1.json" | jq -r '.Pages')
      # if $total_pages is less than 2, this for-loop doesn't
      # execute
      for p in $(seq 2 $total_pages); do
        echo "Fetching series $snum, page $p"
        curl -s "https://data.usajobs.gov/api/jobs?Series=$snum&NumberOfJobs=250&Page=$p" -o "$td/$snum-$p.json"
      done
    done
    

    analyzer.sh

    datadir="./data-hold/scrapes/$1"
    yearly_jobs=$(cat $datadir/*.json | jq '.JobData[] | select(.SalaryBasis == "Per Year")')
    # trimming the data into simple_rows 
    # is actually probably unnecessary, but it doesn't hurt
    simple_rows=$(echo $yearly_jobs | jq '. | {JobTitle, SalaryMin, SalaryMax}')
    
    # easier to read left-to-right pipe notation than what I proposed above...but
    # the effect is the same
    echo $simple_rows | jq -r '.JobTitle' | sort | uniq -c | \
     sort -rn | head -n 25 | \
    while read -r line; do 
      title=$(echo $line | grep -oE '[[:alpha:]].+')
      filtered_rows=$(echo $simple_rows | jq "select(.JobTitle == \"$title\")" )
      min=$(echo $filtered_rows | jq -r '.SalaryMin' | tr -d '$' | tr -d ',' | sort -n | head -n 1)
      max=$(echo $filtered_rows | jq -r '.SalaryMax' | tr -d '$' | tr -d ',' | sort -rn | head -n 1)
    done