Tag Archives: The Data Incubator

A Data Science Career with Kirk Borne, Free Webinar

Once again, The Data Incubator, is hosting another Data Science in 30 minutes webinar. This one features the career of Kirk Borne.

Renowned data scientist, Kirk Borne will take viewers on a journey through his career in science and technology explaining how the industry-and himself have evolved over the last 4 decades. Starting with skipping lunches in high school to a systematic twitter obsession, Kirk will shed light on his road to success in the data science industry.

Kirk is universally considered one of the most (if not the most) influential voices in data science. If you are interested in a career in data science, this is a webinar you will not want to miss.

The webinar is 5:30 Eastern Time on August 29, 2017, and registrations are currently being accepted. It is free.

Netflix Data Scientist on Machine Learning: Free Webinar

The Data Incubator, a data science fellowship program, is currently running a Data Science in 30 minutes webinar series. Next week features a free webinar with Dr. Becky Tucker of Netflix. Dr. Tucker is a Senior Data Scientist at Netflix where she specializes in predictive modeling for content demand (think what do people want to watch). The full abstract of the webinar is below. The webinar is free; all you need to do is register.

Predicting Content Demand with Machine Learning

Date/Time: March 9, 2017 @ 5:30 PM ET
Location: Online
Register: Click Here

Abstract: Netflix is well-known for its data-driven recommendations that seek to customize the user experience for every subscriber. But data science at Netflix extends far beyond that – from optimizing streaming and content caching to informing decisions about the TV shows and films available on the service. The talk will cover work done by Becky and the Content Data Science team at Netflix, which seeks to evaluate where Netflix should spend their next content dollar using machine learning and predictive models.

Update – Below is the Recorded Webinar

How to Kickstart Your Data Science Career

This is a guest post from Michael Li of The Data Incubator. The The Data Incubator runs a free eight week data science fellowship to help transition their Fellows from Academia to Industry. This post runs through some of the toolsets you’ll need to know to kickstart your Data Science Career.

 

If you’re an aspiring data scientist but still processing your data in Excel, you might want to upgrade your toolset.  Why?  Firstly, while advanced features like Excel Pivot tables can do a lot, they don’t offer nearly the flexibility, control, and power of tools like SQL, or their functional equivalents in Python (Pandas) or R (Dataframes).  Also, Excel has low size limits, making it suitable for “small data”, not  “big data.”

In this blog entry we’ll talk about SQL.  This should cover your “medium data” needs, which we’ll define as the next level of data where the rows do not fit the 1 million row restriction in Excel.  SQL stores data in tables, which you can think of as a spreadsheet layout but with more structure.  Each row represents a specific record, (e.g. an employee at your company) and each column of a table corresponds to an attribute (e.g. name, department id, salary).  Critically, each column must be of the same “type”.  Here is a sample of the table Employees:

EmployeeId Name StartYear Salary DepartmentId
1 Bob 2001 10.5 10
2 Sally 2004 20 10
3 Alice 2005 25 20
4 Fred 2004 12.5 20

SQL has many keywords which compose its query language but the ones most relevant to data scientists are SELECT, WHERE, GROUP BY, JOIN.  We’ll go through these each individually.

SELECT

SELECT is the foundational keyword in SQL. SELECT can also filter on columns.  For example

SELECT Name, StartYear FROM Employees

returns

Name StartYear
Bob 2001
Sally 2004
Alice 2005
Fred 2004

 

WHERE

The WHERE clause filters the rows. For example

SELECT * FROM Employees WHERE StartYear=2004

returns

EmployeeId Name StartYear Salary DepartmentId
2 Sally 2004 20 10
4 Fred 2004 12.5 20

 

GROUP BY

Next, the GROUP BY clause allows for combining rows using different functions like COUNT (count) and AVG (average). For example,

SELECT StartYear, COUNT(*) as Num, AVG(Salary) as AvgSalary
FROM EMPLOYEES
GROUP BY StartYear

returns

StartYear Num AvgSalary
2001 1 10.5
2004 2 16.25
2005 1 25

 

JOIN

Finally, the JOIN clause allows us to join in other tables. For example, assume we have a table called Departments:

DepartmentId DepartmentName
10 Sales
20 Engineering

We could use JOIN to combine the Employees and Departments tables based ON the DepartmentId fields:

SELECT Employees.Name AS EmpName, Departments.DepartmentName AS DepName
FROM Employees JOIN Departments
ON Employees.DepartmentId = Departments.DepartmentId;

The results might look like:

EmpName DepName
Bob Sales
Sally Sales
Alice Engineering
Fred Engineering

We’ve ignored a lot of details about joins: e.g. there are actually (at least) 4 types of joins, but hopefully this gives you a good picture.

Conclusion and Further Reading

With these basic commands, you can get a lot of basic data processing done.  Don’t forget, that you can nest queries and create really complicated joins.  It’s a lot more powerful than Excel, and gives you much better control of your data.  Of course, there’s a lot more to SQL than what we’ve mentioned and this is only intended to wet your appetite and give you a taste of what you’re missing.

 

And when you’re ready to step it up from “medium data” to “big data”, you should apply for a fellowship at The Data Incubator where we work with current-generation data-processing technologies like MapReduce and Spark!

Data Sources for Cool Data Science Projects: Part 2 – Guest Post


I am excited for the first ever guest posts on the Data Science 101 blog. Dr. Michael Li, Executive Director of The Data Incubator in New York City, is providing 2 great posts (see Part 1) about finding data for your next data science project.

At The Data Incubator, we run a free six week data science fellowship to help our Fellows land industry jobs. Our hiring partners love considering Fellows who don’t mind getting their hands dirty with data. That’s why our Fellows work on cool capstone projects that showcase those skills. One of the biggest obstacles to successful projects has been getting access to interesting data. Here are some more cool public data sources you can use for your next project:

Data With a Cause:

  1. Environmental Data: Data on household energy usage is available as well as NASA Climate Data.
  2. Medical and biological Data: You can get anything from anonymous medical records, to remote sensor reading for individuals, to data of the Genomes of 1000 individuals.

Miscellaneous:

  1. Geo Data: Try looking at these Yelp Datasets for venues near major universities and one for major cities in the Southwest. The Foursquare API is another good source. Open Street Map has open data on venues as well.
  2. Twitter Data: you can get access to Twitter Data used for sentiment analysis, network Twitter Data, social Twitter data, on top of their API.
  3. Games Data: Datasets for games, including a large dataset of Poker hands, dataset of online Domion Games, and datasets of Chess Games are available.
  4. Web Usage Data: Web usage data is a common dataset that companies look at to understand engagement. Available datasets include Anonymous usage data for MSNBC, Amazon purchase history (also anonymized), and Wikipedia traffic.

Metasources: these are great sources for other web pages.

  1. Stanford Network Data: http://snap.stanford.edu/index.html
  2. Every year, the ACM holds a competition for machine learning called the KDD Cup. Their data is available online.
  3. UCI maintains archives of data for machine learning.
  4. US Census Data
  5. Amazon is hosting Public Datasets on s3
  6. Kaggle hosts machine-learning challenges and many of their datasets are publicly available
  7. The cities of Chicago, New York, Washington DC, and SF maintain public data warehouses.
  8. Yahoo maintains a lot of data on its web properties which can be obtained by writing them.
  9. BigML is a blog that maintains a list of public datasets for the machine learning community.
  10. Finally, if there’s a website with data you are interested in, crawl for it!

 

While building your own project cannot replicate the experience of fellowship at The Data Incubator (our fellows get amazing access to hiring managers and access to nonpublic data sources) we hope this will get you excited about working in data science. And when you are ready, you can apply to be a Fellow!

Got any more data sources? Let us know or leave a comment and we’ll add them to the list!

 

Additional Sources (added via comments since the post was published)

Data Sources for Cool Data Science Projects: Part 1 – Guest Post


I am excited for the first ever guest posts on the Data Science 101 blog. Dr. Michael Li, Executive Director of The Data Incubator in New York City, is providing 2 great posts (see Part 2) about finding data for your next data science project.

At The Data Incubator, we run a free six week data science fellowship to help our Fellows land industry jobs. Our hiring partners love considering Fellows who don’t mind getting their hands dirty with data. That’s why our Fellows work on cool capstone projects that showcase those skills. One of the biggest obstacles to successful projects has been getting access to interesting data. Here are a few cool public data sources you can use for your next project:

Economic Data:

  1. Publically Traded Market Data: Quandl is an amazing source of finance data. Google Finance and Yahoo Finance are additional good sources of data. Corporate filings with the SEC are available on Edgar.
  2. Housing Price Data: You can use the Trulia API or the Zillow API.
  3. Lending data: You can find student loan defaults by university and the complete collection of peer-to-peer loans from Lending Club and Prosper, the two largest platforms in the space.
  4. Home mortgage data: There is data made available by the Home Mortgage Disclosure Act and there’s a lot of data from the Federal Housing Finance Agency available here.

Content Data:

  1. Review Content: You can get reviews of restaurant and physical venues from Foursquare and Yelp (see geodata). Amazon has a large repository of Product Reviews. Beer reviews from Beer Advocate can be found here. Rotten Tomatoes Movie Reviews are available from Kaggle.
  2. Web Content: Looking for web content? Wikipedia provides dumps of their articles. Common Crawl has a large corpus of the internet available. ArXiv maintains all their data available via Bulk Download from AWS S3. Want to know which URLs are malicious? There’s a dataset for that. Music data is available from the Million Songs Database. You can analyze the Q&A patterns on sites like Stack Exchange (including Stack Overflow).
  3. Media Data: There’s open annotated articles form the New York Times, Reuters Dataset, and GDELT project (a consolidation of many different news sources). Google Books has published NGrams for books going back to past 1800.
  4. Communications Data: There’s access to public messages of the Apache Software Foundation and communications amongst former execs Enron

Government Data:

  1. Municipal Data: Crime Data is available for City of Chicago, and Washington DC. Restaurant Inspection Data is available for Chicago and New York City.
  2. Transportation Data: NYC Taxi Trips in 2013 are available courtesy of the Freedom of Information Act. There’s bikesharing data from NYC, Washington DC, and SF. There’s also Flight Delay Data from the FAA
  3. Census Data: Japanese Census Data. US Census data from 2010, 2000, 1990. From census data, the government has also derived time use data. EU Census Data. Checkout popular male / female baby names going back to the 19th Century from the Social Security Administration.
  4. World Bank: they have a lot of data available on their website.
  5. Election Data: Political contribution data for the last few US elections can be downloaded from the FEC here and here. Polling data is available from Real Clear Politics.

 

While building your own project cannot replicate the experience of fellowship at The Data Incubator (our fellows get amazing access to hiring managers and access to nonpublic data sources) we hope this will get you excited about working in data science. And when you are ready, you can apply to be a Fellow!

Got any more data sources? Let us know or leave a comment and we’ll add them to the list!