Google/Alphabet continues toward Total Person Awareness: tracking every vehicle + person.

Secretive Alphabet division aims to fix public transit in US by shifting control to Google (from The Guardian)

Documents reveal Sidewalk Labs is offering a system it calls Flow to Columbus, Ohio, to upgrade bus and parking services – and bring them under Google’s management.

2813

The emails and documents show that Flow applies Google’s expertise in mapping, machine learning and big data to thorny urban problems such as public parking. Numerous studies have found that 30% of traffic in cities is due to drivers seeking parking.

Sidewalk said in documents that Flow would use camera-equipped vehicles,…. It would then combine data from drivers using GoogleMaps with live information from city parking meters to estimate which spaces were still free. Arriving drivers would be directed to empty spots.

Source: Secretive Alphabet division aims to fix public transit in US by shifting control to Google

Notice that this gives Google/Alphabet a legitimate reason to track every car in the downtown area. Flow can be even more helpful if they know  the destination of every car AND every traveler for the next hour.
The next logical step, a few years from now, will be to track the plans of every person in the city. For example Mary Smith normally leaves her house in the suburbs at 8:15AM to drive to her office in downtown Columbus. Today, however, she has to drop off daughter Emily (born Dec 1, 2008, social security number 043-xx-xxxx) at school, so she will leave a little early. This perturbation in normal traffic can be used to help other drivers choose the most efficient route. Add  together thousands of these, and we can add real-time re-routing of buses/ Uber cars.
For now, this sounds like science fiction.  It certainly contains the ability to improve transit efficiency and speed, and “make everyone better off.” But it comes at a price. Yet many are already comfortable with Waze tracking their drives in detail.
Tune back in 10 years from now and tell me how I did.

Web site: Data mining with R for MBA level students.

I just completed teaching a 10 week course on data mining for MS level professional degree students. Most of the material is on a web site, https://irgn452.wordpress.com/chron/   The course assumes good knowledge of OLS regression, but other than that is self-contained.
Software is R, with a heavy dose of Rattle for the first few weeks. (Rattle is a front end for R.) The main algorithms I emphasize are Random Forests and LASSO, for both classification and regression. I emphasize creating new variables that correspond to the physical/economic characteristics of the problem under study. The course requires a major project; some students scrape or mash their own data. Because we have only 10 weeks, I provide a timetable and a lot of milestones for the projects, and frequent one-on-one meetings.
The web site is not designed for public consumption, and is at best in “early beta” status. I am making it available in case anyone wants mine it for problem sets, discussions of applied issues not covered in most books, etc. Essentially, it is a crude draft of a text for MBAs on data mining using R. This was about the fifth time I taught the course. 

By the way, a lot of the lecture notes are modestly modified versions of the excellent lecture material from Matt Taddy. His emphasis is more theoretical than my course, but his explanations and diagrams are great. Readings were generally short sections from either ISLR by James et al,  or Data Mining with Rattle and R. Both are available as ebooks at many universities. My TA was Hyeonsu Kang.

 

Econometrics versus statistical analysis

I teach a course on Data Mining, called Big Data Analytics. (See here for the course web site.) As I began to learn its culture and methods, clear differences from econometrics showed up. Since my students are well trained in standard econometrics, the distinctions are important to help guide them.

One important difference, at least where I teach, is that econometrics formulates statistical problems as hypothesis tests. Students do not learn other tools, and therefore they have trouble  recognizing problems where hypothesis tests are not the right approach.  Example: when viewing satellite images, distinguish urban from non-urban areas. This cannot be solved well in a hypothesis testing framework.

Another difference is less fundamental, but also important in practice: using out-of-sample methods to validate and test estimators is a religious practice in data mining, but is almost not taught in standard econometrics. (Again, I’m sure PhD courses at UCSD are an exception, but it is still rare to see economics papers that use out of sample tests.) Of course in theory econometrics formulas give good error bounds on fitted equations (I still remember the matrix formulas that Jerry Hausman and others drilled into us in the first year of grad school). But the theory assumes  that there are no omitted variables and no measurement errors! Of course all real models have many omitted variables. Doubly so since “omitted” variable includes all  nonlinear transforms of included variables.

Here are two recent columns on other differences between economists’ and statisticians’ approaches to problem solving.

I am not an econometrician  by Rob Hyndman.

and

Differences between econometrics and statistics: From varying treatment effects to utilities, economists seem to like models that are fixed in stone, while statisticians tend to be more comfortable with variation, by Andrew Gelman.

Good data mining reference books

The students in my Big Data Analytics course asked for a list of books on the subject they should have in their library. UCSD has an excellent library, including digital versions of many technical books, so my  list is entirely books that can be downloaded on our campus. Many are from Springer. There are several other books that I have purchased, generally from O’Reilly, that are not listed here because they are not available on campus.

These are intended as reference books for people who have taken one course in R and data mining. Some of them are “cookbooks” for R. Others discuss various machine learning techniques. BDA16 reference book suggestions

If you have other suggestions, please add them in the comments with a brief description of what is covered.

Death by GPS | Ars Technica

Why do we follow digital maps into dodgy places? Something is happening to us. Anyone who has driven a car through an unfamiliar place can attest to how easy it is to let GPS do all the work. We have come to depend on GPS, a technology that, in theory, makes it impossible to get lost. Not only are we still getting lost, we may actually be losing a part of ourselves. Source: Death by GPS | Ars Technica

As usual, aviation is way “ahead.” Use of automated navigation reduces pilots’ navigation skills; automated flight reduces hand-flying skills. Commercial aviation is starting to grapple with this, but there is no easy solution.

The Tesla Dividend: Better Internet Access — Interesting but Wrong

Elon Musk’s newest car doesn’t just run on electricity — it needs a world class fiber network  Source: The Tesla Dividend: Better Internet Access — Backchannel — Medium

This is an interesting attempt to give still more importance to Tesla and very smart cars. “Tesla cars generate about 1 Gigabyte per minute of [raw] data.”

But the argument is wrong. They generate plenty of data internally – so do today’s other advanced cars with their 100+ processors. But that data is thrown away as fast as it is created. It’s part of what I called “dark data” in my report on Measuring Information. Neither Tesla nor anyone else needs the massive detail. Even for deep learning, only a few seconds are going to be useful, per hour of operation. See my response to the original article, here.

Using data mining to ban trolls on League of Legends

Something I just found for my Big Data class.

Riot rolls out automated, instant bans for League of Legends trolls

Machine learning system aims to remove problem players “within 15 minutes.”

An interesting thread of player comments has a good discussion of potential problems with automated bans. Only time will tell how well the company develops the system to get around these issues.

This company also took an experimental approach to banning players. And hired 3 PhDs in Cognitive Science to develop it. (Just to be clear, their experiments did not appear to be automated A/B style experiments.) After the jump is a screen shot from that system.

League of Legends screen shot

But, I’m not tempted to play League of Legends to study player behavior and experiment with getting banned! (I don’t think I’ve ever tried an MMO beyond some prototypes 15 years ago.)  If any players want to post your observations here, great.