I went over to NYU Poly in Brooklyn on Friday of last week for their Big Data Finance Conference. To get a slightly negative point out of the way early, I guess I would have to pose the question "When is a big bata conference, not a big data Conference?". Answer: "When it is a time series analysis conference" (sorry if you were expecting a funny answer...but as you can see, then what I occupy my time with professionally doesn't naturally lend itself to too much comedy). As I like time series analysis, then this was ok, but certainly wasn't fully "as advertised" in my view, but I guess other people are experiencing this problem too.
Maybe this slightly skewed agenda was due to the relative newness of the topic, the newness of the event and the temptation for time series database vendors to jump on the "Big Data" marketing bandwagon (what? I hear you say, we vendors jumping on a buzzword marketing bandwagon, never!...). Many of the talks were about statistical time series analysis of market behaviour and less about what I was hoping for, which was new ways in which empirical or data-based approaches to financial problems might be addressed through big data technologies (as an aside, here is a post on a previous PRMIA event on big data in risk management as some additional background). There were some good attempts at getting a cross-discipline fertilization of ideas going at the conference, but given the topic then representatives from the mobile and social media industries were very obviously missing in my view.
So as a complete counterexample to the two paragraphs above, the first speaker (Kevin Atteson of Morgan Stanley) at the event was on very much on theme with the application of big data technologies to the mortgage market. Apparently Morgan Stanley had started their "big data" analysis of the mortgage market in 2008 as part of a project to assess and understand more about the potential losses than Fannie Mae and Freddie Mac faced due to the financial crisis.
Echoing some earlier background I had heard on mortgages, one of the biggest problems in trying to understand the market according to Kevin was data, or rather the lack of it. He compared mortgage data analysis to "peeling an onion" and that going back to the time of the crisis, mortgage data at an individual loan level was either not available or of such poor quality as to be virtually useless (e.g. hard to get accurate ZIP code data for each loan). Kevin described the mortgage data set as "wide" (lots of loans with lots of fields for each loan) rather than "deep" (lots of history), with one of the main data problems was trying to match nearest-neighbour loans. He mentioned that only post crisis have Fannie and Freddie been ordered to make individual loan data available, and that there is still no readily available linkage data between individual loans and mortgage pools (some presentations from a recent PRMIA event on mortgage analytics are at the bottom of the page here for interested readers).
Kevin said that Morgan Stanley had rejected the use of Hadoop, primarily due write through-put capabilities, which Kevin indicated was a limitating factor in many big data technologies. He indicated that for his problem type that he still believed their infrastructure to be superior to even the latest incarnations of Hadoop. He also mentioned the technique of having 2x redundancy or more on the data/jobs being processed, aimed not just at failover but also at using the whichever instance of a job that finished first. Interestingly, he also added that Morgan Stanley's infrastructure engineers have a policy of rebooting servers in the grid even during the day/use, so fault tolerance was needed for both unexpected and entirely deliberate hardware node unavailability.
Other highlights from the day:
- Dennis Shasha had some interesting ideas on using matrix algebra for reducing down the data analysis workload needed in some problems - basically he was all for "cleverness" over simply throwing compute power at some data problems. On a humourous note (if you are not a trader?), he also suggested that some traders had "the memory of a fruit-fly".
- Robert Almgren of QuantitativeBrokers was an interesting speaker, talking about how his firm had done a lot of analytical work in trying to characterise possible market responses to information anouncements (such as Friday's non-farm payroll announcement). I think Robert was not so much trying to predict the information itself, but rather trying to predict likely market behaviour once the information is announced.
- Scott O'Malia of the CFTC was an interesting speaker during the morning panel. He again acknowledged some of the recent problems the CFTC had experienced in terms of aggregating/analysing the data they are now receiving from the market. I thought his comment on the twitter crash was both funny and brutally pragmatic with him saying "if you want to rely solely upon a single twitter feed to trade then go ahead, knock yourself out."
- Eric Vanden Eijnden gave an interesting talk on "detecting Black Swans in Big Data". Most of the examples were from current detection/movement in oceanography, but seemed quite analogous to "regime shifts" in the statistical behaviour of markets. Main point seemed to be that these seemingly unpredicatable and infrequent events were predictable to some degree if you looked deep enough in the data, and in particular that you could detect when the system was on a possible likely "path" to a Black Swan event.
One of the most interesting talks was by Johan Walden of the Haas Business School, on the subject of "Investor Networks in the Stock Market". Johan explained how they had used big data to construct a network model of all of the participants in the Turkish stock exchange (both institutional and retail) and in particular how "interconnected" each participant was with other members. His findings seemed to support the hypothesis that the more "interconnected" the investor (at the centre of many information flows rather than add the edges) the more likely that investor would demonstrate superior return levels to the average. I guess this is a kind of classic transferral of some of the research done in social networking, but very interesting to see it applied pragmatically to financial markets, and I would guess an area where a much greater understanding of investor behaviour could be gleaned. Maybe Johan could do with a little geographic location data to add to his analysis of how information flows.
So overall a good day with some interesting talks - the statistical presentations were challenging to listen to at 4pm on a Friday afternoon but the wine afterwards compensated. I would also recommend taking a read through a paper by Charles S. Tapiero on "The Future of Financial Engineering" for one of the best discussions I have so far read about how big data has the potential to change and improve upon some of the assumptions and models that underpin modern financial theory (haven't found an online copy as yet but will update when I do). Coming back to my starting point in this post on the content of the talks, I liked the description that Charles gives of traditional "statistical" versus "data analytics" approaches, and some of the points he makes about data immediately inferring relationships without the traditional "hypothesize, measure, test and confirm-or-not" were interesting, both in favour of data analytics and in cautioning against unquestioning belief in the findings from data (feels like this post from October 2008 is a timely reminder here). With all of the hype and the hope around the benefits of big data, maybe we would all be wise to remember this quote by a certain well-known physicist: "No amount of experimentation can ever prove me right; a single experiment can prove me wrong."