Their mission is to improve gender equality in the technology sector by providing 5,000 girls with tech work experience by 2020.
If we want to see change, we must drive it. Next Tech Girls joins hundreds of organisations and companies around the world that are dedicated to gender equality in tech.
They ask technology companies across the UK to offer work experience opportunities to school girls who are passionate about tech. One or two week long placements will show girls the scope of technology opportunities available to them.
I am very excited to announce that Data! Data! Data! has several discount codes available for Data Events across the UK & Europe.
Having been running lots of events across the year with AI Europe, Unicom, Data! Data! Data! Meetup and the Weekend University we have managed to network lots of Data Lovers!
Feel free to click the link, enter the code and enjoy a discount on me! (or email me for discount)
- Data Science, AI and Deep Learning - November 16th - Novotel London West - Email: firstname.lastname@example.org for 30% Discount!
- AI Europe - November 20-21, 2017 - Queen Elizabeth II Conference Centre - Enter Discount Code - AIE17NAKAMAZP for 20% Discount
- Data Visualisation Summit - December 6th - Brussels - Email: email@example.com for 30% Discount! (5 Free Tickets Available)
- Data Visualisation Summit - December 13th - Manchester - Email: firstname.lastname@example.org for 30% Discount! (5 Free Tickets Available)
- And finally if you fancy being a Podcast Mogul you can join the Weekend University's A Crash Course on Podcasting held at Birbeck University on November 25th - Enter Discount Code: nakama10
I'll see you there @TheDataAgent
I'm proud to be acting as a Partner for the upcoming event "Data Analytics and Behavioural Science Applied to Retail and Consumer Markets" that is to be held at the Millennium Hotel London Mayfair on June 28th.
If you're looking for a discounted ticket drop me a line as I have 10 Free tickets along with 25% off on others email: email@example.com
Use the voucher code: DATA25, to get a 25% discount on the current ticket price.
It's with great pleasure that I've started my new role at the leading Global Digital Recruitment business NAKAMA. They say a change is as good as a rest so I'm excited to be joining NAKAMA well rested and ready to go. The Data market in London is booming as is New York so I'm looking forward to continuing to help talented data professionals secure cool jobs with the best businesses in Data globally!
If you are interested in learning more about what NAKAMA do or how my team can help you in your career or team growth drop me a line on: firstname.lastname@example.org or call me 07814 397 783 / 0203 588 4572
Imagine teaching a person a new magic trick. It's a simple enough formula, you show them how it's done, let them have a go and correct them when they make a mistake. Practice makes perfect. Now imagine, that instead of being able to show them the trick, you could only show them low quality photos of it being performed. These photos have no description on how the trick is done and barely show enough features to see the trick in action.
This is the problem that is plaguing data science. We all know that data science is all the rage - the hottest job on the block. People train in complex, mathematical disciples to apply cutting edge algorithms to glean new insights from data. The problem is, before the learning can take place, the low quality input has to be improved. This is of immense frustration to business stakeholder, data scientists and engineers who spend much of their expert time performing monotonous data cleansing work.
The new open-source project Salute attempts to address some of this challenge and give data scientists some of their time back to focus on what's really important.
Salute's goal is to take any type of file (video, audio, image or text) and to recognise, analyse and transform it so that it is ready for consumption by Machine Learning routines. This involves recognising delimiters, data types, data distributions, categorical variables and much much more. The Salute framework is built in a way that allows new derivations from the data to be added easily so that experts can contribute in their own area of expertise.
To achieve all this on even the largest files, Salute is based on Apache Spark and so can integrate with existing clusters and use their processing power. Using Java means that contributing is straight-forward and is open to many coders.
Salute is a new project, much is immature. If you feel you can contribute - please do here .
Everyone wants a Data Scientist. It's cool, it's sexy, it's the job of the century but what do they actually do?
Every day I speak with clients eager to grow out their Data Science function hopeful that in hiring a “Data Scientist” their business will become enlightened and pass into commercial nirvana. It’s not always so easy.
All too often once the company have been successful in hiring the Ivy League “Unicorn” with cutting edge Machine Learning Algorithm development skills they hand them a bunch of dirty data and say: “Can you clean that please?”
When the Angel Round funding lands the Data Scientist is the “go-to” trophy hire, an injection of sexiness that will bring about the “Uberisation” of your disruptive Bricklaying App.
Sometimes it pays to have a horse before you buy a cart. Or is that a cart before the horse? It’s a chicken and egg situation. Perhaps buy lots of Chickens and Horses and somehow the cart will learn to pull itself. That’s Machine Learning in an eggshell.
Our latest Blog comes from Dr Mozafar Hajian currently IBM’s Analytics Consultative Sales leader in the Financial Industry and Banking.
Already, many industry front runners in Retail and Banking industries are adapting it to take their investments in cloud, data, analytics, mobile
and social to the next level, which enables them to change the way they run their businesses for better, see http://www.ibm.com/cognitive/
Soon, cognitive computing will become the recognised vehicle for transforming businesses into the new operating models such as digital and multi-channels.
The digital model would leverage information and insights - Not accessible before - to streamline new business processes, help to make complex decisions at speed and quality that matters and changes the way companies interact with their customers and partners that lead to maximising the value of full business potentials and opportunities for all parties with no real limits!
What is cognitive computing? Some practitioners believe that Cognitive Computing is the next step in analytics world.
Using the famous analytics chart described in “Competing on Analytics” by Thomas H. Davenport, Jeanne G. Harris, they place it as the next Analytic evolution as:
Descriptive -> Predictive -> Prescriptive -> Cognitive.
This definition in my view, is not accurate, for start traditionally most analytics processes solely worked with structured data, the Cognitive Computing as defined by @IBM, makes use of all data available (structured and unstructured), it takes advantage of existing descriptive, predictive and prescriptive processes and even enrich them by providing more information and insight back to them, it progressively improves the quality of the information as it grows.
I think, Figure 1 would give a more accurate presentation of cognitive computing by which seamless integration of all phases of advanced analytics and use of all type of data is made possible.
Figure 1: Cognitive computing spans across analytics landscape using all available data
Where to start? Like many emerging technological eras that some of us have witnessed in the past 25 years, cognitive computing
too needs to go through a ramp up period before it is widely adopted within the enterprises and across them.
It is only during this time that early adapters - market leaders and visionaries - would rip the benefits of the new technology by being the first to take advantage of it, taking the lion share of the value and most importantly influencing the way it shapes commodities in the future. However, gaining advantage of the new technological era over competitors requires businesses to acquire new skills among their professionals.
What skills are necessary to make this journey? To answer this question let’s take a look at a typical decision
processes as follows:
- Define the business problem using its unique domain, attributes and relationships.
- Translate the relationships between these attributes into a logical model that represents the business problem
- Connect the model to its relevant data in a form that is consumable by its solving algorithm
- Run the algorithm to solve the model
- Translate the result into a solution to the business problem.
To perform the above tasks, typically we need to have business analysts, who understand the business domain, mathematicians and/or statisticians who are able to build the logical models and design the algorithms to solve them, data analysts who know where the necessary data is stored, how to get them, make the necessary transformation and make them available to software engineers who know how to install, configure and connect all the software tools. The above description is perhaps an oversimplification of the skill requirements, however to understand where these skills would come from let’s take a couple of short steps back into how the science of problem solving has evolved.
Usually, we have two separate sets of disciplines thought at universities; these are the fields of Operations Research and
Computer Science that are heavily contributing to advancements in Decision Science, a branch of Management Science.
In the 70s and 80s, scientists from these fields were often working to invent new algorithms, or improve the existing ones
to solve better the same complex problem arising from management science without much visibility on each other work!
Most of this work was often carried out to overcome the limitations in available storage capacity, processing power and rapid access memory
which were limiting the size and complexity of the problems that could be solved in a rerasonable timeframe.
In the mid-90s however, thanks to a lot of pioneering joint work by scientists from both disciplines this started to change and new trends in science of complex problem solving began to emerge, see for example:
The advancements in algorithms and computer platforms for solving large scale complex problems has given businesses enough confidence to apply them in their daily business management and decision making process, see for example,
The above capabilities coupled with technology to obtain, analyse and process large volume of data has enabled companies to create insight that is consumable by decision makers be it man, machine or a combination of the two! Data Scientists (a newly created role in the marketplace after emergence of the big data) are key in delivering this. I have come across many different definitions for the role or skills that a Data Scientist must have, but the best that I have seen so far is by my favourite CDO @usamaf, who wrote:
"A Data Scientist is someone who Knows a lot more software engineering than Statisticians & Knows a lot more Statistics than software engineers"
Why Data Scientists are important in deploying cognitive computing? Cognitive computing as seen on http://www.ibm.com/cognitive/outthink/ is a journey, a journey that takes companies to the next level, a journey that involves planning and execution of all different steps along the way.
As an example, these steps could include but not limited to:
- Understanding the business long/mid-term goal and strategy.
- Aligning the strategy with cognitive capabilities to determine a road map for acquiring data and platforms necessary to achieve the goals.
- Map out man and machine engagement models, i.e. determines who learns from who, what? Who does what?
- Identify the business processes that directly benefit from digitization.
- And last but not least enable business innovation through cognition across all part of the business.
However, although this journey is different for every company, the vehicle is going to be the same. In my opinion Data Scientists are best positioned to drive this vehicle. They know the passengers and their destinations, they know the road and its turns and bumps, they know the vehicle and how it works, and most importantly they know what fuels this vehicle, Data!
Disclaimer: Unless specified, all thoughts and opinion expressed here are my own! @MoziHajian
By Richard Lewis (Director @ Model Citizen)
Last week I was fortunate enough to be invited to talk at the inaugural Data!Data!Data! networking event. It was a great evening and I’d like to thank everyone for raising some really interesting points, challenges and questions. I’ve not captured all of this below, but just created a very brief high level summary main part of the talk.
Data!Data!Data! is a quote from Sherlock Holmes; the concluding part of the quote is “I can’t make bricks without clay”. There’s a seemingly obvious meaning behind it, but to the data scientist it’s more complex. Why should we want bricks anyway? To make houses? (“I can’t make houses without bricks!”) Why should we want houses? To sell; to live in?
In our world of data science there is invariably more than one layer forming the “data to solution” parallel. We sometimes need multiple conclusions from previous insight (treated as new raw data) to solve a new problem. This is perhaps easier to discuss with a more practical example; a problem I call “The Pomelo Problem.”
Imagine a scenario in a supermarket, where we have run a very simple pairwise correlation analysis (or product affinity analysis). The analysis has identified a correlation between vanilla pods (an infrequently purchased item) and eggs. This correlation shows something useful: maybe the customer has a baking related shopping mission. I am able to see beyond the pattern and begin to form my conclusion.
Unfortunately, the analysis also shows exactly the same numbers for pomelos and milk. This is a problem; Pomelos are an obscure fruit from S.E. Asia. It’s only in season for one or 2 months a year and it’s likely to only be added as a top-up item to big shops. The trouble is, the analysis so far cannot differentiate between the two examples.
We need to try and improve our algorithm. A second tier of correlation is introduced. Vanilla pods are also found to be correlated with flour. This is good news as it further supports a shopping mission I can believe in. Yet my analysis also correlates (with exactly the same numbers) pomelos and bread. These two are only likely to appear to be correlated due to the randomness of additional items that are added to big baskets–and I’m back in the same position.
The challenge the Pomelo is posing to my algorithm is that it produces a useless scenario with exactly the same numbers as a useful scenario. The challenge is to be able to improve my algorithm to the point where it can distinguish between the two without needing additional human interaction. I need to provide the algorithm with additional data that’s not fed from a source system. I need to give it my (human) knowledge.
One possible method is to provide the algorithm with additional association data. For example, I can link vanilla pods and baking, and eggs and baking as some meta-data. Or, as one delegate suggested on the evening, maintain recipes in the data. This could be great—but it causes additional problems. Who is going to write this data? Who is going to maintain it? Who is going to look after data quality, data consistency? Across tens of thousands of product lines this could be an exceedingly costly task. Would the benefit gained from uses of this data (and algorithm improvement) be likely to be able to business case this investment?
There are, of course, countless different ways in which the basic algorithm in The Pomelo case could be improved to resolve this issue. The point is not the specific techniques we should be using, but that there comes a point where no algorithm can differentiate between scenarios—they reach their limit.
We can never say the algorithm is complete or the solution optimised (past tense). Partly because of the issue The Pomelo Problem highlights, but also because it must be assumed your competition will also have implemented the basic algorithm, meaning competitive advantage has not yet been attained, only competitive parity. The challenge for the data scientist is to recognise The Pomelo Problem, locate examples and identify solutions to them.
This is the essential human aspect of analytics and more of an art than a science. Introducing the human element also introduces potential bias to the results and needs to be considered very carefully. But we’re not yet in a position where an algorithm can be allowed to run indefinitely with the assumption that it’s the best it can be. There may be a disadvantage to introducing the human element, but doing so will enable better and faster progression. In my view it remains an essential input to any model.
Kevin Schmidt - CTO @ Century Technology gives his views on the Next Generation of Data Architecture
One of the benefits of coming to a greenfield job – like when I joined Mind Candy two years ago – was that you can jump several technological steps ahead as you don’t have any legacy to deal with. Essentially we could build from scratch based on lessons learned from traditional data architecture. One of the main ones was to establish a real-time path right away to avoid having to shoehorn it in afterwards. Another was to avoid physical hardware. And the most important one was to hold off on Hadoop as long as possible.
The last one might seem surprising, isn’t Hadoop the centre-piece of a data architecture? Unfortunately it creates a lots of admin overhead and it might be a full person’s (or more) workload to maintain. Not ideal in a small company where people resources are limited. AWS S3 can fulfil most of the storage function but requires no maintenance and is largely fast enough. Also while HDFS is important and will probably come back for us soon, MR1 or YARN is just not – there are better and more advanced execution systems that can use HDFS and we used one of those: Mesos.
Mesos is a universal execution engine for job and resource distribution. Unlike YARN it can not only run Spark but also Cassandra, Kafka, Docker containers and recently also HDFS. That works because Mesos just offers resources and let the framework handle the starting and management of the jobs. This finally breaks the link between framework and execution engine: in Mesos you can run not only different frameworks but different versions of the same framework. No more waiting for your infrastructure to upgrade to the latest Hadoop or Spark version, you can run it right now even when all your other jobs run on older versions. Combine that with a robust architecture and simple upgrading and Mesos can easily be seen as the successor to YARN (for more details on why Mesos beats Yarn, see Dean Wampler’s talk from Strata).
For the real-time path the obvious processing solution is Spark Streaming (so we have a simpler code base) running on Mesos with Kafka to feed data in and with Cassandra to store the results. You now have a so-called SMACK stack (Spark Mesos Apache Cassandra Kafka) for data processing which the Mesos folks call Mesosphere Infinity for some reason (aka marketing).
The last bit of a data architecture is the SQL engine. Traditionally this was Hive but we all know Hive is slow. While there are several open-source solutions out there that improve on good old Hive (Impala, Spark SQL) in the end we decided on AWS Redshift. It’s a column-oriented SQL-based data warehouse with PostgreSQL interface which fulfils most of the data analysis and data science needs while being reasonably fast and relatively easy to maintain with few people.
The resulting architecture looks like the above picture. We have an event receiver and a enricher/validate/cleaner, which were written in-house in Scala/Akka and are relatively simple programs using AWS SQS as a transport channel. The data is then sent to Kafka and S3. Spark uses data straight from S3 to aggregate and put the processed data back into either Redshift or S3. On the real-time side of things we have Kafka going into Spark Streaming with an output into Cassandra.
What can be improved here? HDFS is still better than S3 for certain large scale jobs and we want to bring it back running on Mesos. Redshift could be replaced with Spark SQL hopefully soon. All in all the switch from tightly coupled Hadoop to an open architecture based on Mesos allowed us to have an unprecedented freedom as to which kind of data jobs we want to run and which frameworks to use, allowing a small team to do data processing in ways previous only possible on a large budget.
The first of our Data! Data! Data! Blog interviews
Maarten Ectors- Vice President Internet of Things at Canonical Ltd. / Ubuntu discusses "The state of Big data"
Modern lambda architectures based on Spark, or Storm and Hadoop allow both batch and stream processing to be done on huge quantities of data at incredible speeds. If you're fortunate enough to be doing greenfield development involving big data then you should definitely be looking at these.
But what if you have a big legacy batch system? Should you aim to migrate everything to a new system?
Well, maybe not. Or at least not all at once.
I'm Dan Hanley, CTO at ActiveStandards and we use these technologies to provide highly robust and scalable Digital Quality Management solutions to global enterprises.
I'd like to share a few things I learned on the way to getting our first Spark based project running in production.
Don't boil the ocean – yes it's important to have the big vision of where you want to be, but the real trick is in breaking the journey down into small steps that deliver value along the way. In our case this meant building a much scaled down version of our full production system. It delivers only a small subset of the functionality. But it let us develop and test various topologies and get our deployment processes ironed out.
Storage tiers are different. Not just different from relational, but very different within their own groups. The architecture and schema you'll use with a column DB like Cassandra will be very different to what you'll have with a document DB like MongoDB. Or maybe you just need a file system like HDFS or S3? You'll need to figure out which is the best fit for your purpose.
Use empirical evidence and keep asking questions. In our traditional batch system we shard the MySQL data across instances by UUID – this gives a uniform volume and load distribution across the cluster, with performance equal to that of the slowest instance. Initially we partitioned our Cassandra data the same way. After a conversation with Datastax we tried a different partitioning scheme – and the tests showed that for Cassandra it was far more appropriate to partition on a much higher level key – effectively siloing all data for a given client onto one or two boxes in the cluster.
Test your topologies. Should you use a few big machines? Many smaller ones? Co-locate the spark node with the data-storage nodes? Have a symmetric or asymmetric Spark to Data ratio? Secondary indices? Denormalised and redundant data? There is no set easy answer. What works for me will quite likely be different for you. The only way to be sure is to test your options rigorously and optimize for your own use case.
"Data! Data! Data! he Cried Impatiently "You can't make bricks without clay." Sir Arthur Connan Doyle
The words of Sir Arthur Conan Doyle’s Sherlock Holmes are the Tag line for this blog and underpin our philosophy on data and the modern world. Data is the building block of modern business and when used correctly can drive decision making, improve performance, engage customers, stream line processes and generate profit. Without a coherent Data strategy in your Business you are reducing your decision making process to at best a punt, a hunch or more accurately a gamble.
According to former Google CEO Eric Schmidt every two days we create as much information as we did from the dawn of civilization up until 2003. That’s something like five exabytes of data every other day that will hold vital information as to how people engage with the world in the Digital Age. With all of this valuable Data available why would you gamble on your decision making and not make an informed business decision. The answer of course is cost as getting the right people in place to actually derive actionable insight from all this Data can seem daunting and expensive but with the right support it is an invaluable investment.
Why invest in Data Science? Elementary my dear Watson!
With the amount of data being created daily the work of Analysts or Data Scientists has never been so in demand. In 2012 Peter Sondergaard, senior vice president at Gartner and global head of Research suggested that the world of Big Data would generate 4.4 Million IT Jobs but he warned that the increase in demand could not be met by the supply of talent.
As someone that has worked in the field of Analytics recruitment for a number of years I have seen first-hand how the desire for highly numerate, Tech savvy Analysts with the ability to communicate effectively has become the go to hire for anyone looking to improve their data proposition. The Holy Grail in this area is a PhD. Stats “Story Teller” that can number crunch like a machine and make his or her findings clearly actionable to the business in through effective communication (and a pretty visualization is they can.)
Finding this Stephen Hawking meets Barack Obama hybrid can be tricky some say it’s like finding the Unicorn but they are out there and they are integral to your company’s growth.
If you feel you are in need of making some Bricks from all this clay then get in touch I know quite a few unicorns that can help you out.