First of all, let me wish you all a happy new year ! This post was intended to be published last year, but I lacked RC !
My last post was an introduction to ETL/ELT. I wanted to automate some decision making for my #dCity and I just thought it would be better to introduce the tools I would use as a first step.
So, if you are new to ETL/ELT, have a look a my ETL/ELT introduction, just to know what we're talking about here.
Context :
I'm playing with dCity for a year now. After buying my first cards, I realised I should optimize my decision process. So, I started a little spreadsheet and patiently filled the characteristics of all the cards and computed the ROI in days. It helped me choose my cards wisely.
I wanted to know which card I could grab now for the highest ROI. This is not optimal, as I'm depending on the current bids, not the "real value" of an asset : this would be a totally different analysis.
This meant that I had to check the price of a subset of cards in order to update my spreadsheet. It did not take me a lot of time, but I got a bit tired of it and decided to automate the process.
The results
I know not everyone will be interested in the details of the Talend Job, so I'll get to the point first and detail the job after. It takes around 15 seconds to run, and the main slow down comes from the API call. I'm on vacation at my in-laws at the moment, and the ADSL connection here is just lame. I'll test on my fiber connection back home but expect a really nice speedup.
The Excel file is very basic at the moment, yet we have a winner, and it is the Hotel !
Without taxes, and not taking the technology cards into account, the best 3 investments in terms of SIM return are, at time of writing, Hotel, Ice Cream Shop and Music Store.
Just don't take it as a buying advice. There's a bit more to take into account than just the SIM return. For example :
- do you have enough workers ?
- do you have already too much workers ?
- do you have a social aid office ?
- ...
Tax return is also something, so we should take the current tax level into account.
Technical presentation
I will now dive into the technical aspects.
Which Data do we need
This is quite straightforward. We need the description of every card, especially the "sim income", and their lowest ask on the market.
The market ask is easily accessible through an API : https://api.dcity.io/market_cards
@scriptioner and @gerber gave me the link on the dCity discord. Thanks a lot guys ! By the way, I'm interested in the other available API's. Maybe I could do something interesting with them.
Concerning the card description, I did not wanted to bother them more and found a way to retreive it somewhere in the webpage code. If someone knows about the API address for this piece of information, I'll be happy to update my Talend Job.
The Talend Job
tFileInputJson - dcity_card_definitions
Let's start with the dcity_cards_definitions. It's a "tFileInputJSON" component. It reads a JSON file As mentionned earlier, I prepared this file with some code I found. I declared it as a metadata. Metadata are a representation of a datastructure.
You just explain the mapping from the json fields to a flat structure. Talend uses flat structures as it's mainly intended for relational databases.
Just set the id as not nullable, as it's the unique key. Length is not that important.
The configuration of the component is easy as everything comes from the metadata.
tRestClient -- Call dCity API market_cards
We want to call this API : https://api.dcity.io/market_cards
There is nothing notable about this API, just a GET request and a full text response in JSON format.
Just set "Accept Type" to "any".
This will set the component schema to the following :
You get 3 fields: statusCode, body and string. The response will be in the "string" field.
tReplace -- Reformat JSON Response
A tReplace component is used to replace strings. Simple mode is a simple text comparison and replacement. Advanced Mode allows the use of regexp patterns.
We need this step to change the format of the API JSON response.
As you can see, the card identifier (here com_pola) is a key, not a value. It makes sense when you need to access the data concerning a specific cards, which is probably the way dCity works.
To make use of this, I should parse the file as many time as there are cards, to get the data related to each card. This is not an efficient way to make use of an ETL.
Or I should be able to access the key as it was a value, which is not possible with JSON.
So we will change the layout of the response to something like this :
"cards" becomes an array of the different kind of cards.
Each element of this array is an object containing a card_id, and a card_list.
card_list becomes, in turn, an array containing all the said cards that are available with their price.
That's what took me the longest work. I'm used neither to JSON format nor to regexp, so I had to fiddle a lot before finding the right format and expressions.
https://jsonformatter.curiousconcept.com/# was a life saver to check my final format.
Here's the list without the java escaping backslashes :
Source field | column2 | column3 | description |
---|---|---|---|
string | ("\w+"):[ | {"card_id": $1, "card_list":[ | Gives a name to the keys |
string | "cards":{ | "cards":[ | Rewrite the beginning of the JSON |
string | ], | ]}, | End of each card list, because we wrapped it in an element |
string | }]}} | }]}]} | End of JSON |
So that's it! We only keep the "string" field, as we won't use the response code or body.
tExtractJsonFields -- Extract First Loop
Little tweak here : we could do it in one step using the XPath syntax which allows to point to the parent node, but I read it was slower. I benchmarked it and it was indeed much more efficient to use the JSONPath syntax with 2 steps.
The first step loops on the "cards" array, getting the card identifier ("card_id") and the array "card_list".
We have the "string" which is the re-written API response as input and as output "card_id" and "card_list".
The loop path is "$.cards.[*]" and the JSON field is "string".
Configuration is very basic.
tExtractJsonFields -- Extract Second Loop
We will only retrieve some fields, and only price and token are really usefull. We create the output fields in the schema editor as seen below.
Just set the JSON field to card_list, the loop path as "$.*" and map the columns.
As you can see, the column "card_id" has no Json query. As it is already present in the incoming schema, Talend will automatically transfer the value.
Now we have a line per card sold on the market.
tFilterRow -- Filter - Keep Sim Tokens Only
For the time being, I'm only interested in cards sold in SIM. So, let's filter out anything else.
tAggregateRow - Get lowers ask for each card type
tAggregateRow has the same role as a SQL "group by" instruction. We want to find the lowest price tag for every card type. And that's actually the only information we need.
The output schema only needs card_id and price.
Configuring this component is really easy : select the "group by" fields first. Here we to group by card_id. Then apply the grouping function in "Operations" : output columns is price, samed as input and the function is "min". It is possible to have multiple calculations (and thus multiple output columns) based on the same input column.
Now we have one line per card type, and the lowest sell price known.
tMap -- Map Ask to card definition
tMap is the most powerful component in Talend. In here, you can rename fields, cast fields to another type, compute values, create complex text, "join" multiple flows, etc. We will use it to join the list of every card type along their definition with their price.
The joining key is "id" from cards_definitions and "card_id" from the API. Just drag-drop the id field into the "Expr. key" column in front of the "card_id" field. By default, it will be a "Left Join" behaviour : exactly what we need.
That's where we compute the amount of days we need to cover the buying of a card.
The syntax is in Java :
(row7.price != null && row3.income != null ? row7.price / row3.income : 0)
First we should check that none of the value are null to prevent null pointer exceptions. Then you can compute the ROI days by divinding price per card income. Set 0 if something is null.
tFileOutputExcel - Write into Excel
Final step : writing the results in an Excel file. Once again, nothing fancy, Talend takes care of everything. Set the file path, toogle the "include header" checkbox, the "Write excel2007 file format" checkbox and you're ready to go.
Conclusion
I never really used JSON, JSONPath or API calls before and always struggled with regexp. So It was a very good exercice for me. Once I flattened the JSON to a table structure, it took me 10 minutes to finish the flow and write it to an Excel file.
As you can see on the job picture, just renaming the components gives a very readable flow, which means that anyone can come after me and take over the work.
The job is quite fast as it runs in 3 or 4 seconds once the API responds.
What to expect next
I'll try to get informations about the current tax as well as my current technology cards in order to impact it on the ROI computation. This will be usefull for churches and students.
I found a external component to write to a google sheet, and would like to test it. Writing into a google sheet could allow me to make a litte Datastudio dashboard.
Final Word
Don't hesitate to ask questions or share your personal experience ! I'm here to learn !