Friday, April 30, 2010

How is email spam found out?

Email spam is a not just a problem but more a menace with the increasing number of spammers that are active. Detection of spam mails from a collection of legitimate mails is a very important data classification problem. However, the impact of classifying a legitimate email as spam is more than filtering spam. Legitimate mail classified as spam is known as ham and is the result of improper classification. Hence, the threshold is selected in such a way as to avoid false negatives.

Various features of the mail such as the senders mail id (black listed or not), content of the mail in terms of occurrence of words common to spam mail like 'free', 'buy' etc are used to judge if a mail is spam or not.

Email Spam – detection and anti detection methods

Server side filter out spam such as spam assassin make use of rules which classify mail into spam and email based on occurrence of different types of words and features. It makes use of a neural network algorithm based method to do the classification. Spam Bayes provides tools for desktop utilities such as outlook, gmail, yahoo POP3 and IMAP and many other popular email clients. It is based on stasticial analysis of the email content.

The methods which rely on content based classification of spam have been very effective as the spammer has to deliver his spam message whose content is very different from a legitimate mail in many ways.

The empire fights back

The spam empire has fought back with many changes. Tools such as Spam checker will check the mail and suggest synonyms or changes to the mail to make it look less like spam and more like legitimate mail. Although the spammer can make such changes he cant make the mail ridiculously complex and incomprehensible. The new rules and new synonyms to escape the rules is an ongoing battle between spam and spam detectors. Gmail uses the same content based approach to decide what are relevant ads for that particular mail user.

Using images in place of the words that might give them away has been a popular method among spammers to avoid detection by such content methods. Having to recognize the characters and words in the images and checking them for spam is the obvious solution. This seems to be a never ending battle.

Wednesday, April 28, 2010

Would a rose by any other name smell as sweet?

Whenever there is huge rush to rename anything from a street to a nation, people often wonder whats the point in it all. Wouldn't a rose by any other name smell as sweet? Well name is important, or else we would not have a rename file option in windows.

Lets see if it would smell as sweet in the below cases in which renaming was done:

  1. Hot Springs is Mexico changed its name to Truth or Consequences to get a popular radio program hosted from its city.
  2. Clark in Texas (America) renamed itself DISH and got itself free satellite television for 10 years in the bargain.
  3. Más a Tierra in chile renamed itself as Robinson Crusoe Island
The new names rock at least in these 3 cases. May be we should have thought a little more and come up with really catchy names for our Indian cities before renaming them. We dint even get any commercial benefits from renaming them either, just the headache of having to change the names in all the places. Here are few "cool" names that we could have used instead

Kolkata - A Coal Cat
Mumbai - Bye Mom
Pune - Puma
Bengaluru - U R Ben's Gal

Do let me know if you come up with something cool to rename to.

Finding patterns and rules in english words

I took a few hundred words in English starting with the letter "G" and few starting with letter "E". Hopefully i could get some pattern or rule in the way the letter after these letters was arranged. I managed to get as many as 51 rules for the way the letters appeared. The rules were all generated just based on the occurrence of a particular alphabet at a particular position.

The most effective and pervasive rule was if Letter2(A) => D(G). So in my sample of words, the words starting with letter "G" had lot of words with second letter as "A".

If i made the search more knowledgeable by putting in knowledge of vowels and consonants, more rules will turn up with better patterns. This might be even more interesting to try and find patterns in sentences.

Tuesday, April 27, 2010

Biodiesel and Photobioreactors

Biodiesel is a fuel similar to diesel that is obtained from oil rich plants such as reapeseed, soy, palm oil, sunflower or used cooking oils or phototrophic microorganisms. It has the advantage of producing lesser amounts of green house gases than fossil fuels. The main reason for the success of biodiesel is because it can be used without modification to engines and distribution systems.

Biodiesel has some of the same problems as bioethanol. It can start a competition for land with other agricultural crops, causing decrease in food supply or increase in food prices. Hence, the focus is on microalgae and cyanobacteria to produce biodiesel. Since, the microbes can grow in saline environments, they are not as much a threat to food crop cultivation. These methods for production of biodiesel from microbes are still experimental and slow. Developments in bioreactor design and genetic modification of the microbes may make these methods more viable in the future.

Growing the microbes required for biodiesel production requires photobioreactors as the microbes get their energy by photosynthesis. The photobioreactors can be mainly classified into open and closed systems. Open systems are lakes and natural ponds which can be used to grow the microorganisms. Closed systems are tubular or flat panel shaped bioreactors. The tubular bioreactors can be horizontal or vertical. Closed bioreactors have the advantage of not being contaminated and can be easily controlled. Open bioreactors have cost benefits.

The design of the bioreactor is driven by various factors such as light considerations, gas exchange, nutrient availability, product recovery and contamination. Proper mixing is required to ensure time for both dark and light reactions to occur. Cooling is required to remove the heat due to high irradiation. Too much light is observed by the cells at the surface of the culture and lost as heat, this is known as the shading problem. Genetic engineering changes to the cells to have smaller photosynthetic antenna seems to reduce this problem considerably.

Idea: The shading problem can be overcome by having cells of two different types in the reactor. The first type of cells do the light reaction and are positioned at the surface of the culture. Second type of cells do the dark reaction below the surface. The two cell types interact and exchange the products of their respective reactions through the medium.

Monday, April 26, 2010

Acessing windows drives from linux terminal

Many of us use linux and windows on the same machine. The new bread of linux installs does a wonderful job of mounting the windows partitions for us. We can open the files and work on them from the comfort of the GUI.

Some files and programs require to be run from the command line or terminal, especially in the case of linux. How do we access the windows drives on the command line? Since linux does not have the C,D and E drives impemented at the command line.

The drives can be easily accessed from the /media/ folder. Just type cd /media in the terminal and you will land in the media folder. This folder will contain the list of all drives mounted in the recent past. Many of these folders will not be accessible outside the 'root' group of users. To know which folders are your windows drives, just run a ls -lrt in the terminal. The windows drive folders will have read, write and execute permission for you. Just cd into that folder and acess all the files in window drives from the linux terminal.

Sunday, April 25, 2010

First contact - fear of the aliens returns

The very first time we humans meet an alien race is a dream/nightmare scenario that has been played out many times in books and movies alike. Very existence of alien life forms has been postulated and hypothesized based on probability calculations. Considering the size and content of the universe, it might be expected that life could originate in other parts of the universe. Whether these life forms will just be microbes or intelligent life forms is a totally different possibility.

The scientific community has seen contact with an alien race as a positive thing, going as far as to send out signals into space with directions to our planet. The search for extra terrestrials Intelligence is underway with many different approaches being tried. Benefits of a friendly alien race aside, the possibility of life outside planet earth gives us hope of a second home which we may one day require.

Is it really wise to be doing this? what if the alien race just plunders earths resources? Concerns regarding this have been raised by the cosmologist Stephen Hawking, famous for his book "a brief history of time". Be it microbial or any other form of life we encounter, we may not be ready to face the complications that might arise. New and deadly diseases of alien origin could wipe out the entire planet. So do we just lay low till we are technically more advanced or do we try our luck in finding an alien?

Taking candy from a baby

"Taking candy from a baby" is defined as something that is easy to achieve in wiktionary and many other sites. However, is our interpretation correct? Come to think of it, there are lot of things that are easier than taking candy from a baby. Similes such as "easy as falling of a log" would be something thats just easy to achieve, as nothing mean is involved.

A more appropriate usage would be doing something mean thats easy at the same time. Does it just reflect that being mean is so much part of everyday lives that we are just not thinking about it at all? Just take a look at this video, you will get an idea of how mean it really is. The baby will surely cry for a while. Taking candy from a baby sure requires a strong will and a stone heart.

It seems there is also an explanation that the origin of this simile has nothing to do with candies or babies. Its interpreted as taking C.A.N.D (abbreviation for a type of cargo)from Bay B of a ship. Since ships generally use Bay B as the sick bay, its impossible to locate it, let alone take the cargo out of it. So basically taking candy from Bay B was impossible. A linguist (April 1st joke) explains how the expression originated on board ships and got to mean something else after coming ashore.

We may never be able to prove if this is just a result of liberal attitudes towards being mean or the result of not having Bay B in ships.

Thursday, April 22, 2010

A Rough Set-Based Model of HIV-1 Reverse Transcriptase Resistome

A huge effort is being put into find a treatment for AIDS. HIV, the causative agent of AIDS has been studied in ever increasing detail to produce effective antiviral therapies. The high rate of replication and mutability of the virus leads to rapid drug-resistance in the virus. Efforts to overcome the AIDS pandemic would require drugs or drug regimens that can control the drug resistance in the virus.

Reverse transcriptase is one of the viral enzymes that is required for transcribing the RNA to DNA. This transcription is required for the viral genes to get integrated into the host genome. Only after integrating into the host genome, the virus can replicate and propagate. Drugs that inactivate this enzyme can be very effective in stopping the replication of the virus. However, the rapid emergence of drug resistance in the enzyme has made it difficult to treat AIDS with any single drug. Among 25 drugs currently used in HIV therapy, 12 attempt at inhibiting reverse transcriptase enzyme.

Drug resistance generally occurs due to a non-linear combination of mutations. Being able to predict if a drug will be effective against a particular mutant has been a useful tool in treatment. Further research has also given details about the mechanisms of drug resistance development and functionally important regions in the enzyme.

In this study, local phyiscochemical properties of a protein sequence where used to understand and predict drug resistance. Annotated data from the Stanford HIV database was used to classify the mutants into three groups labeled as susceptible, moderately resistant and resistant. The Monte Carlo feature selection was used to select the best features from a total of 7* 560 properties. The selected features where used to generate rules to classify the sequences into the correct class. The method was tried based on data available for different antivirals.

Evaluation of the results was done by 10 fold cross validation of the data. Performance of the classifier was assessed based on prediction accuracy and area under the curve. Analysis of the results for the sites responsible for the resistance where found to be in correlation with the known sites. New sites that could lead to resistance have also been predicted. Newly discovered sites seem to have the resistance effect by disturbing the complex network of hydrophobic and polar interactions responsible for stability of tertiary structure.

Wednesday, April 21, 2010

Bioethanol - fuel of the future?

Most of the alternative energy sources such as solar, wind, nuclear energy have a major drawback of not being useful as automobile fuels. Automobiles are one of the major consumers of the crude fuels today. This makes it necessary to have alternative energy source that can be used with the automobile engines being used today with little or no modifications. Bioethanol is one such alternative which has shown significant potential.

Ethanol is produced by fungi such as Saccharomyces cerevisiae and bacteria such as Zymomonas mobilis. The raw material for this production of ethanol is sugar plants, cereals or ligno cellulose. The use of food crops for ethanol production has the disadvantage of having a negative impact on food production. Hence, the use of ligno cellulose is a very attractive alternative.

Ethanol production is dependent on having effective production and storage of raw materials, pretreatment, fermentation, the production step itself and transport and use of the final product. Each of these steps has many problems which have to be overcome. Cost benefits and impact on other agricultural products are the main concerns with respect to bio ethanol.

Production of raw materials for ethanol production have to consider the impact on the environment due to increased usage of fertilizers and pesticides. There has also been significant concern regarding the reduction in the rain forest to meet the energy needs. However, sugar cane is not grown on rain forest land and is not actually having any impact on the rain forests.

Storage of raw materials has to provide the optimal conditions to maintain the correct water content for later use in fermentation. The raw material should also be protected from contamination and degradation during storage. Improvements in the fermentation and refinement of ethanol are also required to get better yields. Ethanol production has the advantage of being produced locally in most of the regions. However, concerns include over-utilization of land and destruction of rain forests. Integration of the different steps in the production of ethanol will increase efficiency.

Bioethanol is more sustainable than fossil fuels, but it may not be able to solely fulfill the growing need for energy.

Tuesday, April 20, 2010

Solar Energy and Solar cells

With the growing need for energy, alternative energy sources are being developed and refined. Solar energy is an attractive alternative as it is a relatively clean, renewable source of energy. Solar energy has been utilized in various ways such as for direct heating, electricity production and biomass production.

The energy needs by the year 2050 have been projected to be 28 TW in comparison to the currently used 11 TW. Although solar energy could probably provide a significant share of the required energy, it needs to be made available at a reasonable cost. The price of a 100 W silicon panel for converting solar to electric energy is 350 to 400 US dollars. However, this is too expensive to be practical. The exponential growth of about 40% per year has been mostly driven by huge subsidies from the government.

A solar cell is a device that converts solar energy directly into electricity[1]. The first generation solar cells transform light energy by using crystalline or amorphous silicon as inorganic solid-state material. The first generation cells are very expensive due to the cost involved in purification and production of the solar cells. The second generation solar cells make use of thin film as the core of the solar cell. The 3rd generation of solar cells is inspired by photosynthesis and has shown the potential to be more cost effective.

Dye- sensitized solar cells have been used to generate a potential gradient to generate electricity. These solar cells have shown good performance in diffuse light and have low investment cost to initiate production. The dye stability has been improved upto 15 years in sunlight by continued research. Titanium dioxide has emerged as the semiconductor of choice due to its abundance, non-toxicity, cost and compatibility.

Solar cells are facing the problem of scalability as the third generation cells are not being cost effective at large scale. Further developments in the field would be focused on better conversion efficiency and cost of production and maintenance.

My idea:

A biological model such as living organism capable of generating the potential gradient could be a idea worth exploring as the cost of production could be reduced. Many organisms are known to be capable of maintaining potential gradients. The challenge would probably be to combined the potential gradients of individual cells to get a net higher potential.

Sunday, April 18, 2010

Copy Number Variation - the root of all evil and good?

Human genomes contain repeated segments of DNA through most of the genome. When the number of copies of the repeats vary between different human beings, we have a copy number variation. Few regions of the genomes are popular locations for such variations, these regions which have different number of copies are known as copy number variant regions.

Copy number variable regions(CNVR's) are of particular interest as they can be responsible for diseases and prototypic differences between individuals. Similar to single nucleotide polymorphisms (SNP's) which are disease markers, these regions have been associate with many conditions. These regions have also been associated with resistance from infection by HIV and Malaria.

The importance of these regions becomes apparent as copy number variation maps are being generated and updated into genomic variant databases. This type of data will be useful in understanding the relations between CNVR's and specific characters. Evolutinary impact of these regions could also be anlaysed to get an understanding of how evolution proceeded.

It will require a more clear understanding of the role of CNVR's to really appreciate how much they influence the different characters. May be they are root of all evil and good, but again they might just be a part of the bigger puzzle.

Tuesday, April 13, 2010

Monday Morning - Swami and Friends

Our hate for Monday mornings may have something to do with the same things as Swaminathan's feelings of unpleasantness about this day. The joy of having enjoyed the freedom of Saturday and Sunday is overshadowed by the dreary nature of Monday mornings. Apart from the obviously long list of things to complete from the past two days, there is always this feeling of foreboding about Monday which is difficult to overcome.

Anything could happened on a Monday, may be Mr Ebenezar would take upon himself the duty of teaching us idiots the futility of idol worship. Worse still we might feel the insane urge to contradict and question him. Even if we do manage to get through all this ruckus, will we have the sense of not telling anybody at home during the meal about it and getting a stern letter written to the principal the next day?

Everything is not bad as we have few things to look forward to even on Mondays, there is the 12.30 mail to watch from the window. If we manage to solve the 5 arithmetic “puzzles” correctly life would not be that bad.

Hydrogen from solar energy and water?

Industrialization has been largely driven by the continued discovery of oil reserves. However, the number of oil findings is decreasing. A future with no oil left to use is a reality we have to face. Apart from the obvious problem of scarcity the fuels such as oil, coal and gas have been known to contribute to the problem of global warming. The situation is further complicated by the growing need for energy from users who are yet to start using the energy resources.

Many alternative strategies such as solar energy, wind, nuclear, tidal, geothermal etc have been proposed to solve these problems. Although these alternative sources might be able to provide energy, it might not be possible to use them effectively as fuels for transportation systems. Transportation systems being the major consumer of fuels today may need a different approach. Loss of energy during the conversion process has made it necessary to have a direct product which can be used as a fuel.

Use of solar energy to produce fuels such as hydrogen has gained importance in this context. Hydrogen could be directly used as a fuel and lack of carbon in the fuel source makes it a rather clean source of energy. The problem of scarcity and global warming can be tackled with this interesting approach. Two main approaches are being pursued to achieve this goal of using water to produce hydrogen using solar energy. The first approach is the photo biological method which aims to create or alter a biological system to convert solar energy into hydrogen using water as raw material. The second approach is the chemical method, which uses photo systems or molecules that imitate photo systems coupled to other molecules to drive reaction that convert water into hydrogen.

Photosystem II uses solar energy to oxidize water releasing electrons. This reaction is rather efficient although the other steps happening in the biological systems are not as efficient. Hence, the aim is to mimic just this step of the process from nature. The chemical approach has used molecules such as ruthenium linked to the photo systems to act as electron acceptor from Manganese. This is used to drive the reaction to produce hydrogen from water. The enzyme hydrogenase which can catalyse the reaction to produce hydrogen is used in this second step of the reaction.

Biological systems such as Nostoc produce hydrogen in special cells from nitrogenase. Currently large and small scale reactors are being developed to produce hydrogen from such biological systems and make them as effective as possible.

The direct methods of producing fuels have been found to be much more effective than the indirect methods which require the energy to be converted to electricity which is then used to split the water molecules by electrolysis.

Sunday, April 11, 2010

Brick BAT - iGEM brick biosaftey Assessment tool

With increasing popularity of genetic engineering and synthetic biology we are on the way to malicious biological content. We have seen programs like spyware and malware come out of the IT revolution. What horrors will come out of the advances in biology?

What if crops of entire nations are held to ransom by a pathogenic viral strain? worse still, the human population may be threatened. Organisms that copy our genetic material and take it for analysis without our permission are a distant possibility. Such spyteria (spy + bacteria) are a threat we must get ready to face.

A popular synthetic biology contest called iGEM ( International Genetically Engineered Machines) is encouraging an engineering based approach to synthetic biology. Although they are very serious about biosafety, a categorisation system for the various parts and components was not being used. I have come up with a simple categorisation scheme called Brick BAT. This is a set of questions to classify the components into different levels of biosafety.